Likelihood Adaptively Modified Penalties

08/23/2013 ∙ by Yang Feng, et al. ∙ 0

A new family of penalty functions, adaptive to likelihood, is introduced for model selection in general regression models. It arises naturally through assuming certain types of prior distribution on the regression parameters. To study stability properties of the penalized maximum likelihood estimator, two types of asymptotic stability are defined. Theoretical properties, including the parameter estimation consistency, model selection consistency, and asymptotic stability, are established under suitable regularity conditions. An efficient coordinate-descent algorithm is proposed. Simulation results and real data analysis show that the proposed method has competitive performance in comparison with existing ones.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Classical work on variable selection dates back to Akaike (1973), who proposed to choose a model that minimizes the Kullback- Leibler (KL) divergence of the fitted model from the true model, leading to the well-known Akaike Information Criterion (AIC). Schwarz (1978)

took a Bayesian approach by assuming prior distributions with nonzero probabilities on lower dimensional subspaces. He proposed what is known as the BIC method for model selection. Other types of

penalties include (Mallows, 1973), AICC (Hurvich and Tsai, 1989), RIC (Foster and George, 1994) and EBIC (Chen and Chen, 2008), among others.

The regularization has a natural interpretation in the form of best subset selection. It also exhibits good sampling properties (Barron et al., 1999). However, in a high-dimensional setting, the combinatorial problem has NP-complexity, which is computationally prohibitive. As a result, numerous attempts have been made to modify the type regularization to alleviate the computational burden. They include bridge regression (Frank and Friedman, 1993), non-negative garrote (Breiman, 1995), LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), elastic net (Zou and Hastie, 2005), adaptive LASSO or ALASSO (Zou, 2006), Dantzig selector (Candes and Tao, 2007), SICA (Lv and Fan, 2009), MCP (Zhang, 2010), among others.

To a certain extent, existing penalties can be classified into one of the following two categories: convex penalty and nonconvex penalty. Convex penalties, such as LASSO

(Tibshirani, 1996), can lead to a sparse solution and are stable as the induced optimization problems are convex. Nonconvex penalties, such as SCAD (Fan and Li, 2001) and MCP (Zhang, 2010), can on the other hand lead to sparser solutions as well as the so-called oracle properties (the estimator works as if the identities of nonzero regression coefficients were known beforehand). However, the non-convexity of the penalty could make the entire optimization problem nonconvex, which in turn could lead to a local minimizer and the solution may not be as stable as the one if instead a convex penalty is used. Therefore, an important issue for nonconvex penalties is a good balance between sparsity and stability. For example, both SCAD and MCP have an extra tuning parameter which regulates the concavity of the penalty so that, when it exceeds a threshold, the optimization problem becomes convex.

It is well known that penalty functions have Bayesian interpretation. The classical

penalty (ridge regression) is equivalent to the Bayesian estimator with a normal prior. The

-type penalties, such as LASSO, ALASSO etc., also have Bayesian counterparts; cf. Park and Casella (2008), Griffin and Brown (2010) and Hara and Sillanp (2009).

Breiman (1996) initiated the discussion about the issue of stability in model selection. He demonstrated that many model selection methods are unstable but can be stabilized by perturbing the data and averaging over many predictors. Breiman (2001)

introduced the random forest, providing a way to stabilize the selection process.

Bühlmann and Yu (2002)

derived theoretical results to analyze the variance reduction effect of bagging in hard decision problems.

Meinshausen and Bühlmann (2010) proposed a stable selection, which combines the subsampling with high-dimensional variable selection methods.

The main objective of the paper is to introduce a family of penalty functions for generalized linear models that can achieve a proper balance between sparsity and stability. Because for the generalized linear models, the loss function is often chosen to be the negative log-likelihood, it is conceivable to take into consideration the form of the likelihood in the construction of penalty functions. The Bayesian connection to the penalty construction and to the likelihood function make it natural to introduce penalty functions through suitable prior distributions. To this end, we introduce the family of negative absolute priors (NAP) and use it to develop what to be called likelihood adaptively modified penalties (LAMP). We define two types of asymptotic stability that cover a wide range of situations and provide a mathematical framework under which the issue of stability can be studied rigorously. We show that under suitable regularity conditions, the LAMP results in an asymptotically stable estimator.

The rest of the paper is organized as follows. Section 2 introduces the LAMP family with motivations from its likelihood and Bayesian connections. Specific examples are given for the commonly encountered generalized linear models. In Section 3, we introduce two types of asymptotic stability and study the asymptotic properties of LAMP family. The choice of the tuning parameters and an efficient algorithm are discussed in Section 4. In Section 5, we present simulation results and applied the proposed method to two real data sets. We conclude with a short discussion in Section 6. All the technical proofs are relegated to the Appendix.

2 Likelihood adaptively modified penalty

To introduce our approach, we will first focus on the generalized linear models (Nelder and Wedderburn, 1972). It will be clear that the approach also works for other types of regression models, including various nonlinear regression models. Indeed, our simulation results presented in Section 5 also include the probit model, which does not fall into the exponential family induced generalized linear models.

Throughout the paper, we shall use and

to denote the iid random variables for

, where is the response observation of and is the -dimensional covariate observation of , with . Let , the matrix of observed covariates. Following Nelder and Wedderburn (1972), we assume that the conditional density of given covariates has the following form

(1)

where is a smooth convex function, or , and is the dispersion parameter. Then up to an affine transformation, the log-likelihood function is given by

(2)

Note that the form of is uniquely determined by and vice versa.

For a given , we propose the induced penalty function

(3)

which contains three parameters , and . The corresponding negative penalized log-likelihood function is

(4)

Because the penalty function defined by (3) is likelihood specific, we will call such a penalty likelihood adaptively modified penalty (LAMP).

We clearly have and . Furthermore, taking the first and second derivatives, we get

Remark 1.

The parameters have clear interpretations: is the usual tuning parameter to control the overall penalty level; is a location parameter, which may be fixed as a constant; controls the concavity of the penalty in the same way as in SCAD (Fan and Li, 2001) and in MCP (Zhang, 2010).

Remark 2.

The exponential family assumption given by (1) implies that and . When is nonnegative, . Thus is negative and the corresponding LAMP must be concave.

Like many other penalty functions, the family of LAMPs also has a Bayesian interpretation. To see this, we introduce the following prior for any given that defines the exponential family (1).

Definition 1.

Let , and be constants. Define a prior density, to be called negative absolute prior (NAP),

(5)

If we choose , and , then the posterior mode is exactly the minimizer of (4) the penalty form of LAMP. We will show that such a choice will lead to good asymptotic properties.

Remark 3.

The additive form for the penalty function suggests that the prior must be of the product form. For each

, hyperparameter

scales while hyperparameter gives the speed of decay. For the th parameter , the larger the values of and are, the more information the prior has for . Translating it into the penalty function, it means that values of and represent the level of penalty and they can be adjusted separately for each component.

Remark 4.

The form of NAP is to similar to that of a conjugate prior, in that the shape of the prior is adapted to that of the likelihood function. However, NAP also differs from the conjugate prior in several aspects: negative when

, absolute, from separate dimension, and with a LASSO term taken away. Unlike conjugate prior, which assumes additional samples, this can be seen as to take away the absolute value of redundant sample information from each dimension.

It is worth looking into the commonly encountered generalized linear models and examine properties of the corresponding LAMPs.

Linear regression. In this case, . Thus, LAMP reduces to the elastic net (Zou and Hastie, 2005).

Logistic regression. Here . Consequently, the penalty function

(6)

where . This penalty will be called sigmoid penalty.

Poisson regression. Here and we have

This will be called the Poisson penalty.

Gamma regression. For the gamma regression, we have . Then, the penalty has the following form

Inverse gaussian regression. For the inverse Gaussian, we have . The resulting penalty

Probit regression. As we mentioned earlier, LAMP approach can also accommodate regression models not in the form of naturally parametrized exponential families. In particular, it is applicable to the probit regression for binary outcomes. In this case, , which leads to the following penalty form

where and are respectively distribution and density functions of the standard normal random variable.

Remark 5.

For the above examples, the effect of tuning parameters in the penalty function varies across settings. For example, does not play a role in the Poisson regression. In addition, we have a natural choice for for the penalty functions. Specifically, we can choose for the cases of linear, gamma, and inverse gaussian, and for the cases of logistic and probit.

Figure 1: Penalties and the derivative of the penalties.

As the examples show, the LAMP family is fairly rich. They also differ from the commonly used penalties. Figure 1 contains plots of penalty functions (left panel) along with their derivatives (right panel) that include LASSO, SCAD, MCP and two members of the LAMP family (sigmoid and Poisson). Here for MCP, for SCAD, for sigmoid and for Poisson penalty, and for sigmoid penalty. The parameters are chosen to keep the maximum concavity of these penalties the same. Figure 1 shows that sigmoid and Poisson penalties lie between MCP and SCAD when the maximum concavity is the same. Also, from the graphs of the derivatives, it is easy to identify that the penalties of the LAMP family have continuous derivatives (actually they have continuous derivatives of any order for most common generalized linear models) as compared with the discontinuous ones for SCAD and MCP. It will be shown that this feature can make the optimization problem easier and the estimation more stable.

Remark 6.

Consider a one-dimensional penalized logistic regression problem, , where derivative of the penalty function If the true parameter , we need , where is a -centered ball with radius uniform in , to get , while for , we need . Thus, to differentiate a nonzero from 0, we should make the difference between for and as large as possible. In addition, we need to control the second derivative of the penalty function to make the penalized negative log-likelihood function globally convex. Figure 2 illustrates graphically how such dual objectives can be met for the logistic regression under the sigmoid penalty and MCP. The grey area represents the difference between and for the sigmoid penalty. A subset of the grey area, marked by darker color, represent the corresponding difference under MCP. The reason that MCP tends to have a smaller difference compared with sigmoid penalty is because the second derivative of the penalty needs to be controlled by that of the negative log-likelihood in order to maintain the global convexity.

Figure 2: Negative second derivative of log-likelihood, the sigmoid penalty and MCP in logistic regression.

3 Theoretical properties

3.1 Asymptotic stability

Recall that the negative log-likelihood considered here is convex, and the maximum likelihood estimation is uniquely defined and stable. By adding a nonconvex penalty, the convexity may be destroyed. To study stability of the penalized MLE, it is necessary to study the impact of the penalty on the concavity, especially when is large.

For the negative penalized maximum log-likelihood estimation procedure (4), if nonconvex penalties are used, the solution to the minimization problem may not be unique. Therefore, it is natural to study the behavior of the local maximizers in penalized likelihood estimates when the observations are perturbed. Here we introduce a new concept of asymptotic stability to describe the asymptotic performances of local minimizers in penalized likelihood estimates. Note that even for the negative penalized likelihood estimators with convex penalty where the unique minimizer exists, such asymptotic stability concept is still useful in characterizing the behavior of the global maximizer.

Suppose we want to minimize with respect to a criterion function , where and is the th observation of . Denote by and the support for and domain for , respectively. We say that is a local minimizer if there exists a neighborhood within which attains its minimum. More precisely, the set of the local minimizers is defined as

Throughout the paper, denotes the

-norm of a vector or matrix (vectorized) while

denotes the norm. For any let the be -centered ball with radius .

It is clear from the definition that the set of local maximizers is random. We characterize its asymptotic behavior in terms of whether or not the set converges to a single point as . For a set , define its diameter as .

Definition 2 (Weak asymptotic stability).

We say that the set of local minimizers of satisfies weak asymptotic stability if

(7)

The weak asymptotic stability characterizes the asymptotic behavior of local minimizers when the data are perturbed slightly. It shows that for large and small perturbation, the local minimizers stay sufficiently close to each other with high probability. Defined below is a stronger version, which guarantees uniqueness of the minimizer.

Definition 3 (Strong asymptotic stability).

We say that the set of local minimizers of satisfies strong asymptotic stability if

(8)
Remark 7.

Under the weak asymptotic stability, multiple minimizers, though shrinking to , may exist with high probability for any finite . The strong asymptotic stability, on the other hand, entails that for sufficiently large , the probability of having multiple minimizers must converge to , implying that there must be a unique minimizer with high probability.

Remark 8.

penalties, though may have the weak asymptotic property if we adjust its tuning parameter like AIC, will never possess the strong asymptotic property, because then the solution of each submodel with the number of parameters constrained will be a local optimizer, and with no probability all solutions will coincide, as .

We now consider the situation in which can be approximated by an i.i.d. sum with a remainder term, i.e.,

(9)

This general form includes the form of a negative log-likelihood plus a penalty. Let . We assume throughout the rest of the paper that is compact with the true parameter value, denoted by , lying in its interior and that is finite. We need the following regularity conditions.

  1. For any is convex as a function , and is strictly convex.

  2. There exists function such that and, for any and , such that

  3. The remainder term is asymptotically flat:

  4. There exists a local minimizer of (9) that is a consistent estimator of

  5. There exists such that

Remark 9.

When and are both smooth functions, Conditions (C2) and (C3) can be guaranteed by assuming that the derivative of is bounded uniformly and that the derivative of tends to 0 uniformly. Condition (C5) is guaranteed by the convexity around the true parameters, uniformly in .

The next two lemmas provide sufficient conditions for the two types of asymptotic stability.

Lemma 1.

If Conditions (C1)-(C3) are satisfied, then we have weak asymptotic stability.

It is straightforward to verify that for generalized linear models, SCAD, MCP, sigmoid penalty and Poisson penalty all satisfy Conditions (C1)-(C3), leading to the weak asymptotic stability.

Lemma 2.

If Conditions (C1)-(C5) are satisfied, then strong asymptotic stability holds.

3.2 Asymptotic properties

In this subsection, we study asymptotic properties for the proposed LAMP, including parameter estimation consistency, model selection consistency and asymptotic stability. Suppose or is the true value of or , . Let , the number of signals. Without loss of generalization, we assume , where the true value has no zero element, and , which indicates a vector with each element being . Denote , , then is the range of signal level. For any sequences we say if . Here, we consider the setting of fixed when though some results may be extended to the case of .

We first introduce certain regularity conditions which are needed for establishing asymptotic properties.

  1. where denotes the th derivative of , and of a matrix “

    ” denotes its minimum eigenvalue.

  2. at and ; is increasing at , where is a constant.

  3. , , and

Theorem 1.

Suppose that Conditions (C6)-(C9) are satisfied. Then the penalized maximum likelihood estimator based on the LAMP family is consistent and asymptotically normal, and achieves model selection consistency and strong asymptotic stability.

Lemma 3.

Suppose that (C6) and (C7) are satisfied and , where is a constant and is negative and slowly varying at , i.e., for any , and . Then implies (C8).

It can be verified that the condition on is satisfied for the logistic, Poisson and probit regression models. To achieve model selection consistency, we can choose and

We now present the corresponding results for the case of sigmoid penalty. Here, where is a slowly varying function. The following conditions simplify (C6)-(C9).

  1. .

  2. Parameter is a constant (does not depend on ) and .

The following proposition shows that the results of Theorem 1 carry over to the penalized logistic regression when (C6’)–(C8’) are assumed.

Corollary 1.

For the penalized maximum likelihood estimator of logistic regression with the sigmoid penalty (6), we have parameter estimation consistency, model selection consistency and strong asymptotic stability under conditions (C6’)-(C8’).

4 Algorithms

An important aspect of the penalized likelihood estimation method is the computational efficiency. For the LASSO penalty, Efron et al. (2004) proposed the path-following LARS algorithm. In Friedman et al. (2007), the coordinate-wise descent method was proposed. It optimizes a target function with respect to a single parameter at a time, cycling through all parameters until convergence is reached. For non-convex penalties, Fan and Li (2001) used the LQA approximation approach. In Zou and Li (2008), a local linear approximation type method was proposed for maximizing the penalized likelihood for a broad class of penalty functions. In Fan and Lv (2011), the coordinate-wise descent method was implemented for non-convex penalties as well. Yu and Feng (2013) proposed a hybrid approach of Newton-Raphson and coordinate descent for calculating the approximate path for penalized likelihood estimators with both convex and non-convex penalties.

4.1 Iteratively reweighted least squares

We apply quadratic approximation and use the coordinate decent algorithm similar to Keerthi and Shevade (2007) and Shevade and Keerthi (2003). Recall that our objective function is the negative penalized log-likelihood Following Keerthi and Shevade (2007), let and define the violation function

We see that the objective function achieves its maximum value if and only if for all . Thus we use as the stop condition of our iteration, with being the chosen tolerance threshold.

In each step, we use quadratic approximation to the log-likelihood function where and depend on the current value of . The algorithm is summarized as follows.

Algorithm: Set values for . Denote by the th column of

  • Standardize

  • Initialize . Calculate and go to S3 if or else go to S5.

  • Choose . Calculate If let ; else .

  • If else .

    Calculate and go to S3 if , else go to S5.

  • Do the transformation of the coefficients due to standardization.

For S2, the initial solution can be taken as the zero solution or the MLE or the estimate calculated using a parameter from the previous steps.

For S4, we first carry out the iterations for the variables in the current active set until convergence, then check whether additional variables should join the active set. Alternatively, we may speed up the calculation by using “warm start”. Readers are referred to Keerthi and Shevade (2007) and Friedman et al. (2010) for details of the strategies to speed up calculation in coordinate descent algorithms.

For example, the logistic regression with sigmoid penalty is, where for We define , and . In S3, we have the following two different approximation methods for updating.

  1. Quadratic approximation from IRLS (Munk et al., 2006). Let and where .

  2. Quadratic approximation using Taylor expansion. Let and where and is the component-wise product operator.

For an initial estimator , denote by a quadratic approximation of at , i.e.,

Theorem 2.

Let be a closed set and the objective function is strictly convex. In addition, assume the quadratic approximation at satisfies for all . Then the algorithm constrained in (minimization within ) converges to the true minimum . In addition, method 1 for the logistic regression satisfies the conditions on quadratic approximation.

4.2 Balance between stability and parsimony

An important issue is how to choose tuning parameters in the penalized likelihood estimation. For the LAMP family, there are three tuning parameters, i.e., , and . Our numerical experiences show that the resulting estimator is not sensitive to the choice of . In most cases, we may simply take or , depending on the type of regression (See Remark 5). For and , we recommend using cross-validation or BIC, so long as the solutions are stable enough. The package described in Breheny and Huang (2011) to determine a stable area or perform local diagnosis is recommended. There are two approaches to get a stable area: to control the smoothness of the -estimate curve and calculate the smallest eigenvalue of the penalized likelihood at each point of the path as stated in Theorem 1. Here, we take the second approach in all numerical analysis.

Our algorithm differs from in the following two aspects: we use the “viol” function as the convergence criteria; we do not use the adaptive-scale. Both algorithms 1 and 2 use the linear approximation (suppose at ) of the penalty term. For algorithm 1, from concavity, we have which naturally falls into the MM-algorithm framework.

To choose the pair, we use the hybrid approach introduced by Breheny and Huang (2011), i.e., combining BIC, cross-validation, and convexity diagnostics. For a path of solutions with a given value of large enough, use AIC/BIC to select and use the convexity diagnostics to determine the locally convex regions of the solution path. If the chosen solution lies outside the stable region, we lower to make the penalty more convex. After this process is iterated multiple times, we can find a value of that produces a good balance between sparsity and convexity. Then we can fix and use BIC or cross-validation to choose the best .

5 Simulation results and examples

Simulation studies cover logistic, Poisson and probit cases. The performance of the LAMP family is compared with those of LASSO, SCAD and MCP. Particular attention is given to the logistic regression to demonstrate how sparsity and stability are properly balanced. Two classification examples from microarray experiments involving cancer patients are presented.

5.1 Logistic regression

We simulate from the logistic regression model with , , , , and , where with . The number of replications is 100 for this and all the subsequent simulations.

Table 1 reports true positive (TP), false positve (FP), proportion of correct fit (cf), proportion of over fit (of), proportion of under fit (uf), (L1 loss), and (L2 loss). To compare performances among LASSO, SCAD, MCP and the sigmoid penalty, we use to calculate the LASSO solution path, to calculate the SCAD and MCP. For all penalties, we use EBIC (Chen and Chen, 2008) to choose the tuning parameter with other parameters fixed. The EBIC parameter .

From Table 1, it is clear that the sigmoid penalty outperforms SCAD and MCP in the sense that at a similar level of TP, the sigmoid penalty results in a smaller FP. Somewhat surprisingly the LASSO has a competitive performance, which may be attributed to the use of the EBIC selection criterion.

Penalties TP FP cf of uf L1 L2
LASSO 1.87 (0.053) 0.12 (0.041) 0.05 0.01 0.94 2.49 1.46
sig(.02) 1.87 (0.053) 0.11 (0.040) 0.05 0.01 0.94 2.48 1.45
sig(.03) 1.90 (0.052) 0.13 (0.042) 0.06 0.01 0.93 2.41 1.41
sig(.05) 1.95 (0.05) 0.15 (0.041) 0.08 0.01 0.91 2.34 1.36
sig(.07) 1.99 (0.048) 0.20 (0.047) 0.09 0.02 0.89 2.25 1.31
sig(.09) 2.02 (0.047) 0.20 (0.049) 0.1 0.02 0.88 2.17 1.26
sig(.15) 2.21 (0.057) 1.86 (0.43) 0.1 0.19 0.71 3.18 1.36
sig(.38) 2.61 (0.049) 4.65 (0.40) 0.1 0.51 0.39 6.23 2.13
SCAD(300) 1.69 (0.073) 0.14 (0.043) 0.04 0.01 0.95 2.94 1.74
SCAD(7) 1.69 (0.073) 0.14 (0.043) 0.04 0.01 0.95 2.94 1.74
SCAD(5) 1.84 (0.077) 3.20 (0.69) 0.03 0.11 0.86 2.94 1.74
SCAD(4) 2.25 (0.089) 12 (0.75) 0.01 0.45 0.54 91.9 23.5
MCP(300) 1.69 (0.073) 0.14 (0.043) 0.04 0.01 0.95 2.94 1.74
MCP(50) 1.70 (0.073) 0.16 (0.047) 0.04 0.01 0.95 2.94 1.74
MCP(15) 1.75 (0.073) 0.14 (0.043) 0.06 0.01 0.93 2.45 1.45
MCP(7) 1.79 (0.071) 0.14 (0.040) 0.07 0.01 0.92 2.21 1.31
MCP(5) 1.91 (0.081) 2.98 (0.69) 0.09 0.1 0.81 2.07 1.23
MCP(4) 2.29 (0.092) 11.89 (0.80) 0.02 0.50 0.48 109.3 28.7
Table 1:

Simulation results for model selection and estimation under different penalties. Entries are mean values with the standard error in parentheses. Here “sig” represents the sigmoid penalty. The number inside the parentheses following by the penalty name represents the concavity parameter.

5.2 Smoothness

sigmoid SCAD MCP
TP FP TP FP TP FP
.04 1.86 0.28 1.1 1.89 0.21 1.1 1.90 0.21
.05 1.81 0.22 7 1.89 0.21 7 1.90 0.21
.06 1.84 0.24 14 1.89 0.21 14 1.90 0.21
.08 1.89 0.22 20 1.89 0.21 20 1.90 0.21
.10 1.91 0.22 27 1.88 0.23 27 1.89 0.22
.11 1.89 0.21 34 1.83 0.23 34 1.86 0.24
.13 1.89 0.21 41 1.83 0.24 41 1.83 0.23
.15 1.89 0.21 54 1.83 0.28 54 1.83 0.29
.16 1.89 0.21 60 1.86 0.32 60 1.86 0.32
.18 1.89 0.21 67 1.88 0.35 67 1.88 0.36
Table 2: Simulation results for TP and FP under different penalties with the tuning parameter selected by BIC. Mean values are presented.
(a) sigmoid(0.1)
(b) LASSO
(c) SCAD(20)
(d) MCP(20)
Figure 3: Solution paths under different penalties.

The same logistic regression model as in Section 5.1 except , and is used. In Figure 3, we compare smoothness of solution paths generated from the sigmoid penalty, SCAD and MCP with the same concavity at 0. It can be seen that this choice will also lead to similar sparse level as in Table 2. SCAD and MCP use the same algorithm as sigmoid does (not adaptive-scale as in package). The shorter vertical line is the BIC choice of and the longer one uses a 10-fold cross validation. To avoid variation due to the random division in the cross-validation, the result in Table 2 uses BIC to choose with the other tuning parameter fixed. The subfigures (c) and (d) for SCAD and MCP are both less smooth than subfigure (a) for the sigmoid penalty which is of similar smoothness as subfigure (b) for LASSO. The sigmoid penalty appears to outperform SCAD and MCP in terms of smoothness of the solution path.

5.3 Stability

The data are generated in the same way as in the preceding subsection. For each replication, we repeat 100 times cross-validation to select

and calculate its mean sample standard deviation. The box-plots for the mean sample standard deviations are generated. In addition, to evaluate the asymptotic stability as introduced in Section

5.3, we add a small random perturbation generated from to all the observations before conducting the analysis. The box-plot results are presented in Figure 4.

The result without the random error term evaluates the stability regarding the randomness of cross-validation for each penalty. The result with the random error evaluates the stability towards the random perturbation of the data. To ensure a fair comparison, we choose the same level of concavity at 0 for SCAD, MCP and the sigmoid penalty. It is seen from the box-plots that LASSO is the most stable one, while the sigmoid penalty outperforms SCAD and MCP, in terms of both the median and 75% quantile in the case without error, and 75% quantile in the case with the error added.

Figure 4: Box-plots for the mean standard deviations in Section 5.3. The left panel shows the box-plot without perturbation and the right one is that with perturbation.

5.4 Poisson and probit

We simulate from the Poisson regression model with , , , and , where with . For probit regression, we simulate the the same way and set and .

The results for the Poisson and probit regression models are summarized in Tables 3 and 4, respectively. For the Poisson regression, we also report the MRME (median of ratios of model error of a selected model to that of the ordinary maximum likelihood estimate under the full model). We observe similar selection performance using Poisson, SCAD and MCP. However, the Poisson penalty leads to the smallest MRME. For the probit model, all three nonconvex penalties lead to the same result for all criteria considered.

Penalties TP FP cf of uf L1 L2 MRME
LASSO 2.98 (0.01) 1.40 (0.13) 0.26 0.72 0.02 0.63 0.34 0.49
Poisson 2.80 (0.04) 0.41 (0.08) 0.65 0.15 0.20 0.49 0.30 0.19
SCAD 2.78 (0.04) 0.35 (0.07) 0.54 0.24 0.22 0.47 0.30 0.20
MCP 2.81 (0.04) 0.38 (0.08) 0.66 0.15 0.19 0.48 0.29 0.22
Table 3: Simulation results for Poisson regression under different penalties.
Penalties TP FP cf of uf L1 L2 MRME
LASSO 3.00 (0) 2.13 (0.18) 0.21 0.79 0 2.47 1.35 0.71
probit 2.99 (0.01) 0.20 (0.05) 0.82 0.17 0.01 1.37 0.85 0.27
SCAD 2.99 (0.01) 0.20 (0.05) 0.82 0.17 0.01 1.37 0.85 0.27
MCP 2.99 (0.01) 0.20 (0.05) 0.82 0.17 0.01 1.37 0.85 0.27
Table 4: Simulation results for probit regression under different penalties.

5.5 Examples

We apply the proposed LAMP to two gene expression datasets: lung cancer data (Gordon et al., 2002), and prostate cancer data (Singh et al., 2002). The two datasets are downloadable at http://www.chestsurg.org and http://www.broad.mit.edu

. The response variable in each dataset is binary.

We aim to use the lung cancer data to classify malignant pleural mesothelioma (MPM) from adenocarcinoma (ADCA) of the lung. The data consists of 181 tissue samples, 32 of which are for training with remaining 149 for testing. Each sample is described by 12533 genes.

After the initial standardization of the predictors into mean zero and variance one, we apply LASSO, SCAD, MCP, and the sigmoid penalty, using for LASSO, for SCAD and MCP. For each method, a 10-fold cross-validation is used to select the best . We repeat 10 times to make different divisions to calculate the cross-validation error. For SCAD and MCP, we evaluate the performance when while for sigmoid penalty, . The results are summarized in Table 5.

Penalties Training error Test error # of selected genes
LASSO   —   0/32  7/149    15
SCAD (36)   0/32  6/149    13
MCP (36)   0/32  7/149     3
sigmoid (.0220.03)   0/32  7/149     3
sigmoid (.021)   0/32  6/149     5
Table 5: Classifications errors and selected model sizes under different penalties for lung cancer data.

The result for sigmoid penalty is quite similar to MCP when . When , the sigmoid have 6 test errors with only 5 genes selected, which is better compared with SCAD.

For the prostate cancer data, the goal is to classify prostate tumor samples from the normal samples. There are 102 patient samples for training, and 34 patient samples for testing with 12600 genes in total. The result are reported in Table 6. Here we see the test errors are similar across methods, although the sigmoid penalty leads to the most sparse solution.

Penalties Training error Test error # of selected genes
LASSO   —   0/102  2/34     30
SCAD (2025)   0/102  2/34     26
MCP (3550)   0/102  2/34     24
sigmoid (.001)   0/102  1/34     26
sigmoid (.002)   0/102  2/34     23
Table 6: Classifications errors and selected model sizes under different penalties for prostate cancer data.

6 Discussion

Penalty based regularization methods have received much attention in recent years. This paper proposes a family of penalty functions (LAMP) that is adaptive and specific to the shapes of the log-likelihood functions. The proposed LAMP family is different from the well-known LASSO, SCAD and MCP. It can be argued that the new approach provides a good balance between sparsity and stability, two important aspects in model selection. It is shown that the resulting penalized estimation achieves model selection consistency and strong asymptotic stability, in addition to the usual consistency and asymptotic normality.

An important issue is how to choose the three parameters imbedded in a LAMP. The “location” parameter can be chosen in an obvious way for the standard generalized linear models, while , which represents the penalty level, can be chosen through standard CV or information criteria. For , which controls the concavity level, it is computationally intensive to use CV. It is desirable to develop more effective ways to select . It is also important to study the stability of the solution path.

The LAMP approach can be modified to handle grouped variable selection. It will also be of interest to develop parallel results for semiparametric regression models.

Appendix A General results

We first state a general result about estimation consistency, model selection consistency and asymptotic normality.

Consider the penalized log-likelihood function as defined by (4). Let be the true value of . Recall is the nonzero part of and . For notational simplicity, let

Let , be i.i.d. with density that satisfies the following regularity conditions (see Fan and Li (2001)):

  1. where ;

  2. There exist functions , a neighborhood of such that and

    for all , , and .

Note that for the generalized linear models, (C10)-(C12) are satisfied under mild assumptions on covariates.

Let be the Fisher information matrix at partitioned by the intercept, the nonzero and zero parts and

For , define Let be the norm of a matrix .

Lemma 4.

Suppose that both and are strictly convex, satisfying (C10)-(C12), and that are continuous at 0. Assume there exist such that