Variable Selection Using a Smooth Information Criterion for Multi-Parameter Regression Models

10/06/2021
by   Meadhbh O'Neill, et al.
0

Modern variable selection procedures make use of penalization methods to execute simultaneous model selection and estimation. A popular method is the LASSO (least absolute shrinkage and selection operator), which contains a tuning parameter. This parameter is typically tuned by minimizing the cross-validation error or Bayesian information criterion (BIC) but this can be computationally intensive as it involves fitting an array of different models and selecting the best one. However, we have developed a procedure based on the so-called "smooth IC" (SIC) in which the tuning parameter is automatically selected in one step. We also extend this model selection procedure to the so-called "multi-parameter regression" framework, which is more flexible than classical regression modelling. Multi-parameter regression introduces flexibility by taking account of the effect of covariates through multiple distributional parameters simultaneously, e.g., mean and variance. These models are useful in the context of normal linear regression when the process under study exhibits heteroscedastic behaviour. Reformulating the multi-parameter regression estimation problem in terms of penalized likelihood enables us to take advantage of the close relationship between model selection criteria and penalization. Utilizing the SIC is computationally advantageous, as it obviates the issue of having to choose multiple tuning parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/02/2019

Penalized Variable Selection in Multi-Parameter Regression Survival Modelling

Multi-parameter regression (MPR) modelling refers to the approach whereb...
03/20/2012

Selection of tuning parameters in bridge regression models via Bayesian information criterion

We consider the bridge linear regression modeling, which can produce a s...
04/08/2014

A Permutation Approach for Selecting the Penalty Parameter in Penalized Model Selection

We describe a simple, efficient, permutation based procedure for selecti...
01/15/2021

Fitting very flexible models: Linear regression with large numbers of parameters

There are many uses for linear fitting; the context here is interpolatio...
05/29/2019

Topological Techniques in Model Selection

The LASSO is an attractive regularisation method for linear regression t...
06/19/2018

Simultaneous Signal Subspace Rank and Model Selection with an Application to Single-snapshot Source Localization

This paper proposes a novel method for model selection in linear regress...
10/03/2018

Simultaneous Parameter Estimation and Variable Selection via the LN-CASS Prior

We introduce a Bayesian prior distribution, the Logit-Normal continuous ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Enhancements in data collection technologies have highlighted the importance of modern variable selection techniques. Traditional methods, such as best subset selection, are suboptimal and are computationally expensive when the number of variables is high (Fan and Lv, 2010). Modern approaches make use of penalization methods to execute simultaneous model selection and estimation. A popular method is the LASSO (least absolute shrinkage and selection operator) (Tibshirani, 1996), which comprises of an penalty but leads to biased estimates. In contrast, the adaptive LASSO (Zou, 2006) has adaptive weights, which reduce the bias present in the LASSO estimates. These methods have been developed primarily in the context of normal linear regression and have been extended to generalized linear models (GLMs) (Tibshirani, 1996; Friedman et al., 2010) and Cox’s proportional hazard models for survival data (Tibshirani, 1997). In these classical models, covariates enter through the location parameter (or the hazard scale in the Cox model case). One of the goals of this paper will be to expand these methods for application in a more flexible setting with multiple distributional parameters.

Originally, methods such as quadratic programming were used to solve these non-differentiable LASSO-type problems but Efron et al. (2004) and Friedman et al. (2007), respectively, proposed the least angle regression (LARS) and co-ordinate descent algorithms — with the latter proving to be particularly fast for problems of this type. These are somewhat “non-standard” estimation procedures in the context of classical statistical estimation where non-differentiable objective functions are relatively less common. As an alternative non-gradient based optimization, perturbing the penalty function slightly to render it differentiable (Hunter and Li, 2005; Lloyd-Jones et al., 2018) enables standard optimization methods to be used. Oelker and Tutz (2017) outline a series of approximations of different penalties, which allows for penalized smooth functions. These differentiable penalties can be easily implemented and solved using standard gradient based optimization procedures, i.e., Newton-Raphson. The tuning parameter that controls the strength of the penalty is typically obtained by minimizing the cross-validation error or an information criterion (IC), such as the Akaike IC (AIC) (Akaike, 1974) or Bayesian IC (BIC) (Schwarz et al., 1978). This is a two-step estimation process, which tends to be computationally intensive as it involves fitting an array of different models and selecting the best one.

Su (2015) and Su et al. (2018) present an estimation procedure that is not based on the norm, titled “MIC” (minimum approximated information criterion). They exploit the close connection between model selection criteria and penalization (Fan and Lv, 2010) and introduce an approximated information criteria in order to avoid the classic two-step estimation process. At its core, the MIC utilizes an approximation of the “ norm” with a continuous unit dent function. The norm is discrete in nature and it is preferable to have a penalty function with a level of smoothness for optimization purposes. Su (2015) describes a “subtle uprooting” method for variable selection, which involves using a smooth surrogate function for approximating cardinality. This is followed by a second technical step for enhancing sparsity, where the final problem becomes non-differentiable. This approach is extended to GLMs in Su et al. (2018). Fixing the tuning parameter at two for the AIC or for the BIC is computationally advantageous, as it avoids the tuning parameter selection problem. It is not required to compute the whole regularization path of solutions, nor is it necessary to choose the best tuning parameter using cross-validation, as is typically done.

We propose a more straightforward method of approximating the IC function using a smooth approximation of the norm, which can be optimized directly. Instead of performing the reparameterization step as outlined in Su (2015), which renders the problem non-differentiable, we achieve sparsity in a different way. Our approach squeezes the coefficient values to zero by optimizing a sequence of objective functions that get successively closer to the non-differentiable one. Consequently, our proposed “smooth IC” (SIC) function can be optimized directly using standard gradient based optimization techniques. Additionally, we extend this new SIC variable selection procedure for use in the developing area of “multi-parameter regression” (MPR). Our proposed methods are implemented in our publicly available R package “smoothic(O’Neill and Burke, 2021). To date, penalized estimation has been primarily applied in the context of classical regression models, where the covariates are allowed to enter the model through a single parameter (e.g., a location parameter). Other distributional parameters, such as a dispersion parameter, are typically constant. This “single parameter regression” (SPR) does not take into account the possible impact of covariates on the other distributional parameters. MPR is a more flexible approach where multiple parameters are regressed simultaneously on covariates (this is sometimes referred to as “distributional regression” (Stasinopoulos et al., 2018)). For example, covariates can enter the model through the location and dispersion parameters, or scale and shape parameters of the hazard function in the survival context (see Burke and MacKenzie (2017), Burke et al. (2020) and references therein), or indeed in various different distributional parameters as in generalized additive models for location, scale and shape (GAMLSS) (Rigby and Stasinopoulos, 2005). Mayr et al. (2012)

address the problem of variable selection by utilizing classical gradient boosting techniques to fit GAMLSS models. More recently,

Groll et al. (2019) suggest implementing a LASSO-type penalization in the GAMLSS framework. This regularization approach to GAMLSS is highly flexible, but it has the added complexity of separate tuning parameters for each regression component. Groll et al. state that carrying out the computationally demanding grid search for the optimal tuning parameters is a drawback of their method. In our proposed multi-parameter regression with smooth IC (MPR-SIC) procedure, this issue is circumvented as the values of both tuning parameters are known in advance.

The model formulation, including the introduction of the “smooth norm”, the estimation procedure and the optimization algorithm are outlined in Section 2. In Section 3, the performance of our proposed methods is evaluated in both variable selection and parameter estimation through extensive simulation studies. We consider three real data analyses to demonstrate our proposed methods in Section 4. Finally, we close with some concluding remarks in Section 5.

2 Model Formulation

2.1 Preliminaries

The classic normal linear regression is a single parameter problem that assumes there is constant variance in the errors. The model is

(2.1)

for , where is the response value and

is a vector of covariates for the

th individual over the predictor variables

. Here, is the vector of regression coefficients for the location parameter and under the homogeneity assumption. For the multi-parameter regression (MPR) approach, the single parameter model in (2.1) is extended to include heterogeneity of error variance:

(2.2)

where the log-linear form ensures that remains positive. The vector of regression coefficients for the dispersion parameter is . There may be different (possibly overlapping) sets of covariates impacting the location and dispersion, and although we use for both, a given or coefficient may be set to zero, which removes the covariate from that model component. Because we apply penalized variable selection, the regression coefficients need to be on a similar scale, and therefore we assume that the predictors are scaled to have unit variance. The log-likelihood function for the MPR normal model is

(2.3)

where . Model selection criteria, such as the AIC and BIC, have a penalized functional form similar to regularization and in the multi-parameter framework they can be formulated as

(2.4)

where is a tuning parameter, and and , i.e., the coefficient vectors with the intercepts omitted; there is an addition of two in the penalty to take into account the estimation of the intercept terms and . Fixing the tuning parameter as or results in the AIC and BIC respectively. The norm, , indicates the cardinality or the number of non-zero elements in . This is not truly a norm since when . The AIC is reported to be asymptotically “selection inconsistent” and “loss-efficient” as a variable selection criterion (Shao, 1997; Yang, 2005; Wang et al., 2009). As a result of its consistency property and superior empirical performance in variable selection, we employ a BIC-type criterion (Wang and Leng, 2007) where . Using the multi-parameter normal likelihood in (2.3) and arranging (2.4) as a penalized likelihood results in

(2.5)

2.2 Smooth Norm

Due to the non-differentiability of the norm, we propose a smooth function to approximate it:

(2.6)

This is differentiable for and . Figure 1 demonstrates how gets closer to as decreases. The smallest value shown () approximates the norm very closely, but it is also near the discontinuity at making it unstable. In practice, we have found that fixing the smoothing parameter generally works well, but telescoping through a decreasing sequence of values (outlined in Section 2.4) is recommended for optimal performance. Note that the first and second derivatives (for ) have a simple analytic form and therefore can be used within the gradient based optimization procedure of Section 2.3:

(2.7)
Figure 1: Smooth norm.

Defining as the “smooth norm”, and substituting the norm in (2.5) with and results in our proposed approach of MPR with smooth IC (MPR-SIC):

(2.8)

2.3 Estimation Procedure

We define the penalized estimates as

where is given by (2.8). The first and second derivatives with respect to the parameters are

(2.9)

where is an matrix, whose th row is , and are vectors of length such that and , and and are vectors whose th elements are given by and respectively, except whose first elements are zero due to the fact that the intercepts are not penalized.

The matrix of negative second derivatives of , i.e., is given by

where is the observed information matrix of the unpenalized likelihood; and are diagonal matrices that appear due to the penalties and whose th diagonal elements are given by and respectively, except whose first diagonal elements are zero due to the fact that the intercepts are not penalized; , and are diagonal weight matrices whose th diagonal elements are given by , and respectively. We employ the “RS” algorithm (Rigby and Stasinopoulos, 2005), which does not use cross derivatives. This algorithm is motivated by the fact that in many classical models, including location and scale models, the parameters are information orthogonal as discussed in Cox and Reid (1987). However, Stasinopoulos et al. (2007) report that the RS algorithm works well even when the parameters are not information orthogonal.

The resulting system of Newton-Raphson equations can be expressed compactly as

(2.10)

They are iteratively solved for , where the elements super-scripted by depend on

, but this is excluded for notational convenience. We use the classical ordinary least squares estimates as initial values for the location parameter, i.e.,

, where is the response vector. We fix the starting value for the intercept of the dispersion term at , where the classical residual variance estimator is used. The remaining elements of the parameter vector are set to zero (Rutemiller and Bowers, 1968; Harvey, 1976), which gives

. The standard errors of the estimates can be directly acquired by estimating the covariance of the penalized estimates for the true non-zero parameters using the sandwich formula,

(2.11)

which has been shown to be accurate for moderate sample sizes (Fan and Li, 2001, 2002).

2.4 -telescoping

Although smaller values of lead to a better approximation of the norm (see Figure 1), and hence IC optimization, we have found that the procedure becomes less numerically stable. On the other hand, larger values of lead to a more stable optimization procedure, but one that does not yield coefficients close to zero. Therefore, we propose a method that “telescopes” through a decreasing sequence of values and makes use of “warm starts”, whereby the solution to the previous optimization problem is used as the initial point for the current nearby problem. The method can obtain final estimates of the true zero coefficients, which are extremely close to zero to the extent that they can be treated as zero for practical purposes. In this paper, we treat values below as being zero. We have found that using a sequence of values from to performs well. The performance of the method is not highly impacted by the choice of the sequence. A large enough must be chosen in order to introduce smoothness and give stable estimates, while the smaller more closely approximates the norm to induce pseudo-sparsity (where we say “pseudo” since the algorithm produces coefficients that can be made arbitrarily close to zero while not being exactly zero). In addition to this, we propose implementing an exponentially decaying sequence in the form of , where is the starting value, is the rate of decay and is the step number. For our suggested sequence with steps from to , the decay parameter is . This is advantageous as the optimization begins with large increments from , which provides rapid improvements and estimates that are initially close to the unpenalized values. The smaller increments leading to allow for smaller refinements, especially with regard to squeezing some coefficients to be close to zero.

Table 1 presents an example of the coefficient estimates for a true zero and true non-zero coefficient for some simulated data. The shrinkage effect due to decreasing values is apparent. The value of the true zero coefficient drastically reduces in magnitude through each step. It is obvious that the final estimate at is extremely close to zero. As a result, it can be treated as a zero coefficient without any issues — and, indeed, could be shrunk further if desired by reducing . The estimate for the true non-zero coefficient does not vary greatly over the telescoping steps.

Figure 2 provides a visualization of the telescoping effect in terms of the objective function. This is a slice through the objective function, which is plotted as a function of the coefficient value. Different curves are plotted corresponding to the values in the telescope sequence. In the case of the true zero coefficient , it is clear that as decreases, the width of the curves become narrower and therefore there is less uncertainty around the estimate. Additionally, it is evident that the minimum of the curve shifts towards zero as decreases. For the true non-zero coefficient , the curves for the different values are almost identical, i.e., the telescoping has little impact on the shape of the objective function in this case.

0.0007631062 0.9998
0.0000078397 0.9998
0.0000000815 0.9998
0.0000000008 0.9998
Table 1: Coefficient values of and as the method telescopes through .
Figure 2: Slice through objective function for different values of . Dashed vertical lines mark true value.

2.5 Algorithm

The proposed MPR-SIC variable selection method is summarized in Algorithm 1.

  1. [label=0.]

  2. Initialization: Set , where and are the initial values for the location and dispersion parameters respectively (see Section 2.3). For BIC-based variable selection, set as suggested in Section 2.1.

  3. Telescoping: Go through the exponentially decaying sequence of telescope values of length from to , where for step and is the rate of decay (see Section 2.4).

    • [leftmargin=*]

    • For :
      Optimization:
      Maximize in (2.8) by iteratively re-solving the system of equations in (2.10) with initial values , to obtain . Convergence is achieved when for some small tolerance, e.g., . For warm starts, set = so that the obtained estimates are used as initial values for the next step in the telescope. Note that we set .

  4. Output: At , the final estimates are obtained and any estimates that are very close to zero (below for example) can be treated as being zero. The corresponding standard errors are computed by evaluating (2.11) at . Note that because we are applying penalized variable selection, the predictors are scaled to have unit variance. However, the final estimates are converted back to their original scale.

Algorithm 1 Implementation of the MPR-SIC -telescope Method

3 Simulation Studies

3.1 Setup

We have undertaken a simulation study to investigate the performance of our proposed MPR-SIC method. We simulate data from the normal MPR model from Section 2, where is the matrix of 12 covariates of which , , and are Bernoulli(0.5) and the remainder are ; the regression coefficients are provided in Table 2. Three different sample sizes (, 500 and 1000) are considered. Results of a simulation study carried out in a single parameter setting (not discussed here) are given in Appendix A.

0 1 0.5 0.5 1 0.5 1 0 0 0 0 0 0
0 0.5 1 0.5 1 0 0 0.5 1 0 0 0 0
  • Binary covariates indicated in bold.

Table 2: True parameter values.

The true values of 0, 0.5 and 1 correspond to having no effect, a weak effect and a strong effect respectively. There are several combinations where covariates are entering either through both distributional parameters, a single distributional parameter, or not at all. There are four pure noise covariates in that both the and coefficients are zero.

For each scenario, we perform our proposed methods, MPR-SIC and SPR-SIC (i.e.,  the SIC implemented in the single-parameter regression framework). To compare performance, we apply the ALASSO method from the “glmnet” package, which pertains to a linear regression model, where we select the value of the tuning parameter by minimizing the BIC (ALASSO-IC). Note that only cross-validation-minimization is available in the glmnet package, and, therefore, we compute this by evaluating the normal likelihood at the ALASSO estimates, , and with the variance estimator

(3.12)

as suggested in Reid et al. (2016), where is the number of non-zero elements in .

3.2 Simulation Results

Metrics including the average number of true zero coefficients correctly set to zero (C) and the average number of true non-zero coefficients incorrectly set to zero (IC) are investigated to analyse the performance of the method. The probability of choosing the true model (PT) is examined by looking at the proportion of times the true model is selected. The mean squared error (MSE) is computed for each simulation replicate in order to assess in-sample prediction accuracy, and is calculated by

(Tibshirani, 1997). These metrics, averaged over simulation replicates, are presented in Table 3.

The C values are close to six (true value) and improve as the sample size increases. The IC values are zero in all cases apart from in the parameters for . This is due to the three smaller-valued , and parameters being set to zero incorrectly. This behaviour is also conveyed by the probability of choosing the true model (PT). The PT values are low for , which is due to both and sometimes having a zero coefficient not set to zero, and, in the case of , sometimes having a non-zero coefficient set to zero. The sample size of is a challenging scenario as the MPR-SIC method is fitting a total of parameters, which in this case is 26 parameters for a relatively small sample size. Taking this into account, we suggest that the performance of the method in this setting is reasonable. The PT values are high for and 1000. Both the SPR-SIC and ALASSO-IC models underperform in the location component compared to the MPR-SIC, even though the location and dispersion parameters are orthogonal in the normal model. Nonetheless, both models do relatively well in the case of , with the SPR-SIC model doing better than the ALASSO-IC. Also, the MSE values are higher than in the MPR-SIC approach.

The estimation and inferential performance of our proposed MPR-SIC method is investigated in Table 4. The equivalent results (not discussed here) for the SPR-SIC and ALASSO-IC methods are deferred to Appendix B

. The average estimate over simulation replicates is shown along with the true standard error (SE), which is the standard deviation of the estimates over simulation replicates, and the average estimated standard error (SEE) over simulation replicates, where the SEE in a given replicate is computed using (

2.11

); also shown is the empirical coverage probability (CP) for a nominal 95% confidence interval. We can see that, in all cases, the estimated parameter is close to the true value, albeit there is some bias in the larger

coefficients at . The standard errors for both the and parameters are underestimated at , leading to CPs below 0.95. However, at the standard errors are well estimated and the coverage is very close to the desired 0.95 level.

Out-of-sample prediction coverage probabilities (PCPs) for the methods are calculated for a sample 20% the size of the original data and are shown in Table 5. This is calculated as the proportion of times the true response value lies in a nominal 95% prediction interval (PI) in each replicate. The average is then taken over the 1000 replicates. The 95% PIs are calculated as . Note that for the SPR-SIC and ALASSO-IC, the dispersion parameter is held constant across observations and so is a vector of zeros except for its first element, which is the intercept. The methods appear to perform similarly when examining the overall PCP, although the MPR-SIC is somewhat poorer than the others in the case. However, note that the observations can be categorized by their variability and, therefore, we split them into groups with low, medium and high variability using the thresholds , and , respectively, where these thresholds are the tertiles computed numerically from the true underlying distribution of . Doing so reveals that none of the methods perform particularly well at when viewed in terms of these three levels of variability. For and 1000, the coverage for the MPR-SIC remains at approximately 95%, which is the desired nominal value. In contrast, the PIs are too wide for the low and medium variability cases (leading to 100% coverage) and too narrow for the large variability cases (leading to approximately 85% coverage). This is unsurprising since these methods assume a constant , and, therefore, cannot adapt to heterogeneity in the data. This can also be visualized in Figure 3, which shows the coverage for ten categories in the case where ; again we see that the MPR-SIC method lies close to 95%, whereas the other methods are either too high or too low.

MPR-SIC SPR-SIC ALASSO-IC
C(6) IC(0) PT MSE C(6) IC(0) PT MSE C(6) IC(0) PT MSE
100 5.28 0.00 0.53 0.06 5.64 1.46 0.15 0.87 5.36 1.27 0.14 0.82
500 5.89 0.00 0.90 0.00 5.84 0.15 0.76 0.15 5.66 0.10 0.67 0.16
1000 5.93 0.00 0.94 0.00 5.90 0.01 0.90 0.06 5.80 0.01 0.83 0.07
100 5.50 0.64 0.31 0.59 6.00 6.00 0.00 6.24 6.00 6.00 0.00 6.46
500 5.91 0.00 0.91 0.04 6.00 6.00 0.00 6.88 6.00 6.00 0.00 6.93
1000 5.94 0.00 0.94 0.02 6.00 6.00 0.00 6.91 6.00 6.00 0.00 6.93
  • C, average correct zeros; IC, average incorrect zeros; PT, the probability of choosing the true model;
    MSE, the average mean squared error.

Table 3: Simulation results: model selection metrics.
MPR-SIC
SE SEE CP SE SEE CP SE SEE CP
1.0 1.00 0.07 0.05 0.80 1.00 0.02 0.02 0.95 1.00 0.01 0.01 0.95
0.5 0.50 0.08 0.05 0.82 0.50 0.02 0.02 0.95 0.50 0.01 0.01 0.95
0.5 0.50 0.07 0.05 0.80 0.50 0.02 0.02 0.94 0.50 0.01 0.01 0.94
1.0 1.00 0.09 0.06 0.82 1.00 0.03 0.03 0.92 1.00 0.02 0.02 0.95
0.5 0.50 0.07 0.04 0.82 0.50 0.02 0.02 0.94 0.50 0.01 0.01 0.93
1.0 1.00 0.07 0.05 0.80 1.00 0.02 0.02 0.94 1.00 0.01 0.01 0.94
0.5 0.50 0.31 0.13 0.69 0.51 0.07 0.07 0.95 0.50 0.05 0.05 0.95
1.0 1.12 0.24 0.17 0.82 1.02 0.07 0.07 0.94 1.01 0.05 0.05 0.94
0.5 0.49 0.31 0.13 0.67 0.50 0.07 0.07 0.94 0.50 0.04 0.05 0.96
1.0 1.08 0.21 0.16 0.86 1.01 0.07 0.06 0.94 1.00 0.04 0.05 0.95
0.5 0.49 0.29 0.13 0.71 0.51 0.07 0.06 0.94 0.50 0.05 0.05 0.94
1.0 1.12 0.23 0.17 0.81 1.02 0.07 0.07 0.93 1.01 0.04 0.05 0.95
  • SE, standard deviation of estimates over 1000 replications; SEE, average of estimated standard errors
    over 1000 replications; CP, the empirical coverage probability of a nominal 95% confidence interval.

Table 4: Simulation results: estimation and inference metrics.
MPR-SIC SPR-SIC ALASSO-IC
100 500 1000 100 500 1000 100 500 1000
Low 0.78 0.93 0.94 1.00 1.00 1.00 1.00 1.00 1.00
Medium 0.89 0.95 0.95 0.98 1.00 1.00 0.98 1.00 1.00
High 0.94 0.95 0.95 0.79 0.84 0.85 0.81 0.85 0.85
Overall 0.86 0.94 0.95 0.93 0.95 0.95 0.93 0.95 0.95
  • Variability categorized as low (), medium () and high ().
                   Out-of-sample coverage is calculated for a sample 20% the size of the original data.

Table 5: Simulation results: out-of-sample prediction coverage probabilities.
Figure 3: Prediction coverage probabilities (PCPs) of observations split by variability . Solid black line indicates the coverage and the red dashed line is a reference line at 0.95.

4 Real Data Analyses

4.1 Overview

We consider three real data analyses to illustrate our proposed MPR-SIC method, which is implemented using the -telescope (Algorithm 1). For each dataset, the resulting MPR-SIC, SPR-SIC and ALASSO-IC estimates () and associated standard errors (in brackets) are presented in Table 6 (note that for the SPR-SIC and ALASSO-IC methods, is a vector of zeros except for its first element, which contains the intercept). Additionally, the change in BIC, BIC, from dropping the variable from the location or dispersion component of the model is computed; this is a measure of the effect of the variable. For the sake of brevity, the standard errors and BIC are only discussed for the MPR-SIC method as this is the main focus of this paper. The corresponding results for the SPR-SIC and ALASSO-IC methods are given in Appendix C. Table 7 provides the out-of-sample PCPs overall and split by category of variability calculated using 10-fold cross-validation.

4.2 Prostate Cancer Data

We examine the prostate cancer data, which come from a study by Stamey et al. (1989) and which appear in Tibshirani (1996) and Zou and Hastie (2005). The correlation between the level of prostate-specific antigen (PSA) and various clinical measures in 97 men who were about to receive a radical prostatectomy is examined. The predictors consist of eight clinical measures: log(cancer volume ()) (lcavol), log(prostate weight (g)) (lweight), presence of seminal vesicle invasion (SVI) (svi), age of the patient (age), log(amount of benign prostatic hyperplasia ()) (lbph), log(capsular penetration (cm)) (lcp), Gleason score (gleason) and percentage of Gleason scores four of five (pgg45

). The logarithm of PSA (ng/mL) is the response variable. The presence of SVI (

svi

) is a binary variable (1=yes, 0=no) and

gleason is a discrete numerical variable with four values. The Gleason score relates to prostate cancer grades and the pgg45 predictor provides information on the history of the patient. This is the percentage of Gleason scores they received before their final Gleason score in gleason. PSA is a protein that is produced by normal and malignant prostate cells, and is useful as a preoperative marker, as prostate cancer causes PSA to be discharged into the blood.

Table 6 shows that all the methods select the same variables in the location parameter (lcavol, lweight and svi). These selected variables have positive location coefficients (across all methods), indicating that an increase in the log(cancer volume), log(prostate weight) and the presence of SVI are associated with an increase in the response log(PSA), and therefore may be indicative of prostate cancer. The MPR-SIC method also identifies lweight and svi as important variables for the dispersion parameter. In the case of log(prostate weight), although it has a positive coefficient in the location parameter, it has a negative coefficient in the dispersion parameter. Therefore, higher values of log(prostate weight) are associated with higher values of log(PSA) and also lower variability. The svi variable has a positive dispersion coefficient, i.e., the presence of SVI is associated with more variable log(PSA) values. However, removal of svi from the location only increases the BIC by 1.64 units suggesting that svi is less important in terms of the overall level of log(PSA); in contrast, its removal from the dispersion increases the BIC by 4.09 units.

4.3 Sniffer Data

When gasoline is pumped into a tank, hydrocarbon vapours are forced out and emitted into the atmosphere. This is a source of air pollution and in order to reduce this, devices that capture the vapour are set up. Testing these vapour recovery systems involves a “sniffer” device to measure the amount of vapour that is recovered. A method of estimating the total amount released is required to estimate the efficiency of the system. A laboratory experiment was carried out to discover factors that impact the amount of hydrocarbon vapour released when gasoline is pumped into a tank. Four factors are varied - vapour pressure (psi) of the dispensed gasoline (gaspres), temperature (°F) of the dispensed gasoline (gastemp), initial tank temperature (°F) (tanktemp) and initial vapour pressure (psi) in the tank (tankpres). The quantity of emitted hydrocarbon (g) is the response variable. There are 125 runs in the data. These data have previously been considered by Weisberg (2013) who noted that the dispersion may depend on the predictors but did not apply a heteroscedastic model, and Bedrick (2000) who used a model with all four predictors in the location along with gastemp and gaspres in the dispersion.

The MPR-SIC and SPR-SIC methods each select a different combination of variables for the location parameter, while the ALASSO-IC does not eliminate any variables (Table 6). In terms of the location, the MPR-SIC method selects tanktemp but not tankpres, whereas the SPR-SIC does the opposite. For all three methods, gaspres and gastemp have positive location coefficients, indicating that an increase in these variables is related to an increase in the quantity of hydrocarbon emitted. All of the coefficients are significant at the 95% level. The MPR-SIC method identifies that gastemp is associated with an increase in the variability of the quantity of hydrocarbon emitted from the tank. The change in BIC from dropping the variable from the dispersion parameter is 17.62 units, which, interestingly, is far greater than the impact of dropping tanktemp from the location component (BIC increase of 5.6 units); this demonstrates the fact that modelling only the location — as is most often done in practice — can miss important features of the process under study.

4.4 Boston House Price Data

Data from a cross-sectional study of 506 communities in the Boston area carried out in 1970 (Harrison Jr and Rubinfeld, 1978) are available in Wooldridge (2015). The association between median house prices in a particular community with various community characteristics is examined. There are eight explanatory variables: average number of rooms per house (rooms), percentage of the population that are “lower status” (lowstat), average student-teacher ratio of schools in the community (stratio), log(property tax per $1000) (lproptax), log(weighted distances to five employment centres in the Boston region) (ldist), crimes committed per capita (crime), log(annual average nitrogen oxide concentration (pphm)) (lnox) and index of accessibility to radial highways (radial). The log(median house price ($)) is the dependent variable. DiCiccio et al. (2019) applied a weighted least squares approach to these data (which accounts for heterogeneity but does not model the dispersion), where they only considered the rooms, stratio, ldist and lnox variables.

Table 6 shows that all eight covariates are included in the location component across the MPR-SIC, SPR-SIC and ALASSO-IC methods. All the coefficients are significant at the 95% level, and the signs of the estimated location coefficients are alike across the methods. Only two covariates (rooms and radial) have a positive effect on house prices. Houses with a greater number of rooms and access to radial highways are generally desirable, which results in increased house prices. The remaining variables may be considered to be undesirable, thus reducing the median house prices. In particular, the percentage of the population in the community that are “low status” and the student-teacher ratio of schools in the community both have a sizable impact on the BIC when they are removed from the location parameter. The MPR-SIC method selects three covariates in the dispersion parameter (lowstat, ldist and radial), of which ldist and radial are associated with increased house prices (as previously discussed) but also increased variability, i.e., it is generally positive to have access to the highway, but there may be exceptions leading to increased variability, which, for example, could be due to badly positioned houses where there is road noise or reduced privacy (but these features are not captured in the data). In contrast, the distance to employment centres (ldist) is associated with reduced house prices and reduced variability, so that it appears that being farther from work is more consistently seen to be negative. We note that ldist, when dropped from the dispersion, leads to a greater BIC value than three of the eight variables in the location component, again demonstrating the importance of modelling the dispersion.

MPR-SIC SPR-SIC ALASSO-IC
intercept -1.26 (0.53) 3.15 (1.36) -0.78 -0.73 -0.27 -0.68
lcavol 0.47 (0.06) 40.39 0.53 0.54
lweight 0.82 (0.14) 19.79 -1.17 (0.38) 4.78 0.66 0.52
svi 0.58 (0.22) 1.64 1.07 (0.38) 4.09 0.67 0.53
  • Variables not selected by any method: age, lbph, lcp, gleason, pgg45.

Prostate Cancer Data
MPR-SIC SPR-SIC ALASSO-IC
intercept 0.76 (0.85) -1.35 (0.64) 0.45 2.01 0.21 2.03
gaspres 5.19 (0.51) 68.41 10.84 9.79
gastemp 0.23 (0.03) 56.85 0.06 (0.01) 17.62 0.15 0.19
tanktemp -0.09 (0.03) 5.60 -0.07
tankpres -5.73 -4.08
Sniffer Data
MPR-SIC SPR-SIC ALASSO-IC
intercept 11.16 (0.28) -3.53 (0.30) 13.26 -3.30 13.18 -3.28
rooms 0.24 (0.01) 178.27 0.10 0.10
lowstat -0.02 (0.00) 95.03 0.03 (0.01) 5.55 -0.03 -0.03
stratio -0.03 (0.00) 51.66 -0.04 -0.04
lproptax -0.20 (0.03) 39.81 -0.26 -0.25
ldist -0.16 (0.02) 38.50 -0.92 (0.16) 26.80 -0.28 -0.27
crime -0.01 (0.00) 25.87 -0.01 -0.01
lnox -0.39 (0.08) 19.18 -0.62 -0.60
radial 0.01 (0.00) 16.74 0.05 (0.01) 20.54 0.01 0.01
Boston House Price Data
Table 6: Real data analyses results: estimation metrics.

4.5 Prediction Coverage Probabilities

Table 7 contains the out-of-sample PCPs overall and split by category of variability calculated using 10-fold cross-validation for the prostate cancer, sniffer, and Boston house price data. Considering the PCPs from an overall point of view, the MPR-SIC, SPR-SIC and ALASSO-IC methods perform similarly across all three data analyses. As expected based on the simulation, the coverage improves with respect to sample size, as we compare results from the smaller prostate cancer and sniffer datasets with the larger Boston house price dataset. The overall pattern is that both the SPR-SIC and ALASSO-IC methods tend to produce wider PIs for the low and medium variability categories, and narrower PIs for the high category — but do okay in terms of coverage for the low variability cases in the two smaller datasets (prostate cancer and sniffer data). The MPR-SIC approach does the opposite, but it is somewhat more balanced in that it tends towards good coverage on the larger Boston house price data — whereas the other two methods continue to produce overly wide intervals for the low and medium categories and overly narrow intervals for the high category.

MPR-SIC SPR-SIC ALASSO-IC
(a) (b) (c) (a) (b) (c) (a) (b) (c)
Low 0.88 0.86 0.91 0.95 0.94 0.99 0.95 0.96 0.99
Medium 0.89 0.90 0.94 0.89 0.99 0.98 0.90 0.99 0.98
High 0.96 0.95 0.92 0.79 0.87 0.80 0.79 0.87 0.80
Overall 0.89 0.90 0.93 0.86 0.92 0.93 0.87 0.93 0.93
  • (a) Prostate cancer data, low: , medium: , high: .
                    (b) Sniffer data, low: , medium: , high: .
                    (c) Boston house price data, low: , medium: , high: .

Table 7: Real data analyses results: out-of-sample prediction coverage probabilities.

5 Discussion

Our proposed variable selection procedure uses a smooth norm to facilitate smooth information criterion optimization, and is extended for application in the developing area of multi-parameter regression. This enables straightforward selection of more complex covariate effects afforded by modelling multiple distributional parameters simultaneously. The smooth objective function means that this is achieved using standard gradient based optimization procedures, i.e., Newton-Raphson. Moreover, because the objective function is an information criterion, the approach circumvents the need for penalty tuning parameter optimization, e.g., this is fixed at in the BIC case. This is something that can be computationally intensive in LASSO-type problems, especially in the context of MPR modelling due to the additional parameters to be estimated and the fact that, in theory, there may be a separate tuning parameter for each distributional parameter. We provide a publicly available package smoothic for the implementation of our proposed methods (O’Neill and Burke, 2021).

Through extensive simulation studies, we have demonstrated that the procedure has very favourable performance in terms of variable selection, parameter inference, and out-of-sample prediction; this is true in both single and multi-parameter settings. Results from the real data analyses illustrate the effectiveness of our procedure, and in particular, the advantage of modelling the dispersion is clear from the fact that we have found dispersion effects that are stronger (in BIC terms) than location effects.

The methods proposed in this article are not restricted for use with only the normal distribution. The techniques can be extended for use with non-normal models, for example, the generalized linear model family. Moreover, we anticipate that the methods will also extend to the setting of robust statistical modelling, which is particularly important given the ever-growing presence of complex datasets

(Fan et al., 2014; Avella Medina and Ronchetti, 2015; Maronna et al., 2019)

. Such an extension would be capable of dealing with heteroscedasticity and outliers, while also carrying out variable selection and parameter estimation simultaneously; this will be a focus of our future work.

Acknowledgements

This work was carried out within the Confirm Smart Manufacturing Research Centre (https://confirm.ie/) funded by Science Foundation Ireland (grant number: 16/RC/3918). We would also like to thank Prof. James Gleeson for his support.

Appendix A Simulation Results: Single Parameter Setting

This appendix contains simulation results for data simulated using the normal SPR model.

0 1 0.5 0.5 1 0.5 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
  • Binary covariates indicated in bold.

Table 8: True parameter values.
MPR-SIC SPR-SIC ALASSO-IC
100 500 1000 100 500 1000 100 500 1000
Overall 0.91 0.94 0.95 0.93 0.94 0.95 0.94 0.95 0.95
  • Out-of-sample coverage is calculated for a sample 20% the size of the original data.

Table 9: Out-of-sample prediction coverage probabilities.
MPR-SIC SPR-SIC ALASSO-IC
C(6) IC(0) PT MSE C(6) IC(0) PT MSE C(6) IC(0) PT MSE
100 5.64 0.01 0.68 0.10 5.73 0.01 0.75 0.08 5.54 0.00 0.66 0.09
500 5.92 0.00 0.92 0.02 5.92 0.00 0.92 0.02 5.91 0.00 0.93 0.02
1000 5.94 0.00 0.95 0.01 5.95 0.00 0.95 0.01 5.94 0.00 0.95 0.01
C(12) IC(0) PT MSE C(12) IC(0) PT MSE C(12) IC(0) PT MSE
100 11.31 0.00 0.52 0.18 12.00 0.00 1.00 0.03 12.00 0.00 1.00 0.02
500 11.84 0.00 0.84 0.01 12.00 0.00 1.00 0.00 12.00 0.00 1.00 0.00
1000 11.88 0.00 0.89 0.00 12.00 0.00 1.00 0.00 12.00 0.00 1.00 0.00
  • C, average correct zeros; IC, average incorrect zeros; PT, the probability of choosing the true model;
    MSE, the average mean squared error.

Table 10: Model selection metrics.
MPR-SIC
SE SEE CP SE SEE CP SE SEE CP
1.00 1.00 0.11 0.10 0.91 1.00 0.05 0.04 0.95 1.00 0.03 0.03 0.94
0.50 0.50 0.12 0.09 0.91 0.50 0.04 0.04 0.95 0.50 0.03 0.03 0.95
0.50 0.50 0.11 0.10 0.91 0.50 0.04 0.04 0.96 0.50 0.03 0.03 0.94
1.00 1.00 0.11 0.09 0.91 1.00 0.05 0.04 0.94 1.00 0.03 0.03 0.95
0.50 0.49 0.11 0.09 0.90 0.50 0.05 0.04 0.94 0.50 0.03 0.03 0.94
1.00 1.00 0.11 0.09 0.90 1.00 0.05 0.04 0.94 1.00 0.03 0.03 0.95
SPR-SIC
SE SEE CP SE SEE CP SE SEE CP
1.00 1.00 0.10 0.10 0.93 1.00 0.05 0.04 0.95 1.00 0.03 0.03 0.94
0.50 0.50 0.11 0.10 0.94 0.50 0.04 0.04 0.95 0.50 0.03 0.03 0.95
0.50 0.50 0.10 0.10 0.94 0.50 0.04 0.04 0.96 0.50 0.03 0.03 0.94
1.00 1.00 0.10 0.10 0.95 1.00 0.05 0.04 0.94 1.00 0.03 0.03 0.95
0.50 0.49 0.10 0.10 0.93 0.50 0.05 0.04 0.94 0.50 0.03 0.03 0.94
1.00 1.00 0.10 0.10 0.93 1.00 0.05 0.04 0.95 1.00 0.03 0.03 0.95
ALASSO-IC
SE SEE CP SE SEE CP SE SEE CP
1.00 0.98 0.11 0.10 0.94 1.00 0.05 0.04 0.95 1.00 0.03 0.03 0.94
0.50 0.46 0.12 0.10 0.91 0.49 0.04 0.04 0.94 0.49 0.03 0.03 0.94
0.50 0.46 0.11 0.10 0.90 0.49 0.05 0.04 0.94 0.49 0.03 0.03 0.94
1.00 0.98 0.10 0.10 0.94 0.99 0.05 0.04 0.94 1.00 0.03 0.03 0.95
0.50 0.45 0.11 0.10 0.90 0.49 0.05 0.04 0.93 0.49 0.03 0.03 0.94
1.00 0.98 0.11 0.10 0.93 1.00 0.05 0.04 0.94 1.00 0.03 0.03 0.95
  • SE, standard deviation of estimates over 1000 replications; SEE, average of estimated standard errors
    over 1000 replications; CP, the empirical coverage probability of a nominal 95% confidence interval.

Table 11: Eestimation and inference metrics for MPR-SIC, SPR-SIC and ALASSO-IC methods.

Appendix B Simulation Results: Multi-Parameter Setting

Additional simulation results for data simulated using the normal MPR model are presented here.

SPR-SIC
SE SEE CP SE SEE CP SE SEE CP
1.00 0.98 0.33 0.21 0.88 1.01 0.12 0.11 0.93 1.00 0.09 0.08 0.92
0.50 0.41 0.40 0.12 0.48 0.49 0.19 0.10 0.82 0.50 0.12 0.08 0.83
0.50 0.37 0.36 0.12 0.51 0.49 0.15 0.10 0.89 0.50 0.09 0.08 0.92
1.00 0.98 0.31 0.21 0.88 1.00 0.11 0.11 0.95 1.00 0.08 0.08 0.94
0.50 0.37 0.36 0.12 0.52 0.49 0.14 0.10 0.93 0.50 0.08 0.08 0.95
1.00 0.98 0.31 0.21 0.89 1.00 0.12 0.11 0.94 0.99 0.08 0.08 0.95
ALASSO-IC
SE SEE CP SE SEE CP SE SEE CP
1.00 0.87 0.32 0.22 0.87 0.98 0.13 0.11 0.92 0.98 0.09 0.08 0.90
0.50 0.35 0.34 0.14 0.59 0.44 0.18 0.10 0.78 0.46 0.12 0.08 0.79
0.50 0.32 0.30 0.14 0.61 0.44 0.15 0.11 0.83 0.47 0.10 0.08 0.88
1.00 0.87 0.32 0.22 0.85 0.97 0.12 0.11 0.93 0.98 0.08 0.08 0.93
0.50 0.31 0.29 0.14 0.60 0.44 0.14 0.11 0.88 0.47 0.08 0.08 0.92
1.00 0.87 0.32 0.22 0.86 0.97 0.12 0.11 0.91 0.98 0.08 0.08 0.93
  • SE, standard deviation of estimates over 1000 replications; SEE, average of estimated standard errors
    over 1000 replications; CP, the empirical coverage probability of a nominal 95% confidence interval.

Table 12: Estimation and inference metrics for SPR-SIC and ALASSO-IC methods.

Appendix C Real Data Analyses: Additional Results

This appendix shows additional results for the three real data analyses.

SPR-SIC ALASSO-IC
intercept -0.78 (0.61) -0.73 (0.14) -0.27 (0.63) -0.68 (0.15)
lcavol 0.53 (0.07) 36.71 0.54 (0.08) 36.71
lweight 0.66 (0.17) 9.21 0.52 (0.18) 9.21
svi 0.67 (0.20) 5.64 0.53 (0.21) 5.64
  • Variables not selected by any method: age, lbph, lcp, gleason, pgg45.

Prostate Cancer Data
SPR-SIC ALASSO-IC
intercept 0.45 (1.01) 2.01 (0.13) 0.21 (1.04) 2.03 (0.13)
gaspres 10.84 (1.51) 38.43 9.79 (1.63) 28.45
gastemp 0.15 (0.04) 12.97 0.19 (0.04) 15.37
tanktemp -0.07 (0.05) -2.00
tankpres -5.73 (1.23) 15.33 -4.08 (1.58) 1.71
Sniffer Data
SPR-SIC ALASSO-IC
intercept 13.26 (0.36) -3.30 (0.06) 13.18 (0.36) -3.28 (0.06)
rooms 0.10 (0.02) 29.60 0.10 (0.02) 29.45
lowstat -0.03 (0.00) 193.18 -0.03 (0.00) 193.02
stratio -0.04 (0.00) 55.96 -0.04 (0.00) 55.81
lproptax -0.26 (0.05) 24.78 -0.25 (0.05) 24.62
ldist -0.28 (0.03) 62.36 -0.27 (0.03) 62.20
crime -0.01 (0.00) 76.71 -0.01 (0.00) 76.55
lnox -0.62 (0.09) 34.87 -0.60 (0.10) 34.72
radial 0.01 (0.00) 24.54 0.01 (0.00) 24.38
Boston House Price Data
Table 13: Estimation metrics for SPR-SIC and ALASSO-IC methods.

References

  • Akaike (1974) Akaike, H. (1974). A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723.
  • Avella Medina and Ronchetti (2015) Avella Medina, M. and Ronchetti, E. (2015). Robust statistics: a selective overview and new directions. Wiley Interdisciplinary Reviews: Computational Statistics, 7(6):372–393.
  • Bedrick (2000) Bedrick, E. J. (2000). Checking for lack of fit in linear models with parametric variance functions. Technometrics, 42(3):226–236.
  • Burke et al. (2020) Burke, K., Jones, M., and Noufaily, A. (2020).

    A flexible parametric modelling framework for survival analysis.

    Journal of the Royal Statistical Society: Series C (Applied Statistics), 69(2):429–457.
  • Burke and MacKenzie (2017) Burke, K. and MacKenzie, G. (2017). Multi-parameter regression survival modeling: An alternative to proportional hazards. Biometrics, 73(2):678–686.
  • Cox and Reid (1987) Cox, D. R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference. Journal of the Royal Statistical Society: Series B (Methodological), 49(1):1–18.
  • DiCiccio et al. (2019) DiCiccio, C. J., Romano, J. P., and Wolf, M. (2019). Improving weighted least squares inference. Econometrics and Statistics, 10:96–119.
  • Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004). Least angle regression. The Annals of statistics, 32(2):407–499.
  • Fan et al. (2014) Fan, J., Fan, Y., and Barut, E. (2014). Adaptive robust variable selection. Annals of statistics, 42(1):324.
  • Fan and Li (2001) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360.
  • Fan and Li (2002) Fan, J. and Li, R. (2002). Variable selection for cox’s proportional hazards model and frailty model. Annals of Statistics, pages 74–99.
  • Fan and Lv (2010) Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1):101.
  • Friedman et al. (2007) Friedman, J., Hastie, T., Höfling, H., Tibshirani, R., et al. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332.
  • Friedman et al. (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1.
  • Groll et al. (2019) Groll, A., Hambuckers, J., Kneib, T., and Umlauf, N. (2019). Lasso-type penalization in the framework of generalized additive models for location, scale and shape. Computational Statistics & Data Analysis, 140:59–73.
  • Harrison Jr and Rubinfeld (1978) Harrison Jr, D. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81–102.
  • Harvey (1976) Harvey, A. C. (1976). Estimating regression models with multiplicative heteroscedasticity. Econometrica: Journal of the Econometric Society, pages 461–465.
  • Hunter and Li (2005) Hunter, D. R. and Li, R. (2005). Variable selection using mm algorithms. Annals of statistics, 33(4):1617.
  • Lloyd-Jones et al. (2018) Lloyd-Jones, L. R., Nguyen, H. D., and McLachlan, G. J. (2018). A globally convergent algorithm for lasso-penalized mixture of linear regression models. Computational Statistics & Data Analysis, 119:19–38.
  • Maronna et al. (2019) Maronna, R. A., Martin, R. D., Yohai, V. J., and Salibián-Barrera, M. (2019). Robust statistics: theory and methods (with R). John Wiley & Sons.
  • Mayr et al. (2012) Mayr, A., Fenske, N., Hofner, B., Kneib, T., and Schmid, M. (2012).

    Generalized additive models for location, scale and shape for high dimensional data—a flexible approach based on boosting.

    Journal of the Royal Statistical Society: Series C (Applied Statistics), 61(3):403–427.
  • Oelker and Tutz (2017) Oelker, M.-R. and Tutz, G. (2017). A uniform framework for the combination of penalties in generalized structured models. Advances in Data Analysis and Classification, 11(1):97–120.
  • O’Neill and Burke (2021) O’Neill, M. and Burke, K. (2021). smoothic: Variable Selection Using a Smooth Information Criterion. R package.
  • Reid et al. (2016) Reid, S., Tibshirani, R., and Friedman, J. (2016).

    A study of error variance estimation in lasso regression.

    Statistica Sinica, pages 35–67.
  • Rigby and Stasinopoulos (2005) Rigby, R. A. and Stasinopoulos, D. M. (2005). Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(3):507–554.
  • Rutemiller and Bowers (1968) Rutemiller, H. C. and Bowers, D. A. (1968). Estimation in a heteroscedastic regression model. Journal of the American Statistical Association, 63(322):552–557.
  • Schwarz et al. (1978) Schwarz, G. et al. (1978). Estimating the dimension of a model. The annals of statistics, 6(2):461–464.
  • Shao (1997) Shao, J. (1997). An asymptotic theory for linear model selection. Statistica sinica, pages 221–242.
  • Stamey et al. (1989) Stamey, T. A., Kabalin, J. N., McNeal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A., and Yang, N. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. ii. radical prostatectomy treated patients. The Journal of urology, 141(5):1076–1083.
  • Stasinopoulos et al. (2007) Stasinopoulos, D. M., Rigby, R. A., et al. (2007). Generalized additive models for location scale and shape (gamlss) in r. Journal of Statistical Software, 23(7):1–46.
  • Stasinopoulos et al. (2018) Stasinopoulos, M. D., Rigby, R. A., and Bastiani, F. D. (2018). Gamlss: a distributional regression approach. Statistical Modelling, 18(3-4):248–273.
  • Su (2015) Su, X. (2015). Variable selection via subtle uprooting. Journal of Computational and Graphical Statistics, 24(4):1092–1113.
  • Su et al. (2018) Su, X., Fan, J., Levine, R. A., Nunn, M. E., and Tsai, C.-L. (2018). Sparse estimation of generalized linear models (glm) via approximated information criteria. Statistica Sinica, 28(3):1561–1581.
  • Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
  • Tibshirani (1997) Tibshirani, R. (1997). The lasso method for variable selection in the cox model. Statistics in medicine, 16(4):385–395.
  • Wang and Leng (2007) Wang, H. and Leng, C. (2007). Unified lasso estimation by least squares approximation. Journal of the American Statistical Association, 102(479):1039–1048.
  • Wang et al. (2009) Wang, H., Li, B., and Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):671–683.
  • Weisberg (2013) Weisberg, S. (2013). Applied linear regression. John Wiley & Sons.
  • Wooldridge (2015) Wooldridge, J. M. (2015). Introductory econometrics: A modern approach. Cengage learning.
  • Yang (2005) Yang, Y. (2005). Can the strengths of aic and bic be shared? a conflict between model indentification and regression estimation. Biometrika, 92(4):937–950.
  • Zou (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429.
  • Zou and Hastie (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320.