1 Introduction
Multiparameter regression (MPR) modelling refers to the approach whereby covariates are allowed to enter the model through multiple distributional parameters simultaneously. This is in contrast to the standard approaches where covariates enter through a single parameter (e.g., a location parameter) while holding the remaining parameter(s) (e.g., a scale or shape parameters) constant. Multiparameter regression approaches have been considered in many areas of applied statistics. One of the earliest examples appears in econometrics literature where Park (1966) proposed a loglinear model for the scale parameter in the normal linear model when the assumption of homoscedasticity is violated and describes a two stage process for the estimation of the parameters. Harvey (1976)
develops a maximum likelihood estimation procedure for these so called heteroscedastic regression models.
Stirling (1985) and Aitkin (1987) illustrate the procedure using examples and GLIM (Generalized Linear Interactive Modelling) macros for the implementation of these models. Other examples of multiparameter regression models include an extension of generalized linear models to allow for the joint modelling of the mean and dispersion (Smyth, 1989; Nelder and Lee, 1991; Lee and Nelder, 1998), the generalized additive models for location, scale and shape (GAMLSS) which goes beyond the exponential family to general parameters (not necessarily location and scale) for a variety of distributions (Rigby and Stasinopoulos, 2005). Taylor and Verbyla (2004) model the location and scale parameters of the tdistribution jointly to rectify the dependence of the scale parameter on the location.In this article, we consider MPR modelling in the setting of survival analysis. Examples of the MPR modelling framework include Anderson (1991) who extended the Weibull accelerated failure time model such that both the location and dispersion parameters depend on covariates. A set of models known as the threshold regression models, model survival time through an underlying diffusion process (e.g., Weiner process) whose drift and initial state parameters both depend on covariates (Lee and Whitmore, 2006; Aalen et al., 2008). More recently, Burke and MacKenzie (2017) explored the general use of parameteric hazard MPR models, and demonstrated the favourable interpretation and flexibility afforded by jointly modelling the scale and shape parameters of a Weibull distribution.
The variable selection problem is omnipresent in most, if not all, statistical applications. This problem arises due to the uncertainty associated with the selection of a subset of explanatory variables to model some variable of interest (Thompson, 1978; George, 2000). With the growing availability of data, this problem has received a significant amount of attention in recent years. Traditionally, variable selection has been performed using methods such as backward elimination, forward selection, and stepwise regression; Miller (2002) provides a comprehensive discussion of these methods. Due to their inherent discreteness (i.e., covariates are either “in” or “out”), these methods can be unstable (Breiman, 1996). Furthermore, they are known to be computationally intensive with a power explosion of the number of possible submodels () to be considered for predictors. Modern approaches have been focused on penalized regression whereby parameter estimation and (continuous) model selection are both carried simultaneously. Methods such as the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996), smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001), the elastic net (Zou and Hastie, 2005), and adaptive lasso (ALASSO) (Zou, 2006) are used to simultaneously select variables and estimate their regression coefficients.
Standard regression approaches only have one regression component, and, therefore, variable selection literature mainly focuses on this, i.e., selection of covariates in a location parameter (cf. Fan and Lv (2010)). Beyond this standard setting, MPR models require variable selection in multiple distributional parameters. However, little work has been done in this area. Wu and Li (2012) considered penalized likelihood for variable selection in the case of an inverse Gaussian joint mean and dispersion model, while Wu et al. (2013) and Wu (2014)
applied this procedure to joint locationscale skewnormal and skewtnormal models. None of the aforementioned papers consider censored data, and, furthermore, these papers only consider the case of a common tuning parameter for the two regression components. Therefore, the main objective of this article is to develop penalized variable selection in MPR models more generally. Using the Weibull MPR model as an example, we investigate the need for a separate tuning parameter for each regression component. We select tuning parameters based on the BIC function, and, because this is multimodal, we propose the use of a differential evolution “global” optimization procedure.
The remainder of the paper is structured as follows. In Section 2, the Weibull multiparameter regression model, the penalty functions and the penalized likelihood estimation procedure are introduced. Section 3 describes the model fitting process and the algorithm used in the selection of the tuning parameter(s). Simulation studies to investigate the characteristics of the algorithm and evaluate its performance in both variable selection and parameter estimation are given in Section 4. The proposed methodologies are then illustrated on a real data example, a lung cancer dataset, in Section 5. A discussion of the proposed methods and some concluding remarks are given in the last section, Section 6.
2 Model Formulation
2.1 Weibull MPR Model
Although the variable selection methods we consider in this article can be applied to any parametric MPR model, it is helpful to focus on a specific example. We therefore consider the Weibull MPR model as the Weibull distribution is one of the most popular parametric survival distributions. The Weibull hazard function for survival time corresponding to the th individual is given by
for , where is the scale parameter and is the shape parameter. The Weibull MPR model is obtained by letting both distributional parameters depend on covariates as follows:
where and
are scale and shape covariate vectors which may or may not have covariates in common,
and are the corresponding regression coefficients, and the log link is used to ensure positivity of the parameters.Parameter estimation within the unpenalized MPR model can be carried out in a standard fashion using maximum likelihood. First, let be the observed survival time for the th individual. Then the associated loglikelihood function given by
(2.1) 
where is the full parameter vector, is the realization of , and is the censoring indicator which takes the value 0 for censored survival times and 1 for uncensored survival times. Beyond the Weibull case we consider here, the likelihood function is where is the cumulative hazard function.
2.2 Penalized Likelihood
Penalized MPR estimation can be developed on the basis of maximising a penalized loglikelihood given by
(2.2) 
where is the unpenalized likelihood, is a vector of coefficientspecific tuning parameters, and and are scale and shape penalty functions which we assume have the same functional form (but differ with respect to the tuning parameter). As is standard practice, we assume that the intercepts are not penalized, and, therefore, define (rather than, for example, assuming the intercepts are zero as other authors do); we also assume that the covariates are standardized. Although we have defined quite generally, we will in fact impose constraints on this vector (beyond fixing ) by considering the following possibilities (for ):

single penalty,

single adaptive penalty,
where and are predefined weights,

separate nonadaptive penalties,

separate adaptive penalties,
Note that the only “adaptive” penalty considered for the purpose of this article is the ALASSO penalty. (i) and (ii) are standard approaches where a single penalty, , applies to the whole vector of parameters. This is reasonable in more standard setting where there is only a vector. However, in this particular MPR setting, we have two separate distributional parameters, which exist on different scales. For this reason we investigate methods (iii) and (iv) which apply different penalties to the two regression vectors via and .
For the purpose of this article, we consider the most commonly used penalties, namely: the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996),
which, although popular, is known to select too many variables (Radchenko and James, 2008); the nonconvex smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001),
where , and the adaptive LASSO (ALASSO) (Zou, 2006),
where, typically, and is a unpenalized estimate of . These so called adaptive weights are used to apply different penalties to different regression coefficients such that a larger amount of shrinkage is applied to the unimportant variables. Note that, here, we use to denote a generic regression coefficient, and is the corresponding tuning parameter. The latter two penalties (i.e., SCAD and ALASSO) are known to possess the oracle property, i.e., the procedure asymptotically identifies the right subset model and estimates the coefficients and covariance matrix as though the true model were known in advance (Fan and Li, 2001).
3 Penalized Estimation Procedure
3.1 Model Fitting
We define
(3.3) 
where is given by (2.2). The corresponding score functions are given by
(3.4) 
where is an matrix whose th row is , is an matrix whose th row is ; and are vectors of length such that and ; and are vectors of lengths and respectively, such that, for , and .
Note however, the presence of the absolute value function renders the penalty functions nondifferentiable at zero. Various algorithms have been developed to overcome this issue including quadratic programming (Tibshirani, 1996), least angle regression (LARS) (Efron et al., 2004), coordinate descent (Friedman et al., 2007) and the local quadratic approximation (Fan and Li, 2001). In this paper, we take a different approach, and use an extension of the absolute value function given by
where . This yields a differentiable penalty so that standard gradientbased optimization algorithms can be applied straightforwardly and transparently. Thus, (which is an approximation of the signum function) and . Smaller values of bring the approximate penalty closer to the original penalty, but also closer to the penalty being nondifferentiable; we have found that fixing generally works well. As we use smooth functions, and in place of , (3.4) is then smooth in the parameters and can therefore be solved using the NetwonRaphson algorithm.
We denote by the matrix of second derivatives of , i.e., . Then,
where is the usual observed information matrix of the unpenalized likelihood; and appear due to the penalties, and are diagonal matrices of dimension and , respectively, such that, for , and ; and , , and are diagonal matrices whose th diagonal elements are given by , , and respectively. Thus, following Ha et al. (2014), the resulting system of NewtonRaphson equations, which are iteratively solved for , can be written compactly as
(3.5) 
where the various elements superscripted by depend on , but this dependence is suppressed for notational convenience; we use unpenalized estimates as the initial values in this iterative procedure, i.e., . Having obtained the penalized estimates, , the covariance can be estimated using the sandwich formula
(3.6) 
(Fan and Li, 2001, 2002; Ha et al., 2014; Park and Ha, 2019). This formula is known to have good accuracy when the sample size is moderate (Fan and Li, 2001, 2002), and its performance in our MPR setting is investigated in Section 4 through simulation studies.
3.2 Tuning Parameter Selection
The selection of “optimal” tuning parameter(s) is typically done through the use of datadriven criteria such as generalized crossvalidation (GCV), Akaike information criterion (AIC) or Bayesian information criterion (BIC). GCV and the AIC are known to be asymptotically “lossefficient” and “selection inconsistent” variable selection criteria (Shao, 1997; Yang, ; Wang et al., 2009). Wang et al. (2007) provide a formal proof that the shrinkage or tuning parameter selected using GCV may not be able to identify the true model consistently for the SCAD estimator in linear models and partially linear models. Instead, they suggest using the BIC and prove its model selection consistency property. A similar conclusion has been reached by Wang and Leng (2007) for the ALASSO. Hence, due to its widely reported superior empirical performance in variable selection, we use a BICtype criterion to determine the values of the tuning parameter(s), where
(3.7) 
is the unpenalized likelihood function defined in (2.1), is the sample size and
is the effective degrees of freedom
(Ha et al., 2007); we define(3.8) 
Note that, as described in Section 2.2, will either be onedimensional (when a common penalty is applied to and ) or twodimensional (when separate penalties are applied).
The simplest method to solve this optimization problem is grid search. While it is straightforward to implement, grid search is known to suffer from the curse of dimensionality, i.e., the number of grid points grows exponentially with the dimension. Furthermore, if the grid is too coarse, the minimum may be overlooked. This is especially true in the case of a multimodal function, such as what we are trying to optimize (see Figure
3). To overcome the issues associated with the grid search algorithm, we consider “global” optimization algorithms. In an empirical comparison of various algorithms for continuous global optimization, Mullen (2014) found “DEoptim” (implemented in R) (Mullen et al., 2011) to be among the best. The function implements a differential evolution algorithm, an example of an evolutionary strategy developed by Storn and Price (1997) (see Mullen et al. (2011) for a detailed overview of the underlying algorithm).3.3 Variable Selection Algorithm
The variable selection algorithm described above is summarized in the following bullet points.

Initialization. Set where is the vector of unpenalized estimates, i.e., those which minimize (defined in (2.1)).

Optimization.

Outer. Minimize with respect to using DEoptim, yielding as defined in (3.8). Convergence occurs when is below a prespecified threshold. (Here, is the best value found at step of the DEoptim algorithm.)

Inner. For a given value of , maximize by iteratively resolving the system of equations given in (3.5) starting from the initial value, ; this yields . Convergence occurs when is below a prespecified threshold. (Here is the infinity norm.)


Output. The estimates
are returned from the above procedure, and the corresponding standard errors are calculated by evaluating (
3.6) at .
4 Simulation Studies
4.1 Setup
The performance of the proposed variable selection methods is evaluated through simulation studies. The failure time is simulated from a Weibull MPR model with
where is a vector of correlated variables generated from an AR(1) process with a correlation coefficient . Each variable is marginally standard normal and the correlation between any two consecutive variables and is given by
. The corresponding censored times were generated from an exponential distribution. This setup was chosen so as to yield realistic survival data, where the true model is sparse and correlation in the covariates exists. Three different sample sizes (
= 100, 500 and 1000) and two censoring proportions () are considered. For each scenario, we considered the LASSO, SCAD, and ALASSO penalties with both a single tuning parameter or two tuning parameters (i.e., one for each of the two regression components). Each simulation was repeated 200 times.4.2 Simulation Results
The variable selection and estimation procedures described in Sections 2 and 3 are applied to the simulated data and the results are summarized and discussed here. A number of metrics are used to evaluate the performance of the variable selection procedures, namely: the average number of true zero coefficients correctly set to zero (C), the average number of true nonzero coefficients incorrectly
set to zero (IC), and the probability of choosing the true model (PT); for the oracle model, C = 7 and IC = 0. As a measure of prediction accuracy, we also consider the mean squared error (MSE), given by
and , where , the simulated sample covariance matrix of the covariates, is computed for each simulation replicate (Zhang and Lu, 2007; Tibshirani, 1997). These metrics, averaged over simulation replicates for the scenarios with 25% censoring, are reported in Table 1. (The results for 50% censoring, shown in the Appendix, are similar.)As the sample size increases, we see an improvement across all the four metrics, for both the shape and the scale parameters and across all penalties. However, it is evident that the LASSO penalty does not set enough covariates equal to zero (i.e., it selects an overly complex model) irrespective of whether there is one tuning parameter or two. SCAD performs better but overselects in the shape component, , when there is only one tuning parameter; this is improved by having two separate tuning parameters. The best performance comes from the ALASSO penalty which, for the largest sample size, selects the true scale and shape covariates more than 90% of the time. Interestingly, the ALASSO performs well even with a single tuning parameter (but it does improve a little with two tuning parameters). Computation times for each of the penalties are given in Table 2 where we see that SCAD is considerably slower than LASSO and ALASSO, while the times for these latter two are comparable. Furthermore, the computation times for the cases with two tuning parameters are 2  3 times longer than those with one tuning parameter.
Besides variable selection, we also consider parameter inference in terms of estimation bias, accuracy of the estimated standard error (SEE) computed using the sandwich formula, (3.6
), and the empirical coverage probability (CP) of a nominal 95% confidence interval. The results for the ALASSO penalty (for the 25% censoring level) are presented in Table
3. Results for the ALASSO penalty show that the SEE is accurate for moderate sample sizes, but may underestimate the standard error (SE) for smaller samples. In the samples and , the smaller parameters are overshrunk, i.e., they are biased downwards. For this reason the CPs do not perform well. However, this is not the case for . In the case of , the CPs are close to the nominal level (although perhaps a little low for the shape coefficients). An improvement can be seen in the case of two tuning parameters. Similar tables for the other penalties (and 50% censoring) can be found in the Appendix, and, to summarize these additional results: LASSO overshrinks all parameters and the standard errors are underestimated in all cases, whereas the results for SCAD are better, but not as good as the ALASSO; in particular, the coverage for SCAD confidence intervals is much lower than the nominal level even for . As expected, when the censoring rate is increased from 25% to 50%, the variation (i.e., SE, SEE and MSE) of estimates is increased overall across all the three penalties.One Tuning Parameter  Two Tuning Parameters  
= 25%  LASSO  SCAD  ALASSO  LASSO  SCAD  ALASSO  
100  1.5  6.1  1.9  4.6  11.7  4.7  
500  2.7  6.2  3.2  7.3  19.9  9.0  
1000  4.3  11.5  5.8  13.1  37.5  17.1 
5 Lung Cancer Data
To illustrate the penalized variable selection methods on real data, a lung cancer dataset is considered. This dataset was collected as part of a PhD thesis by Wilkinson (1995) (see also Burke and MacKenzie (2017)). This dataset contains all individuals, of all ages, diagnosed with lung cancer in Northern Ireland during the one year period 1 October 1991 to 30 September 1992. Only cases of primary lung cancer were included. The date of diagnosis was taken to be the time origin for an individual and the end point was the earlier of the occurrence of death or the study end date, which was on 30 May 1993. Individuals who were still alive on the study end date were taken to have censored survival times. Individuals who died from another cause or who dropped out of the study were also censored. The final dataset included 855 patients, of which there were 673 deaths and 182 censored times. Besides the survival time and the censoring indicator, a number of other variables were recorded for each of the patients enrolled in the study (reference categories are listed first): age group (, , , , ), sex (female, male), treatment group (palliative, surgery, chemotherapy, radiotherapy, chemotherapy and radiotherapy), WHO status (normal activity, light work, unable to work, walking, bed/chair bound), cancer cell type (squamous cell, small cell, adenocarcinoma, other), serum sodium level (, , missing), serum albumen level (, , missing), metastases (no, yes, unknown), smoking status (nonsmoker, current smoker, exsmoker, missing).
5.1 Adequacy of Weibull
Before considering covariates and variable selection, we first carry out an initial check that a baseline Weibull distribution is appropriate for the lung cancer data. The cumulative hazard function for the Weibull model is given by , and, hence, . Therefore, given an estimate , a plot of against should produce a straight line. This standard Weibull model check is shown in Figure 4, and, despite a slight deficiency for very small survival times, it appears that the Weibull model is reasonable.
5.2 Variable Selection Results
The variable selection results for the different penalties are summarized in Table 4. In line with the results of the simulation study, the LASSO penalty selects the most complex model and the ALASSO penalty selects the least complex. Both ALASSO penalties (one and two tuning parameter cases) are in agreement on the nonimportance of sex and smoking status, and although age group is selected in the scale in the case with one tuning parameter, it is not significant. Interestingly, the two tuning parameter ALASSO selects the same set of covariates as identified by Burke and MacKenzie (2017) using a BIC stepwise procedure (albeit they additionally selected treatment in the shape). We also see that, in the two tuning parameters cases, the scale tuning parameter is smaller that of the one tuning parameter case, while the shape tuning parameter is larger. This suggests that the single penalty overpenalizes the scale coefficients and underpenalizes the shape; this is also evident from the scale and shape degrees of freedom. Interestingly, the one tuning parameter ALASSO converges in less than half the time of the two tuning parameter ALASSO, and achieves similar results. We expect this based on our simulation studies, and also expect the results of the two tuning parameter case to be marginally better (albeit it takes longer to converge).
Table 5 displays the estimated coefficients for both ALASSO penalties along with the unpenalized coefficients (we focus the ALASSO due to its superior performance in our simulation studies, but similar tables for LASSO and SCAD can be found in the Appendix). From Burke and MacKenzie (2017), note that the scale coefficients characterize the overall scale of the hazard (a positive value indicates an increase relative to the reference category), while the shape coefficients characterize its time evolution (a positive value indicates a hazard which increases over time relative to the reference category). We clearly see the similarity of the coefficient values for both the one and two tuning parameter ALASSO penalties, and, furthermore, that the selected variables are broadly in line with those which are statistically significant in the unpenalized model. Focussing on the results of the two tuning parameter case we find that: all treatments (apart from chemotherapy) have a negative scale coefficient suggesting that treatment reduces hazard (relative to palliative care); however, worse WHO status, small cancer cell type, presence of metasteses, and reduced sodium and albumen levels increase the hazard; lastly, sex, age group, and smoking status have no significant effect on the hazard. Since no variable appears in the shape component (i.e., all shape coefficients are set to zero), the selected model is a proportional hazards model, and exponentiating the scale coefficients yields the hazard ratios, e.g., the surgery hazard ratio is so that the risk of death is approximately 37.5% that of a patient receiving palliative care.
One Tuning Parameter  Two Tuning Parameters  
LASSO  SCAD  ALASSO  LASSO  SCAD  ALASSO  
Treatment  ,  ,  ,  ,  ,  
Age group    
WHO status  ,  ,  ,  ,  
Sex            
Smoking status      
Cell type  ,  ,  ,  
Metastases  ,  ,  
Sodium  ,  ,  
Albumen  ,  ,  ,  ,  
Tuning parameter(s)  0.026  0.041  0.015  0.014  0.024  0.004 
0.080  0.074  0.045  
Degrees of freedom  32.5  27.1  15.5  25.6  24.6  15.2 
Scale degrees of freedom  14.2  12.0  13.2  18.3  17.1  14.2 
Shape degrees of freedom  18.3  15.0  2.4  7.4  7.6  1.0 
Computation time (in minutes)  27.0  41.9  28.8  58.3  100.6  86.9 

= “selected in scale”, = “selected in shape”, and those which are nonsignificant (at the level) are shown in gray.
6 Discussion
The MPR approach results in flexible models which extend standard models, but the presence of multiple regression components means that variable selection is necessarily more challenging than in standard settings where there is only a single regression component. In this article, we have proposed a penalized variable selection procedure for the simultaneous selection of significant variables in the shape and scale parameters of a Weibull MPR model in the survival analysis setting. The performance of these methods was illustrated using simulation studies and a real data example. While we have considered the Weibull model example in this article, the proposed variable selection procedures can be applied easily to other MPR models.
Given that we model different distributional parameters (a scale and a shape parameter), there is no reason to assume that variable selection can be achieved with a single penalty applied to both regression components; hence, we also investigated the need for a separate tuning parameter for each regression component. We have found that the ALASSO performs very favourably in terms of identifying the true subset of covariates and coverage of calculated confidence intervals. This is true even with a single tuning parameter, however the results are improved when there are two tuning parameters (albeit this is more computationally intensive). On the other hand, SCAD does not perform well in the MPR setting, selecting an overly complex model and with poor confidence interval coverage for shape parameters.
Acknowledgements
The first author is funded by the the Irish Research Council. The second author was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science & ICT (No. NRF2017R1E1A1A03070747)
References
 Aalen et al. (2008) Aalen, O., Borgan, O., and Gjessing, H. (2008). Survival and event history analysis: a process point of view. Springer Science & Business Media.

Aitkin (1987)
Aitkin, M. (1987).
Modelling variance heterogeneity in normal regression using GLIM.
J. R. Statist. Soc. C., 36:332–339.  Anderson (1991) Anderson, K. M. (1991). A nonproportional hazards Weibull accelerated failure time regression model. Biometrics, 47:281–288.
 Breiman (1996) Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Ann. Statist., 24:2350–2383.
 Burke and MacKenzie (2017) Burke, K. and MacKenzie, G. (2017). Multiparameter regression survival modeling: An alternative to proportional hazards. Biometrics, 73:678–686.
 Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Ann. Statist., 32:407–499.
 Fan and Li (2001) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist. Ass., 96:1348–1360.
 Fan and Li (2002) Fan, J. and Li, R. (2002). Variable selection for Cox’s proportional hazards model and frailty model. Ann. Statist., 30:74–99.
 Fan and Lv (2010) Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica, 20:101–148.
 Friedman et al. (2007) Friedman, J., Hastie, T., Höfling, H., and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Statist., 1:302–332.
 George (2000) George, E. I. (2000). The variable selection problem. J. Am. Statist. Ass., 95:1304–1308.
 Ha et al. (2007) Ha, I. D., Lee, Y., and MacKenzie, G. (2007). Model selection for multicomponent frailty models. Statist. Med., 26:4790–4807.
 Ha et al. (2014) Ha, I. D., Pan, J., Oh, S., and Lee, Y. (2014). Variable selection in general frailty models using penalized hlikelihood. J. Comp. & Graph. Statist., 23:1044–1060.
 Harvey (1976) Harvey, A. C. (1976). Estimating regression models with multiplicative heteroscedasticity. Econometrica, 44:461–465.
 Lee and Whitmore (2006) Lee, M.L. T. and Whitmore, G. A. (2006). Threshold regression for survival analysis: Modeling event times by a stochastic process reaching a boundary. Statist. Sci., 21:501–513.
 Lee and Nelder (1998) Lee, Y. and Nelder, J. A. (1998). Generalized linear models for the analysis of qualityimprovement experiments. Canadian J. Statist., 26:95–105.
 Miller (2002) Miller, A. (2002). Subset selection in regression. Chapman and Hall/CRC.
 Mullen (2014) Mullen, K. M. (2014). Continuous global optimization in R. J. Statist. Software, 60:1–45.
 Mullen et al. (2011) Mullen, K. M., Ardia, D., Gil, D. L., Windover, D., and Cline, J. (2011). DEoptim: An R package for global optimization by differential evolution. J. Statist. Software, 40:1–26.
 Nelder and Lee (1991) Nelder, J. A. and Lee, Y. (1991). Generalized linear models for the analysis of Taguchitype experiments. Appl. Stoch. Mod. & Data Analysis, 7:107–120.
 Park and Ha (2019) Park, E. and Ha, I. D. (2019). Penalized variable selection for accelerated failure time models with random effects. Statist. Med., 38:878–892.
 Park (1966) Park, R. E. (1966). Estimation with heteroscedastic error terms. Econometrica, 34:888.
 Radchenko and James (2008) Radchenko, P. and James, G. M. (2008). Variable inclusion and shrinkage algorithms. J. Am. Statist. Ass., 103:1304–1315.
 Rigby and Stasinopoulos (2005) Rigby, R. A. and Stasinopoulos, D. M. (2005). Generalized additive models for location, scale and shape. J. R. Statist. Soc. C., 54:507–554.
 Shao (1997) Shao, J. (1997). An asymptotic theory for linear model selection. Statist. Sinica, 7:221–242.
 Smyth (1989) Smyth, G. K. (1989). Generalized linear models with varying dispersion. J. R. Statist. Soc. B., 51:47–60.
 Stirling (1985) Stirling, W. D. (1985). Heteroscedastic models and an application to block designs. J. R. Statist. Soc. C., 34:33–41.
 Storn and Price (1997) Storn, R. and Price, K. (1997). Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim., 11:341–359.
 Taylor and Verbyla (2004) Taylor, J. and Verbyla, A. (2004). Joint modelling of location and scale parameters of the t distribution. Statist. Mod., 4:91–112.
 Thompson (1978) Thompson, M. L. (1978). Selection of variables in multiple regression: Part I. A review and evaluation. Int. Statist. Review, 46:1–19.
 Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B., 58:267–288.
 Tibshirani (1997) Tibshirani, R. (1997). The lasso method for variable selection in the Cox model. Statist. Med., 16:385–395.
 Wang and Leng (2007) Wang, H. and Leng, C. (2007). Unified lasso estimation by least squares approximation. J. Am. Statist. Ass., 102:1039–1048.
 Wang et al. (2009) Wang, H., Li, B., and Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. J. R. Statist. Soc. B., 71:671–683.
 Wang et al. (2007) Wang, H., Li, R., and Tsai, C.L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94:553–568.
 Wilkinson (1995) Wilkinson, P. (1995). Lung cancer in Northern Ireland 1991 – 1992. PhD thesis, Queen’s University Belfast.

Wu and Li (2012)
Wu, L. and Li, H. (2012).
Variable selection for joint mean and dispersion models of the inverse Gaussian distribution.
Metrika, 75:795–808. 
Wu (2014)
Wu, L.C. (2014).
Variable selection in joint location and scale models of the skewtnormal distribution.
Comm. Statist. Sim. & Comp., 43:615–630.  Wu et al. (2013) Wu, L.C., Zhang, Z.Z., and Xu, D.K. (2013). Variable selection in joint location and scale models of the skewnormal distribution. J. Statist. Comp. & Sim., 83:1266–1278.
 (40) Yang, Y. Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika, 92.
 Zhang and Lu (2007) Zhang, H. H. and Lu, W. (2007). Adaptive lasso for Cox’s proportional hazards model. Biometrika, 94:691–703.
 Zou (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. J. Am. Statist. Ass., 101:1418–1429.
 Zou and Hastie (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Statist. Soc. B., 67:301–320.
7 Appendix
This Appendix contains (a) additional simulation results (Tables 6  11), and (b) coefficients and standard errors for various fitted (penalized) models (Tables 12 and 13).
Comments
There are no comments yet.