1 Introduction
Enhancements in data collection technologies have highlighted the importance of modern variable selection techniques. Traditional methods, such as best subset selection, are suboptimal and are computationally expensive when the number of variables is high (Fan and Lv, 2010). Modern approaches make use of penalization methods to execute simultaneous model selection and estimation. A popular method is the LASSO (least absolute shrinkage and selection operator) (Tibshirani, 1996), which comprises of an penalty but leads to biased estimates. In contrast, the adaptive LASSO (Zou, 2006) has adaptive weights, which reduce the bias present in the LASSO estimates. These methods have been developed primarily in the context of normal linear regression and have been extended to generalized linear models (GLMs) (Tibshirani, 1996; Friedman et al., 2010) and Cox’s proportional hazard models for survival data (Tibshirani, 1997). In these classical models, covariates enter through the location parameter (or the hazard scale in the Cox model case). One of the goals of this paper will be to expand these methods for application in a more flexible setting with multiple distributional parameters.
Originally, methods such as quadratic programming were used to solve these nondifferentiable LASSOtype problems but Efron et al. (2004) and Friedman et al. (2007), respectively, proposed the least angle regression (LARS) and coordinate descent algorithms — with the latter proving to be particularly fast for problems of this type. These are somewhat “nonstandard” estimation procedures in the context of classical statistical estimation where nondifferentiable objective functions are relatively less common. As an alternative nongradient based optimization, perturbing the penalty function slightly to render it differentiable (Hunter and Li, 2005; LloydJones et al., 2018) enables standard optimization methods to be used. Oelker and Tutz (2017) outline a series of approximations of different penalties, which allows for penalized smooth functions. These differentiable penalties can be easily implemented and solved using standard gradient based optimization procedures, i.e., NewtonRaphson. The tuning parameter that controls the strength of the penalty is typically obtained by minimizing the crossvalidation error or an information criterion (IC), such as the Akaike IC (AIC) (Akaike, 1974) or Bayesian IC (BIC) (Schwarz et al., 1978). This is a twostep estimation process, which tends to be computationally intensive as it involves fitting an array of different models and selecting the best one.
Su (2015) and Su et al. (2018) present an estimation procedure that is not based on the norm, titled “MIC” (minimum approximated information criterion). They exploit the close connection between model selection criteria and penalization (Fan and Lv, 2010) and introduce an approximated information criteria in order to avoid the classic twostep estimation process. At its core, the MIC utilizes an approximation of the “ norm” with a continuous unit dent function. The norm is discrete in nature and it is preferable to have a penalty function with a level of smoothness for optimization purposes. Su (2015) describes a “subtle uprooting” method for variable selection, which involves using a smooth surrogate function for approximating cardinality. This is followed by a second technical step for enhancing sparsity, where the final problem becomes nondifferentiable. This approach is extended to GLMs in Su et al. (2018). Fixing the tuning parameter at two for the AIC or for the BIC is computationally advantageous, as it avoids the tuning parameter selection problem. It is not required to compute the whole regularization path of solutions, nor is it necessary to choose the best tuning parameter using crossvalidation, as is typically done.
We propose a more straightforward method of approximating the IC function using a smooth approximation of the norm, which can be optimized directly. Instead of performing the reparameterization step as outlined in Su (2015), which renders the problem nondifferentiable, we achieve sparsity in a different way. Our approach squeezes the coefficient values to zero by optimizing a sequence of objective functions that get successively closer to the nondifferentiable one. Consequently, our proposed “smooth IC” (SIC) function can be optimized directly using standard gradient based optimization techniques. Additionally, we extend this new SIC variable selection procedure for use in the developing area of “multiparameter regression” (MPR). Our proposed methods are implemented in our publicly available R package “smoothic” (O’Neill and Burke, 2021). To date, penalized estimation has been primarily applied in the context of classical regression models, where the covariates are allowed to enter the model through a single parameter (e.g., a location parameter). Other distributional parameters, such as a dispersion parameter, are typically constant. This “single parameter regression” (SPR) does not take into account the possible impact of covariates on the other distributional parameters. MPR is a more flexible approach where multiple parameters are regressed simultaneously on covariates (this is sometimes referred to as “distributional regression” (Stasinopoulos et al., 2018)). For example, covariates can enter the model through the location and dispersion parameters, or scale and shape parameters of the hazard function in the survival context (see Burke and MacKenzie (2017), Burke et al. (2020) and references therein), or indeed in various different distributional parameters as in generalized additive models for location, scale and shape (GAMLSS) (Rigby and Stasinopoulos, 2005). Mayr et al. (2012)
address the problem of variable selection by utilizing classical gradient boosting techniques to fit GAMLSS models. More recently,
Groll et al. (2019) suggest implementing a LASSOtype penalization in the GAMLSS framework. This regularization approach to GAMLSS is highly flexible, but it has the added complexity of separate tuning parameters for each regression component. Groll et al. state that carrying out the computationally demanding grid search for the optimal tuning parameters is a drawback of their method. In our proposed multiparameter regression with smooth IC (MPRSIC) procedure, this issue is circumvented as the values of both tuning parameters are known in advance.The model formulation, including the introduction of the “smooth norm”, the estimation procedure and the optimization algorithm are outlined in Section 2. In Section 3, the performance of our proposed methods is evaluated in both variable selection and parameter estimation through extensive simulation studies. We consider three real data analyses to demonstrate our proposed methods in Section 4. Finally, we close with some concluding remarks in Section 5.
2 Model Formulation
2.1 Preliminaries
The classic normal linear regression is a single parameter problem that assumes there is constant variance in the errors. The model is
(2.1) 
for , where is the response value and
is a vector of covariates for the
th individual over the predictor variables
. Here, is the vector of regression coefficients for the location parameter and under the homogeneity assumption. For the multiparameter regression (MPR) approach, the single parameter model in (2.1) is extended to include heterogeneity of error variance:(2.2) 
where the loglinear form ensures that remains positive. The vector of regression coefficients for the dispersion parameter is . There may be different (possibly overlapping) sets of covariates impacting the location and dispersion, and although we use for both, a given or coefficient may be set to zero, which removes the covariate from that model component. Because we apply penalized variable selection, the regression coefficients need to be on a similar scale, and therefore we assume that the predictors are scaled to have unit variance. The loglikelihood function for the MPR normal model is
(2.3) 
where . Model selection criteria, such as the AIC and BIC, have a penalized functional form similar to regularization and in the multiparameter framework they can be formulated as
(2.4) 
where is a tuning parameter, and and , i.e., the coefficient vectors with the intercepts omitted; there is an addition of two in the penalty to take into account the estimation of the intercept terms and . Fixing the tuning parameter as or results in the AIC and BIC respectively. The norm, , indicates the cardinality or the number of nonzero elements in . This is not truly a norm since when . The AIC is reported to be asymptotically “selection inconsistent” and “lossefficient” as a variable selection criterion (Shao, 1997; Yang, 2005; Wang et al., 2009). As a result of its consistency property and superior empirical performance in variable selection, we employ a BICtype criterion (Wang and Leng, 2007) where . Using the multiparameter normal likelihood in (2.3) and arranging (2.4) as a penalized likelihood results in
(2.5) 
2.2 Smooth Norm
Due to the nondifferentiability of the norm, we propose a smooth function to approximate it:
(2.6) 
This is differentiable for and . Figure 1 demonstrates how gets closer to as decreases. The smallest value shown () approximates the norm very closely, but it is also near the discontinuity at making it unstable. In practice, we have found that fixing the smoothing parameter generally works well, but telescoping through a decreasing sequence of values (outlined in Section 2.4) is recommended for optimal performance. Note that the first and second derivatives (for ) have a simple analytic form and therefore can be used within the gradient based optimization procedure of Section 2.3:
(2.7) 
Defining as the “smooth norm”, and substituting the norm in (2.5) with and results in our proposed approach of MPR with smooth IC (MPRSIC):
(2.8) 
2.3 Estimation Procedure
We define the penalized estimates as
where is given by (2.8). The first and second derivatives with respect to the parameters are
(2.9) 
where is an matrix, whose th row is , and are vectors of length such that and , and and are vectors whose th elements are given by and respectively, except whose first elements are zero due to the fact that the intercepts are not penalized.
The matrix of negative second derivatives of , i.e., is given by
where is the observed information matrix of the unpenalized likelihood; and are diagonal matrices that appear due to the penalties and whose th diagonal elements are given by and respectively, except whose first diagonal elements are zero due to the fact that the intercepts are not penalized; , and are diagonal weight matrices whose th diagonal elements are given by , and respectively. We employ the “RS” algorithm (Rigby and Stasinopoulos, 2005), which does not use cross derivatives. This algorithm is motivated by the fact that in many classical models, including location and scale models, the parameters are information orthogonal as discussed in Cox and Reid (1987). However, Stasinopoulos et al. (2007) report that the RS algorithm works well even when the parameters are not information orthogonal.
The resulting system of NewtonRaphson equations can be expressed compactly as
(2.10) 
They are iteratively solved for , where the elements superscripted by depend on
, but this is excluded for notational convenience. We use the classical ordinary least squares estimates as initial values for the location parameter, i.e.,
, where is the response vector. We fix the starting value for the intercept of the dispersion term at , where the classical residual variance estimator is used. The remaining elements of the parameter vector are set to zero (Rutemiller and Bowers, 1968; Harvey, 1976), which gives. The standard errors of the estimates can be directly acquired by estimating the covariance of the penalized estimates for the true nonzero parameters using the sandwich formula,
(2.11) 
which has been shown to be accurate for moderate sample sizes (Fan and Li, 2001, 2002).
2.4 telescoping
Although smaller values of lead to a better approximation of the norm (see Figure 1), and hence IC optimization, we have found that the procedure becomes less numerically stable. On the other hand, larger values of lead to a more stable optimization procedure, but one that does not yield coefficients close to zero. Therefore, we propose a method that “telescopes” through a decreasing sequence of values and makes use of “warm starts”, whereby the solution to the previous optimization problem is used as the initial point for the current nearby problem. The method can obtain final estimates of the true zero coefficients, which are extremely close to zero to the extent that they can be treated as zero for practical purposes. In this paper, we treat values below as being zero. We have found that using a sequence of values from to performs well. The performance of the method is not highly impacted by the choice of the sequence. A large enough must be chosen in order to introduce smoothness and give stable estimates, while the smaller more closely approximates the norm to induce pseudosparsity (where we say “pseudo” since the algorithm produces coefficients that can be made arbitrarily close to zero while not being exactly zero). In addition to this, we propose implementing an exponentially decaying sequence in the form of , where is the starting value, is the rate of decay and is the step number. For our suggested sequence with steps from to , the decay parameter is . This is advantageous as the optimization begins with large increments from , which provides rapid improvements and estimates that are initially close to the unpenalized values. The smaller increments leading to allow for smaller refinements, especially with regard to squeezing some coefficients to be close to zero.
Table 1 presents an example of the coefficient estimates for a true zero and true nonzero coefficient for some simulated data. The shrinkage effect due to decreasing values is apparent. The value of the true zero coefficient drastically reduces in magnitude through each step. It is obvious that the final estimate at is extremely close to zero. As a result, it can be treated as a zero coefficient without any issues — and, indeed, could be shrunk further if desired by reducing . The estimate for the true nonzero coefficient does not vary greatly over the telescoping steps.
Figure 2 provides a visualization of the telescoping effect in terms of the objective function. This is a slice through the objective function, which is plotted as a function of the coefficient value. Different curves are plotted corresponding to the values in the telescope sequence. In the case of the true zero coefficient , it is clear that as decreases, the width of the curves become narrower and therefore there is less uncertainty around the estimate. Additionally, it is evident that the minimum of the curve shifts towards zero as decreases. For the true nonzero coefficient , the curves for the different values are almost identical, i.e., the telescoping has little impact on the shape of the objective function in this case.
0.0007631062  0.9998  
0.0000078397  0.9998  
0.0000000815  0.9998  
0.0000000008  0.9998 
2.5 Algorithm
The proposed MPRSIC variable selection method is summarized in Algorithm 1.
3 Simulation Studies
3.1 Setup
We have undertaken a simulation study to investigate the performance of our proposed MPRSIC method. We simulate data from the normal MPR model from Section 2, where is the matrix of 12 covariates of which , , and are Bernoulli(0.5) and the remainder are ; the regression coefficients are provided in Table 2. Three different sample sizes (, 500 and 1000) are considered. Results of a simulation study carried out in a single parameter setting (not discussed here) are given in Appendix A.
0  1  0.5  0.5  1  0.5  1  0  0  0  0  0  0  
0  0.5  1  0.5  1  0  0  0.5  1  0  0  0  0 

Binary covariates indicated in bold.
The true values of 0, 0.5 and 1 correspond to having no effect, a weak effect and a strong effect respectively. There are several combinations where covariates are entering either through both distributional parameters, a single distributional parameter, or not at all. There are four pure noise covariates in that both the and coefficients are zero.
For each scenario, we perform our proposed methods, MPRSIC and SPRSIC (i.e., the SIC implemented in the singleparameter regression framework). To compare performance, we apply the ALASSO method from the “glmnet” package, which pertains to a linear regression model, where we select the value of the tuning parameter by minimizing the BIC (ALASSOIC). Note that only crossvalidationminimization is available in the glmnet package, and, therefore, we compute this by evaluating the normal likelihood at the ALASSO estimates, , and with the variance estimator
(3.12) 
as suggested in Reid et al. (2016), where is the number of nonzero elements in .
3.2 Simulation Results
Metrics including the average number of true zero coefficients correctly set to zero (C) and the average number of true nonzero coefficients incorrectly set to zero (IC) are investigated to analyse the performance of the method. The probability of choosing the true model (PT) is examined by looking at the proportion of times the true model is selected. The mean squared error (MSE) is computed for each simulation replicate in order to assess insample prediction accuracy, and is calculated by
(Tibshirani, 1997). These metrics, averaged over simulation replicates, are presented in Table 3.The C values are close to six (true value) and improve as the sample size increases. The IC values are zero in all cases apart from in the parameters for . This is due to the three smallervalued , and parameters being set to zero incorrectly. This behaviour is also conveyed by the probability of choosing the true model (PT). The PT values are low for , which is due to both and sometimes having a zero coefficient not set to zero, and, in the case of , sometimes having a nonzero coefficient set to zero. The sample size of is a challenging scenario as the MPRSIC method is fitting a total of parameters, which in this case is 26 parameters for a relatively small sample size. Taking this into account, we suggest that the performance of the method in this setting is reasonable. The PT values are high for and 1000. Both the SPRSIC and ALASSOIC models underperform in the location component compared to the MPRSIC, even though the location and dispersion parameters are orthogonal in the normal model. Nonetheless, both models do relatively well in the case of , with the SPRSIC model doing better than the ALASSOIC. Also, the MSE values are higher than in the MPRSIC approach.
The estimation and inferential performance of our proposed MPRSIC method is investigated in Table 4. The equivalent results (not discussed here) for the SPRSIC and ALASSOIC methods are deferred to Appendix B
. The average estimate over simulation replicates is shown along with the true standard error (SE), which is the standard deviation of the estimates over simulation replicates, and the average estimated standard error (SEE) over simulation replicates, where the SEE in a given replicate is computed using (
2.11); also shown is the empirical coverage probability (CP) for a nominal 95% confidence interval. We can see that, in all cases, the estimated parameter is close to the true value, albeit there is some bias in the larger
coefficients at . The standard errors for both the and parameters are underestimated at , leading to CPs below 0.95. However, at the standard errors are well estimated and the coverage is very close to the desired 0.95 level.Outofsample prediction coverage probabilities (PCPs) for the methods are calculated for a sample 20% the size of the original data and are shown in Table 5. This is calculated as the proportion of times the true response value lies in a nominal 95% prediction interval (PI) in each replicate. The average is then taken over the 1000 replicates. The 95% PIs are calculated as . Note that for the SPRSIC and ALASSOIC, the dispersion parameter is held constant across observations and so is a vector of zeros except for its first element, which is the intercept. The methods appear to perform similarly when examining the overall PCP, although the MPRSIC is somewhat poorer than the others in the case. However, note that the observations can be categorized by their variability and, therefore, we split them into groups with low, medium and high variability using the thresholds , and , respectively, where these thresholds are the tertiles computed numerically from the true underlying distribution of . Doing so reveals that none of the methods perform particularly well at when viewed in terms of these three levels of variability. For and 1000, the coverage for the MPRSIC remains at approximately 95%, which is the desired nominal value. In contrast, the PIs are too wide for the low and medium variability cases (leading to 100% coverage) and too narrow for the large variability cases (leading to approximately 85% coverage). This is unsurprising since these methods assume a constant , and, therefore, cannot adapt to heterogeneity in the data. This can also be visualized in Figure 3, which shows the coverage for ten categories in the case where ; again we see that the MPRSIC method lies close to 95%, whereas the other methods are either too high or too low.
MPRSIC  SPRSIC  ALASSOIC  
C(6)  IC(0)  PT  MSE  C(6)  IC(0)  PT  MSE  C(6)  IC(0)  PT  MSE  
100  5.28  0.00  0.53  0.06  5.64  1.46  0.15  0.87  5.36  1.27  0.14  0.82  
500  5.89  0.00  0.90  0.00  5.84  0.15  0.76  0.15  5.66  0.10  0.67  0.16  
1000  5.93  0.00  0.94  0.00  5.90  0.01  0.90  0.06  5.80  0.01  0.83  0.07  
100  5.50  0.64  0.31  0.59  6.00  6.00  0.00  6.24  6.00  6.00  0.00  6.46  
500  5.91  0.00  0.91  0.04  6.00  6.00  0.00  6.88  6.00  6.00  0.00  6.93  
1000  5.94  0.00  0.94  0.02  6.00  6.00  0.00  6.91  6.00  6.00  0.00  6.93 

C, average correct zeros; IC, average incorrect zeros; PT, the probability of choosing the true model;
MSE, the average mean squared error.
MPRSIC  
SE  SEE  CP  SE  SEE  CP  SE  SEE  CP  
1.0  1.00  0.07  0.05  0.80  1.00  0.02  0.02  0.95  1.00  0.01  0.01  0.95  
0.5  0.50  0.08  0.05  0.82  0.50  0.02  0.02  0.95  0.50  0.01  0.01  0.95  
0.5  0.50  0.07  0.05  0.80  0.50  0.02  0.02  0.94  0.50  0.01  0.01  0.94  
1.0  1.00  0.09  0.06  0.82  1.00  0.03  0.03  0.92  1.00  0.02  0.02  0.95  
0.5  0.50  0.07  0.04  0.82  0.50  0.02  0.02  0.94  0.50  0.01  0.01  0.93  
1.0  1.00  0.07  0.05  0.80  1.00  0.02  0.02  0.94  1.00  0.01  0.01  0.94  
0.5  0.50  0.31  0.13  0.69  0.51  0.07  0.07  0.95  0.50  0.05  0.05  0.95  
1.0  1.12  0.24  0.17  0.82  1.02  0.07  0.07  0.94  1.01  0.05  0.05  0.94  
0.5  0.49  0.31  0.13  0.67  0.50  0.07  0.07  0.94  0.50  0.04  0.05  0.96  
1.0  1.08  0.21  0.16  0.86  1.01  0.07  0.06  0.94  1.00  0.04  0.05  0.95  
0.5  0.49  0.29  0.13  0.71  0.51  0.07  0.06  0.94  0.50  0.05  0.05  0.94  
1.0  1.12  0.23  0.17  0.81  1.02  0.07  0.07  0.93  1.01  0.04  0.05  0.95 

SE, standard deviation of estimates over 1000 replications; SEE, average of estimated standard errors
over 1000 replications; CP, the empirical coverage probability of a nominal 95% confidence interval.
MPRSIC  SPRSIC  ALASSOIC  
100  500  1000  100  500  1000  100  500  1000  
Low  0.78  0.93  0.94  1.00  1.00  1.00  1.00  1.00  1.00 
Medium  0.89  0.95  0.95  0.98  1.00  1.00  0.98  1.00  1.00 
High  0.94  0.95  0.95  0.79  0.84  0.85  0.81  0.85  0.85 
Overall  0.86  0.94  0.95  0.93  0.95  0.95  0.93  0.95  0.95 

Variability categorized as low (), medium () and high ().
Outofsample coverage is calculated for a sample 20% the size of the original data.
4 Real Data Analyses
4.1 Overview
We consider three real data analyses to illustrate our proposed MPRSIC method, which is implemented using the telescope (Algorithm 1). For each dataset, the resulting MPRSIC, SPRSIC and ALASSOIC estimates () and associated standard errors (in brackets) are presented in Table 6 (note that for the SPRSIC and ALASSOIC methods, is a vector of zeros except for its first element, which contains the intercept). Additionally, the change in BIC, BIC, from dropping the variable from the location or dispersion component of the model is computed; this is a measure of the effect of the variable. For the sake of brevity, the standard errors and BIC are only discussed for the MPRSIC method as this is the main focus of this paper. The corresponding results for the SPRSIC and ALASSOIC methods are given in Appendix C. Table 7 provides the outofsample PCPs overall and split by category of variability calculated using 10fold crossvalidation.
4.2 Prostate Cancer Data
We examine the prostate cancer data, which come from a study by Stamey et al. (1989) and which appear in Tibshirani (1996) and Zou and Hastie (2005). The correlation between the level of prostatespecific antigen (PSA) and various clinical measures in 97 men who were about to receive a radical prostatectomy is examined. The predictors consist of eight clinical measures: log(cancer volume ()) (lcavol), log(prostate weight (g)) (lweight), presence of seminal vesicle invasion (SVI) (svi), age of the patient (age), log(amount of benign prostatic hyperplasia ()) (lbph), log(capsular penetration (cm)) (lcp), Gleason score (gleason) and percentage of Gleason scores four of five (pgg45
). The logarithm of PSA (ng/mL) is the response variable. The presence of SVI (
svi) is a binary variable (1=yes, 0=no) and
gleason is a discrete numerical variable with four values. The Gleason score relates to prostate cancer grades and the pgg45 predictor provides information on the history of the patient. This is the percentage of Gleason scores they received before their final Gleason score in gleason. PSA is a protein that is produced by normal and malignant prostate cells, and is useful as a preoperative marker, as prostate cancer causes PSA to be discharged into the blood.Table 6 shows that all the methods select the same variables in the location parameter (lcavol, lweight and svi). These selected variables have positive location coefficients (across all methods), indicating that an increase in the log(cancer volume), log(prostate weight) and the presence of SVI are associated with an increase in the response log(PSA), and therefore may be indicative of prostate cancer. The MPRSIC method also identifies lweight and svi as important variables for the dispersion parameter. In the case of log(prostate weight), although it has a positive coefficient in the location parameter, it has a negative coefficient in the dispersion parameter. Therefore, higher values of log(prostate weight) are associated with higher values of log(PSA) and also lower variability. The svi variable has a positive dispersion coefficient, i.e., the presence of SVI is associated with more variable log(PSA) values. However, removal of svi from the location only increases the BIC by 1.64 units suggesting that svi is less important in terms of the overall level of log(PSA); in contrast, its removal from the dispersion increases the BIC by 4.09 units.
4.3 Sniffer Data
When gasoline is pumped into a tank, hydrocarbon vapours are forced out and emitted into the atmosphere. This is a source of air pollution and in order to reduce this, devices that capture the vapour are set up. Testing these vapour recovery systems involves a “sniffer” device to measure the amount of vapour that is recovered. A method of estimating the total amount released is required to estimate the efficiency of the system. A laboratory experiment was carried out to discover factors that impact the amount of hydrocarbon vapour released when gasoline is pumped into a tank. Four factors are varied  vapour pressure (psi) of the dispensed gasoline (gaspres), temperature (°F) of the dispensed gasoline (gastemp), initial tank temperature (°F) (tanktemp) and initial vapour pressure (psi) in the tank (tankpres). The quantity of emitted hydrocarbon (g) is the response variable. There are 125 runs in the data. These data have previously been considered by Weisberg (2013) who noted that the dispersion may depend on the predictors but did not apply a heteroscedastic model, and Bedrick (2000) who used a model with all four predictors in the location along with gastemp and gaspres in the dispersion.
The MPRSIC and SPRSIC methods each select a different combination of variables for the location parameter, while the ALASSOIC does not eliminate any variables (Table 6). In terms of the location, the MPRSIC method selects tanktemp but not tankpres, whereas the SPRSIC does the opposite. For all three methods, gaspres and gastemp have positive location coefficients, indicating that an increase in these variables is related to an increase in the quantity of hydrocarbon emitted. All of the coefficients are significant at the 95% level. The MPRSIC method identifies that gastemp is associated with an increase in the variability of the quantity of hydrocarbon emitted from the tank. The change in BIC from dropping the variable from the dispersion parameter is 17.62 units, which, interestingly, is far greater than the impact of dropping tanktemp from the location component (BIC increase of 5.6 units); this demonstrates the fact that modelling only the location — as is most often done in practice — can miss important features of the process under study.
4.4 Boston House Price Data
Data from a crosssectional study of 506 communities in the Boston area carried out in 1970 (Harrison Jr and Rubinfeld, 1978) are available in Wooldridge (2015). The association between median house prices in a particular community with various community characteristics is examined. There are eight explanatory variables: average number of rooms per house (rooms), percentage of the population that are “lower status” (lowstat), average studentteacher ratio of schools in the community (stratio), log(property tax per $1000) (lproptax), log(weighted distances to five employment centres in the Boston region) (ldist), crimes committed per capita (crime), log(annual average nitrogen oxide concentration (pphm)) (lnox) and index of accessibility to radial highways (radial). The log(median house price ($)) is the dependent variable. DiCiccio et al. (2019) applied a weighted least squares approach to these data (which accounts for heterogeneity but does not model the dispersion), where they only considered the rooms, stratio, ldist and lnox variables.
Table 6 shows that all eight covariates are included in the location component across the MPRSIC, SPRSIC and ALASSOIC methods. All the coefficients are significant at the 95% level, and the signs of the estimated location coefficients are alike across the methods. Only two covariates (rooms and radial) have a positive effect on house prices. Houses with a greater number of rooms and access to radial highways are generally desirable, which results in increased house prices. The remaining variables may be considered to be undesirable, thus reducing the median house prices. In particular, the percentage of the population in the community that are “low status” and the studentteacher ratio of schools in the community both have a sizable impact on the BIC when they are removed from the location parameter. The MPRSIC method selects three covariates in the dispersion parameter (lowstat, ldist and radial), of which ldist and radial are associated with increased house prices (as previously discussed) but also increased variability, i.e., it is generally positive to have access to the highway, but there may be exceptions leading to increased variability, which, for example, could be due to badly positioned houses where there is road noise or reduced privacy (but these features are not captured in the data). In contrast, the distance to employment centres (ldist) is associated with reduced house prices and reduced variability, so that it appears that being farther from work is more consistently seen to be negative. We note that ldist, when dropped from the dispersion, leads to a greater BIC value than three of the eight variables in the location component, again demonstrating the importance of modelling the dispersion.



4.5 Prediction Coverage Probabilities
Table 7 contains the outofsample PCPs overall and split by category of variability calculated using 10fold crossvalidation for the prostate cancer, sniffer, and Boston house price data. Considering the PCPs from an overall point of view, the MPRSIC, SPRSIC and ALASSOIC methods perform similarly across all three data analyses. As expected based on the simulation, the coverage improves with respect to sample size, as we compare results from the smaller prostate cancer and sniffer datasets with the larger Boston house price dataset. The overall pattern is that both the SPRSIC and ALASSOIC methods tend to produce wider PIs for the low and medium variability categories, and narrower PIs for the high category — but do okay in terms of coverage for the low variability cases in the two smaller datasets (prostate cancer and sniffer data). The MPRSIC approach does the opposite, but it is somewhat more balanced in that it tends towards good coverage on the larger Boston house price data — whereas the other two methods continue to produce overly wide intervals for the low and medium categories and overly narrow intervals for the high category.
MPRSIC  SPRSIC  ALASSOIC  
(a)  (b)  (c)  (a)  (b)  (c)  (a)  (b)  (c)  
Low  0.88  0.86  0.91  0.95  0.94  0.99  0.95  0.96  0.99 
Medium  0.89  0.90  0.94  0.89  0.99  0.98  0.90  0.99  0.98 
High  0.96  0.95  0.92  0.79  0.87  0.80  0.79  0.87  0.80 
Overall  0.89  0.90  0.93  0.86  0.92  0.93  0.87  0.93  0.93 

(a) Prostate cancer data, low: , medium: , high: .
(b) Sniffer data, low: , medium: , high: .
(c) Boston house price data, low: , medium: , high: .
5 Discussion
Our proposed variable selection procedure uses a smooth norm to facilitate smooth information criterion optimization, and is extended for application in the developing area of multiparameter regression. This enables straightforward selection of more complex covariate effects afforded by modelling multiple distributional parameters simultaneously. The smooth objective function means that this is achieved using standard gradient based optimization procedures, i.e., NewtonRaphson. Moreover, because the objective function is an information criterion, the approach circumvents the need for penalty tuning parameter optimization, e.g., this is fixed at in the BIC case. This is something that can be computationally intensive in LASSOtype problems, especially in the context of MPR modelling due to the additional parameters to be estimated and the fact that, in theory, there may be a separate tuning parameter for each distributional parameter. We provide a publicly available package smoothic for the implementation of our proposed methods (O’Neill and Burke, 2021).
Through extensive simulation studies, we have demonstrated that the procedure has very favourable performance in terms of variable selection, parameter inference, and outofsample prediction; this is true in both single and multiparameter settings. Results from the real data analyses illustrate the effectiveness of our procedure, and in particular, the advantage of modelling the dispersion is clear from the fact that we have found dispersion effects that are stronger (in BIC terms) than location effects.
The methods proposed in this article are not restricted for use with only the normal distribution. The techniques can be extended for use with nonnormal models, for example, the generalized linear model family. Moreover, we anticipate that the methods will also extend to the setting of robust statistical modelling, which is particularly important given the evergrowing presence of complex datasets
(Fan et al., 2014; Avella Medina and Ronchetti, 2015; Maronna et al., 2019). Such an extension would be capable of dealing with heteroscedasticity and outliers, while also carrying out variable selection and parameter estimation simultaneously; this will be a focus of our future work.
Acknowledgements
This work was carried out within the Confirm Smart Manufacturing Research Centre (https://confirm.ie/) funded by Science Foundation Ireland (grant number: 16/RC/3918). We would also like to thank Prof. James Gleeson for his support.
Appendix A Simulation Results: Single Parameter Setting
This appendix contains simulation results for data simulated using the normal SPR model.
0  1  0.5  0.5  1  0.5  1  0  0  0  0  0  0  
0  0  0  0  0  0  0  0  0  0  0  0  0 

Binary covariates indicated in bold.
MPRSIC  SPRSIC  ALASSOIC  
100  500  1000  100  500  1000  100  500  1000  
Overall  0.91  0.94  0.95  0.93  0.94  0.95  0.94  0.95  0.95 

Outofsample coverage is calculated for a sample 20% the size of the original data.
MPRSIC  SPRSIC  ALASSOIC  
C(6)  IC(0)  PT  MSE  C(6)  IC(0)  PT  MSE  C(6)  IC(0)  PT  MSE  
100  5.64  0.01  0.68  0.10  5.73  0.01  0.75  0.08  5.54  0.00  0.66  0.09  
500  5.92  0.00  0.92  0.02  5.92  0.00  0.92  0.02  5.91  0.00  0.93  0.02  
1000  5.94  0.00  0.95  0.01  5.95  0.00  0.95  0.01  5.94  0.00  0.95  0.01  
C(12)  IC(0)  PT  MSE  C(12)  IC(0)  PT  MSE  C(12)  IC(0)  PT  MSE  
100  11.31  0.00  0.52  0.18  12.00  0.00  1.00  0.03  12.00  0.00  1.00  0.02  
500  11.84  0.00  0.84  0.01  12.00  0.00  1.00  0.00  12.00  0.00  1.00  0.00  
1000  11.88  0.00  0.89  0.00  12.00  0.00  1.00  0.00  12.00  0.00  1.00  0.00  

C, average correct zeros; IC, average incorrect zeros; PT, the probability of choosing the true model;
MSE, the average mean squared error.
MPRSIC  
SE  SEE  CP  SE  SEE  CP  SE  SEE  CP  
1.00  1.00  0.11  0.10  0.91  1.00  0.05  0.04  0.95  1.00  0.03  0.03  0.94  
0.50  0.50  0.12  0.09  0.91  0.50  0.04  0.04  0.95  0.50  0.03  0.03  0.95  
0.50  0.50  0.11  0.10  0.91  0.50  0.04  0.04  0.96  0.50  0.03  0.03  0.94  
1.00  1.00  0.11  0.09  0.91  1.00  0.05  0.04  0.94  1.00  0.03  0.03  0.95  
0.50  0.49  0.11  0.09  0.90  0.50  0.05  0.04  0.94  0.50  0.03  0.03  0.94  
1.00  1.00  0.11  0.09  0.90  1.00  0.05  0.04  0.94  1.00  0.03  0.03  0.95  
SPRSIC  
SE  SEE  CP  SE  SEE  CP  SE  SEE  CP  
1.00  1.00  0.10  0.10  0.93  1.00  0.05  0.04  0.95  1.00  0.03  0.03  0.94  
0.50  0.50  0.11  0.10  0.94  0.50  0.04  0.04  0.95  0.50  0.03  0.03  0.95  
0.50  0.50  0.10  0.10  0.94  0.50  0.04  0.04  0.96  0.50  0.03  0.03  0.94  
1.00  1.00  0.10  0.10  0.95  1.00  0.05  0.04  0.94  1.00  0.03  0.03  0.95  
0.50  0.49  0.10  0.10  0.93  0.50  0.05  0.04  0.94  0.50  0.03  0.03  0.94  
1.00  1.00  0.10  0.10  0.93  1.00  0.05  0.04  0.95  1.00  0.03  0.03  0.95  
ALASSOIC  
SE  SEE  CP  SE  SEE  CP  SE  SEE  CP  
1.00  0.98  0.11  0.10  0.94  1.00  0.05  0.04  0.95  1.00  0.03  0.03  0.94  
0.50  0.46  0.12  0.10  0.91  0.49  0.04  0.04  0.94  0.49  0.03  0.03  0.94  
0.50  0.46  0.11  0.10  0.90  0.49  0.05  0.04  0.94  0.49  0.03  0.03  0.94  
1.00  0.98  0.10  0.10  0.94  0.99  0.05  0.04  0.94  1.00  0.03  0.03  0.95  
0.50  0.45  0.11  0.10  0.90  0.49  0.05  0.04  0.93  0.49  0.03  0.03  0.94  
1.00  0.98  0.11  0.10  0.93  1.00  0.05  0.04  0.94  1.00  0.03  0.03  0.95 

SE, standard deviation of estimates over 1000 replications; SEE, average of estimated standard errors
over 1000 replications; CP, the empirical coverage probability of a nominal 95% confidence interval.
Appendix B Simulation Results: MultiParameter Setting
Additional simulation results for data simulated using the normal MPR model are presented here.
SPRSIC  
SE  SEE  CP  SE  SEE  CP  SE  SEE  CP  
1.00  0.98  0.33  0.21  0.88  1.01  0.12  0.11  0.93  1.00  0.09  0.08  0.92  
0.50  0.41  0.40  0.12  0.48  0.49  0.19  0.10  0.82  0.50  0.12  0.08  0.83  
0.50  0.37  0.36  0.12  0.51  0.49  0.15  0.10  0.89  0.50  0.09  0.08  0.92  
1.00  0.98  0.31  0.21  0.88  1.00  0.11  0.11  0.95  1.00  0.08  0.08  0.94  
0.50  0.37  0.36  0.12  0.52  0.49  0.14  0.10  0.93  0.50  0.08  0.08  0.95  
1.00  0.98  0.31  0.21  0.89  1.00  0.12  0.11  0.94  0.99  0.08  0.08  0.95  
ALASSOIC  
SE  SEE  CP  SE  SEE  CP  SE  SEE  CP  
1.00  0.87  0.32  0.22  0.87  0.98  0.13  0.11  0.92  0.98  0.09  0.08  0.90  
0.50  0.35  0.34  0.14  0.59  0.44  0.18  0.10  0.78  0.46  0.12  0.08  0.79  
0.50  0.32  0.30  0.14  0.61  0.44  0.15  0.11  0.83  0.47  0.10  0.08  0.88  
1.00  0.87  0.32  0.22  0.85  0.97  0.12  0.11  0.93  0.98  0.08  0.08  0.93  
0.50  0.31  0.29  0.14  0.60  0.44  0.14  0.11  0.88  0.47  0.08  0.08  0.92  
1.00  0.87  0.32  0.22  0.86  0.97  0.12  0.11  0.91  0.98  0.08  0.08  0.93 

SE, standard deviation of estimates over 1000 replications; SEE, average of estimated standard errors
over 1000 replications; CP, the empirical coverage probability of a nominal 95% confidence interval.
Appendix C Real Data Analyses: Additional Results
This appendix shows additional results for the three real data analyses.



References
 Akaike (1974) Akaike, H. (1974). A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723.
 Avella Medina and Ronchetti (2015) Avella Medina, M. and Ronchetti, E. (2015). Robust statistics: a selective overview and new directions. Wiley Interdisciplinary Reviews: Computational Statistics, 7(6):372–393.
 Bedrick (2000) Bedrick, E. J. (2000). Checking for lack of fit in linear models with parametric variance functions. Technometrics, 42(3):226–236.

Burke et al. (2020)
Burke, K., Jones, M., and Noufaily, A. (2020).
A flexible parametric modelling framework for survival analysis.
Journal of the Royal Statistical Society: Series C (Applied Statistics), 69(2):429–457.  Burke and MacKenzie (2017) Burke, K. and MacKenzie, G. (2017). Multiparameter regression survival modeling: An alternative to proportional hazards. Biometrics, 73(2):678–686.
 Cox and Reid (1987) Cox, D. R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference. Journal of the Royal Statistical Society: Series B (Methodological), 49(1):1–18.
 DiCiccio et al. (2019) DiCiccio, C. J., Romano, J. P., and Wolf, M. (2019). Improving weighted least squares inference. Econometrics and Statistics, 10:96–119.
 Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004). Least angle regression. The Annals of statistics, 32(2):407–499.
 Fan et al. (2014) Fan, J., Fan, Y., and Barut, E. (2014). Adaptive robust variable selection. Annals of statistics, 42(1):324.
 Fan and Li (2001) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360.
 Fan and Li (2002) Fan, J. and Li, R. (2002). Variable selection for cox’s proportional hazards model and frailty model. Annals of Statistics, pages 74–99.
 Fan and Lv (2010) Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1):101.
 Friedman et al. (2007) Friedman, J., Hastie, T., Höfling, H., Tibshirani, R., et al. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332.
 Friedman et al. (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1.
 Groll et al. (2019) Groll, A., Hambuckers, J., Kneib, T., and Umlauf, N. (2019). Lassotype penalization in the framework of generalized additive models for location, scale and shape. Computational Statistics & Data Analysis, 140:59–73.
 Harrison Jr and Rubinfeld (1978) Harrison Jr, D. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81–102.
 Harvey (1976) Harvey, A. C. (1976). Estimating regression models with multiplicative heteroscedasticity. Econometrica: Journal of the Econometric Society, pages 461–465.
 Hunter and Li (2005) Hunter, D. R. and Li, R. (2005). Variable selection using mm algorithms. Annals of statistics, 33(4):1617.
 LloydJones et al. (2018) LloydJones, L. R., Nguyen, H. D., and McLachlan, G. J. (2018). A globally convergent algorithm for lassopenalized mixture of linear regression models. Computational Statistics & Data Analysis, 119:19–38.
 Maronna et al. (2019) Maronna, R. A., Martin, R. D., Yohai, V. J., and SalibiánBarrera, M. (2019). Robust statistics: theory and methods (with R). John Wiley & Sons.

Mayr et al. (2012)
Mayr, A., Fenske, N., Hofner, B., Kneib, T., and Schmid, M. (2012).
Generalized additive models for location, scale and shape for high dimensional data—a flexible approach based on boosting.
Journal of the Royal Statistical Society: Series C (Applied Statistics), 61(3):403–427.  Oelker and Tutz (2017) Oelker, M.R. and Tutz, G. (2017). A uniform framework for the combination of penalties in generalized structured models. Advances in Data Analysis and Classification, 11(1):97–120.
 O’Neill and Burke (2021) O’Neill, M. and Burke, K. (2021). smoothic: Variable Selection Using a Smooth Information Criterion. R package.

Reid et al. (2016)
Reid, S., Tibshirani, R., and Friedman, J. (2016).
A study of error variance estimation in lasso regression.
Statistica Sinica, pages 35–67.  Rigby and Stasinopoulos (2005) Rigby, R. A. and Stasinopoulos, D. M. (2005). Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(3):507–554.
 Rutemiller and Bowers (1968) Rutemiller, H. C. and Bowers, D. A. (1968). Estimation in a heteroscedastic regression model. Journal of the American Statistical Association, 63(322):552–557.
 Schwarz et al. (1978) Schwarz, G. et al. (1978). Estimating the dimension of a model. The annals of statistics, 6(2):461–464.
 Shao (1997) Shao, J. (1997). An asymptotic theory for linear model selection. Statistica sinica, pages 221–242.
 Stamey et al. (1989) Stamey, T. A., Kabalin, J. N., McNeal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A., and Yang, N. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. ii. radical prostatectomy treated patients. The Journal of urology, 141(5):1076–1083.
 Stasinopoulos et al. (2007) Stasinopoulos, D. M., Rigby, R. A., et al. (2007). Generalized additive models for location scale and shape (gamlss) in r. Journal of Statistical Software, 23(7):1–46.
 Stasinopoulos et al. (2018) Stasinopoulos, M. D., Rigby, R. A., and Bastiani, F. D. (2018). Gamlss: a distributional regression approach. Statistical Modelling, 18(34):248–273.
 Su (2015) Su, X. (2015). Variable selection via subtle uprooting. Journal of Computational and Graphical Statistics, 24(4):1092–1113.
 Su et al. (2018) Su, X., Fan, J., Levine, R. A., Nunn, M. E., and Tsai, C.L. (2018). Sparse estimation of generalized linear models (glm) via approximated information criteria. Statistica Sinica, 28(3):1561–1581.
 Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
 Tibshirani (1997) Tibshirani, R. (1997). The lasso method for variable selection in the cox model. Statistics in medicine, 16(4):385–395.
 Wang and Leng (2007) Wang, H. and Leng, C. (2007). Unified lasso estimation by least squares approximation. Journal of the American Statistical Association, 102(479):1039–1048.
 Wang et al. (2009) Wang, H., Li, B., and Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):671–683.
 Weisberg (2013) Weisberg, S. (2013). Applied linear regression. John Wiley & Sons.
 Wooldridge (2015) Wooldridge, J. M. (2015). Introductory econometrics: A modern approach. Cengage learning.
 Yang (2005) Yang, Y. (2005). Can the strengths of aic and bic be shared? a conflict between model indentification and regression estimation. Biometrika, 92(4):937–950.
 Zou (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429.
 Zou and Hastie (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320.
Comments
There are no comments yet.