1 Introduction
Regression models are important tools for studying the relationships between response variables and predictor variables. Often, there are many predictor variables available to build a regression model but some of these variables may be inactive in the sense that they have no impact on the response. For the purpose of statistical inference and scientific discovery, it is necessary that we identify the true model containing only and all active variables. To set up notation, consider a full regression model with
predictor variables,(1) 
where is an vector of independent observations of the response variable, is the design matrix and is the unknown vector of regression parameters. The marginal distributions of are assumed to be of the same type, usually in the exponential family of distributions, and known except for the values of their parameters. There may be other parameters besides but they are not of interest in the context of model selection. With the above notation, a variable is said to be active if its parameter and inactive if . An important case of (1) is the generalized linear model which may be written as where is the vector of linear predictors and is the inverse of the link function. Let be the collection of subsets of the variables in the full model (1) where each represents a subset. We will call each a model as it defines a reduced model,
where is the design matrix containing only variables in and is the parameter vector for variables in . Throughout this paper, we are only concerned with the classical low dimensional setting where is fixed and
, and we assume that the estimation problem for each
is wellposed in that the maximum likelihood estimator for , denoted by , exists and is unique. We also assume that the true model containing only and all active variables is in . We denote the true model by , and our objective is to identify it from the models in .There is a large body of literature on model selection. Ding, Tarokh and Yang (2018a) and Kadane and Lazar (2004) gave comprehensive reviews of the related literature. Wit, van den Heuvel and Romeijn (2012) provided a stimulating discussion on the principle ideas and philosophical debate concerning model selection. Here, we only briefly review two commonly used model selection criteria, Akaike Information Criterion (AIC) by Akaike (1974) and Bayesian Information Criterion (BIC) by Schwarz (1978), which are part of the motivations behind the present work. The AIC approach does not assume that the underlying mechanism (true model) that generated the data is in the set of models under consideration. It selects the model from the set that minimizes the Kullback–Leibler divergence between the fitted and the true model. Denote by
the maximum loglikelihood of model and by the number of predictor variables in . In the present context of selecting a regression model from , the AIC of model is given by(2) 
The model with the smallest AIC value is the model that minimizes the Kullback–Leibler divergence in an asymptotic sense. The AIC (2) is a penalized measure of fit of with the fit measured by its loglikelihood and the penality term proportional to its size . For small sample situations, corrected penality terms have been proposed by several authors including Hurvich and Tsai (1989) and Broersen (2000). The BIC approach tackles the model selection problem from a Bayesian perceptive by assuming that the parameter vector of the model follows a prior distribution. It selects the model with the largest marginal likelihood which is asymptotically equivalent to selecting the model with the minimum BIC where
(3) 
The BIC (3) is also a penalized measure of fit where the fit is again measured by the loglikelihood and the penality term is proportional to the size of the model. Unlike the AIC which is not consistent even when the true model is in the set under consideration, under certain conditions the BIC is consistent (Rao and Wu, 1989). Apart from AIC, BIC and their variants, there are other criteria based on penalized loglikelihood such as the Hannan and Quinn Information Criterion (Hannan and Quinn, 1979) and Bridge Criterion (Ding, Tarokh and Yang, 2018b). To summarize, using penalized loglikelihood for model selection is a popular way of balancing the fit and the size of the selected model. It is one of the key ideas for model selection. The fact that methods with completely different motivations have resulted in criteria with similar penalized loglikelihood forms such as (2) and (3) shows the inherent importance of the loglikelihood as a measure of fit for model selection.
In this paper, we study using the closely related loglikelihood ratio as a measure of fit for model selection. The maximum loglikelihood ratio of a model is
(4) 
where is the maximum likelihood estimator and is the maximum loglikelihood of the full model with all variables. The provides a relative measure of fit of model with respect to the full model. It has an important advantage over the loglikelihood in that its value may be directly used to evaluate the plausibility of model because the asymptotic null distribution of is known to be a distribution, whereas the value of the loglikelihood of a model alone does not carry information about the plausibility of the model. To use for model selection, instead of penalizing it with a penality term proportional to model size, we take advantage of the null distribution of to look for a set of plausible models using the likelihood ratio test. Then, from this set we select the smallest model. This amounts to giving the fit (as represented by the loglikelihood ratio) a higher priority and minimizing the size subject to a lower bound on the fit. We refer to this approach as the constrained minimum method for model selection. Tsao (2021) first studied this method for selecting Gaussian linear models under an approximated likelihood ratio test with a decreasing significance level and showed the method is consistent for Gaussian linear models. The present paper uses the exact likelihood ratio test with a fixed significance level and generalizes the method to all regression models. For large and small , we show that there is a high probability that the smallest model in the set of plausible models is the true model. This provides an asymptotic justification for using the constrained minimum method for model selection. Further, the ultimate justification of a model selection method is its good accuracy in finite sample applications, so we also provide empirical evidence for the excellent finite sample accuracy of the constrained minimum method.
The rest of this paper is organized as follows. In Section 2, we present the constrained minimum method based on the likelihood ratio test for selecting regression models. In Section 3, we compare this method to the AIC and BIC in terms of selection accuracy in a simulation study with examples of linear, logistic and Poisson regression models. We also discuss the choice of the significance level for the underlying likelihood ratio test. In Section 4, we apply the constrained minimum method to perform model selection for logistic regression for a South African heart disease dataset. We conclude with a few remarks in Section 5.
2 The constrained minimum criterion
Denote by the true value of the regression parameter vector for the full model. Here, is a vector and its elements corresponding to inactive variables are all zero. For simplicity, we make the following three assumptions for all regression models under consideration. The first assumption is that the maximum likelihood estimator for based on the full model is consistent, that is,
(5) 
The second assumption is that the null distribution of the loglikelihood ratio converges to the usual distribution when the sample size goes to infinity, that is,
(6) 
By (6), for any fixed , a asymptotic confidence region for is
(7) 
where denotes the
th quantile of the
distribution. The centre of this dimensional confidence region is as is the smallest value of . The third assumption is that the size of the confidence region goes to zero as goes to infinity in the sense that(8) 
For commonly used regression models, regularity conditions for the consistency and asymptotic normality of the maximum likelihood estimator of the full model are available in the literature. It may be verified that assumptions (5), (6) and (8) hold under these conditions. For linear models, a commonly used set of two such regularity conditions are
where is the th row of and is a positive definite matrix, and
For generalized linear models, such regularity conditions are discussed by several authors including Haberman (1977), Gourieroux and Monfort (1981) and Fahrmeir and Kaufmann (1985).
The confidence region contains the collection of not rejected by the likelihood ratio test for at the given level. As such, it represents the set of plausible vectors at the level. To extend the notion of plausibility from a vector to a model , we first find a vector to represent model . The maximum likelihood estimator for is a vector of dimension , which is less than when is not the full model. It is usually a continuous random vector, so with probability one none of its elements is zero. We augment the dimension of by adding zeros to its elements to represent the variables not in . For example, if is not in , then the second element of the augmented (which corresponds to ) is a zero. For simplicity, we still use the same notation but it is now a vector representing and its nonzero elements correspond to the intercept and variables in . We say that model is plausible at the level if is in the confidence region . Alternatively, we may also say that is plausible if is less than . Note that although we need the augmented dimensional version of to define the plausibility of its corresponding model , when computing the maximum loglikelihood ratio of this model , may be either the augmented dimensional version or the original dimensional version as they both give the same value of . In numerical computations of , we use the dimensional version as it appears in (4) which is more convenient. Using the norm which counts the number of nonzero elements in a vector, we define the constrained minimum criterion (CMC) based on the likelihood ratio test as the criterion that chooses the model represented by the solution of the following constrained optimization problem
(9) 
We call the solution vector to this optimization problem the CMC solution and its corresponding model the CMC selection. When there are multiple solution vectors, we choose the one with the highest likelihood as the CMC solution.
Denote by the maximum likelihood estimator for the unknown true model . Nonzero elements of this (augmented) are those corresponding to active variables and zero elements are those corresponding to inactive variables. The following theorem gives the asymptotic properties of the CMC solution and selection.
Theorem 2.1
The asymptotic lower bound (11) shows that when the sample size is large, we may choose a small so that there is a high probability that the CMC selection is the true model. Numerical results show that the lower bound is rather loose for many values when is large in that the observed probability of the event is usually much larger than . Also, when is not large, small levels are not appropriate. We will discuss the selection of the level with numerical examples in the next section. We now prove the theorem.
Proof of Theorem 2.1. By (5), we have . Since , by (8) we also have . It follows from these and the triangle inequality that
(12) 
which implies the consistency of the CMC solution (10).
To prove the asymptotic lower bound in (11), note that
(13) 
as events , so it suffices to show that has the asymptotic lower bound in (11). To this end, we first identify the elements of vectors in that may not be zero when is large. Define an event
{Elements of in corresponding to nonzero elements of are also nonzero}. 
Similar to (12), by the triangle inequality and (8), we have uniformly for all . It follows that individual elements of converge in probability to corresponding elements of uniformly, so as the sample size goes to infinity. When event occurs, among the set of vectors only those for models containing all active variables can be in , so of the true model is the smallest (in norm) member of that may possibly be in . It follows that implies , so
(14) 
as goes to infinity. Also, event implies because is the maximum likelihood estimator for which has a higher likelihood and thus a smaller loglikelihood ratio than ; that is, and thus implies . This and (6) imply that
(15) 
as goes to infinity. Equations (13), (14) and (15) then imply (11).
The above proof follows the same steps as the proof in Tsao (2021) for the consistency of the CMC for Gaussian linear models. However, the CMC in Tsao (2021) is based on an approximated likelihood ratio statistic whose finite sample distribution is known. The level for that CMC is not fixed; it goes to zero as goes to infinity. The proof of consistency depends on the finite sample distribution which is only available for Gaussian linear models. In the present paper, the finite sample distribution of the likelihood ratio statistic is unavailable. We have only the asymptotic distribution (6) which leads to a weaker result (11) instead of consistency. Nevertheless, this does not seem to affect the accuracy of the present version of CMC based on the likelihood ratio test as numerical results show that it is equally accurate as the consistent version for Gaussian linear models (see numerical examples in the next section). Further, the present version can be applied to all types of regression models, not just the Gaussian linear models.
3 Simulation study
We now compare the CMC based on the likelihood ratio test (9) with the AIC and BIC in terms of false active rate (FAR) and false inactive rate (FIR) through numerical examples. We also discuss the selection of the level for the CMC. Here, FAR is the number of inactive variables appearing in the selected model divided by the total number of inactive variables in the full model, and FIR is the number of active variables not in the selected model divided by the total number of active variables in the full model. A model selection criterion is accurate when FIR and FAR of its selected model are both low. To compute the examples, we use R package ‘bestglm’ by McLeod, Xu and Lai (2020) which performs the best subset selection for generalized linear models. For the best subset selection of Gaussian linear models, ‘bestglm’ uses the ‘leaps and bounds’ algorithm by Furnival and Wilson (1974) which can handle situations with 40 or fewer predictor variables. For the best subset selection of logistic regression models and Poisson regression models, it uses a complete enumeration method by Morgan and Tatar (1972) and has a limit of 15 on the number of predictor variables allowed in the full model. In our simulation examples, we set the number of predictor variables below the limit to a maximum of 30 for linear models and 10 for the two generalized linear models to avoid long simulation time.
3.1 Linear model examples
The linear model used for comparison is
(16) 
where with , , and with , so only the first variables are active. Elements of all
are independent random numbers generated from the standard normal distribution. The performance of the CMC depends on the
level. To find the appropriate levels for different sample sizes, we consider three levels, and 0.9.Table 1 contains simulated values of the (FIR, FAR) pairs for five model selection criteria, AIC, BIC, CMC, CMC and CMC, at 12 different combinations of , and . The subscript in CMC indicates the level used. Each (FIR, FAR) pair in the table is based on 1000 simulation runs. For each run, we first generate an () pair, and then perform the best subset selection using () with the five criteria to find their chosen models and compute their (FIR, FAR) values based on their chosen models. After 1000 runs, we obtain 1000 (FIR, FAR) values for each criteria, and Table 1 contains the average of these 1000 values. We make the following comments based on results in Table 1.
aic  bic  cmc  cmc  cmc  

(20, 10, 5)  (0.04, 0.34)  (0.05, 0.24)  (0.05, 0.25)  (0.09, 0.13)  (0.21, 0.06) 
(30, 10, 5)  (0.00, 0.25)  (0.01, 0.12)  (0.01, 0.16)  (0.02, 0.05)  (0.09, 0.01) 
(40, 10, 5)  (0.00, 0.24)  (0.00, 0.09)  (0.00, 0.13)  (0.00, 0.04)  (0.03, 0.01) 
(50, 10, 5)  (0.00, 0.22)  (0.00, 0.08)  (0.00, 0.12)  (0.00, 0.03)  (0.01, 0.00) 
(40, 20, 10)  (0.00, 0.32)  (0.00, 0.15)  (0.01, 0.12)  (0.02, 0.05)  (0.06, 0.02) 
(60, 20, 10)  (0.00, 0.25)  (0.00, 0.09)  (0.00, 0.08)  (0.00, 0.02)  (0.01, 0.00) 
(80, 20, 10)  (0.00, 0.21)  (0.00, 0.06)  (0.00, 0.05)  (0.00, 0.01)  (0.00, 0.00) 
(100, 20, 10)  (0.00, 0.20)  (0.00, 0.05)  (0.00, 0.05)  (0.00, 0.01)  (0.00, 0.00) 
(60, 30, 15)  (0.00, 0.31)  (0.00, 0.12)  (0.00, 0.08)  (0.00, 0.03)  (0.02, 0.01) 
(90, 30, 15)  (0.00, 0.23)  (0.00, 0.07)  (0.00, 0.04)  (0.01, 0.01)  (0.01, 0.00) 
(120, 30, 15)  (0.00, 0.21)  (0.00, 0.05)  (0.00, 0.03)  (0.00, 0.00)  (0.00, 0.00) 
(150, 30, 15)  (0.00, 0.20)  (0.00, 0.04)  (0.00, 0.02)  (0.00, 0.00)  (0.00, 0.00) 

The AIC and BIC have low FIR, but the AIC has a high FAR of more than 20% even when the sample size is five times as large as the dimension . If we treat false active and false inactive as equally serious errors and rank the five criteria by the overall error rate defined as the sum of FIR and FAR, then the AIC has the highest overall error rate regardless the dimension and sample size. The BIC is consistent, and we see that its overall error rate is going down towards zero as the sample size increases.

The performance of the CMC is similar to that of the BIC with comparable FIR and FAR. For small and moderate sample sizes of , the CMC has in general the smallest overall error rate among the five criteria. For large sample sizes of , the CMC has the smallest overall error rate but CMC is a close second. Because of these, we recommend the 0.5 level as the default level for the CMC. CMC results at this recommanded default level are highlighted in bold fonts in Table 1, and they are substantially more accurate than that of the AIC and BIC. When the sample size is very large relative to the dimension , we may use the 0.1 level.

Although the three assumptions in the previous section were insufficient for proving the consistency of the present version of the CMC based on the likelihood ratio test, Table 1 shows when is large and is small, the CMC overall error rates are zero or very close to zero. This suggests that for Gaussian linear models, the present version of the CMC is also consistent when we let the level go to zero at a certain speed as the sample size increases. Further, comparing the CMC results with the BIC results, we see that the CMC selection appears to converge to the true model faster than the BIC selection as the BIC error rates never reached zero even when .
Model (16) was also used to evaluate the consistent CMC for Gaussian linear models in Table 1 of Tsao (2021). The CMC results in Table 1 of that paper differ from the CMC results in Table 1 here, especially for the small sample cases of . These differences are due to the fact that two different tests were used in the formulation of the CMC. The tests are asymptotically equivalent, so for large sample sizes () the CMC results in both tables are very similar. In the examples reported here, we had set so that there is an equal number of active and inactive variables which makes the use of FIR+FAR as a measure of the overall error the most meaningful. For simplicity, we also set parameters of all active variables to 1. We have tried other values and parameter values, and obtained similar observations concerning the relative performance of the five criteria.
3.2 Logistic regression examples
Let be independent observations of the response variable where and let be the corresponding matrix of predictor variables. The logistic regression model is given by
(17) 
or alternatively,
where is the th row of and . As in the linear model examples, we set and , so that only the first half of the variables are active, and elements of all predictor variables are independent random numbers generated from the standard normal distribution. The sample size here depends on both and , so we used different combinations of and in the simulation. Table 2 contains the (FIR, FAR) values of the AIC, BIC, CMC, CMC and CMC for 16 combinations of where each (FIR, FAR) is the average of 1000 simulated pairs. We make the following observations based on Table 2:
aic  bic  cmc  cmc  cmc  

(20, 5, 6, 3)  (0.06, 0.20)  (0.10, 0.11)  (0.06, 0.20)  (0.14, 0.07)  (0.30, 0.03) 
(30, 5, 6, 3)  (0.01, 0.17)  (0.02, 0.08)  (0.01, 0.17)  (0.03, 0.05)  (0.12, 0.01) 
(40, 5, 6, 3)  (0.00, 0.17)  (0.00, 0.07)  (0.00, 0.16)  (0.01, 0.04)  (0.03, 0.00) 
(50, 5, 6, 3)  (0.00, 0.16)  (0.00, 0.06)  (0.00, 0.16)  (0.00, 0.04)  (0.01, 0.00) 
(20, 10, 6, 3)  (0.01, 0.17)  (0.01, 0.10)  (0.01, 0.17)  (0.02, 0.05)  (0.08, 0.01) 
(30, 10, 6, 3)  (0.00, 0.16)  (0.00, 0.07)  (0.00, 0.16)  (0.00, 0.04)  (0.01, 0.00) 
(40, 10, 6, 3)  (0.00, 0.16)  (0.00, 0.06)  (0.00, 0.16)  (0.00, 0.04)  (0.00, 0.00) 
(50, 10, 6, 3)  (0.00, 0.16)  (0.00, 0.05)  (0.00, 0.16)  (0.00, 0.03)  (0.00, 0.00) 
(20, 5, 10, 5)  (0.17, 0.26)  (0.22, 0.17)  (0.19, 0.20)  (0.30, 0.11)  (0.42, 0.06) 
(30, 5, 10, 5)  (0.04, 0.19)  (0.07, 0.10)  (0.06, 0.13)  (0.12, 0.05)  (0.24, 0.02) 
(40, 5, 10, 5)  (0.00, 0.18)  (0.02, 0.07)  (0.01, 0.10)  (0.04, 0.03)  (0.13, 0.01) 
(50, 5, 10, 5)  (0.00, 0.16)  (0.00, 0.05)  (0.00, 0.08)  (0.01, 0.02)  (0.07, 0.00) 
(20, 10, 10, 5)  (0.06, 0.19)  (0.08, 0.11)  (0.08, 0.13)  (0.15, 0.06)  (0.26, 0.04) 
(30, 10, 10, 5)  (0.00, 0.17)  (0.01, 0.07)  (0.00, 0.10)  (0.01, 0.02)  (0.06, 0.01) 
(40, 10, 10, 5)  (0.00, 0.15)  (0.00, 0.06)  (0.00, 0.07)  (0.00, 0.02)  (0.02, 0.00) 
(50, 10, 10, 5)  (0.00, 0.14)  (0.00, 0.05)  (0.00, 0.07)  (0.00, 0.02)  (0.00, 0.00) 

The AIC has the lowest FIR but the highest FAR for all combinations of . Due to its high FAR, its overall error rate is in general the highest among the five criteria. The BIC has much lower FAR than the AIC. It is consistent but its FAR converges to zero slowly as and increase, and it is still about 5% even when and are at their highest values of 50 and 10, respectively.

We had noted that for Gaussian linear models, the performance of the CMC is similar to that of the BIC with low FIR. However, for logistic regression models, CMC behaves more like the AIC with similar FIR and FAR, especially for cases where . When , it has smaller FAR than the AIC.

For Gaussian linear models, we have recommended the 0.5 level as the default level for the CMC. For logistic regression model selection, both and affect the accuracy of the CMC. Interestingly, however, through exploring a wide range of and combinations we found that CMC again has a stable performance and is usually the most or the second most accurate criterion among the five criteria. We thus also recommend the level as the default level for logistic regression model selection. When is much larger than , CMC may be used instead. Table 2 has such cases where is 50 times as large as , and for these cases CMC reached zero error rates, suggesting that the CMC is also consistent for selecting logistic regression models.
aic  bic  cmc  cmc  cmc  

(20, 6, 3)  (0.06, 0.19)  (0.09, 0.10)  (0.05, 0.20)  (0.12, 0.06)  (0.28, 0.03) 
(30, 6, 3)  (0.01, 0.16)  (0.02, 0.07)  (0.01, 0.16)  (0.02, 0.03)  (0.08, 0.01) 
(40, 6, 3)  (0.00, 0.16)  (0.01, 0.06)  (0.00, 0.17)  (0.01, 0.03)  (0.03, 0.00) 
(50, 6, 3)  (0.00, 0.15)  (0.00, 0.05)  (0.00, 0.16)  (0.00, 0.03)  (0.01, 0.00) 
(100, 6, 3)  (0.00, 0.15)  (0.00, 0.03)  (0.00, 0.15)  (0.00, 0.02)  (0.00, 0.00) 
(20, 10, 5)  (0.13, 0.20)  (0.17, 0.14)  (0.16, 0.15)  (0.26, 0.08)  (0.39, 0.06) 
(30, 10, 5)  (0.01, 0.16)  (0.03, 0.07)  (0.02, 0.09)  (0.06, 0.03)  (0.15, 0.02) 
(40, 10, 5)  (0.00, 0.16)  (0.00, 0.06)  (0.00, 0.08)  (0.01, 0.02)  (0.05, 0.00) 
(50, 10, 5)  (0.00, 0.16)  (0.00, 0.05)  (0.00, 0.08)  (0.00, 0.01)  (0.01, 0.00) 
(100, 10, 5)  (0.00, 0.16)  (0.00, 0.03)  (0.00, 0.07)  (0.00, 0.01)  (0.00, 0.00) 
3.3 Poisson regression examples
Let be independent observations of the response variable where . The Poisson regression model with log link is
(18) 
where is the th row of the matrix of predictor variables and is the vector of regression parameters. We set with and . Table 3 contains the simulated (FIR, FAR) results for the 5 criteria. Like the case of logistic regression models in Table 2, the performance of the CMC is similar to that of the AIC which has the lowest FIR but the highest FAR. The BIC has slightly higher FIR than the AIC and CMC but lower FAR. On the relative performance of the three CMC criteria, CMC has lower overall error when the sample size is small (). When , CMC is usually the most accurate. When , CMC is usually the most accurate but CMC is a close second. Based on these findings and for simplicity, we again recommend the 0.5 level as the default level. For small sample sizes, the 0.9 level may be used. For very large sample sizes, the 0.1 level may be used.
To summarize the simulation study, the recommendations on the level that we have made in this section are based on the objective of minimizing the overall error. For fixed and , the FIR of the CMC decreases and the FAR increases when increases. This gives users of the CMC control on the balance between these two rates through the choice of the level. If a low FAR is the priority instead of a lower overall error, one can set to 0.1 regardless the sample size and dimension . If a low FIR is the priority, one can set it to 0.9. We have only considered three levels here. Other levels may also be used. For example, in Table 1 for the linear model (16), the lowest level is . When such as , even smaller levels such as 0.05 may be used (we tried CMC for this case and obtained zero error rates). Finally, we note that predictor variables in the above examples have low correlations as they are independently generated. When there are strongly correlated predictor variables, simulation results (not included here) show that CMC may be more accurate than CMC and CMC for small and moderate sample sizes. Nevertheless, CMC is still often the most or second most accurate, and is often substantially more accurate than the AIC and BIC. Because of these, we recommend the 0.5 level as the default regardless the type of regression model, the sample size and the correlation situation of the predictor variables. This makes the application of the CMC straightforward as a user does not have to spend time deciding on which level to use. However, to optimize the CMC, one may consider a different level depending on the sample size and correlation situation.
Variable  Estimate  Std. Error  value  value 

(Intercept)  6.1507208650  1.308260018  4.70145138  2.583188e06 
sbp  0.0065040171  0.005730398  1.13500273  2.563742e01 
tob  0.0793764457  0.026602843  2.98375801  2.847319e03 
ldl  0.1739238981  0.059661738  2.91516648  3.554989e03 
adi  0.0185865682  0.029289409  0.63458325  5.257003e01 
fhd  0.9253704194  0.227894010  4.06052980  4.896149e05 
typ  0.0395950250  0.012320227  3.21382267  1.309805e03 
obe  0.0629098693  0.044247743  1.42176449  1.550946e01 
alc  0.0001216624  0.004483218  0.02713729  9.783502e01 
age  0.0452253496  0.012129752  3.72846442  1.926501e04 
4 South African heart disease data analysis
We now apply the CMC to perform model selection for logistic regression for a dataset from a heart disease study conducted by Rousseauw et al. (1983). The dataset can be found in various publicly available sources such as the R package ‘bestglm’ by McLeod, Xu and Lai (2020) and the online resource for the book Elements of Statistical Learning
by Hastie, Tibshirani and Friedman (2009). The response variable in the dataset is the coronary heart disease status (chd), a binary variable recording the presence (chd=1) or absence (chd=0) of coronary heart disease for a sample of 462 males from a heart disease high risk region of the Western Cape, South Africa. There are 9 predictor variables: systolic blood pressure (sbp), tobacco use (tob), low density lipoprotein cholesterol (ldl), adiposity (adi), family history of heart disease (fhd), typeA behavior (typ), obesity (obe), alcohol consumption (alc), age at onset (age). Fitting the full logistic regression model to chd using all 9 predictor variables yields the output in Table
4. Five variables have small values, and in ascending order of their values these 5 variables are fhd, age, typ, tob, and ldl.sdp  tob  ldl  adi  fhd  typ  obe  alc  age  AIC  BIC  LogLR 

0  0  0  0  0  0  0  0  0  596.1084  596.1084  123.96 
0  0  0  0  0  0  0  0  1  527.5623  531.6979  53.422 
0  0  0  0  1  0  0  0  1  510.6582  518.9293  34.518 
0  1  0  0  1  0  0  0  1  501.3854  513.7921  23.245 
0  1  0  0  1  1  0  0  1  492.7143  509.2566  12.574 
0  1  1  0  1  1  0  0  1  485.6856  506.3634  3.5455 
0  1  1  0  1  1  1  0  1  485.9799  510.7933  1.8398 
1  1  1  0  1  1  1  0  1  486.5490  515.4979  0.4089 
1  1  1  1  1  1  1  0  1  488.1408  521.2253  0.0001 
1  1  1  1  1  1  1  1  1  490.1400  527.3601  0.0000 
Using ‘bestglm’, we obtain the 10 models with the highest likelihood among models with the same number of predictor variables. These 10 models, their AIC values, BIC values and maximum loglikelihood ratio values (LogLR) are shown in Table 5. The model with the smallest AIC value is the fivevariable model containing the 5 variables with the smallest values, fhd+age+typ+tob+ldl. The model with the smallest BIC value is also this fivevariable model. Since there are
variables in the full model, the degrees of freedom of the
distribution for calibrating the loglikelihood ratio is . The quantiles defining the confidence regions (7) associated with CMC, CMC and CMC are, respectively, 4.865, 9.341 and 15.987. From the LogLR column in Table 5, we see that models with loglikelihood ratios below 4.865 (or in ) are the last 5 models with 5 to 9 variables, so CMC chooses the smallest model in this set, which is the model with 5 variables chosen by the AIC and BIC. Similarly, CMC also chooses the same 5variable model. On the other hand, models with loglikelihood ratios below 15.987 are the last 6 models with 4 to 9 variables, so CMC chooses the smallest model in this set of 6 models which is the 4variable model consisting of the 4 variables with the smallest values, fhd+age+typ+tob. Although this model is different from the common choice of the other four criteria, it is worth considering as for this dataset the sample size , which is 50 times larger than the number of variables , and CMC has been very accurate in our simulation study when the sample size is this large.McLeod and Xu (2020) analysed this dataset and obtained the above 5variable model and 4variable model, respectively, under two different BIC criteria discussed in that paper. In Chapter 4 of their book, Hastie, Tibshirani and Friedman (2009) also analysed this dataset. They obtained a different 4variable model containing fhd+age+tob+ldl using a backward selection method. Different model selection criteria may lead to different selections. The CMC criteria at different levels are no exceptions, but the CMC provides a simple and unified framework to view the different selections through their loglikelihood ratios and associated levels.
5 Concluding remarks
The CMC based on the loglikelihood ratio provides a family of criteria indexed by the level for selecting regression models. It makes effective use of the null distribution of the likelihood ratio (6) for model selection. For general applications, we recommend CMC as it has showed excellent accuracy in our simulation study, outperforming other criteria including AIC and BIC in most cases. With a parameter , it is easy for the CMC to adapt to special situations. There have been various efforts in finding finite sample adjustments for the AIC and BIC in order to improve their performance; see, for example, Hurvich and Tsai (1989), Broersen (2000) and Sclove (1987). The CMC does not need such adjustments. When the sample size is small or when there are strongly correlated predictor variables, we simply use CMC with a large level, say , to handle such special situations.
The CMC as defined in (9) is for the best subset selection as the minimization is taken over the set of all possible models, whereas the AIC and BIC may be applied to select a model from a subset . In situations where models of interest form such a subset , we simply replace the in (9) with so that the CMC can still select a model from . It is possible that is empty, and in this case the CMC does not have a solution. This is a useful warning as it tells us that among the models in , none is acceptable at level by the likelihood ratio test. As such, we should either refrain from selecting a model from , or consider lowering the level to increase the set of plausible models so that is not empty. In contrast, the AIC and BIC do not give such a warning and would select a model from even if it contains no plausible models.
We have used the likelihood ratio test to define the set of plausible models. The score test and Wald test are asymptotically equivalent to the likelihood ratio test, and in principle they may also be used to define the set of plausible models for constructing the CMC. However, they are computationally more complicated than the likelihood ratio test. Further, one of the key argument used in establishing the lower bound (11) for the CMC based on the likelihood ratio test is that event implies . This argument would be invalid if other tests are used which will make the theoretical investigation of the CMC selection more difficult. Nevertheless, we plan to study Wald test based CMC to determine if it has theoretical advantages over the likelihood ratio test based CMC. In particular, letting be the Wald test induced confidence region for , we hope to find a sequence of such that and
uniformly for all as . If such a sequence of can be found, then we can show that the CMC defined by is consistent, so it may have superior large sample accuracy than CMC. Wald test induced confidence region has an analytic expression which should be helpful in looking for such a sequence. The score test and likelihood ratio test induced confidence regions do not have this advantage.
References
 [1]
 [2] Akaike, H. (1974). A new look at the statistical model identification IEEE Transactions on Automatic Control, 19, 716–723.
 [3] Broersen, P. M. T. (2000). Finite sample criteria for autoregressive order selection. IEEE Transactions on Signal Processing, 48, 3550–3558.
 [4] Ding, J., Tarokh, V. and Yang, Y. (2018a). Model Selection Techniques: An Overview. IEEE Signal Processing Magazine, 35, 16–34.
 [5] Ding, J., Tarokh, V. and Yang, Y. (2018b). Bridging AIC and BIC: a new criterion for autoregression. IEEE Transactions on Information Theory, 64, 4024–4043.
 [6] Fahrmeir, L. and Kaufmann, H. (1985). Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. Annals of Statistics, 13, 342368.
 [7] Furnival, G. M. and Wilson, R. W., Jr. (1974). Regressions by leaps and bounds. Technometrics, 16, 499–511.

[8]
Gourieroux, C. and Monfort, A. (1981). Asymptotic properties of the maximum likelihood estimator in dichotomous logit models.
Journal of Econometrics, 17, 83–97.  [9] Haberman, S. J. (1977). Maximum likelihood estimates in exponential response models. Annals of Statistics, 5, 815–841.
 [10] Hastie, T., Tibshirani, R. and Friedman, J. (2009). Elements of Statistical Learning: Data Mining, Inference and Predictions. 2nd edition. Springer Verlag, New York.
 [11] Hannan, E. J. and Quinn, B. G. (1979). The determination of the order of an autoregression. Journal of Royal Statistical Society, Series B, 41, 190–195.
 [12] Hurvich, C. M. and Tsai, C. L. (1989). Regression and time series model selection in small samples. Biometrika, 76, 297–307.
 [13] Kadane, J. B. and Lazar, N. A. (2004). Methods and criteria for model selection, Journal of the American Statistical Association, 99, 279–290.
 [14] McLeod, A. I., Xu, C. and Lai, Y. (2020). Package ‘bestglm’. An R package available at https://cran.rproject.org.
 [15] McLeod, A. I. and Xu, C. (2020). ‘bestglm: Best Subset GLM’. Vignette for R package ‘bestglm’ available at http://www2.uaem.mx/rmirror.
 [16] Morgan J. A. and Tatar, J. F. (1972). Calculation of the Residual Sum of Squares for all Possible Regressions. Technometrics, 14, 317–325.
 [17] Rao, C. R. and Wu, Y. H. (1989). A strongly consistent procedure for model selection in a regression problem. Biometrika, 76, 2, 369–374.
 [18] Rousseauw, J., du Plessis, J., Benade, A., Jordaan, P., Kotze, J., Jooste, P. and Ferreira, J. (1983). Coronary risk factor screening in three rural communities, South African Medical Journal, 64, 430–436.
 [19] Schwarz, G. E. (1978). Estimating the dimension of a model, Annals of Statistics, 6, 461–464.

[20]
Sclove, S. L. (1987). Application of modelselection criteria to some problems in multivariate analysis.
Psychometrika, 52, 333–343. 
[21]
Tsao, M. (2021). A constrained minimum method for model selection. Stat, e387.
Available at https://onlinelibrary.wiley.com/toc/20491573/0/ja
or https://onlinelibrary.wiley.com/toc/20491573/current.  [22] Wit, E., van den Heuvel, E. and Romeijn, J. (2012). ‘All models are wrong…’: an introduction to model uncertainty. Statistica Neerlandica, 3, 217–236.
 [23]
Comments
There are no comments yet.