When there are several plausible models to choose from but no definite scientific rationale to dictate which one should be used, a model selection method has been used traditionally to determine a ‘correct’ model for data analysis. Commonly used model selection methods, such as Akaike information criterion (AIC), Bayesian information criterion (BIC), stepwise regression, best subset selection, penalised regression, etc., are data driven and different methods may use different criteria; cf., e.g., Hastie2009 and reference therein. Once a model is chosen, further analysis proceeds as if the model selected is the true one. This practice does not account for the uncertainty introduced in the process due to model selection, and can often lead to faulty inference as discussed in Madigan1994; Draper1995; Buckland1997, among others. To provide a solution to the problem, model averaging methods have been introduced to incorporate model uncertainty during analysis; cf., e.g., Claeskens2008. Instead of deciding which model is the ‘correct’ one, a model averaging method uses a set of plausible candidate models and final measures of inference are derived from a combination of all models. The candidate models are combined using some data-dependent weights to reflect the degree to which each candidate model is trusted.
Our research on model averaging is motivated in part by a real life example on a prostate cancer study where the relationship between the level of prostate-specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy was investigated. The variables included in the study are log cancer volume, log prostate weight, age, log of the amount of benign prostatic hyperplasia, seminal vesicle invasion, log of capsular penetration, Gleason score, and percent of Gleason scores 4 or 5. In analysis of such data, a common theme is that different model selection methods may choose different models as the ‘true’ one. For example, AIC and BIC, two commonly used model selection criteria, may pick two different models, as the criteria for selection is different. Such situations would certainly raise many questions in practice. For instance, if the estimator is selected by using a model selection criteria, how would we address the possibility that the selection is a wrong model? Also, if different model selection methods give us different results, we might wonder how trustworthy the model selection procedures are. Instead of choosing one model using a model selection scheme, we can use an average of estimators from different models. The model averaging estimator then can provide us with an estimate of any parameter involved in the study and can be used for providing confidence bounds. The model averaging estimator can be used for prediction purposes as well.
Hjort2003 provided a formal theoretical treatment of frequentist model averaging approaches, which provided in-depth understanding of model averaging approaches and was well cited. However, the development had an assumption that any extra parameters not included in the narrowest model will shrink to zero at a rate. It essentially requires that the all candidate models are within a neighborhood of the true model. Although this assumption avoids a technical difficulty of handling biased estimators, in reality we do not know the true model and thus excluding from consideration those models that are beyond neighborhood of the true model appears to be very restrictive in practice. In this paper, we remove this restrictive assumption in Hjort2003 and develop frequentist model averaging approaches under a much more general framework. Our model averaging scheme allows us to use all the potential candidate models available, even the ones that produce biased estimates.
The development is motivated by the familiar bias-variance trade-off. If we use an overly simple model, the parameter estimates will often be biased, but it can also possibly have less variance, because there are fewer parameters to estimate. Similarly, if a bigger model is used, the parameter estimates often have low or no bias but increased variance. It is possible that biased estimators may end up having lower mean squared error (MSE) than the bigger model or even true model, and visa verse. In our development, we study the delicate balance between bias and variance in all possible models and utilize the knowledge to develop new frequentist model averaging approaches.
A key element of a model averaging method is selection of weights that help us build a combined model averaging estimator. The weights proposed in our development are based on the aforementioned bias-variance trade-off, anchoring on the mean squared error (MSE) of the overall model averaging estimator. The weighing scheme is similar to but not the same as that discussed in Liang2011
, in which the authors only focused on Gaussian linear regression models. Specifically, a consistent estimate of the mean squared error of the model averaging estimator is proposed, and the weights are chosen such that the MSE estimate is minimized. Using these weights, we show that model averaging performs better or no worse than several existing and commonly used model selection or model averaging methods. In particular, the weights chosen often display good optimality properties, for example, the parameter estimates converging to the true parameter values as sample sizegoes to infinity. Thus it can be shown that in most of the cases weights that are chosen to combine the candidate models highlight the contribution of the true model. However, for a finite sample size with , biased estimators may end up having lower mean squared error than that from the true model and the model averaging estimator may be based on biased candidate models.
A model averaging estimator incorporates model uncertainty into the analysis by combining a set of competing candidate models rather than choosing just one. It also provides an insurance against selecting a poor model thus improving the risk in estimation. In Hjort2006 and Claeskens2008, variable selection methods for the Cox proportional hazards regression model were discussed along with the choice of weights. In Hansen2007 a new set of weights was derived using Mallow’s criterion. In Liang2011
, the authors proposed an unbiased estimator of the risk and a set of optimal weights was chosen by minimizing the trace of the unbiased estimator. Further details about model selection and averaging can also be found inLien2005; Karagrigoriou2009; wan.zhang.ea:2010; zhang.wan.zhou:2012; Wei2012. The model averaging method has also been used in many areas of applications, e.g., Danilov2004a; Danilov2004 for forecasting stock market data, Pesaran2009 for risk of using false models in portfolio management, magnus.wan.ea:2011 for analysis of the Hong Kong housing market, and Posada2004 for a study of phylogenetics in biology. Our development in this article extends the existing theoretical frequentist development to a general framework so it can incorporate biased models under a general setting. Model averaging has been also discussed in the Bayesian framework; see, e.g. Raftery1998 and Hoeting1999a
. In a Bayesian approach, a weighted average of the posterior distributions under every available candidate model was used for estimation and prediction purposes. The weights were determined by posterior model probabilities. Model averaging in a frequentist setup, as inHjort2003 and also ours, precludes the need to specify any prior distributions, thus removing any possible oversight due to faulty choice of priors. The question in a frequentist setting is how to obtain the weights by a data-driven approach.
The rest of the article is organized as follows. In section 2, we propose a general framework that covers the framework of Hjort2003 as a special case and study asymptotic properties of model averaging estimators. We also derive a consistent estimator for the mean squared error of the model averaging estimator and use it to facilitate our choice of data-driven weights in section 2.4. The development is illustrated in generalized linear models and particularly in linear and logistic model setups. In section 4, simulation studies are carried out to examine the performance of the proposed estimator and to compare its performance with existing methods.
2 General Framework
2.1 Notations and Set up
Consider independent data points sampled from a distribution having density of the form , where is the unknown parameter of interest. Here the parameter can be written as , where , , are the parameters that are certainly included in every candidate model and is the remaining set of parameters that may or may not be included in the candidate models. We assume that and are given. As a model averaging method, instead of choosing one particular candidate model as the “correct” model, we consider a set of candidate models, say , in which each candidate model contains the common parameters and a unique that includes of components of the parameter , .
The choice of can vary depending on the problem that one is trying to solve. For example, the candidate model set can contain all possible combinations of . Or, one can choose a subset of the possible models as . In Hansen2007, a set of nested models has been used as candidate models, with . In Hjort2003 includes candidate models that are within a neighborhood of the true model. Our development encompasses both setups as there are no restrictions on , and can include any number of candidate models between and . Similar setup was used in Liang2011 where is also unrestricted, but the development there was done in the standard linear regression framework.
Let the parameters in the true model be given by . Let be the number of components of that are present in the true model. Define as the collection of the candidate models that contain the true model, thus every model in contain each and every one of the components of . Define , so contains candidate models for which at least one of those components are not present. Clearly .
In Hjort2003, a common parameter is also present in all the candidate models that is similar to ours. But the treatment of is different. In particular, the model containing just is called a narrow model and the true model is chosen of the form . Here, parameter determines how far a candidate model can vary from the narrow model and is a given value of for which any extended model reduces down to the narrow model. Thus, this choice of true model essentially requires that the all candidate models are within a neighborhood of the true model. Any model that is beyond neighborhood of the true model is excluded from the analysis. In this paper, we remove this rather restrictive constraint. Indeed, we assume the parameter for the true model is , where may or may not have any of the components, and the candidate model set can be a subset or contain all possible combinations of . Thus in our model setup there are no restrictions on the choice of true model or on the set of candidate models as in Hjort2003. Furthermore, we can treat the setup considered in Hjort2003 as a special case of ours by restricting so that all the candidate models will have a bias of order or less.
Note that, every candidate model includes a unique that may or may not include all components. Thus the numbers of parameters from different candidate models may be different. For ease of presentation and following Hansen2007, we introduce an augmentation scheme to bring all of them to the same length. We first illustrate the idea using the regression example considered by Hansen2007:
is the vector of responses,is the design matrix with full column rank and the candidate models are nested models. We further assume the first columns of are always included in the candidate models; the special case with goes back to the setup of Hansen2007. It follows that the candidate model includes the first columns of , . Denote by the estimated regression parameters corresponding to the candidate model. Then the vector can be augmented to a vector , by adding ’s. The augmented estimator for the candidate model is given by
cf., e.g., Hansen2007 which adopted this augmentation on a set of nested candidate models.
More generally, let be the parameter for the model in . Assume the length of is , where depends on . Define the log-likelihood for the observation in the model as . The maximum likelihood estimate (MLE) of using the model is , where . Write the score function of the model as . As in the example above, the vector for the model can be augmented to a vector , where is a fixed value used for augmentation to hold spaces. The augmented maximum likelihood estimator is given by . The fixed value augmentation does not affect the parameter, and only appends the length of the parameter. In the linear model example above the values . Some examples of can be found in MitraThesis. Similarly, can be augmented to a vector for a certain fixed set of without altering the true model. Thus, without loss of generality and from now on, we assume is a vector in the sense that some of the elements may be the augmented to fill the space.
For the model , let us define as the solution of the equation , where is the score function of the model having parameters. Define, as before, as the augmented version of . Since the score function is Fisher consistent, under usual regularity conditions. But this may not be close to .
Let be a general function that is order partially differentiable and is the parameter of interest. Then, the model averaging estimator of is defined as
where the weights , and . In the remainder of this section, we derive the asymptotic properties of the model averaging estimator (2.2) for any given set of weights .
2.2 Main Results
We assume the usual regularity conditions under which the familiar likelihood asymptotic arguments apply; cf., the conditions listed in the Appendix. See also Lehmann1998; Lehmann1999; van00 for more details.
Let be the first order derivative of the function . Define = and assume it is invertible. We also assume
for any , where is the indicator function. We have the following theorem. Its proof is given in the Appendix.
Let be the -augmented MLE as defined in (2.1) for the model in . Let for be model weights so that . Assume condition holds. The asymptotic distribution of the model averaging estimator for is given as,
where the variance is given by
The condition (A1) implies that the contribution of to the total variance, for each model in the set and for each is asymptotically negligible, and it is satisfied in a wide array of cases. We provide a set of sufficient conditions under which it is satisfied, and we also provide such examples in the cases of linear and generalized linear models in Section 2.4. See further discussions of the condition in Section 2.4.
In our general framework, there is no guarantee that , neither does asymptotically, particularly when . So is not necessarily , even asymptotically. But we can view it as a measurement of the bias by the model. Thus, with the second term on the left hand side of (2.3) serving as a bias correction term, Theorem 1 states that the model averaging estimator still retains the usual form of asymptotic normality after the bias correction. In the theorem, the weights are fixed. In practice, we often estimate the weights using data. In this case, we need that the estimated weight for the th model converges to as goes to infinity. By Slutsky’s lemma, the result in Theorem 1 still holds. A further study of data dependent weights is in Section 2.4.
All the candidate models have in common. We can use Theorem 1 to construct asymptotic convergence results for the common parameter . If we consider a function from to extract the parameter, then by a direct application of Theorem 1 we can derive the asymptotic distribution of as given below in Corollary 1.
Let be the common parameter for all candidate models in . Let , , . Then under the same setup as in Theorem 1
where the variance is given by .
2.3 Connection to Hjort2003’s development
The development of Hjort2003 required that all candidate models are within a neighborhood of the true model. We broaden this framework in our development. In particular, we show in this subsection that the results described in Hjort2003 can be obtained as a special case of our result.
We start with a description of the misspecified model setup used in Hjort2003. Let be a independent and identically distributed (i.i.d.) sample from density of maximum parameters. The parameter of interest is , where . The model that includes just parameters, say , is defined as the narrow model, while any extended model reduces to the narrow model for ; here the vector is fixed and known. For the model with unknown parameters , the MLE of is written as , where refers to the elements that are not contained in . Thus in this setup, if a parameter is not included in the candidate model, we set , the element of . The true model is assumed to be
where signify the deviation of the model in directions . So . Let us write . We will also write , which is the estimand of interest. Under this model setup, Hjort2003 derived asymptotic normality result for the model averaging estimator . To describe their result, let us first define
Let and . Denote by and the appropriately subsetted vectors obtained from and , with the subset indices corresponding to that of in model , respectively. Also, define for all . Hjort2003 showed that,
Here, is the projection matrix that projects any vector to with indices as given by .
The following corollary states that the result in (2.7) can be directly obtained from Theorem 1 and thus Theorem 1 covers the special setting (2.6) of Hjort2003. A proof of the corollary can be found in the Appendix.
2.4 Selection of Weights in Frequentist Model Averaging
Model averaging acknowledges the uncertainty caused by model selection and tackles the problem by weighting all models under consideration. To make it effective, it is desirable that the weights can reflect the impact of each candidate model, which can be achieved by properly assigning a weight to each candidate model. If model is more likely to impact or is more plausible than the model , its associated weight should be no smaller than for the model . In our development, we propose to measure the strength of a model by its mean squared error, based on which we obtain a set of data-adaptive weights by minimizing the mean squared error of the combined model averaging estimator. A similar scheme was developed in Liang2011, where the authors minimized an unbiased estimator of mean squared error to obtain their optimal weights. However, their work was done for the linear models. As in Liang2011, we assume that the true model is included in the set of candidate models in the development of our weighing scheme.
Recall Theorem 1, the asymptotic mean squared error (AMSE) of is,
for any given set of weights. However, this quantity depends on the unknown parameter , so we instead consider its estimate . Assume that we estimate consistently and the estimate is, say, . Then, in (2.8) can be consistently estimated by . We propose to obtain a set of data adaptive weights by minimizing :
The numerical performance of the proposed averaging estimators will be evaluated in Section 4. In the next section, we illustrate the procedure in the linear and logistic models in details.
3 Model Averaging and Weight Selection in Regression Models
We now discuss the model averaging estimator described in Section 2 for generalized linear models (GLM). Specifically, let where is a given link function connecting the mean and the linear predictor . We consider a set of models. Suppose we want to estimate a function and, as defined in (2.2), the final model averaging estimator is given by . Since the set up for Theorem 1
is for a general parametric model, the same asymptotic convergence results hold for GLM models. In particular we verify condition (A1) and discuss the data-driven weight choices below in two special cases: linear regression and logistic models.
3.1 Prediction in Linear Regression Models
We first derive the model averaging estimator in the linear regression framework:
where is a non-random design matrix of full column rank; i.e., , and .
Let be the set of candidate models. Here denotes a particular set of features having cardinality . Define as the design matrix of the candidate model with the features in . We consider zero-augmentation of the parameter set for all . Let be the augmented version of with the missing columns replaced by the vector. In our analysis, all the candidate models contain the intercept term corresponding to . With the rest of the components, we can construct candidate models, all of which are included in our analysis.
Let us fix a . Define so that consists of those components of indexed by . Consider the particular choice of the function so that for , . Clearly the . For the following discussion, we are interested in the model averaging estimator of = , which is given by with and . In the simulations, we will use generated from the known covariate distribution, while for the real data, we split the whole data set into a training set and a test set and will be set to be the covariate in the test set.
For the candidate model with , the score function is given by and is given by ; note that this follows from the definition immediately preceding condition (A1). Thus our Hessian matrix satisfies the condition as it does not depend on . Similarly we note that regarding condition (A1),
where and are fixed constants, and is the th column of the matrix . Note that . The condition (A1) is satisfied if, for any arbitrary ,
Moreover, if for some fixed constant , the condition is further reduced to,
It is appropriate to note that we can have a bound of as
denotes the smallest singular value of matrixB. Now by an application of Cauchy-Schwarz inequality,
Thus it follows that for finite, as goes to infinity, the right hand side of (3.1) goes to zero and thus Condition (A1) is satisfied.
The MLE of in the model is given by . Let be such that ; being the score function of the model, solving which we find that,
As discussed in Section 2.1, the entire set of candidate models can be divided into two categories. The category contains the ones that are biased and is denoted by and the second category contains ones that are not and is denoted by . So, for we have , whereas for we have . Therefore the bias term of model averaging estimator can be written as,
Since the weights assigned to the models are unknown, we propose an estimate of the mean squared error (MSE) and minimize the MSE to obtain weights that would be assigned to the candidate models. From Theorem 1, the asymptotic mean squared error (MSE) of is given by
Since does not depend on , we focus on estimating , which equals It follows that
Define the estimates of and as and , respectively. Then are consistent estimates of under mild conditions. We therefore propose to estimate by
We obtain the weights for model averaging estimator such that in (3.3) is minimized.
3.2 Estimation in Logistic Regression Framework
In this section we study the proposed model averaging estimation method under logistic regression models. Letbe
independent copies of a dichotomous response variabletaking values 0/1. Let
be a set of features. The logit model is given by,
where are the set of unknown parameters of interest. The log-likelihood for the logistic regression can be written as,
As before, let be the set of candidate models. Here denotes a particular set of features having cardinality . Define as the design matrix of the candidate model with the features in . Denote by the th column of the matrix , thus . Let be the parameter vector with components corresponding to the index set . We consider zero-augmentation of the parameter set for all as was done for the linear regression models.
Again, we consider estimation of a function of the form given by
Let the unknown true parameter in our model be . Then calculated component wise. To estimate the parameter , we consider the model averaging estimator given by
where is the 0-augmented version of the MLE of for the model. The score function for the model is given by