In many data modeling scenarios, many plausible models are available to fit to the data, each of which may result in drastically different predictions and conclusions. Being able to select the right model for inference is a crucial task. As our main example, we consider model selection for a normal linear model:
where is an
dimensional response vector,is an dimensional design matrix and is a dimensional vector of regression parameters. Here the candidate models to be selected could refer to the sets of significant variables. In a Bayesian setting, we have a natural probabilistic evaluation of models through posterior model probabilities. Depending on the objectives of the data analysis, we may be interested in assessing the belief on which is the “best” model or obtaining predictions with minimum error.
Existing procedures to accomplish the aforementioned goals, however, will perform poorly under the presence of outliers and contaminations. In addition, Markov chain Monte Carlo (MCMC) algorithms for these methods do not scale to big data situations. The goal of this paper is to investigate a “divide-and-conquer” method that integrates with existing Bayesian model selection techniques, in a way that is robust to outliers and, moreover, allows us to perform Bayesian model selection in parallel.
Our “divide-and-conquer” strategy is based on the ideas for robust inference using the notion of the geometric median minsker2015geometric , especially the median posterior in the Bayesian context wang2014median ; minsker2014scalable
. Previous work in this area has focused on the performance in parametric inference. Our contribution in this paper is to demonstrate the effectiveness of these ideas in selecting the correct class of models on top of the parameters. In particular, we show that the model aggregated across different subsets (the “divide”) has improved concentration to the true model class compared to the one using the full data set. This concentration is in terms of the posterior model probabilities to the point mass assigned to the true model. The result also holds jointly with the concentration of the parameter estimates, and under the presence of outliers and hence demonstrates robustness. We carry out extensive numerical studies on simulation data and a real data example to demonstrate the performance of our proposed approach.
2 Bayesian Model Selection
In Bayesian model selection, we define the prior model probability for each of the model () under consideration. For model , we additionally have parameters with prior , which leads to a likelihood . Thus, the posterior model probability for model , , is proportional to
However, as noted in barbieri2004optimal , choosing the model with the highest posterior model probability is not always the best option nor should one neglect the risk of model uncertainty. Instead of resorting to a single model for predicted values (or some quantity of interest in general), hoeting1999bayesian
proposes to average over the model uncertainty with Bayesian model averaging (BMA) to obtain a posterior mean and variance ofat a covariate level :
We will focus on BMA in our theoretical developments in this paper. Our numerical experiments, however, will show that our divide-and-conquer strategy is also effective in applying on other model selection methods.
The first alternative to BMA is the median probability model, which can be shown to be optimal if we must choose one model for prediction barbieri2004optimal . In this approach, we define the posterior inclusion probability of each predictor () as the sum of posterior model probabilities of the models that include predictor , namely . The median probability model is the model that includes the predictors if .
Second, using the maximum value of the likelihood for each model , where is the maximum likelihood estimate of , we can perform penalized model selection through the Akaike information criterion (AIC) akaike1974new or the Bayesian information criterion (BIC) schwarz1978estimating by selecting the model with the lowest information criterion:
The final model selection technique we will consider is stochastic variable selection through the spike and slab model george1993variable , which allows for variable shrinkage under high-dimensional models. For the purposes of this paper, we will use the rescaled spike and slab model ishwaran2005spike . To perform posterior inference in this model, we first define where
is the unbiased estimate ofunder the full model and let be some small number. The model is defined to be the following mixture model:
3 Divide-and-Conquer and Robust Bayesian Model Selection
In our robust model selection strategy, we divide observations into subsets of roughly equal sample size. Then inference, model selection and prediction is performed for the linear model independently across subsets using the existing Bayesian model selection procedures, which are then combined to form a final model or a combined prediction value.
Given linear model (1
), we first define the following priors on a normal likelihood with response variableand -dimensional predictor . The observations are divided into subsets with observations within each subset. One has,
To compensate for the data division, we raise the likelihood of the divided data to the -th power and adjust the normalizing constant accordingly so that the likelihood for is:
The intuition and motivation for raising the subset likelihood to -th power is to adjust the potentially inflated variance of the subset posterior distribution. Exploiting conjugacy, we obtain the full conditionals for data subset :
Let , then integrating out the parameters gives us the following marginal distribution :
For distributed AIC and BIC model evaluation, we raise the likelihood term of the AIC and BIC formula to the power of :
In applying our procedure with the spike and slab prior, we derived the full Gibbs sampler for our procedure. For posterior inference in the spike and slab model, let , we can perform Gibbs sampling by drawing from the following posteriors:
Once inference is built on each subset, the key step is to aggregate the subset models (or estimates) together into a final model (or estimate). To aggregate our results, we collect the number of subset models or estimates and find the geometric median between these elements. The geometric median for a set of elements valued on a Hilbert space , is defined as
For instance, in the case of aggregating the posterior model probabilities across subsets of data, the geometric median operates on the space of posterior distributions and the geometric median posterior model probability, , is defined as:
where is the posterior model probabilities for subset , and denotes the space of distributions on support points. The metric here can be taken as the Euclidean metric, or an integral probability metric (IPM) defined as for some class of functions sriperumbudur2010hilbert ; sriperumbudur2012empirical .
For the model selection techniques discussed earlier (AIC, BIC, and the median model selection), we can choose a final model in two ways: One, we can select the best model locally on each subset, use it for prediction, and then aggregate the results (estimate combination). Or two, we can take the median of the model selection criteria and choose that particular model on each subset and then aggregate the results to get a final model (model combination).
However, in Bayesian model averaging and spike and slab modeling we do not choose a final model. We can still perform model or estimate combination by aggregating the posterior model probabilities. We consider both model and estimate combinations in our experiments and show that they yield similar results in our experimental settings.
4 Improved Concentration and Robustness
In this section we provide theoretical justification on the robustness in the divide-and-conquer strategy. In particular, we focus on BMA. Additionally, we show that the aggregated model class from our strategy concentrates faster, in terms of posterior model probabilities, to the correct class compared to using the whole data set at once. This concentration result can be joint with parameter estimation, and also applies in a way that exhibits robustness against outliers. Note that we do not raise the subset likelihood to -th power in our current theoretical analysis, but the results can be generalized by imposing slightly stronger entropy conditions on the model.
Let be the domain of , our set of model indices and parameters. Let be the true data generating parameter, and let be a generic data point. Let be the true conditional density of given , and be the true density of the covariates . We denote . Let be the distribution defined by and is the true distribution . For convenience, we denote where is the expectation under . We denote as the true probability measure taken on the data of size and . Lastly, we denote as the -packing number of a set of probability measures under the metric , which is the maximal number of points in such that the distance between any pair is at least . We implicitly assume here that is separable. The following Theorem 1 follows from a modification of Theorem 2.1 in ghosal2000 :
Assume that there is a sequence such that and as , a constant , and a set so that
where and is the Hellinger distance. Then we have
for any and sufficiently large such that and , where is a universal constant.
The proof of Theorem 1 is in the Appendix. As noted by ghosal2000 , the important assumptions are Assumptions 1 and 3. Essentially, Assumption 1 constrains the size of the parameter domain to be not too big, whereas Assumption 3 ensures sufficient mass of the prior on a neighborhood of the true parameter. The concentration result (4) states that the posterior distribution of is close to the true with high probability, where the closeness is measured in terms of the Hellinger distance between the likelihoods. Note that the RHS of (4) consists of three terms. The dominant term is the power-law decay in . The other two exponential decay terms result from technical arguments in the existence of tests that sufficiently distinguish between distributions birge1983approximation ; le2012asymptotic .
Next we describe the concentration behavior of BMA. We focus on the situations where all the candidate models are non-nested, i.e. only one model contains distributions that are arbitrarily close to the truth. Without loss of generality, we let be the true model.
Theorem 2 (BMA of Non-Nested Models)
For any given ,
for sufficiently large .
For any given ,
for sufficiently large , where is the Euclidean distance, and is the point mass on .
For any ,
for sufficiently large .
Proof of Theorem 2.
Proof of 1. Consider large enough and fix a sufficiently large . We have
|where denotes the posterior expectation|
|and denotes the posterior distribution given model|
by the condition that for any and any eventually. Hence
since and is an optimizer of the optimization
By redefining , we get (6).
since for all gives the optimizer of the optimization
Note that is a convex function in for and is equal to 0 at . Thus for , where is the slope of the line between and . Hence, for , we have
Combining with (14), we have
By redefining , we get (7). ∎
Note that the assumption for any and sufficiently small is a manifestation of the non-nested model situation, asserting that only one model is “correct”. Result 1
is a concentration on the posterior probability of picking the correct model to be close to 1.
Result 2 translates this in terms of the Euclidean distance between the model posterior probability and the point mass on the correct model. Result 3 is an alternative using the Hellinger distance. Note that the concentration bound for Hellinger distance (7) is inferior to that for Euclidean distance (6) for small since instead of shows up in the RHS of (7). This is because in our proof, the function that appears in (14) has derivative which is at , and thus no linearization is available when is close to 0.
where is the collection of all such that contains the true model. In (6) and (7), the use of is replaced by an existence of some probability vector (dependent on ) supported on the indices in . In other words, one now allows comparing with an arbitrary allocation of probability masses to all true models in the concentration bound. These modifications can be seen by following the arguments in the proof of Theorem 2. Specifically, (9) would be modified as
giving the claimed modification for (5). Then, following (12), we could find a probability vector to make all terms vanish except one, which is in turn bounded by . This gives the claimed modifications for (6) and (7).
The following result states how a divide-and-conquer strategy can improve the concentration rate of the posterior model probabilities towards the correct model:
Theorem 3 (Concentration Improvement)
Suppose the assumptions in Theorem 2 hold. Let , and . For sufficiently large , letting be constants such that and , we have:
, the geometric median under of , satisfies
where , and .
Let be the number of model classes, then:
Suppose in addition that, for any such that for , we have
where , with being a characteristic kernel defined on the space and is the corresponding reproducing kernel Hilbert space (RKHS), and and are constants. Moreover, assume that there is a universal constant such that for all , and we choose such that . Then
where is defined as , is a sufficiently large constant, is the geometric median of under the -norm, and is the delta measure at the true parameter.
The significance of Theorem 3 is the improvement of the concentration from power-law decay in Theorem 2 to exponential decay, as the number of subsets grows. Such type of results is known in the case of parameter estimation (e.g., wang2014median ; minsker2014scalable ). Theorem 3 generalizes to the case of model selection. Results 1 and 2 describe the exponential concentration for the model posteriors to the correct model, while Result 3 states the joint concentration in both the model posterior and the parameter posterior given the correct model, when one adopts a second layer of divide-and-conquer on the parameter posterior conditional on each individual candidate model. Result 3 in particular combines with the parameter concentration result in minsker2014scalable .
Note that we have taken a hybrid viewpoint here that we assume a “correct” model and parameters in a frequentist sense. Under this view, a posterior probability more concentrated towards the truth is more desirable. This constitutes our main claim that the divide-and-conquer strategy is attractive. This view has been used in existing work like wang2014median ; minsker2014scalable .
Finally, the following theorem highlights that the concentration improvement still holds even if the data are contaminated to a certain extent:
Theorem 4 (Robustness to Outliers)
Theorem 4 stipulates that when a small number of subsets are contaminated by arbitrary nature, the geometric median approach still retains the same exponential concentration.
The proofs of both theorems rely on a key theorem on geometric median in minsker2015geometric , restated in the Appendix. We focus on Theorem 3, as the proof for Theorem 4 is a straightforward modification in light of Theorem 5.
Proof of 2. Note that
To see this, let . We have
where ’s satisfy . Since is the optimizer of the optimization