# Robust and Parallel Bayesian Model Selection

Effective and accurate model selection is an important problem in modern data analysis. One of the major challenges is the computational burden required to handle large data sets that cannot be stored or processed on one machine. Another challenge one may encounter is the presence of outliers and contaminations that damage the inference quality. In this paper, we extend the recently studied "divide and conquer" strategy in Bayesian parametric inference to the model selection context, in which we divide the observations of the full data set into roughly equal subsets and perform inference and model selection independently on each subset. After local subset inference, we aggregate the posterior model probabilities or other model/variable selection criteria to obtain a final model, by using the notion of geometric median. We show how this approach leads to improved concentration in finding the "correct" model and also parameters, and how it is robust to outliers and data contamination.

## Authors

• 10 publications
• 39 publications
• 33 publications
12/29/2019

### Robust Variable Selection Criteria for the Penalized Regression

We propose a robust variable selection procedure using a divergence base...
10/24/2014

### Median Selection Subset Aggregation for Parallel Inference

For massive data sets, efficient computation commonly relies on distribu...
12/04/2017

### Episodic memory for continual model learning

Both the human brain and artificial learning agents operating in real-wo...
07/24/2020

### Robust and Reproducible Model Selection Using Bagged Posteriors

Bayesian model selection is premised on the assumption that the data are...
10/12/2018

### The good, the bad, and the ugly: Bayesian model selection produces spurious posterior probabilities for phylogenetic trees

The Bayesian method is noted to produce spuriously high posterior probab...
07/29/2020

### Robust variable selection for model-based learning in presence of adulteration

The problem of identifying the most discriminating features when perform...
06/21/2021

### A generalized EMS algorithm for model selection with incomplete data

Recently, a so-called E-MS algorithm was developed for model selection i...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In many data modeling scenarios, many plausible models are available to fit to the data, each of which may result in drastically different predictions and conclusions. Being able to select the right model for inference is a crucial task. As our main example, we consider model selection for a normal linear model:

 Y =Xβ+ϵ,ϵ∼N(0,σ2I), (1)

where is an

dimensional response vector,

is an dimensional design matrix and is a dimensional vector of regression parameters. Here the candidate models to be selected could refer to the sets of significant variables. In a Bayesian setting, we have a natural probabilistic evaluation of models through posterior model probabilities. Depending on the objectives of the data analysis, we may be interested in assessing the belief on which is the “best” model or obtaining predictions with minimum error.

Existing procedures to accomplish the aforementioned goals, however, will perform poorly under the presence of outliers and contaminations. In addition, Markov chain Monte Carlo (MCMC) algorithms for these methods do not scale to big data situations. The goal of this paper is to investigate a “divide-and-conquer” method that integrates with existing Bayesian model selection techniques, in a way that is robust to outliers and, moreover, allows us to perform Bayesian model selection in parallel.

Our “divide-and-conquer” strategy is based on the ideas for robust inference using the notion of the geometric median minsker2015geometric , especially the median posterior in the Bayesian context wang2014median ; minsker2014scalable

. Previous work in this area has focused on the performance in parametric inference. Our contribution in this paper is to demonstrate the effectiveness of these ideas in selecting the correct class of models on top of the parameters. In particular, we show that the model aggregated across different subsets (the “divide”) has improved concentration to the true model class compared to the one using the full data set. This concentration is in terms of the posterior model probabilities to the point mass assigned to the true model. The result also holds jointly with the concentration of the parameter estimates, and under the presence of outliers and hence demonstrates robustness. We carry out extensive numerical studies on simulation data and a real data example to demonstrate the performance of our proposed approach.

## 2 Bayesian Model Selection

In Bayesian model selection, we define the prior model probability for each of the model () under consideration. For model , we additionally have parameters with prior , which leads to a likelihood . Thus, the posterior model probability for model , , is proportional to

 Pr(Mk)∫Pr(Y|βk,σ2k,Mk)Pr(βk,σ2k|Mk)dβkdσ2k.

However, as noted in barbieri2004optimal , choosing the model with the highest posterior model probability is not always the best option nor should one neglect the risk of model uncertainty. Instead of resorting to a single model for predicted values (or some quantity of interest in general), hoeting1999bayesian

proposes to average over the model uncertainty with Bayesian model averaging (BMA) to obtain a posterior mean and variance of

at a covariate level :

 E[~Y|~X,Y]=K∑k=1 E[~Y|~X,Y,Mk]Pr(Mk|~X,Y), Var(~Y|~X,Y)=K∑k=1 Pr(Mk|X,Y)(Var(~Y|~X,Y,Mk)+ E[~Y|~X,Y,Mk]2)−E[~Y|X,Y]2.

We will focus on BMA in our theoretical developments in this paper. Our numerical experiments, however, will show that our divide-and-conquer strategy is also effective in applying on other model selection methods.

The first alternative to BMA is the median probability model, which can be shown to be optimal if we must choose one model for prediction barbieri2004optimal . In this approach, we define the posterior inclusion probability of each predictor () as the sum of posterior model probabilities of the models that include predictor , namely . The median probability model is the model that includes the predictors if .

Second, using the maximum value of the likelihood for each model , where is the maximum likelihood estimate of , we can perform penalized model selection through the Akaike information criterion (AIC) akaike1974new or the Bayesian information criterion (BIC) schwarz1978estimating by selecting the model with the lowest information criterion:

 AIC=−2logPr(Y|^βk,^σ2k,Mk)+2(D+1), BIC=−2logPr(Y|^βk,^σ2k,Mk)+(D+1)logN.

The final model selection technique we will consider is stochastic variable selection through the spike and slab model george1993variable , which allows for variable shrinkage under high-dimensional models. For the purposes of this paper, we will use the rescaled spike and slab model ishwaran2005spike . To perform posterior inference in this model, we first define where

is the unbiased estimate of

under the full model and let be some small number. The model is defined to be the following mixture model:

 Y′∼N(Xβ,Nσ2I),βSSd∼N(0,Jdτ2d),σ−2ss∼Gamma(a,b),Jd∼(1−w)δJd(ν0)+wδJd(1),τ−2d∼Gamma(aτ,bτ),w∼Uniform(0,1).

## 3 Divide-and-Conquer and Robust Bayesian Model Selection

In our robust model selection strategy, we divide observations into subsets of roughly equal sample size. Then inference, model selection and prediction is performed for the linear model independently across subsets using the existing Bayesian model selection procedures, which are then combined to form a final model or a combined prediction value.

Given linear model (1

), we first define the following priors on a normal likelihood with response variable

and -dimensional predictor . The observations are divided into subsets with observations within each subset. One has,

 Pr(σ−2(j)) =Gamma(a,b), Pr(β(j)|σ2(j)) =N(β0,σ2(j)Σ0).

To compensate for the data division, we raise the likelihood of the divided data to the -th power and adjust the normalizing constant accordingly so that the likelihood for is:

 ⎛⎝R2πσ2(j)⎞⎠N/2exp{−R2σ2(Y(j)−X(j)β(j))T(Y(j)−X(j)β(j))}.

The intuition and motivation for raising the subset likelihood to -th power is to adjust the potentially inflated variance of the subset posterior distribution. Exploiting conjugacy, we obtain the full conditionals for data subset :

 Pr(β(j)|−) =N(μβ,σ2Σβ), μβ =Σβ(β0Σ−10+RXT(j)Y(j)), Σβ =(Σ−10+RXT(j)X(j))−1, Pr(σ−2(j)|−) =Gamma(a′,b′), a′ =a+N+D2, b′ =b+R2ϵTϵ+12(β(j)−β0)TΣ−10(β(j)−β0), ϵ =(Y(j)−X(j)β(j)).

Let , then integrating out the parameters gives us the following marginal distribution :

 (R2π)N2baΓ(a+N2)|ΣX|−12/Γ(a)(b+R2(Y(j)−X(j)β0)TΣ−1X(Y(j)−X(j)β0))a+N2.

For distributed AIC and BIC model evaluation, we raise the likelihood term of the AIC and BIC formula to the power of :

 AICR=−2RlogPr(Y(j)|^βk,^σ2k,Mk)+2(D+1), BICR=−2RlogPr(Y(j)|^βk,^σ2k,Mk)+(D+1)logN.

In applying our procedure with the spike and slab prior, we derived the full Gibbs sampler for our procedure. For posterior inference in the spike and slab model, let , we can perform Gibbs sampling by drawing from the following posteriors:

 Pr(βSS(j)|−) =N(μβSS,ΣβSS), ΣβSS =⎛⎝Δ−1+RNσ−2SS(j)XT(j)X(j)⎞⎠−1, μβSS =ΣβSS⎛⎝RNσ−2SS(j)XT(j)Y(j)⎞⎠, Pr(σ−2SS(j)|−) =Gamma(a′SS,b′SS), a′SS =a+N2, b′SS Pr(Jd|−)∝wd1δJd(ν0)+wd2δJd(1),wd1=(1−w)ν−1/20exp⎧⎨⎩−β2SS(j)d2ν0τ2d⎫⎬⎭,wd2=wexp⎧⎨⎩−β2SS(j)d2τ2d⎫⎬⎭, Pr(τ−2d|−) =Gamma⎛⎝aτ+12,bτ+β2SS(j)d2Jd⎞⎠, Pr(w|−)

Once inference is built on each subset, the key step is to aggregate the subset models (or estimates) together into a final model (or estimate). To aggregate our results, we collect the number of subset models or estimates and find the geometric median between these elements. The geometric median for a set of elements valued on a Hilbert space , is defined as

 x∗=medg(x1,…,xR)=argminy∈HR∑j=1∥y−xj∥, (2)

where is the norm associated with the inner product in minsker2014scalable . The solution can generally be effectively approximated using the Weiszfeld algorithm wes37 .

For instance, in the case of aggregating the posterior model probabilities across subsets of data, the geometric median operates on the space of posterior distributions and the geometric median posterior model probability, , is defined as:

 argminP∈ΠKR∑j=1∣∣∣∣P−Pr(Mk|X(j),Y(j))∣∣∣∣, (3)

where is the posterior model probabilities for subset , and denotes the space of distributions on support points. The metric here can be taken as the Euclidean metric, or an integral probability metric (IPM) defined as for some class of functions sriperumbudur2010hilbert ; sriperumbudur2012empirical .

For the model selection techniques discussed earlier (AIC, BIC, and the median model selection), we can choose a final model in two ways: One, we can select the best model locally on each subset, use it for prediction, and then aggregate the results (estimate combination). Or two, we can take the median of the model selection criteria and choose that particular model on each subset and then aggregate the results to get a final model (model combination).

However, in Bayesian model averaging and spike and slab modeling we do not choose a final model. We can still perform model or estimate combination by aggregating the posterior model probabilities. We consider both model and estimate combinations in our experiments and show that they yield similar results in our experimental settings.

## 4 Improved Concentration and Robustness

In this section we provide theoretical justification on the robustness in the divide-and-conquer strategy. In particular, we focus on BMA. Additionally, we show that the aggregated model class from our strategy concentrates faster, in terms of posterior model probabilities, to the correct class compared to using the whole data set at once. This concentration result can be joint with parameter estimation, and also applies in a way that exhibits robustness against outliers. Note that we do not raise the subset likelihood to -th power in our current theoretical analysis, but the results can be generalized by imposing slightly stronger entropy conditions on the model.

Let be the domain of , our set of model indices and parameters. Let be the true data generating parameter, and let be a generic data point. Let be the true conditional density of given , and be the true density of the covariates . We denote . Let be the distribution defined by and is the true distribution . For convenience, we denote where is the expectation under . We denote as the true probability measure taken on the data of size and . Lastly, we denote as the -packing number of a set of probability measures under the metric , which is the maximal number of points in such that the distance between any pair is at least . We implicitly assume here that is separable. The following Theorem 1 follows from a modification of Theorem 2.1 in ghosal2000 :

###### Theorem 1

Assume that there is a sequence such that and as , a constant , and a set so that

where and is the Hellinger distance. Then we have

 PN0(Pr(θ:dH(Pθ,P0)>Tε2N|X,Y)>δ)≤1C2Nε2Nδ+2e−LNε2Nδ+2e−2Nε2Nδ, (4)

for any and sufficiently large such that and , where is a universal constant.

The proof of Theorem 1 is in the Appendix. As noted by ghosal2000 , the important assumptions are Assumptions 1 and 3. Essentially, Assumption 1 constrains the size of the parameter domain to be not too big, whereas Assumption 3 ensures sufficient mass of the prior on a neighborhood of the true parameter. The concentration result (4) states that the posterior distribution of is close to the true with high probability, where the closeness is measured in terms of the Hellinger distance between the likelihoods. Note that the RHS of (4) consists of three terms. The dominant term is the power-law decay in . The other two exponential decay terms result from technical arguments in the existence of tests that sufficiently distinguish between distributions birge1983approximation ; le2012asymptotic .

Next we describe the concentration behavior of BMA. We focus on the situations where all the candidate models are non-nested, i.e. only one model contains distributions that are arbitrarily close to the truth. Without loss of generality, we let be the true model.

###### Theorem 2 (BMA of Non-Nested Models)

Suppose the assumptions in Theorem 1 hold. Also assume that, for sufficiently small , for any . Let be the same universal constant arising in Theorem 1. We have

1. For any given ,

 PN0(Pr(M1|X,Y)<1−δ)≤1C2Nε2Nδ+2e−LNε2Nδ+2e−2Nε2Nδ, (5)

for sufficiently large .

2. For any given ,

 PN0(dE(Pr(Mk|X,Y),e1)>δ)≤√2C2Nε2Nδ+2√2e−LNε2Nδ+2√2e−2Nε2Nδ, (6)

for sufficiently large , where is the Euclidean distance, and is the point mass on .

3. For any ,

 PN0(dH(Pr(Mk|X,Y),e1)>δ)≤(√2−1)2+1√2C2Nε2Nδ2+((√2−1)2+1)e−LNε2Nδ2+((√2−1)2+1)e−2Nε2Nδ2, (7)

for sufficiently large .

Proof of Theorem 2.

Proof of 1. Consider large enough and fix a sufficiently large . We have

 Pr(θ:d(Pθ,P0)≤Tε2N|X,Y) = EPr[Pr(θ:d(Pθ,P0)≤Tε2N|Mk,X,Y)|X,Y], where EPr[⋅|X,Y] denotes the posterior expectation and Pr(⋅|Mk,X,Y) denotes the posterior distribution given model Mk = Pr(M1|X,Y)Pr(θ:d(Pθ,P0)≤Tε2N|M1,X,Y), (9)

by the condition that for any and any eventually. Hence

 Pr(θ:d(Pθ,P0)≤Tε2N|X,Y)≥1−δ, (10)

implies

 Pr(M1|X,Y)≥1−δ. (11)

The result then follows from Theorem 1, which implies that (10) occurs with probability at least

 1−(1C2Nε2Nδ+2e−LNε2Nδ+2e−2Nε2Nδ),

Proof of 2. Note that (11) implies

 dE(Pr(Mk|X,Y),e1)=√(1−Pr(M1|X,Y))2+∑k≠1Pr(Mk|X,Y)2≤√2δ, (12)

since and is an optimizer of the optimization

 maxK∑i=2x2i\ \ subject to\ \ K∑i=2xi≤δ.

Hence (5) and (12) together imply

 PN0(dE(Pr(Mk|X,Y),e1)≥√2δ)≤1C2Nε2Nδ+2e−LNε2Nδ+2e−2Nε2Nδ.

By redefining , we get (6).

Proof of 3. Note that (11) implies

 dH(Pr(Mk|X,Y),e1) = ⎷12⎛⎝(√1−Pr(M1|X,Y))2+∑k≠1Pr(Mk|X,Y)⎞⎠ ≤√12((1−√1−δ)2+δ), (13)

since for all gives the optimizer of the optimization

 max∑i≠0√xi\ \ subject to\ \ ∑i≠0xi≤δ.

Hence (5) and (13) together imply

 PN0(dH(Pr(Mk|X,Y),e1)>√12((1−√1−δ)2+δ))≤1C2Nε2Nδ+2e−LNε2Nδ+2e−2Nε2Nδ. (14)

Note that is a convex function in for and is equal to 0 at . Thus for , where is the slope of the line between and . Hence, for , we have

 √12((1−√1−δ)2+δ)≤√((√2−1)2+1)δ2.

Combining with (14), we have

 PN0(dH(Pr(Mk|X,Y),e1)>√((√2−1)2+1)δ2)≤1C2Nε2Nδ+2e−LNε2Nδ+2e−2Nε2Nδ. (15)

By redefining , we get (7). ∎

Note that the assumption for any and sufficiently small is a manifestation of the non-nested model situation, asserting that only one model is “correct”. Result 1

is a concentration on the posterior probability of picking the correct model to be close to 1.

Result 2 translates this in terms of the Euclidean distance between the model posterior probability and the point mass on the correct model. Result 3 is an alternative using the Hellinger distance. Note that the concentration bound for Hellinger distance (7) is inferior to that for Euclidean distance (6) for small since instead of shows up in the RHS of (7). This is because in our proof, the function that appears in (14) has derivative which is at , and thus no linearization is available when is close to 0.

Theorem 2 can be modified to handle the case where multiple models contain the truth. In particular, the expression inside the probability in (5) becomes

 ∑r∈MPr(Mr|X,Y)<1−δ,

where is the collection of all such that contains the true model. In (6) and (7), the use of is replaced by an existence of some probability vector (dependent on ) supported on the indices in . In other words, one now allows comparing with an arbitrary allocation of probability masses to all true models in the concentration bound. These modifications can be seen by following the arguments in the proof of Theorem 2. Specifically, (9) would be modified as

 ∑r∈MPr(Mr|X,Y)Pr(θ:d(Pθ,P0)≤Tε2N|Mr,X,Y).

Then (10) would imply a modified version of (11), namely

 ∑r∈MPr(Mr|X,Y)≥1−δ,

giving the claimed modification for (5). Then, following (12), we could find a probability vector to make all terms vanish except one, which is in turn bounded by . This gives the claimed modifications for (6) and (7).

The following result states how a divide-and-conquer strategy can improve the concentration rate of the posterior model probabilities towards the correct model:

###### Theorem 3 (Concentration Improvement)

Suppose the assumptions in Theorem 2 hold. Let , and . For sufficiently large , letting be constants such that and , we have:

1. , the geometric median under of , satisfies

 PN0(dE(Pr∗(Mk|X,Y),e1)>Cαδ)≤(e(1−ν)ψ(α−ν1−ν,q))−R, (16)

where , and .

2. Let be the number of model classes, then:

 PN0(Pr∗(M1|X,Y)<1−Cαδ√K−1K)≤(e(1−ν)ψ(α−ν1−ν,q))−R. (17)
3. Suppose in addition that, for any such that for , we have

 dH(Pθ1,Pθ2)≥~Cρk(θ1,θ2)γ, (18)

where , with being a characteristic kernel defined on the space and is the corresponding reproducing kernel Hilbert space (RKHS), and and are constants. Moreover, assume that there is a universal constant such that for all , and we choose such that . Then

where is defined as , is a sufficiently large constant, is the geometric median of under the -norm, and is the delta measure at the true parameter.

The significance of Theorem 3 is the improvement of the concentration from power-law decay in Theorem 2 to exponential decay, as the number of subsets grows. Such type of results is known in the case of parameter estimation (e.g., wang2014median ; minsker2014scalable ). Theorem 3 generalizes to the case of model selection. Results 1 and 2 describe the exponential concentration for the model posteriors to the correct model, while Result 3 states the joint concentration in both the model posterior and the parameter posterior given the correct model, when one adopts a second layer of divide-and-conquer on the parameter posterior conditional on each individual candidate model. Result 3 in particular combines with the parameter concentration result in minsker2014scalable .

Note that we have taken a hybrid viewpoint here that we assume a “correct” model and parameters in a frequentist sense. Under this view, a posterior probability more concentrated towards the truth is more desirable. This constitutes our main claim that the divide-and-conquer strategy is attractive. This view has been used in existing work like wang2014median ; minsker2014scalable .

Finally, the following theorem highlights that the concentration improvement still holds even if the data are contaminated to a certain extent:

###### Theorem 4 (Robustness to Outliers)

Using the notation in Theorem 3, but assume instead that, for where ,

 Ps0(dE(Pr(Mk|X(j),Y(j)),e1)>δ)≤√2C2sε2sδ+2√2e−Lsε2sδ+2√2e−2sε2sδ,

the conclusion of Theorem 3 still holds.

Theorem 4 stipulates that when a small number of subsets are contaminated by arbitrary nature, the geometric median approach still retains the same exponential concentration.

Proofs of Theorems 3 and 4.

The proofs of both theorems rely on a key theorem on geometric median in minsker2015geometric , restated in the Appendix. We focus on Theorem 3, as the proof for Theorem 4 is a straightforward modification in light of Theorem 5.

Proof of 1. Immediate by noting that

 Ps0(dE(Pr(Mk|X(j),Y(j)),e1)>δ)≤q,

for all , and applying Theorem 5.

Proof of 2. Note that

 dE(Pr∗(Mk|X,Y),e1)≥(1−Pr∗(M1|X,Y))√KK−1. (19)

To see this, let . We have

 dE(Pr∗(Mk|X,Y),e1)= ⎷(1−a)2+K∑i=2x2i,

where ’s satisfy . Since is the optimizer of the optimization

 minK∑i=2x2i\ \ subject to\ \ K∑i=2xi=1−a,

we get