1 Introduction
As datasets grow in complexity and size, Bayesian inference becomes increasingly difficult. The posterior is often intractable, so we resort to simulation methods for inference via Markov chain Monte Carlo (MCMC), which is inherently serial and often too computationally expensive in datasets with a large number of data points (Bardenet et al., 2017). MCMC further struggles with multimodal posteriors which arise in many settings including mixture models (Jasra et al., 2005) or nonconvex priors (Seeger et al., 2007), as the MCMC sampler can become trapped in local modes (Rudoy & Wolfe, 2006). Current methods to sample from multimodal posteriors with MCMC include parallel tempering (Neal, 1996) and adaptive MCMC (Pompe et al., 2018), but the associated computational cost is high. Posterior approximation with variational Bayes (VB) (Blei et al., 2017) is a faster alternative, but it is generally difficult to quantify the quality of the approximation, and is thus problematic if accurate uncertainty quantification is desired (Giordano et al., 2015).
A further methodological issue facing Bayesian inference is the fact that all models are false. The increasing scale of datasets exacerbates the effects of model misspecification (Walker, 2013), as the true sampling distribution is meaningfully different from the parametric family of distributions of the model. There is rarely formal acknowledgement of model misspecification which can lead to inconsistencies (Watson & Holmes, 2016; Grünwald & Van Ommen, 2017).
Bayesian nonparametric learning (NPL) introduced by Lyddon et al. (2018)
allows for the use of statistical models without assuming the model is true. NPL uses a nonparametric prior centred on a parametric model, and returns a nonparametric posterior over the parameter of interest. The method focuses on correcting for model misspecification and for posterior approximation such as from Variational Bayes (VB) by placing a mixture of Dirichlet Processes
(Antoniak, 1974) prior on the sampling distribution. In addition to the acknowledgement of model misspecification, the method admits an embarrassingly parallel Monte Carlo sampling scheme consisting of randomized maximizations. However, in most cases this method requires sampling the Bayesian posterior, which is computationally expensive for complex models.1.1 Our contribution
In this work, we propose a simplified variant of NPL that utilises a Dirichlet Process (DP) prior on
instead of a mixture of Dirichlet Processes (MDP) prior. This allows us to perform inference directly and detaches the nonparametric prior from the prior of the model parameter of interest. Instead of centering on a Bayesian posterior, we center the DP on a sampling distribution which encapsulates our prior beliefs. This simpler choice of prior also has desirable theoretical properties and is highly scalable as we no longer need to sample from the Bayesian posterior. Our method can handle a variety of statistical models through the choice of the loss functions, and can be applied to a wide range of machine learning settings as we will demonstrate in Section
3. Our method implies a natural noninformative prior, which may be relevant when the number of data points is substantially larger than the number of parameters.The posterior bootstrap sampling scheme was introduced by Lyddon et al. (2018) under the NPL framework, and we inherit its computational strengths such as parallelism and exact inference under a Bayesian nonparametric model. Independent samples from the nonparametric posterior are obtained through the optimization of randomized objective functions, and we obtain the weighted likelihood bootstrap (Newton & Raftery, 1994) as a special case. Furthermore, sampling from multimodal posteriors now involves a nonconvex optimization at each bootstrap sample that we solve through local search and random restart. We demonstrate that our method recovers posterior multimodality on a Gaussian Mixture Model (GMM) problem. We further show that our method is computationally much faster than conventional Bayesian inference with MCMC, and has superior predictive performance on real sparse classification problems. Finally, we utilize the computational speed of NPL to carry out a Bayesian sparsitypathanalysis for variable selection on a genetic dataset.
2 Bayesian Nonparametric Learning
Assume that we have observed , where is a sequence of i.i.d. observables and is the unknown sampling distribution. We may be interested in a parameter
, which indexes a family of probability densities
. Conventional Bayesian updating of the prior to the posterior via Bayes’ theorem formally assumes that
belongs to the model , which is questionable in the presence of complex and large datasets. This assumption is not necessary for NPL. We derive the foundations of NPL by treating parameters as functionals of , with model fitting as a special case.2.1 The parameter of interest
We define our parameter of interest as satisfying an estimating equation:
(1) 
where is a loss function, and its form can be used to target statistics of interest. For example, setting returns the median and returns the mean.
The loss function of particular interest is , where is the density of some parametric model. The value of
minimises the KullbackLeibler divergence
, which is the parameter of interest in conventional Bayesian analysis (Walker, 2013; Bissiri et al., 2016). We have not assumed that contains , and in this case does not have any particular generative meaning as it is simply the parameter that satisfies (1).2.2 The Dirichlet process prior
As the sampling distribution is unknown, we place a DP prior on :
(2) 
where is our prior centering measure, and is the strength of our belief.
The base measure
We encode our prior knowledge about the sampling distribution in the measure . If we believe a particular model to be accurate, and have prior beliefs about encoded in , a sensible choice for the density of is . Alternatively, we could directly specify
as a density that accurately represents our beliefs without the burden of defining a joint distribution on
. In the presence of historical data , a suitable choice for is the empirical distribution of the historical data, i.e. where is the Dirac measure. This is in a similar fashion to power priors (Ibrahim & Chen, 2000). Further intuition is provided in Section A.1 of the Supplementary.It should be noted that we cannot directly include a prior on the parameter of interest , only implicitly through . Our prior is selected independently of the model of interest, and this is appropriate under a misspecified model setting since we do not believe there to be a true . As all parameters of interest are defined as a functional of as in (1), any informative prior on is thus informative of .
The concentration
The size of measures the concentration of the DP about
, and a large value corresponds to a smaller variance in a functional of the DP. We see in (
3) that the NPL posterior centering measure is a weighted sum of the prior and the empirical distribution , with the weights proportional to and respectively. We can thus interpret as the effective sample size from the prior . One method of selecting is through simulation of the prior distribution of via (1) and tuning its variance. Alternatively, we can select through the a priori variance of the mean functional (see Section A.2 of the Supplementary). The special case of corresponds to the Bayesian bootstrap (Rubin, 1981), which in our case corresponds to a natural way to define an noninformative prior about (see Gelman et al. (2013) for a review on noninformative priors). For , it may be suitable to set as the prior should have little influence and the Bayesian bootstrap is more computationally efficient.2.3 The NPL posterior
From the conjugacy of the DP, the posterior of is:
(3)  
Our NPL posterior is thus:
(4) 
where ; the delta arises as is a deterministic functional of as in (1). Properties of the NPL posterior follow from properties of the DP. For example, draws of are almost surely discrete, so (1) simplifies to:
(5) 
where and from the stickbreaking construction (Sethuraman, 1994). Formally, the GEM distribution is defined:
(6) 
We preserve the theoretical advantages from Lyddon et al. (2018) due to the symmetries in the limits of the DP and the MDP for and .
Consistency
Under regularity conditions, the NPL posterior is consistent at as defined in (1), from the properties of the DP (see van der Vaart (1998); Ghosal (2010); Ghosal & van der Vaart (2017) for details). Interestingly, this is true regardless of the choice of and its support. This is not the case in conventional Bayesian inference through Bayes’ rule where the support of the prior must contain for posterior consistency. This is particularly reassuring in our misspecified model setting, as inferences about are robust to choices of .
Asymptotic Dominance
The NPL posterior predictive
for asymptotically dominates the conventional Bayesian posterior predictive up to under regularity conditions, i.e.:(7) 
for all distributions , where is a nonnegative and possibly positive realvalued functional. This states that compared to the Bayesian posterior predictive, the NPL posterior predictive is closer in expected KL divergence to the true up to . The proof for the MDP case is given in Theorem 1 of Lyddon et al. (2018), and the above follows from the equivalence of the MDP and the DP for .
2.4 Sampling from the NPL posterior
In almost all cases, is not tractable, but lends itself to a parallelizable Monte Carlo sampling scheme. It may be more intuitive to think of sampling from the posterior DP, then calculating (1) to generate the sample from :
where is the number of posterior bootstrap samples. One advantage of this sampling scheme is that it is embarrassingly parallel as each of the samples can be drawn independently. We can thus take advantage of increasingly available multicore computing, unlike in conventional Bayesian inference as MCMC is inherently sequential.
2.4.1 The posterior bootstrap
Sampling from the DP exactly requires infinite computation time if is continuous, but approximate samples can be generated by truncation of the sum in (5). For example, we could truncate the stickbreaking and set the remaining weights to 0. Alternatively, we could approximate with the finite Dirichlet distribution for large . For further details, see Ishwaran & Zarepour (2002). We opt for the latter suggestion as Dirichlet weights can be generated efficiently, which leads to a simpler variant of the posterior bootstrap algorithm:
For , we simply draw , which is no longer an approximation and is equivalent to the Bayesian bootstrap. For , the sampling scheme is asymptotically exact for , but this is computationally infeasible. We could fix to a moderate value, or select it adaptively via adaptive NPL, where we use the stickbreaking construction until the remaining probability is less than .
2.5 Tackling multimodal posteriors with initialization
Multimodal posteriors can arise in Bayesian inference if the likelihood function is nonlogconcave like in GMMs (Jin et al., 2016; Stephens, 1999), or if the prior is nonlogconcave which can arise when selecting sparse priors (Seeger et al., 2007; Park & Casella, 2008; Lee et al., 2010). Unlike the method by Lyddon et al. (2018) with the MDP, our NPL posterior with the DP is now decoupled from the Bayesian posterior. There is thus no reliance on an accurate representation of the Bayesian posterior with potential multimodality, which MCMC and VB can often struggle to capture. If our loss function in (1) is nonconvex (e.g. of a GMM), our NPL posterior may also be multimodal. This now presents an optimization issue: solving (1) requires nonconvex optimization. In general, optimizing nonconvex objectives is difficult (see Jain & Kar (2017)), but under smoothness assumption of the loss, we can apply convex optimization methods to find local minima.
2.5.1 Random restart for multiple modes
Random restart (see G. E. Boender & H. G. Rinnooy Kan (1987)) can be utilized with convex optimization methods to generate a list of potential global minima then selecting the one with the lowest objective. This involves random initializations of for each local optimization, and it was shown by Hu et al. (1994) that the uniform measure for has good properties for convergence. If the number of modes is finite, then the global minimum will be achieved asymptotically in the limit of the . The probability of obtaining the correct global minimum for finite is related to the size of its basin of attraction. Random restart NPL (RRNPL) is shown in Algorithm 3:
This is particularly suited to NPL with nonconvex loss functions for the following reasons. Firstly, random restart can utilize efficient convex optimization techniques such as quasiNewton methods, and the restarts can be easily implemented in parallel which is coherent with our parallelizable sampling scheme. Secondly, we can compromise between accuracy and computational cost by selecting , as computational cost scales linearly with (though we can parallelize). The repercussions of an insufficiently large are not severe: our NPL posterior will incorrectly allocate more density to local modes/saddles but all modes will likely still be present for a sufficiently large . This is demonstrated in Section E.2.2 of the Supplementary. Finally, the uniform initialization can sample from nonidentifiable posteriors with symmetric modes (like in GMMs) as their basins of attraction are selected with equal probability.
Practically, uniform initialization may not be possible if the support of the parameter is infinite, e.g. the variance . In this case, we can simply pick another (e.g. Gamma for a positive parameter), or sample uniformly from a truncated support. For adaptively setting , we can utilize stopping rules as discussed in Section B of the Supplementary.
2.5.2 Fixed initialization for local modes
We may be interested in targeting local modes of the posterior when we value interpretability of posterior quantities over exact posterior representation. For example in component mixture models, there will be symmetrical modes (or sets of modes), and labelswitching occurs if the sampler travels between these (Jasra et al., 2005) which impedes useful inference in terms of clustering.
We can target one NPL posterior mode through a fixed initialization scheme by taking advantage of the fact that local optimization methods like expectationmaximization (EM) or gradient ascent are hillclimbers. We initialize each maximization step with the same
, causing the sampler to stay within the basin of attraction of the local posterior mode with high probability. In highdimensional models, it may be difficult to visualize modes of the posterior, so we can utilize VB’s modeselection to select , assuming the Bayesian and NPL posterior modes are close. Meanfield VB also tends to underestimate posterior variance (Blei et al., 2017), so we are able to obtain accurate local uncertainty quantification of the mode through this scheme. Fixed initialization NPL (FINPL) is shown in Algorithm 4:2.6 LossNPL
As we cannot define priors on directly, we can instead penalize undesirable properties in the loss:
(8) 
For example, obtains the Bayesian NPLLasso, or we can set if we have some prior preference. We recommend if we desire roughly the same prior regularization as in Bayesian inference, where is the size of the training set. The reasoning is outlined in Section D of the Supplementary. We could also tune through desired predictive performance or properties of . Note that we are no longer encoding prior beliefs, and are instead expressing an alternative parameter of interest that minimizes the expectation of (8).
2.7 Related Work
We build on the work of Lyddon et al. (2018) which specifies a mixture of Dirichlet Process prior on , and recovers conventional Bayesian inference in the limit of . Although the foundations of nonparametric learning are unchanged, our NPL posterior is decoupled from the Bayesian model, offering flexibility in prior measure selection, computational scalability and full multimodal exploration.
NPL unsurprisingly overlaps with other nonparametric approaches to inference. We recover the Bayesian bootstrap (Rubin, 1981) if we set , and further setting gives the weighted likelihood bootstrap (Newton & Raftery, 1994), as discussed in Lyddon et al. (2018). Setting the loss to (8) and also returns the fixed prior weighted Bayesian bootstrap (Newton et al., 2018). However, these methods were posited as approximations to the true Bayesian posterior, and the Bayesian bootstrap/weighted likelihood bootstrap are unable to incorporate prior information. The NPL posterior on the other hand is exact and distinct to the conventional Bayesian posterior with theoretical advantages, and we are able to incorporate prior information either through or .
Treating parameters as functionals of the sampling distribution is akin to empirical likelihood methods (Owen, 1988), in which parameters are defined through estimating equations of the form . The definition of a parameter of interest through the loss is also present in general Bayesian updating introduced by Bissiri et al. (2016), where a coherent posterior over a parameter of interest is obtained without the need to specify a joint generative model. Their target parameter is equivalent to (1), and their methodology is built on a notion of coherency.
3 Examples
We now demonstrate our method on some examples, and compare it to conventional Bayesian inference with the NoUTurn Sampler (NUTS) by Homan & Gelman (2014), and Automatic Differentiation Variational Inference (ADVI) by Kucukelbir et al. (2017) in Stan (Carpenter et al., 2017)
. We select these as baselines as the methods are well established and optimized. All NPL examples are run on 4 Azure F72s_v2 (72 vCPUs) virtual machines, implemented in Python. The NUTS and ADVI examples cannot be implemented in an embarrassingly parallel manner, so they are run on a single Azure F72s_v2. We avoid running multiple MCMC chains in parallel as the models are multimodal which may impede mixing, and combining unmixed chains is unprincipled. For tabulated results, each run was repeated 30 times with different seeds, and we report the mean with 1 standard error. We emphasize again that our NPL posterior is distinct to the conventional Bayesian posterior, so we are comparing the two inference schemes and their associated sampling methods.
3.1 Gaussian mixture model
We demonstrate the ability of RRNPL to accurately sample from a multimodal posterior in a component, dimensional diagonal GMM toy problem, which NUTS and ADVI fail to do. It should be noted that in addition to the symmetrical modes present from labelswitching, further multimodality is present due to the nonlogconcavity of the likelihood. We further show how FINPL can be used to sample from a local mode in a clustering example with real data whilst providing accurate local uncertainty quantification which ADVI is unable to do. Our conventional Bayesian model for , and is:
(9)  
The posterior is multimodal, and we use ADVI and NUTS for inference. For NPL, we are interested in model fitting, so our loss function is simply the negative loglikelihood:
(10) 
In the case of small , we may want to include a regularization term in the loss to avoid singularities of the likelihood. We select the DP prior separately for each example.
3.1.1 Toy example: Implementation and Results
We analyze toy data from a GMM with , and the following parameters:
We generate for model fitting and another heldout for model evaluation with different seeds for each of the 30 runs. For the Bayesian model we set , and for NPL we set as . We optimize each bootstrap maximization with a weighted EM algorithm (derived in Section E.2.1 of the Supplementary), and implement this in a modified GaussianMixture class from sklearn.mixture (Pedregosa et al., 2011). For RRNPL, we initialize , and for each restart. For FINPL we initialize with one of the posterior modes from RRNPL. We produce posterior samples for each method. We evaluate the predictive performance of each method on heldout test data with the mean log pointwise predictive density (LPPD) as suggested by Gelman et al. (2013), which is described in Section E.3.1 of the Supplementary. A larger value is equivalent to a better fit to the test data.
Figure 1 shows the posterior KDEs of () for 1 run of each method. RRNPL clearly recovers the multimodality of the NPL posterior, including the symmetry about due to the nonidentifiability of the GMM posterior. NUTS and ADVI remain trapped in one local mode of the Bayesian posterior as expected. Even if we carried out random initialization of NUTS/ADVI over multiple runs, each run would only pick out one mode, and there is no general method to combine the posteriors. ADVI also clearly underestimates the marginal posterior uncertainty. FINPL remains in a single mode, showing that we can fix labelswitching through this initialization. However, the FINPL mode is not identical to a truncated version of the RRNPL mode, as posterior mass is not reallocated symmetrically from the other modes. We see in Tables 1, 2 that RRNPL has similar mean LPPD on toy test data compared to NUTS, and is twice as fast as NUTS.
3.1.2 MNIST: Implementation and Results
We now demonstrate FINPL on clustering handwritten digits from MNIST (LeCun & Cortes, 2010), which consists of pixel images. In this example and . We normalize all pixel values such that they lie in the interval , and set . We believe a priori that many pixels are close to 0, so for ease we elicit a tight normal centering measure for the DP:
(11) 
NUTS is prone to the labelswitching problem and is too computationally intensive as ADVI already requires 5 hours, so we only compare FINPL to ADVI. We set for ADVI, and for FINPL with . We carry out a single run of ADVI to select a local mode, and set of FINPL to the ADVIselected mode. We then carry out 30 repeats of FINPL with this initialization, and compare to the original ADVI run. We see in Figure 2 that we obtain larger posterior variances in FINPL, as ADVI likely underestimates the posterior variances due to the meanfield approximation. Notice the modes are not exactly aligned as the NPL and Bayesian posterior are distinct, and furthermore ADVI is approximate. We conjecture that ADVI does not set components exactly to 0 due to the strong Dirichlet prior. We see in Tables 1, 2 that FINPL is predictively better and runs around 300 times faster than ADVI.
RRNPL  FINPL  NUTS  ADVI  

Toy  1.909 0.040  1.911 0.040  1.908 0.039  1.912 0.041 
MNIST  /  2173.56 9.9  /  1188.2 
RRNPL  FINPL  NUTS  ADVI  

Toy  37.2s 4.5s  5.5 2.2s  1m20s 16s  0.8s 0.1s 
MNIST  /  57.9s 1.0s  /  5h6m 
3.2 Logistic regression with Automatic Relevance Determination priors
We now demonstrate the predictive performance and computational scalability of lossNPL in a Bayesian sparse logistic regression example on real datasets. To induce sparsity, we place automatic relevance determination (ARD) priors (MacKay, 1994)
on the coefficients with Gamma hyperpriors
(Gelman et al., 2008). The conventional Bayesian model for and is:(12)  
Marginally, the prior is the nonstandardized tdistribution with (degrees of freedom, location, squared scale):
(13) 
This posterior is intractable and potentially multimodal due to the nonlogconcavity of the prior, and we carry out conventional Bayesian inference via NUTS and ADVI. When applying lossNPL to regression, we assume , and place a DP prior on the joint distribution . We target the parameter which satisfies (1) with loss:
(14)  
which is the negative sum of the loglikelihood and logprior, with additional scaling parameter . Again our NPL posterior may be multimodal due to the nonconvexity of the loss, and so we utilize RRNPL. It should be noted that our target parameter is now different to conventional Bayesian inference, but our method achieves the common goal of variable selection under a Bayesian framework. For the DP prior, we elicit the centering measure
(15)  
The prior assumes are independent which is equivalent to assuming a priori. This is appropriate as we believe many components of to be close to 0. The prior on is its empirical distribution, which is in an empirical Bayes manner where the prior is estimated from the data.
3.2.1 Implementation and results
We analyze 3 binary classification datasets from the UCI ML repository (Dheeru & Karra Taniskidou, 2017): ‘Adult’ (Kohavi, 1996), ‘Polish companies bankruptcy 3rd year’, (Zikeba et al., 2016), and ‘Arcene’ (Guyon et al., 2005) with details in Table 3
. We handle categorical covariates with dummy variables, and normalize all covariates to have mean 0 and standard deviation 1. Missing real values were imputed with the mean, and data with missing categorical values were dropped. We carry out a random stratified traintest split for each of the 30 runs, with 8020 split for ‘Adult’, ‘Polish’ and 5050 split for ‘Arcene’ due to the smaller dataset. For both NPL and conventional Bayesian inference, the hyperparameters were set to
, which was selected by tuning the sparsity of the Bayesian posterior means to a desired value. For NPL, we set for ‘Adult’ and ‘Polish’ as is sufficiently large, and for ‘Arcene’ with as is only . We set for each dataset as explained in Section 2.6 for a fair comparison to the conventional Bayesian model. We initialize each optimization with , and select the number of restarts to for expediency. Optimization was carried out using the LBFGSB algorithm (Zhu et al., 1997) implemented in scipy.optimize (Jones et al., 2001–).We can see in Table 4 that lossNPL is predictively similar or better than NUTS and ADVI, and from Table 5 we see that the posterior mean is sparser for lossNPL as well. Finally, we see from Table 6 that the lossNPL runtimes for posterior samples are much faster than for NUTS, and comparable to VB. Further measures of predictive performance are provided in Section E.3.4 of the Supplementary.
Data Set  Type  Positive %  

Adult  Cat.  96  36177  9045  24.6 
Polish  Real  64  8402  2101  4.8 
Arcene  Real  10000  100  100  44.0 
Data Set  LossNPL  NUTS  ADVI 

Adult  0.3260.004  0.3260.004  0.327 0.004 
Polish  0.229 0.034  3.336 4.162  0.247 0.047 
Arcene  0.449 0.104  0.464 0.032  0.445 0.068 
Data Set  LossNPL  NUTS  ADVI  

Adult  0.1  17.6 2.8  16.1 2.7  12.1 3.1 
Polish  0.1  33.5 4.7  15.9 3.3  15.83.5 
Arcene  0.01  87.4 0.7  4.7 0.3  3.5 0.3 
Data Set  LossNPL  NUTS  ADVI 

Adult  2m24s 8s  2h36m 4m  26.9s 7.3s 
Polish  19.0s 4.0s  1h20m 21m  3.3s 0.8s 
Arcene  2m20s 2s  4h31m 53m  54.2s 3.3s 
3.3 Bayesian sparsitypathanalysis
We now utilize lossNPL to carry out Bayesian sparsitypathanalysis for logistic regression, which allows us to visualize how the responsibility of each covariate changes with the sparsity penalty as discussed by Lee et al. (2012). We use the same ARD prior as Section 3.2 with the same initialization scheme, set , and elicit a noninformative DP prior with . We found empirically that the results for larger values of are similar and so the approximation with is sufficient. We fix and vary the value of to favour solutions of different sparsity. This varies the squared scale of the Studentt prior with fixed degrees of freedom, where a smaller corresponds to a heavier sparsity penalty and thus more components are set to 0.
3.3.1 Implementation and Results
We analyze the genotype/pseudophenotype dataset with as described by Lee et al. (2012), containing patient covariates which exhibit strong blocklike correlations as shown in Figure 3. We normalize the covariates to have mean 0 and standard deviation 1. The pseudophenotype data is generated by , where has 5 randomly selected nonzero components out of , with the rest set to 0. Each nonzero component is sampled from , and the exact values of are provided in Section E.4.1 of the Supplementary. We set and vary for , and generate 4000 posterior samples for each setting.
The posterior medians of the nonzero components of with
central credible interval are shown in Figure
4 for a range of values. Both the posterior median and central credible intervals are estimated through the appropriate order statistics of the posterior samples (Gelman et al., 2013). We can see that , and have early predictive power as their credible intervals remain large despite a significant sparsity penalty (small ), whilst the other two coefficients are masked. A plot of the absolute medians for all components is included in Section E.4.2 of the Supplementary. For and , the median is close to 0 but the credible interval is large which is due to the multimodality of the marginal posterior. This multimodality is also responsible for the jitter in the median around for in Figure 4 ; the true median likely lies between the two separated modes but the finite posterior sample size causes the sample median to jump between the two. A posterior marginal KDE plot of changing with is shown in Figure 5, allowing us to visualize how the importance of the covariate changes with the sparsity penalty. We observe the bimodality in the marginal posterior for as expected from the above discussion.LossNPL required 5 minutes 24 seconds to generate all posterior samples. The computational speed of NPL enables fast Bayesian analysis of large datasets with different hyperparameter settings, allowing for Bayesian variable selection analysis.
4 Discussion
We have introduced a variant of Bayesian nonparametric learning (NPL) with a Dirichlet process (DP) prior on the sampling distribution , which leads to highly scalable exact inference under model misspecification, detached from the conventional Bayesian posterior. This method admits a sampling scheme for multimodal posteriors that allows for full mode exploration, which involves a nonconvex optimization that we solve through random restart. We demonstrated that NPL can perform predictively better than conventional Bayesian inference, while providing exact uncertainty quantification.
For future work, the small sample performance of NPL could be further explored and compared to conventional Bayesian inference; we currently recommend NPL for moderate to large values of . The scaling of the number of repeats with increasing dimension for full mode exploration would also be a future avenue of research.
Acknowledgements
We would like to thank Ho Chung Leon Law for helpful feedback and Anthony Lee for providing the genotype dataset. EF is funded by The Alan Turing Institute Doctoral Studentship, under the EPSRC grant EP/N510129/1. SL is funded by the EPSRC OxWaSP CDT, through EP/L016710/1. CH is supported by The Alan Turing Institute, the HDRUK, the Li Ka Shing Foundation, and the MRC.
References
 Antoniak (1974) Antoniak, C. E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, 1974. ISSN 00905364. doi: 10.1214/aos/1176342871.
 Bardenet et al. (2017) Bardenet, R., Doucet, A., and Holmes, C. On Markov chain Monte Carlo methods for tall data. J. Mach. Learn. Res., 18(1):1515–1557, 2017. ISSN 15324435.
 Betrò & Schoen (1987) Betrò, B. and Schoen, F. Sequential stopping rules for the multistart algorithm in global optimisation. Mathematical Programming, 38(3):271–286, 1987. ISSN 14364646. doi: 10.1007/BF02592015.
 Bissiri et al. (2016) Bissiri, P. G., Holmes, C. C., and Walker, S. G. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103–1130, 2016. ISSN 13697412. doi: 10.1111/rssb.12158.
 Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017. doi: 10.1080/01621459.2017.1285773.
 Carpenter et al. (2017) Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. Stan: A probabilistic programming language. Journal of Statistical Software, Articles, 76(1):1–32, 2017. ISSN 15487660. doi: 10.18637/jss.v076.i01.
 Dheeru & Karra Taniskidou (2017) Dheeru, D. and Karra Taniskidou, E. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Dick et al. (2014) Dick, T., Wong, E., and Dann, C. How many random restarts are enough? Technical report, 2014.
 Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, 2011. ISSN 15324435.
 G. E. Boender & H. G. Rinnooy Kan (1987) G. E. Boender, C. and H. G. Rinnooy Kan, A. Bayesian stopping rules for multistart global optimization methods. Mathematical Programming, 37:59–80, 1987. doi: 10.1007/BF02591684.
 Gelman et al. (2008) Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y.S. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4):1360–1383, 2008. doi: 10.1214/08AOAS191.
 Gelman et al. (2013) Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. Bayesian data analysis, third edition. Chapman & Hall/CRC Texts in Statistical Science, chapter 1,2. Taylor & Francis, 2013. ISBN 9781439840955.
 Ghosal (2010) Ghosal, S. The Dirichlet process, related priors and posterior asymptotics, pp. 35–79. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2010. doi: 10.1017/CBO9780511802478.003.
 Ghosal & van der Vaart (2017) Ghosal, S. and van der Vaart, A. Fundamentals of nonparametric bayesian inference. Cambridge Series in Statistical and Probabilistic Mathematics, chapter 4. Cambridge University Press, 2017. doi: 10.1017/9781139029834.
 Giordano et al. (2015) Giordano, R., Broderick, T., and Jordan, M. Linear response methods for accurate covariance estimates from mean field variational bayes. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 1, NIPS’15, pp. 1441–1449, Cambridge, MA, USA, 2015. MIT Press.
 Grünwald & Van Ommen (2017) Grünwald, P. and Van Ommen, T. Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis, 12(4):1069–1103, 2017. doi: 10.1214/17BA1085.

Guyon et al. (2005)
Guyon, I., Gunn, S., BenHur, A., and Dror, G.
Result analysis of the NIPS 2003 feature selection challenge.
In Saul, L. K., Weiss, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 17, pp. 545–552. MIT Press, 2005.  Homan & Gelman (2014) Homan, M. D. and Gelman, A. The NoUturn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res., 15(1):1593–1623, 2014. ISSN 15324435.
 Hu et al. (1994) Hu, X., Shonkwiler, R., and Spruill, M. C. Random restarts in global optimization. Technical report, 1994.
 Ibrahim & Chen (2000) Ibrahim, J. G. and Chen, M.H. Power prior distributions for regression models. Statist. Sci., 15(1):46–60, 2000. doi: 10.1214/ss/1009212673.
 Ishwaran & Zarepour (2002) Ishwaran, H. and Zarepour, M. Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics, 30:269 – 283, 2002. doi: 10.2307/3315951.
 Jain & Kar (2017) Jain, P. and Kar, P. Nonconvex optimization for machine learning. Foundations and Trends® in Machine Learning, 10:142–336, 2017. doi: 10.1561/2200000058.
 Jasra et al. (2005) Jasra, A., Holmes, C. C., and Stephens, D. A. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science, 20(1):50–67, 2005. ISSN 08834237. doi: 10.1214/088342305000000016.
 Jin et al. (2016) Jin, C., Zhang, Y., Balakrishnan, S., Wainwright, M. J., and Jordan, M. I. Local maxima in the likelihood of Gaussian mixture models: Structural results and algorithmic consequences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp. 4123–4131, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
 Jones et al. (2001–) Jones, E., Oliphant, T., Peterson, P., et al. SciPy: Open source scientific tools for Python, 2001–. URL http://www.scipy.org/.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.

Kohavi (1996)
Kohavi, R.
Scaling up the accuracy of naiveBayes classifiers: a decisiontree hybrid.
In Proceedings of the second internal conference on knowledge on knowledge discovery and data mining, pp. 202–207. AAAI Press, 1996.  Kucukelbir et al. (2017) Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. Automatic Differentiation Variational Inference. J. Mach. Learn. Res., 18(1):430–474, 2017. ISSN 15324435.
 LeCun & Cortes (2010) LeCun, Y. and Cortes, C. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
 Lee et al. (2010) Lee, A., Caron, F., Doucet, A., and Holmes, C. A hierarchical Bayesian framework for constructing sparsityinducing priors. arXiv eprints, art. arXiv:1009.1914, 2010.
 Lee et al. (2012) Lee, A., Caron, F., Doucet, A., and Holmes, C. Bayesian sparsitypathanalysis of genetic association signal using generalized t priors. Statistical applications in genetics and molecular biology, 11 2, 2012.
 Lyddon et al. (2018) Lyddon, S., Holmes, C., and Walker, S. General Bayesian Updating and the LossLikelihood Bootstrap. arXiv eprints, art. arXiv:1709.07616, 2018.
 Lyddon et al. (2018) Lyddon, S., Walker, S., and Holmes, C. C. Nonparametric learning from Bayesian models with randomized objective functions. In Advances in Neural Information Processing Systems 31, pp. 2075–2085. Curran Associates, Inc., 2018.
 MacKay (1994) MacKay, D. J. C. Bayesian nonlinear modeling for the prediction competition. ASHRAE Trans, 100, 1994.
 Neal (1996) Neal, R. M. Sampling from multimodal distributions using tempered transitions. Statistics and Computing, 6(4):353–366, 1996. ISSN 09603174. doi: 10.1007/BF00143556.
 Newton & Raftery (1994) Newton, M. and Raftery, A. Approximate bayesian inference by the weighted likelihood bootstrap. Journal of the Royal Statistical Society Series BMethodological, 56:3 – 48, 1994. doi: 10.1111/j.25176161.1994.tb01956.x.
 Newton et al. (2018) Newton, M., Polson, N. G., and Xu, J. Weighted Bayesian bootstrap for scalable Bayes. arXiv eprints, art. arXiv:1803.04559, 2018.

Owen (1988)
Owen, A. B.
Empirical likelihood ratio confidence intervals for a single functional.
Biometrika, 75(2):237–249, 1988. doi: 10.1093/biomet/75.2.237.  Park & Casella (2008) Park, T. and Casella, G. The Bayesian lasso. Journal of the American Statistical Association, 103:681–686, 2008. ISSN 01621459. doi: 10.1198/016214508000000337.
 Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 Pompe et al. (2018) Pompe, E., Holmes, C., and Łatuszyński, K. A framework for adaptive MCMC targeting multimodal distributions. arXiv eprints, art. arXiv:1812.02609, 2018.
 Rubin (1981) Rubin, D. B. The Bayesian bootstrap. The Annals of Statistics, 9(1):130–134, 1981. ISSN 00905364. doi: 10.1214/aos/1176345338.
 Rudoy & Wolfe (2006) Rudoy, D. and Wolfe, P. J. Monte carlo methods for multimodal distributions. In 2006 Fortieth Asilomar Conference on Signals, Systems and Computers, pp. 2019–2023, 2006. doi: 10.1109/ACSSC.2006.355120.
 Seeger et al. (2007) Seeger, M., Gerwinn, S., and Bethge, M. Bayesian inference for sparse generalized linear models. In Proceedings of the 18th European Conference on Machine Learning, ECML ’07, pp. 298–309, Berlin, Heidelberg, 2007. SpringerVerlag. ISBN 9783540749578. doi: 10.1007/9783540749585˙29.
 Sethuraman (1994) Sethuraman, J. A constructive definition of Dirichlet priors. Statistica Sinica, 4(2):639–650, 1994. ISSN 10170405, 19968507.
 Stephens (1999) Stephens, M. Dealing with multimodal posteriors and nonidentifiability in mixture models. Journal of the Royal Statistical Society, Series B, 1999.
 van der Vaart (1998) van der Vaart, A. W. Asymptotic statistics. Cambridge Series in Statistical and Probabilistic Mathematics, chapter 5. Cambridge University Press, 1998. ISBN 0521496039.
 Walker (2013) Walker, S. G. Bayesian inference with misspecified models. Journal of Statistical Planning and Inference, 2013. ISSN 03783758. doi: 10.1016/j.jspi.2013.05.013.
 Watson & Holmes (2016) Watson, J. and Holmes, C. Approximate models and robust decisions. Statist. Sci., 31(4):465–489, 11 2016. doi: 10.1214/16STS592.
 Yamato (1984) Yamato, H. Characteristic functions of means of distributions chosen from a Dirichlet process. The Annals of Probability, 12(1):262–267, 1984. ISSN 00911798. doi: 10.1214/aop/1176993389.
 Zhu et al. (1997) Zhu, C., Byrd, R. H., Lu, P., and Nocedal, J. Algorithm 778: LBFGSB: Fortran subroutines for largescale boundconstrained optimization. ACM Trans. Math. Softw., 23(4):550–560, 1997. ISSN 00983500. doi: 10.1145/279232.279236.
 Zikeba et al. (2016) Zikeba, M., Tomczak, S. K., and Tomczak, J. M. Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction. Expert Systems with Applications, 2016.
Appendix A Eliciting the prior DP
a.1 Intuition of the prior
The parameter of interest when model fitting (Walker, 2013) is:
(A.1)  
The prior on is:
(A.2) 
The effects of the implicit prior on due to when modelfitting can be seen in the limit of under regularity conditions:
(A.3) 
In the limit, the prior collapses on one of the points that minimizes the KL divergence between the prior centering density and the model. Intuitively, the prior regularizes towards (A.3), and acts as weighting between and .
a.2 Selecting through the mean functional
We can tune through the a priori variance of the mean functional:
(A.4)  
If , then the a prior variance of (A.4) follows from the properties of the DP:
(A.5) 
and so we can elicit from a priori knowledge of .
Appendix B Stopping rules for adaptively selecting
Although not explored in our paper, we can utilize heuristic stopping rules for adaptively selecting
for full mode exploration when sampling from the NPL posterior. A simple example is to stop the repeats if there have been no improvements in the optimized function value for the last repeats, where is the parameter of the stopping rule. More complex methods involve estimating the missing probability mass due to local minima not being observed, and thresholding based on that. See Betrò & Schoen (1987); Dick et al. (2014) for a comparison of some methods.Appendix C Stochastic Subsampling
For very large , we can utilize stochastic gradient methods by subsampling to optimize the weighted loss. The full weighted loss and gradient are defined as:
(C.1)  
If we subsample a minibatch , we can then calculate the minibatch gradient:
(C.2) 
The minibatch gradient is unbiased:
(C.3)  
Setting
allows use to use stochastic gradient descent (SGD) and its variants which improves scalability. Furthermore, extensions to SGD such as ADAGRAD
(Duchi et al., 2011) and ADAM (Kingma & Ba, 2014) help with escaping saddle points, which can potentially reduce the number of required for RRNPL to obtain full mode exploration.Appendix D Selecting in lossNPL
For lossNPL we can set the loss function to:
(D.1) 
In this case, we recommend the scaling parameter to be if we want roughly the same prior regularization of as in traditional Bayesian inference. This can be seen when we look at the expected of for (i.e. ):
(D.2) 
We obtain the same weighting as in Bayesian inference between the loglikelihood and logprior for .
Appendix E Examples
e.1 Toy example: Normal location model
We now empirically demonstrate the small sample performance of NPL and the role of the prior concentration in a toy normal location model problem. Suppose the model of interest is with known . Our parameter of interest is defined:
(E.1)  
If we set the derivative of the objective to 0, we obtain:
(E.2) 
If we believe our parametric model to be accurate, we can place a prior on . The centering measure on our DP is thus:
(E.3) 
When , our NPL prior is approximately normal (Yamato, 1984) from the properties of the DP:
(E.4) 
e.1.1 Implementation and results
We sample the observables and set our parametric prior variance to . We simulate the NPL posterior in Figure E.1 for various values of and , and compare it to the tractable traditional Bayesian posterior with the same model . For the NPL posterior bootstrap sampler, we generate samples and truncate the DP at .
We see from Figure E.1 that the NPL prior is approximately normal (), with same mean and variance due to the choice of . For large , the NPL posterior and Bayesian posterior are similar, due to the first order correctness of the weighted likelihood bootstrap (Newton & Raftery, 1994). For smaller values of
, the NPL posterior is nonnormal, as our prior is not a conjugate prior on
. For , the sample observed is close to so the posterior uncertainty is small despite only observing one sample; this suggests that NPL may be better suited to moderate to large values of . Figure E.2 shows the effect on the NPL posterior of increasing prior strength for , which regularizes the posterior but also causes it to concentrate about .e.2 Gaussian mixture models
e.2.1 Optimization details
We derive the EM algorithm that maximizes the weighted likelihood of the diagonalcovariance GMM:
Taking an expectation over the posterior :
Taking the difference of the weighted likelihood with :
From Gibbs’ inequality:
As all :
So by maximizing w.r.t. , we cannot decrease the weighted loglikelihood. As a reminder, the loglikelihood for each datapoint is:
(E.5) 
At the expectation step, we calculate:
(E.6) 
The maximization step is then:
(E.7)  
e.2.2 Toy Example
We see the posterior KDE plots for , and in Figures E.5, E.5, E.5, and for increasing in Figures E.8, E.8, E.8. For RRNPL, we observe multimodality in addition to symmetry about the diagonal due to labelswitching. Smaller values of exhibit an overrepresentation of local modes/saddles, and the posterior accuracy increases for larger . We also show the runtimes for different for RRNPL in Table E.1, and we see that the runtime increases roughly linearly with .
Toy Sep  4.9 0.6  8.0 1.1  19.0 2.1  37.2s 4.5s 

e.2.3 Another Toy Example
We carry out the same experiment in another toy GMM example with the parameters:
The means are closer together, and so NUTS is prone to labelswitching. We can see in Figures E.11, E.11, E.11 that NUTS exhibits labelswitching, and that the Bayesian and NPL posterior are distinct. We obtain full mode exploration with RRNPL, and do not observe labelswitching with FINPL. We utilize a Gaussian kernel KDE for visualization, but in reality the posterior of will have 0 density outside the triangular simplex.
e.3 Logistic regression with Automatic Relevance Determination priors
e.3.1 Predictive performance
For all the following measures of predictive performance on heldout test data, we can use a Monte Carlo estimate of the predictive distribution of a test data point :
(E.8)  
where is the NPL or Bayesian posterior, is the number of posterior samples, and is the training set. We evaluate the mean log pointwise predictive probability (LPPD) (Gelman et al., 2013) of heldout test data as a measure of predictive performance:
(E.9) 
Below, we additionally include the mean squared error (MSE) here on heldout test data:
(E.10) 
Finally, we also report the percentage accuracy:
P.a.  (E.11)  
where is the indicator function.
e.3.2 Sparsity measure
For the sparsity results, we simply calculate the posterior mean , where as above. We then report the percentage of components of that have absolute value less than .
e.3.3 Optimization details
LBFGSB (Zhu et al., 1997) is a quasiNewton method which requires the gradient, which for the marginal Studentt distribution is defined for as:
(E.12)  
e.3.4 Additional results
We can see in Tables E.2, E.3 that lossNPL performs equally or better than NUTS and ADVI predictively in MSE and classification accuracy as well as LPPD. A posterior marginal density plot for in the ‘Adult’ dataset is shown in Figure E.12 for reference.
Data Set  LossNPL  NUTS  ADVI 

Adult  0.104 0.001  0.104 0.001  0.105 0.002 
Polish  0.056 0.011  0.524 0.236  0.058 0.011 
Arcene  0.1340.026  0.152 0.014  0.143 0.020 
Data Set  LossNPL  NUTS  ADVI 

Adult  84.92 0.29  84.92 0.30  84.84 0.30 
Polish  93.65 1.68  37.27 
Comments
There are no comments yet.