As datasets grow in complexity and size, Bayesian inference becomes increasingly difficult. The posterior is often intractable, so we resort to simulation methods for inference via Markov chain Monte Carlo (MCMC), which is inherently serial and often too computationally expensive in datasets with a large number of data points (Bardenet et al., 2017). MCMC further struggles with multimodal posteriors which arise in many settings including mixture models (Jasra et al., 2005) or non-convex priors (Seeger et al., 2007), as the MCMC sampler can become trapped in local modes (Rudoy & Wolfe, 2006). Current methods to sample from multimodal posteriors with MCMC include parallel tempering (Neal, 1996) and adaptive MCMC (Pompe et al., 2018), but the associated computational cost is high. Posterior approximation with variational Bayes (VB) (Blei et al., 2017) is a faster alternative, but it is generally difficult to quantify the quality of the approximation, and is thus problematic if accurate uncertainty quantification is desired (Giordano et al., 2015).
A further methodological issue facing Bayesian inference is the fact that all models are false. The increasing scale of datasets exacerbates the effects of model misspecification (Walker, 2013), as the true sampling distribution is meaningfully different from the parametric family of distributions of the model. There is rarely formal acknowledgement of model misspecification which can lead to inconsistencies (Watson & Holmes, 2016; Grünwald & Van Ommen, 2017).
Bayesian nonparametric learning (NPL) introduced by Lyddon et al. (2018)
allows for the use of statistical models without assuming the model is true. NPL uses a nonparametric prior centred on a parametric model, and returns a nonparametric posterior over the parameter of interest. The method focuses on correcting for model misspecification and for posterior approximation such as from Variational Bayes (VB) by placing a mixture of Dirichlet Processes(Antoniak, 1974) prior on the sampling distribution. In addition to the acknowledgement of model misspecification, the method admits an embarrassingly parallel Monte Carlo sampling scheme consisting of randomized maximizations. However, in most cases this method requires sampling the Bayesian posterior, which is computationally expensive for complex models.
1.1 Our contribution
In this work, we propose a simplified variant of NPL that utilises a Dirichlet Process (DP) prior on
instead of a mixture of Dirichlet Processes (MDP) prior. This allows us to perform inference directly and detaches the nonparametric prior from the prior of the model parameter of interest. Instead of centering on a Bayesian posterior, we center the DP on a sampling distribution which encapsulates our prior beliefs. This simpler choice of prior also has desirable theoretical properties and is highly scalable as we no longer need to sample from the Bayesian posterior. Our method can handle a variety of statistical models through the choice of the loss functions, and can be applied to a wide range of machine learning settings as we will demonstrate in Section3. Our method implies a natural noninformative prior, which may be relevant when the number of data points is substantially larger than the number of parameters.
The posterior bootstrap sampling scheme was introduced by Lyddon et al. (2018) under the NPL framework, and we inherit its computational strengths such as parallelism and exact inference under a Bayesian nonparametric model. Independent samples from the nonparametric posterior are obtained through the optimization of randomized objective functions, and we obtain the weighted likelihood bootstrap (Newton & Raftery, 1994) as a special case. Furthermore, sampling from multimodal posteriors now involves a non-convex optimization at each bootstrap sample that we solve through local search and random restart. We demonstrate that our method recovers posterior multimodality on a Gaussian Mixture Model (GMM) problem. We further show that our method is computationally much faster than conventional Bayesian inference with MCMC, and has superior predictive performance on real sparse classification problems. Finally, we utilize the computational speed of NPL to carry out a Bayesian sparsity-path-analysis for variable selection on a genetic dataset.
2 Bayesian Nonparametric Learning
Assume that we have observed , where is a sequence of i.i.d. observables and is the unknown sampling distribution. We may be interested in a parameter
, which indexes a family of probability densities
. Conventional Bayesian updating of the prior to the posterior via Bayes’ theorem formally assumes thatbelongs to the model , which is questionable in the presence of complex and large datasets. This assumption is not necessary for NPL. We derive the foundations of NPL by treating parameters as functionals of , with model fitting as a special case.
2.1 The parameter of interest
We define our parameter of interest as satisfying an estimating equation:
where is a loss function, and its form can be used to target statistics of interest. For example, setting returns the median and returns the mean.
The loss function of particular interest is , where is the density of some parametric model. The value of
minimises the Kullback-Leibler divergence, which is the parameter of interest in conventional Bayesian analysis (Walker, 2013; Bissiri et al., 2016). We have not assumed that contains , and in this case does not have any particular generative meaning as it is simply the parameter that satisfies (1).
2.2 The Dirichlet process prior
As the sampling distribution is unknown, we place a DP prior on :
where is our prior centering measure, and is the strength of our belief.
The base measure
We encode our prior knowledge about the sampling distribution in the measure . If we believe a particular model to be accurate, and have prior beliefs about encoded in , a sensible choice for the density of is . Alternatively, we could directly specify
as a density that accurately represents our beliefs without the burden of defining a joint distribution on. In the presence of historical data , a suitable choice for is the empirical distribution of the historical data, i.e. where is the Dirac measure. This is in a similar fashion to power priors (Ibrahim & Chen, 2000). Further intuition is provided in Section A.1 of the Supplementary.
It should be noted that we cannot directly include a prior on the parameter of interest , only implicitly through . Our prior is selected independently of the model of interest, and this is appropriate under a misspecified model setting since we do not believe there to be a true . As all parameters of interest are defined as a functional of as in (1), any informative prior on is thus informative of .
The size of measures the concentration of the DP about
, and a large value corresponds to a smaller variance in a functional of the DP. We see in (3) that the NPL posterior centering measure is a weighted sum of the prior and the empirical distribution , with the weights proportional to and respectively. We can thus interpret as the effective sample size from the prior . One method of selecting is through simulation of the prior distribution of via (1) and tuning its variance. Alternatively, we can select through the a priori variance of the mean functional (see Section A.2 of the Supplementary). The special case of corresponds to the Bayesian bootstrap (Rubin, 1981), which in our case corresponds to a natural way to define an noninformative prior about (see Gelman et al. (2013) for a review on noninformative priors). For , it may be suitable to set as the prior should have little influence and the Bayesian bootstrap is more computationally efficient.
2.3 The NPL posterior
From the conjugacy of the DP, the posterior of is:
Our NPL posterior is thus:
where ; the delta arises as is a deterministic functional of as in (1). Properties of the NPL posterior follow from properties of the DP. For example, draws of are almost surely discrete, so (1) simplifies to:
where and from the stick-breaking construction (Sethuraman, 1994). Formally, the GEM distribution is defined:
We preserve the theoretical advantages from Lyddon et al. (2018) due to the symmetries in the limits of the DP and the MDP for and .
Under regularity conditions, the NPL posterior is consistent at as defined in (1), from the properties of the DP (see van der Vaart (1998); Ghosal (2010); Ghosal & van der Vaart (2017) for details). Interestingly, this is true regardless of the choice of and its support. This is not the case in conventional Bayesian inference through Bayes’ rule where the support of the prior must contain for posterior consistency. This is particularly reassuring in our misspecified model setting, as inferences about are robust to choices of .
The NPL posterior predictivefor asymptotically dominates the conventional Bayesian posterior predictive up to under regularity conditions, i.e.:
for all distributions , where is a non-negative and possibly positive real-valued functional. This states that compared to the Bayesian posterior predictive, the NPL posterior predictive is closer in expected KL divergence to the true up to . The proof for the MDP case is given in Theorem 1 of Lyddon et al. (2018), and the above follows from the equivalence of the MDP and the DP for .
2.4 Sampling from the NPL posterior
In almost all cases, is not tractable, but lends itself to a parallelizable Monte Carlo sampling scheme. It may be more intuitive to think of sampling from the posterior DP, then calculating (1) to generate the sample from :
where is the number of posterior bootstrap samples. One advantage of this sampling scheme is that it is embarrassingly parallel as each of the samples can be drawn independently. We can thus take advantage of increasingly available multi-core computing, unlike in conventional Bayesian inference as MCMC is inherently sequential.
2.4.1 The posterior bootstrap
Sampling from the DP exactly requires infinite computation time if is continuous, but approximate samples can be generated by truncation of the sum in (5). For example, we could truncate the stick-breaking and set the remaining weights to 0. Alternatively, we could approximate with the finite Dirichlet distribution for large . For further details, see Ishwaran & Zarepour (2002). We opt for the latter suggestion as Dirichlet weights can be generated efficiently, which leads to a simpler variant of the posterior bootstrap algorithm:
For , we simply draw , which is no longer an approximation and is equivalent to the Bayesian bootstrap. For , the sampling scheme is asymptotically exact for , but this is computationally infeasible. We could fix to a moderate value, or select it adaptively via adaptive NPL, where we use the stick-breaking construction until the remaining probability is less than .
2.5 Tackling multimodal posteriors with initialization
Multimodal posteriors can arise in Bayesian inference if the likelihood function is non-log-concave like in GMMs (Jin et al., 2016; Stephens, 1999), or if the prior is non-log-concave which can arise when selecting sparse priors (Seeger et al., 2007; Park & Casella, 2008; Lee et al., 2010). Unlike the method by Lyddon et al. (2018) with the MDP, our NPL posterior with the DP is now decoupled from the Bayesian posterior. There is thus no reliance on an accurate representation of the Bayesian posterior with potential multimodality, which MCMC and VB can often struggle to capture. If our loss function in (1) is non-convex (e.g. of a GMM), our NPL posterior may also be multimodal. This now presents an optimization issue: solving (1) requires non-convex optimization. In general, optimizing non-convex objectives is difficult (see Jain & Kar (2017)), but under smoothness assumption of the loss, we can apply convex optimization methods to find local minima.
2.5.1 Random restart for multiple modes
Random restart (see G. E. Boender & H. G. Rinnooy Kan (1987)) can be utilized with convex optimization methods to generate a list of potential global minima then selecting the one with the lowest objective. This involves random initializations of for each local optimization, and it was shown by Hu et al. (1994) that the uniform measure for has good properties for convergence. If the number of modes is finite, then the global minimum will be achieved asymptotically in the limit of the . The probability of obtaining the correct global minimum for finite is related to the size of its basin of attraction. Random restart NPL (RR-NPL) is shown in Algorithm 3:
This is particularly suited to NPL with non-convex loss functions for the following reasons. Firstly, random restart can utilize efficient convex optimization techniques such as quasi-Newton methods, and the restarts can be easily implemented in parallel which is coherent with our parallelizable sampling scheme. Secondly, we can compromise between accuracy and computational cost by selecting , as computational cost scales linearly with (though we can parallelize). The repercussions of an insufficiently large are not severe: our NPL posterior will incorrectly allocate more density to local modes/saddles but all modes will likely still be present for a sufficiently large . This is demonstrated in Section E.2.2 of the Supplementary. Finally, the uniform initialization can sample from nonidentifiable posteriors with symmetric modes (like in GMMs) as their basins of attraction are selected with equal probability.
Practically, uniform initialization may not be possible if the support of the parameter is infinite, e.g. the variance . In this case, we can simply pick another (e.g. Gamma for a positive parameter), or sample uniformly from a truncated support. For adaptively setting , we can utilize stopping rules as discussed in Section B of the Supplementary.
2.5.2 Fixed initialization for local modes
We may be interested in targeting local modes of the posterior when we value interpretability of posterior quantities over exact posterior representation. For example in component mixture models, there will be symmetrical modes (or sets of modes), and label-switching occurs if the sampler travels between these (Jasra et al., 2005) which impedes useful inference in terms of clustering.
We can target one NPL posterior mode through a fixed initialization scheme by taking advantage of the fact that local optimization methods like expectation-maximization (EM) or gradient ascent are hill-climbers. We initialize each maximization step with the same, causing the sampler to stay within the basin of attraction of the local posterior mode with high probability. In high-dimensional models, it may be difficult to visualize modes of the posterior, so we can utilize VB’s mode-selection to select , assuming the Bayesian and NPL posterior modes are close. Mean-field VB also tends to underestimate posterior variance (Blei et al., 2017), so we are able to obtain accurate local uncertainty quantification of the mode through this scheme. Fixed initialization NPL (FI-NPL) is shown in Algorithm 4:
As we cannot define priors on directly, we can instead penalize undesirable properties in the loss:
For example, obtains the Bayesian NPL-Lasso, or we can set if we have some prior preference. We recommend if we desire roughly the same prior regularization as in Bayesian inference, where is the size of the training set. The reasoning is outlined in Section D of the Supplementary. We could also tune through desired predictive performance or properties of . Note that we are no longer encoding prior beliefs, and are instead expressing an alternative parameter of interest that minimizes the expectation of (8).
2.7 Related Work
We build on the work of Lyddon et al. (2018) which specifies a mixture of Dirichlet Process prior on , and recovers conventional Bayesian inference in the limit of . Although the foundations of nonparametric learning are unchanged, our NPL posterior is decoupled from the Bayesian model, offering flexibility in prior measure selection, computational scalability and full multimodal exploration.
NPL unsurprisingly overlaps with other nonparametric approaches to inference. We recover the Bayesian bootstrap (Rubin, 1981) if we set , and further setting gives the weighted likelihood bootstrap (Newton & Raftery, 1994), as discussed in Lyddon et al. (2018). Setting the loss to (8) and also returns the fixed prior weighted Bayesian bootstrap (Newton et al., 2018). However, these methods were posited as approximations to the true Bayesian posterior, and the Bayesian bootstrap/weighted likelihood bootstrap are unable to incorporate prior information. The NPL posterior on the other hand is exact and distinct to the conventional Bayesian posterior with theoretical advantages, and we are able to incorporate prior information either through or .
Treating parameters as functionals of the sampling distribution is akin to empirical likelihood methods (Owen, 1988), in which parameters are defined through estimating equations of the form . The definition of a parameter of interest through the loss is also present in general Bayesian updating introduced by Bissiri et al. (2016), where a coherent posterior over a parameter of interest is obtained without the need to specify a joint generative model. Their target parameter is equivalent to (1), and their methodology is built on a notion of coherency.
We now demonstrate our method on some examples, and compare it to conventional Bayesian inference with the No-U-Turn Sampler (NUTS) by Homan & Gelman (2014), and Automatic Differentiation Variational Inference (ADVI) by Kucukelbir et al. (2017) in Stan (Carpenter et al., 2017)
. We select these as baselines as the methods are well established and optimized. All NPL examples are run on 4 Azure F72s_v2 (72 vCPUs) virtual machines, implemented in Python. The NUTS and ADVI examples cannot be implemented in an embarrassingly parallel manner, so they are run on a single Azure F72s_v2. We avoid running multiple MCMC chains in parallel as the models are multimodal which may impede mixing, and combining unmixed chains is unprincipled. For tabulated results, each run was repeated 30 times with different seeds, and we report the mean with 1 standard error. We emphasize again that our NPL posterior is distinct to the conventional Bayesian posterior, so we are comparing the two inference schemes and their associated sampling methods.
3.1 Gaussian mixture model
We demonstrate the ability of RR-NPL to accurately sample from a multimodal posterior in a -component, -dimensional diagonal GMM toy problem, which NUTS and ADVI fail to do. It should be noted that in addition to the symmetrical modes present from label-switching, further multimodality is present due to the non-log-concavity of the likelihood. We further show how FI-NPL can be used to sample from a local mode in a clustering example with real data whilst providing accurate local uncertainty quantification which ADVI is unable to do. Our conventional Bayesian model for , and is:
The posterior is multimodal, and we use ADVI and NUTS for inference. For NPL, we are interested in model fitting, so our loss function is simply the negative log-likelihood:
In the case of small , we may want to include a regularization term in the loss to avoid singularities of the likelihood. We select the DP prior separately for each example.
3.1.1 Toy example: Implementation and Results
We analyze toy data from a GMM with , and the following parameters:
We generate for model fitting and another held-out for model evaluation with different seeds for each of the 30 runs. For the Bayesian model we set , and for NPL we set as . We optimize each bootstrap maximization with a weighted EM algorithm (derived in Section E.2.1 of the Supplementary), and implement this in a modified GaussianMixture class from sklearn.mixture (Pedregosa et al., 2011). For RR-NPL, we initialize , and for each restart. For FI-NPL we initialize with one of the posterior modes from RR-NPL. We produce posterior samples for each method. We evaluate the predictive performance of each method on held-out test data with the mean log pointwise predictive density (LPPD) as suggested by Gelman et al. (2013), which is described in Section E.3.1 of the Supplementary. A larger value is equivalent to a better fit to the test data.
Figure 1 shows the posterior KDEs of () for 1 run of each method. RR-NPL clearly recovers the multi-modality of the NPL posterior, including the symmetry about due to the nonidentifiability of the GMM posterior. NUTS and ADVI remain trapped in one local mode of the Bayesian posterior as expected. Even if we carried out random initialization of NUTS/ADVI over multiple runs, each run would only pick out one mode, and there is no general method to combine the posteriors. ADVI also clearly underestimates the marginal posterior uncertainty. FI-NPL remains in a single mode, showing that we can fix label-switching through this initialization. However, the FI-NPL mode is not identical to a truncated version of the RR-NPL mode, as posterior mass is not reallocated symmetrically from the other modes. We see in Tables 1, 2 that RR-NPL has similar mean LPPD on toy test data compared to NUTS, and is twice as fast as NUTS.
3.1.2 MNIST: Implementation and Results
We now demonstrate FI-NPL on clustering handwritten digits from MNIST (LeCun & Cortes, 2010), which consists of pixel images. In this example and . We normalize all pixel values such that they lie in the interval , and set . We believe a priori that many pixels are close to 0, so for ease we elicit a tight normal centering measure for the DP:
NUTS is prone to the label-switching problem and is too computationally intensive as ADVI already requires 5 hours, so we only compare FI-NPL to ADVI. We set for ADVI, and for FI-NPL with . We carry out a single run of ADVI to select a local mode, and set of FI-NPL to the ADVI-selected mode. We then carry out 30 repeats of FI-NPL with this initialization, and compare to the original ADVI run. We see in Figure 2 that we obtain larger posterior variances in FI-NPL, as ADVI likely underestimates the posterior variances due to the mean-field approximation. Notice the modes are not exactly aligned as the NPL and Bayesian posterior are distinct, and furthermore ADVI is approximate. We conjecture that ADVI does not set components exactly to 0 due to the strong Dirichlet prior. We see in Tables 1, 2 that FI-NPL is predictively better and runs around 300 times faster than ADVI.
|Toy||-1.909 0.040||-1.911 0.040||-1.908 0.039||-1.912 0.041|
|Toy||37.2s 4.5s||5.5 2.2s||1m20s 16s||0.8s 0.1s|
3.2 Logistic regression with Automatic Relevance Determination priors
We now demonstrate the predictive performance and computational scalability of loss-NPL in a Bayesian sparse logistic regression example on real datasets. To induce sparsity, we place automatic relevance determination (ARD) priors (MacKay, 1994)
on the coefficients with Gamma hyperpriors(Gelman et al., 2008). The conventional Bayesian model for and is:
Marginally, the prior is the non-standardized t-distribution with (degrees of freedom, location, squared scale):
This posterior is intractable and potentially multimodal due to the non-log-concavity of the prior, and we carry out conventional Bayesian inference via NUTS and ADVI. When applying loss-NPL to regression, we assume , and place a DP prior on the joint distribution . We target the parameter which satisfies (1) with loss:
which is the negative sum of the log-likelihood and log-prior, with additional scaling parameter . Again our NPL posterior may be multimodal due to the non-convexity of the loss, and so we utilize RR-NPL. It should be noted that our target parameter is now different to conventional Bayesian inference, but our method achieves the common goal of variable selection under a Bayesian framework. For the DP prior, we elicit the centering measure
The prior assumes are independent which is equivalent to assuming a priori. This is appropriate as we believe many components of to be close to 0. The prior on is its empirical distribution, which is in an empirical Bayes manner where the prior is estimated from the data.
3.2.1 Implementation and results
We analyze 3 binary classification datasets from the UCI ML repository (Dheeru & Karra Taniskidou, 2017): ‘Adult’ (Kohavi, 1996), ‘Polish companies bankruptcy 3rd year’, (Zikeba et al., 2016), and ‘Arcene’ (Guyon et al., 2005) with details in Table 3
. We handle categorical covariates with dummy variables, and normalize all covariates to have mean 0 and standard deviation 1. Missing real values were imputed with the mean, and data with missing categorical values were dropped. We carry out a random stratified train-test split for each of the 30 runs, with 80-20 split for ‘Adult’, ‘Polish’ and 50-50 split for ‘Arcene’ due to the smaller dataset. For both NPL and conventional Bayesian inference, the hyperparameters were set to, which was selected by tuning the sparsity of the Bayesian posterior means to a desired value. For NPL, we set for ‘Adult’ and ‘Polish’ as is sufficiently large, and for ‘Arcene’ with as is only . We set for each dataset as explained in Section 2.6 for a fair comparison to the conventional Bayesian model. We initialize each optimization with , and select the number of restarts to for expediency. Optimization was carried out using the L-BFGS-B algorithm (Zhu et al., 1997) implemented in scipy.optimize (Jones et al., 2001–).
We can see in Table 4 that loss-NPL is predictively similar or better than NUTS and ADVI, and from Table 5 we see that the posterior mean is sparser for loss-NPL as well. Finally, we see from Table 6 that the loss-NPL run-times for posterior samples are much faster than for NUTS, and comparable to VB. Further measures of predictive performance are provided in Section E.3.4 of the Supplementary.
|Data Set||Type||Positive %|
|Polish||-0.229 0.034||-3.336 4.162||-0.247 0.047|
|Arcene||-0.449 0.104||-0.464 0.032||-0.445 0.068|
|Adult||0.1||17.6 2.8||16.1 2.7||12.1 3.1|
|Polish||0.1||33.5 4.7||15.9 3.3||15.83.5|
|Arcene||0.01||87.4 0.7||4.7 0.3||3.5 0.3|
|Adult||2m24s 8s||2h36m 4m||26.9s 7.3s|
|Polish||19.0s 4.0s||1h20m 21m||3.3s 0.8s|
|Arcene||2m20s 2s||4h31m 53m||54.2s 3.3s|
3.3 Bayesian sparsity-path-analysis
We now utilize loss-NPL to carry out Bayesian sparsity-path-analysis for logistic regression, which allows us to visualize how the responsibility of each covariate changes with the sparsity penalty as discussed by Lee et al. (2012). We use the same ARD prior as Section 3.2 with the same initialization scheme, set , and elicit a noninformative DP prior with . We found empirically that the results for larger values of are similar and so the approximation with is sufficient. We fix and vary the value of to favour solutions of different sparsity. This varies the squared scale of the Student-t prior with fixed degrees of freedom, where a smaller corresponds to a heavier sparsity penalty and thus more components are set to 0.
3.3.1 Implementation and Results
We analyze the genotype/pseudo-phenotype dataset with as described by Lee et al. (2012), containing patient covariates which exhibit strong block-like correlations as shown in Figure 3. We normalize the covariates to have mean 0 and standard deviation 1. The pseudo-phenotype data is generated by , where has 5 randomly selected non-zero components out of , with the rest set to 0. Each non-zero component is sampled from , and the exact values of are provided in Section E.4.1 of the Supplementary. We set and vary for , and generate 4000 posterior samples for each setting.
The posterior medians of the non-zero components of with
central credible interval are shown in Figure4 for a range of values. Both the posterior median and central credible intervals are estimated through the appropriate order statistics of the posterior samples (Gelman et al., 2013). We can see that , and have early predictive power as their credible intervals remain large despite a significant sparsity penalty (small ), whilst the other two coefficients are masked. A plot of the absolute medians for all components is included in Section E.4.2 of the Supplementary. For and , the median is close to 0 but the credible interval is large which is due to the multimodality of the marginal posterior. This multimodality is also responsible for the jitter in the median around for in Figure 4 ; the true median likely lies between the two separated modes but the finite posterior sample size causes the sample median to jump between the two. A posterior marginal KDE plot of changing with is shown in Figure 5, allowing us to visualize how the importance of the covariate changes with the sparsity penalty. We observe the bimodality in the marginal posterior for as expected from the above discussion.
Loss-NPL required 5 minutes 24 seconds to generate all posterior samples. The computational speed of NPL enables fast Bayesian analysis of large datasets with different hyperparameter settings, allowing for Bayesian variable selection analysis.
We have introduced a variant of Bayesian nonparametric learning (NPL) with a Dirichlet process (DP) prior on the sampling distribution , which leads to highly scalable exact inference under model misspecification, detached from the conventional Bayesian posterior. This method admits a sampling scheme for multimodal posteriors that allows for full mode exploration, which involves a non-convex optimization that we solve through random restart. We demonstrated that NPL can perform predictively better than conventional Bayesian inference, while providing exact uncertainty quantification.
For future work, the small sample performance of NPL could be further explored and compared to conventional Bayesian inference; we currently recommend NPL for moderate to large values of . The scaling of the number of repeats with increasing dimension for full mode exploration would also be a future avenue of research.
We would like to thank Ho Chung Leon Law for helpful feedback and Anthony Lee for providing the genotype dataset. EF is funded by The Alan Turing Institute Doctoral Studentship, under the EPSRC grant EP/N510129/1. SL is funded by the EPSRC OxWaSP CDT, through EP/L016710/1. CH is supported by The Alan Turing Institute, the HDR-UK, the Li Ka Shing Foundation, and the MRC.
- Antoniak (1974) Antoniak, C. E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, 1974. ISSN 0090-5364. doi: 10.1214/aos/1176342871.
- Bardenet et al. (2017) Bardenet, R., Doucet, A., and Holmes, C. On Markov chain Monte Carlo methods for tall data. J. Mach. Learn. Res., 18(1):1515–1557, 2017. ISSN 1532-4435.
- Betrò & Schoen (1987) Betrò, B. and Schoen, F. Sequential stopping rules for the multistart algorithm in global optimisation. Mathematical Programming, 38(3):271–286, 1987. ISSN 1436-4646. doi: 10.1007/BF02592015.
- Bissiri et al. (2016) Bissiri, P. G., Holmes, C. C., and Walker, S. G. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103–1130, 2016. ISSN 13697412. doi: 10.1111/rssb.12158.
- Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017. doi: 10.1080/01621459.2017.1285773.
- Carpenter et al. (2017) Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. Stan: A probabilistic programming language. Journal of Statistical Software, Articles, 76(1):1–32, 2017. ISSN 1548-7660. doi: 10.18637/jss.v076.i01.
- Dheeru & Karra Taniskidou (2017) Dheeru, D. and Karra Taniskidou, E. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Dick et al. (2014) Dick, T., Wong, E., and Dann, C. How many random restarts are enough? Technical report, 2014.
- Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, 2011. ISSN 1532-4435.
- G. E. Boender & H. G. Rinnooy Kan (1987) G. E. Boender, C. and H. G. Rinnooy Kan, A. Bayesian stopping rules for multistart global optimization methods. Mathematical Programming, 37:59–80, 1987. doi: 10.1007/BF02591684.
- Gelman et al. (2008) Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y.-S. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4):1360–1383, 2008. doi: 10.1214/08-AOAS191.
- Gelman et al. (2013) Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. Bayesian data analysis, third edition. Chapman & Hall/CRC Texts in Statistical Science, chapter 1,2. Taylor & Francis, 2013. ISBN 9781439840955.
- Ghosal (2010) Ghosal, S. The Dirichlet process, related priors and posterior asymptotics, pp. 35–79. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2010. doi: 10.1017/CBO9780511802478.003.
- Ghosal & van der Vaart (2017) Ghosal, S. and van der Vaart, A. Fundamentals of nonparametric bayesian inference. Cambridge Series in Statistical and Probabilistic Mathematics, chapter 4. Cambridge University Press, 2017. doi: 10.1017/9781139029834.
- Giordano et al. (2015) Giordano, R., Broderick, T., and Jordan, M. Linear response methods for accurate covariance estimates from mean field variational bayes. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 1441–1449, Cambridge, MA, USA, 2015. MIT Press.
- Grünwald & Van Ommen (2017) Grünwald, P. and Van Ommen, T. Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis, 12(4):1069–1103, 2017. doi: 10.1214/17-BA1085.
Guyon et al. (2005)
Guyon, I., Gunn, S., Ben-Hur, A., and Dror, G.
Result analysis of the NIPS 2003 feature selection challenge.In Saul, L. K., Weiss, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 17, pp. 545–552. MIT Press, 2005.
- Homan & Gelman (2014) Homan, M. D. and Gelman, A. The No-U-turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res., 15(1):1593–1623, 2014. ISSN 1532-4435.
- Hu et al. (1994) Hu, X., Shonkwiler, R., and Spruill, M. C. Random restarts in global optimization. Technical report, 1994.
- Ibrahim & Chen (2000) Ibrahim, J. G. and Chen, M.-H. Power prior distributions for regression models. Statist. Sci., 15(1):46–60, 2000. doi: 10.1214/ss/1009212673.
- Ishwaran & Zarepour (2002) Ishwaran, H. and Zarepour, M. Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics, 30:269 – 283, 2002. doi: 10.2307/3315951.
- Jain & Kar (2017) Jain, P. and Kar, P. Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning, 10:142–336, 2017. doi: 10.1561/2200000058.
- Jasra et al. (2005) Jasra, A., Holmes, C. C., and Stephens, D. A. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science, 20(1):50–67, 2005. ISSN 0883-4237. doi: 10.1214/088342305000000016.
- Jin et al. (2016) Jin, C., Zhang, Y., Balakrishnan, S., Wainwright, M. J., and Jordan, M. I. Local maxima in the likelihood of Gaussian mixture models: Structural results and algorithmic consequences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp. 4123–4131, USA, 2016. Curran Associates Inc. ISBN 978-1-5108-3881-9.
- Jones et al. (2001–) Jones, E., Oliphant, T., Peterson, P., et al. SciPy: Open source scientific tools for Python, 2001–. URL http://www.scipy.org/.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Kohavi (1996) Kohavi, R. In Proceedings of the second internal conference on knowledge on knowledge discovery and data mining, pp. 202–207. AAAI Press, 1996.
- Kucukelbir et al. (2017) Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. Automatic Differentiation Variational Inference. J. Mach. Learn. Res., 18(1):430–474, 2017. ISSN 1532-4435.
- LeCun & Cortes (2010) LeCun, Y. and Cortes, C. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
- Lee et al. (2010) Lee, A., Caron, F., Doucet, A., and Holmes, C. A hierarchical Bayesian framework for constructing sparsity-inducing priors. arXiv e-prints, art. arXiv:1009.1914, 2010.
- Lee et al. (2012) Lee, A., Caron, F., Doucet, A., and Holmes, C. Bayesian sparsity-path-analysis of genetic association signal using generalized t priors. Statistical applications in genetics and molecular biology, 11 2, 2012.
- Lyddon et al. (2018) Lyddon, S., Holmes, C., and Walker, S. General Bayesian Updating and the Loss-Likelihood Bootstrap. arXiv e-prints, art. arXiv:1709.07616, 2018.
- Lyddon et al. (2018) Lyddon, S., Walker, S., and Holmes, C. C. Nonparametric learning from Bayesian models with randomized objective functions. In Advances in Neural Information Processing Systems 31, pp. 2075–2085. Curran Associates, Inc., 2018.
- MacKay (1994) MacKay, D. J. C. Bayesian nonlinear modeling for the prediction competition. ASHRAE Trans, 100, 1994.
- Neal (1996) Neal, R. M. Sampling from multimodal distributions using tempered transitions. Statistics and Computing, 6(4):353–366, 1996. ISSN 0960-3174. doi: 10.1007/BF00143556.
- Newton & Raftery (1994) Newton, M. and Raftery, A. Approximate bayesian inference by the weighted likelihood bootstrap. Journal of the Royal Statistical Society Series B-Methodological, 56:3 – 48, 1994. doi: 10.1111/j.2517-6161.1994.tb01956.x.
- Newton et al. (2018) Newton, M., Polson, N. G., and Xu, J. Weighted Bayesian bootstrap for scalable Bayes. arXiv e-prints, art. arXiv:1803.04559, 2018.
Owen, A. B.
Empirical likelihood ratio confidence intervals for a single functional.Biometrika, 75(2):237–249, 1988. doi: 10.1093/biomet/75.2.237.
- Park & Casella (2008) Park, T. and Casella, G. The Bayesian lasso. Journal of the American Statistical Association, 103:681–686, 2008. ISSN 0162-1459. doi: 10.1198/016214508000000337.
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Pompe et al. (2018) Pompe, E., Holmes, C., and Łatuszyński, K. A framework for adaptive MCMC targeting multimodal distributions. arXiv e-prints, art. arXiv:1812.02609, 2018.
- Rubin (1981) Rubin, D. B. The Bayesian bootstrap. The Annals of Statistics, 9(1):130–134, 1981. ISSN 0090-5364. doi: 10.1214/aos/1176345338.
- Rudoy & Wolfe (2006) Rudoy, D. and Wolfe, P. J. Monte carlo methods for multi-modal distributions. In 2006 Fortieth Asilomar Conference on Signals, Systems and Computers, pp. 2019–2023, 2006. doi: 10.1109/ACSSC.2006.355120.
- Seeger et al. (2007) Seeger, M., Gerwinn, S., and Bethge, M. Bayesian inference for sparse generalized linear models. In Proceedings of the 18th European Conference on Machine Learning, ECML ’07, pp. 298–309, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 978-3-540-74957-8. doi: 10.1007/978-3-540-74958-5˙29.
- Sethuraman (1994) Sethuraman, J. A constructive definition of Dirichlet priors. Statistica Sinica, 4(2):639–650, 1994. ISSN 10170405, 19968507.
- Stephens (1999) Stephens, M. Dealing with multimodal posteriors and non-identifiability in mixture models. Journal of the Royal Statistical Society, Series B, 1999.
- van der Vaart (1998) van der Vaart, A. W. Asymptotic statistics. Cambridge Series in Statistical and Probabilistic Mathematics, chapter 5. Cambridge University Press, 1998. ISBN 0-521-49603-9.
- Walker (2013) Walker, S. G. Bayesian inference with misspecified models. Journal of Statistical Planning and Inference, 2013. ISSN 03783758. doi: 10.1016/j.jspi.2013.05.013.
- Watson & Holmes (2016) Watson, J. and Holmes, C. Approximate models and robust decisions. Statist. Sci., 31(4):465–489, 11 2016. doi: 10.1214/16-STS592.
- Yamato (1984) Yamato, H. Characteristic functions of means of distributions chosen from a Dirichlet process. The Annals of Probability, 12(1):262–267, 1984. ISSN 0091-1798. doi: 10.1214/aop/1176993389.
- Zhu et al. (1997) Zhu, C., Byrd, R. H., Lu, P., and Nocedal, J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw., 23(4):550–560, 1997. ISSN 0098-3500. doi: 10.1145/279232.279236.
- Zikeba et al. (2016) Zikeba, M., Tomczak, S. K., and Tomczak, J. M. Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction. Expert Systems with Applications, 2016.
Appendix A Eliciting the prior DP
a.1 Intuition of the prior
The parameter of interest when model fitting (Walker, 2013) is:
The prior on is:
The effects of the implicit prior on due to when model-fitting can be seen in the limit of under regularity conditions:
In the limit, the prior collapses on one of the points that minimizes the KL divergence between the prior centering density and the model. Intuitively, the prior regularizes towards (A.3), and acts as weighting between and .
a.2 Selecting through the mean functional
We can tune through the a priori variance of the mean functional:
If , then the a prior variance of (A.4) follows from the properties of the DP:
and so we can elicit from a priori knowledge of .
Appendix B Stopping rules for adaptively selecting
Although not explored in our paper, we can utilize heuristic stopping rules for adaptively selectingfor full mode exploration when sampling from the NPL posterior. A simple example is to stop the repeats if there have been no improvements in the optimized function value for the last repeats, where is the parameter of the stopping rule. More complex methods involve estimating the missing probability mass due to local minima not being observed, and thresholding based on that. See Betrò & Schoen (1987); Dick et al. (2014) for a comparison of some methods.
Appendix C Stochastic Subsampling
For very large , we can utilize stochastic gradient methods by subsampling to optimize the weighted loss. The full weighted loss and gradient are defined as:
If we subsample a mini-batch , we can then calculate the mini-batch gradient:
The mini-batch gradient is unbiased:
allows use to use stochastic gradient descent (SGD) and its variants which improves scalability. Furthermore, extensions to SGD such as ADAGRAD(Duchi et al., 2011) and ADAM (Kingma & Ba, 2014) help with escaping saddle points, which can potentially reduce the number of required for RR-NPL to obtain full mode exploration.
Appendix D Selecting in loss-NPL
For loss-NPL we can set the loss function to:
In this case, we recommend the scaling parameter to be if we want roughly the same prior regularization of as in traditional Bayesian inference. This can be seen when we look at the expected of for (i.e. ):
We obtain the same weighting as in Bayesian inference between the log-likelihood and log-prior for .
Appendix E Examples
e.1 Toy example: Normal location model
We now empirically demonstrate the small sample performance of NPL and the role of the prior concentration in a toy normal location model problem. Suppose the model of interest is with known . Our parameter of interest is defined:
If we set the derivative of the objective to 0, we obtain:
If we believe our parametric model to be accurate, we can place a prior on . The centering measure on our DP is thus:
When , our NPL prior is approximately normal (Yamato, 1984) from the properties of the DP:
e.1.1 Implementation and results
We sample the observables and set our parametric prior variance to . We simulate the NPL posterior in Figure E.1 for various values of and , and compare it to the tractable traditional Bayesian posterior with the same model . For the NPL posterior bootstrap sampler, we generate samples and truncate the DP at .
We see from Figure E.1 that the NPL prior is approximately normal (), with same mean and variance due to the choice of . For large , the NPL posterior and Bayesian posterior are similar, due to the first order correctness of the weighted likelihood bootstrap (Newton & Raftery, 1994). For smaller values of
, the NPL posterior is non-normal, as our prior is not a conjugate prior on. For , the sample observed is close to so the posterior uncertainty is small despite only observing one sample; this suggests that NPL may be better suited to moderate to large values of . Figure E.2 shows the effect on the NPL posterior of increasing prior strength for , which regularizes the posterior but also causes it to concentrate about .
e.2 Gaussian mixture models
e.2.1 Optimization details
We derive the EM algorithm that maximizes the weighted likelihood of the diagonal-covariance GMM:
Taking an expectation over the posterior :
Taking the difference of the weighted likelihood with :
From Gibbs’ inequality:
As all :
So by maximizing w.r.t. , we cannot decrease the weighted log-likelihood. As a reminder, the log-likelihood for each datapoint is:
At the expectation step, we calculate:
The maximization step is then:
e.2.2 Toy Example
We see the posterior KDE plots for , and in Figures E.5, E.5, E.5, and for increasing in Figures E.8, E.8, E.8. For RR-NPL, we observe multimodality in addition to symmetry about the diagonal due to label-switching. Smaller values of exhibit an over-representation of local modes/saddles, and the posterior accuracy increases for larger . We also show the run-times for different for RR-NPL in Table E.1, and we see that the run-time increases roughly linearly with .
|Toy Sep||4.9 0.6||8.0 1.1||19.0 2.1||37.2s 4.5s|
e.2.3 Another Toy Example
We carry out the same experiment in another toy GMM example with the parameters:
The means are closer together, and so NUTS is prone to label-switching. We can see in Figures E.11, E.11, E.11 that NUTS exhibits label-switching, and that the Bayesian and NPL posterior are distinct. We obtain full mode exploration with RR-NPL, and do not observe label-switching with FI-NPL. We utilize a Gaussian kernel KDE for visualization, but in reality the posterior of will have 0 density outside the triangular simplex.
e.3 Logistic regression with Automatic Relevance Determination priors
e.3.1 Predictive performance
For all the following measures of predictive performance on held-out test data, we can use a Monte Carlo estimate of the predictive distribution of a test data point :
where is the NPL or Bayesian posterior, is the number of posterior samples, and is the training set. We evaluate the mean log pointwise predictive probability (LPPD) (Gelman et al., 2013) of held-out test data as a measure of predictive performance:
Below, we additionally include the mean squared error (MSE) here on held-out test data:
Finally, we also report the percentage accuracy:
where is the indicator function.
e.3.2 Sparsity measure
For the sparsity results, we simply calculate the posterior mean , where as above. We then report the percentage of components of that have absolute value less than .
e.3.3 Optimization details
L-BFGS-B (Zhu et al., 1997) is a quasi-Newton method which requires the gradient, which for the marginal Student-t distribution is defined for as:
e.3.4 Additional results
We can see in Tables E.2, E.3 that loss-NPL performs equally or better than NUTS and ADVI predictively in MSE and classification accuracy as well as LPPD. A posterior marginal density plot for in the ‘Adult’ dataset is shown in Figure E.12 for reference.
|Adult||0.104 0.001||0.104 0.001||0.105 0.002|
|Polish||0.056 0.011||0.524 0.236||0.058 0.011|
|Arcene||0.1340.026||0.152 0.014||0.143 0.020|
|Adult||84.92 0.29||84.92 0.30||84.84 0.30|