Principles of Bayesian Inference using General Divergence Criteria

02/26/2018 ∙ by Jack Jewson, et al. ∙ 0

When it is acknowledged that all candidate parameterised statistical models are misspecified relative to the data generating process, the decision maker (DM) must currently concern themselves with inference for the parameter value minimising the KL-divergence between the model and the process (Walker, 2013). However, it has long been known that minimising the KL-divergence places a large weight on correctly capturing the tails of the sample distribution. As a result the DM is required to worry about the robustness of their model to tail misspecifications if they want to conduct principled inference. In this paper we alleviate these concerns for the DM. We advance recent methodological developments in general Bayesian updating (Bissiri, Holmes and Walker, 2016) to propose a statistically well principled Bayesian updating of beliefs targeting the minimisation of more general divergence criteria. We improve both the motivation and the statistical foundations of existing Bayesian minimum divergence estimation (Hooker and Vidyashankar, 2014; Ghosh and Basu, 2016), allowing the well principled Bayesian to target predictions from the model that are close to the genuine model in terms of some alternative divergence measure to the KL-divergence. Our principled formulation allows us to consider a broader range of divergences than have previously been considered. In fact we argue defining the divergence measure forms an important, subjective part of any statistical analysis, and aim to provide some decision theoretic rational for this selection. We illustrate how targeting alternative divergence measures can impact the conclusions of simple inference tasks, and discuss then how our methods might apply to more complicated, high dimensional models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the -closed world, where a fitted model class is implicitly assumed to contain the sample distribution from which the data came, Bayesian updating is highly compelling. However, modern statisticians are increasingly acknowledging that their inference is taking place in the -open world (Bernardo  Smith, 2001). Within this framework we acknowledge that any class of models we choose is unlikely to capture the actual sampling distribution, and is then at best an approximate description of our own beliefs and the underlying real world process. In this situation it is no longer possible to learn about the parameter which generated the data and a statistical divergence measure must be specified between the fitted model and the genuine one in order to define the parameter targeted by the inference Walker (2013). Remarkably in the -open world standard Bayesian updating can be seen as a method which learns a model by minimising the predictive Kullback-Leibler (KL)-divergence from the model from which the data were sampled (Walker, 2013; Bissiri ., 2016). Therefore traditional Bayesian updating turns out to still be a well principled method for belief updating provided the decision maker (DM) concerns themselves with the parameter of the model that is closest to the data as measured by KL-divergence (Walker, 2013). Recall the KL-divergence of functioning model from data generating density is given by


It is well known that when the KL-divergence is large, which is almost inevitable in high dimensional problems, the KL-divergence minimising models can give a very poor approximation if our main interest is in getting the central part of the posterior model uncertainty well estimated. This is because the KL-divergence gives large consideration to correctly specifying the tails of the generating sample distribution - see Section 3.2.1. As a result, when the DM acknowledges that they are in the -open world, they have several options currently available to them:

  1. [leftmargin=*,labelsep=4.9mm]

  2. Proceed as though the model class does contain the true sample distribution and conduct a posteriori sensitivity analysis.

  3. Modify the model class in order to improve its robustness properties.

  4. Abandon the given parametric class and appeal to more data driven techniques.

To this we add:

  1. [resume,leftmargin=*,labelsep=4.9mm]

  2. Acknowledge that the model class is only approximate, but is the best available, and seek to infer the model parameters in a way that are most useful to the decision maker.

Method 1 is how Box (1980)

recommends approaching parametric model estimation and is the most popular approach amongst statisticians. Although it is acknowledged that the model is only approximate, Bayesian inference is applied as though the statistician believes the model to be correct. The results are then checked to examine how sensitive these are to the approximations made. Authors

Berger . (1994) provide a thorough review while Watson . (2016) consider this in a decision focused manner.

Method 2 corresponds to the classical robustness approach. Within this approach, one model within the parametric class may be substituted for a model providing heavier tails (Berger ., 1994). Alternatively different estimators, for example M-estimators Huber  Ronchetti (1981) Hampel . (2011), see Greco . (2008) for a Bayesian analogue, are used instead of those justified by the model class. Bayes linear methods Goldstein (1999)

, requiring many fewer belief statements than a full probability specification, form a special subclass of these techniques.

The third possibility is to abandon any parametric description of the probability space. Examples of this solution include, empirical likelihood methods (Owen, 1991); a decision focused general Bayesian update (Bissiri ., 2016)

; Bayesian non-parametric methods; or to appeal to statistical learning methods such as neural networks or support vector machines.

Though alternatives 2 and 3 can be shown to be very powerful in certain scenarios, it is our opinion that within the context described above that a fourth option holds considerable relative merit. In an applied statistical problem the model(s) represents the DM’s best guess of the sample distribution that could be applied. The model provides the only opportunity to input not only structural, but quantitative expert judgements about the domain, something that is often critical to a successful analysis - see Lazer . (2014) for example. The model provides an interpretable and transparent explanation about how different factors might be related to each other. This type of evidence is often essential when advising on decisions to policy makers. Often, simple assumptions play important roles in providing interpretability to the model and in particular preventing it from over fitting to any non-generic features contained in any one data set.

For the above reasons an unambiguous statement of a model, however simple, is in our opinion an essential element for much of applied statistics. In light of this, statistical methodology should also be sufficiently flexible to cope with the fact that the DM’s model is inevitably an approximation both of their beliefs and the real world process. Currently, Bayesian statistics sometimes struggles to do this well. We argue below that this is because it implicitly minimises the KL-divergence to the underlying process.

By fitting model parameter in a way that can be non-robust, the DM is having to combine their best guess belief model with something that will be robust to the parameter fitting. The DM must seek the best representation of their beliefs about a process in order to make future predictions. However under traditional Bayesian updating they must also give consideration to how robust these beliefs are. This seems an unfair task to ask of the DM. We therefore propose to decouple what the DM believes about the data generating process, from how the DM wishes the model to be fitted. This results in option 4 above.

This fourth option suggest the DM may actually want to explicitly target more robust divergences, a framework commonly known as Minimum Divergence Estimation (MDE), see Basu . (2011). Minimum divergence estimation is of course a well-developed field by frequentists, with Bayesian contributions coming more recently. However when the realistic assumption of being in the -open world is considered the currently proposed Bayesian minimum divergence posteriors fail to fully appreciate the principled justification and motivation required to produce a coherent updating of beliefs. A Bayesian cannot therefore make principled inference using currently proposed methods in the -open setting, except in a way that Miller  Dunson (2015) describe as “tend(ing) to be either limited in scope, computationally prohibitive, or lacking a clear justification”. In order to make principled inference it appears as though the DM must currently concern themselves with the KL-divergence. However in this paper we remove this reliance upon the KL-divergence by providing a justification for Bayesian updating minimising alternative divergences, both theoretically and ideologically. Our updating of beliefs does not produce an approximate or pseudo posterior, but uses general Bayesian updating (Bissiri ., 2016) to produce the coherent posterior beliefs of a decision maker who wishes to produce predictions from a model that provide an explanation of the data that is as good as possible in terms of some pre-specified divergence measure. By doing this the principled statistical practice of fitting model parameters to produce predictions is adhered to, but the parameter fitting is done so acknowledging the -open nature of the problem.

In Section 2 of this article we review the currently available Bayesian minimum divergence estimation techniques and introduce general Bayesian updating Bissiri . (2016). The contributions of this article are then as follows: in Section 3 we identify the inadequacies in the justification provided by the currently available Bayesian MDE methods and use general Bayesian updating to prove that the Bayesian can still do principled inference on the parameters of the model using alternative, more robust divergences to KL-divergence. This theoretical contribution then allows us to propose a wider variety of divergences that the Bayesian could wish to minimise than have currently been consider in the literature in Section 4. In this section we also consider decision theoretic reasons why targeting alternative divergences to the KL-divergence can be more desirable. Lastly in Section 5 we demonstrate the impact model misspecifications can have on a traditional Bayesian analysis for simple inference, regression and time series analysis, and that superior robustness can be obtained by minimising alternative divergences to the KL-divergence. In appendix A we also show that when the observed data is in fact generated from the model, these methods can be shown not to lose too much precision. In this paper we demonstrate that this advice is not simply based on expedience but has a foundation in a principled inferential methodology. In this purpose we have deliberately restricted ourselves to simple demonstrations designed to provide a transparent illustration of the impact that changing the divergence measure can have on inferential conclusions. However we discuss how robust methodology becomes more important as problems and models become more complex and high dimensional and thus encourage practitioners to experiment with this methodology in practice.

2 Related work

2.1 Bayesian theory and scoring functions

Authors Bernardo  Smith (2001) consider Bayesian inference through the lens of a Bayesian decision problem, where quoting a probability belief distribution for future uncertainties is the action to be taken. In this scenario, a scoring rule is used to define the loss of quoting a distribution and observing a realisation . They argue that the scoring rule associated with scoring probabilistic predictions should be proper and local. A proper scoring rule results in the DM’s expected loss being minimised when the DM quotes their true beliefs, and a local scoring rule is one where the score only depends on the quoted probability of the actual observed outcome and nothing else. The only proper, local scoring rule is the logarithmic scoring rule:


The expected additional loss when quoting distribution , with density , for future outcomes when the data is distributed according to distribution , with density , is then:


the Kullback-Leibler (KL) divergence between the two probability distributions in equation (

1). In fact authors Grünwald  Dawid (2004) define any divergence , associated with proper scoring rule as the extra penalty for believing was distributed according to when it was actually distributed according to .


2.2 The data generating process and the M-open world

Sometimes in this paper we use the phrase “the data generating process”. The data generating process is a widespread term in the literature and appears to suggest that ‘Nature’ is using a simulator to generate observations. While this may fit nicely with some theoretical contributions, it becomes difficult to argue for in reality. In this article we consider the data generating process to represent the DM’s true beliefs about the sample distribution of the observations. However in order to correctly specify these the DM must be able to take the time and infinite introspection to consider all of the information available to them/in the world in order to produce probability specifications in the finest of details. As is pointed out by Goldstein (1990) this requires many more probability specifications to be made at a much higher precision than any DM is ever likely to be able to manage within time constraints of the problem. As a result these genuine beliefs must be approximated. This defines the subjectivist interpretation of the -open world we adopt in this paper - the model used for the belief updating is only ever feasibly an approximation of the DM’s true beliefs about the sample distribution they might use if they had enough time to fully reflect. In the special case when the data is the result of a draw from a known probability model - a common initial step in validating a methodology - then this thoughtful analysis and ”a data generating process” obviously coincide. Henceforth we use the “the data generating process” in this sense to align our terminology as closely as possible with that in common usage.

2.3 General Bayesian updating

Following the above motivation, Bissiri . (2016) consider only conducting inference about the factors of the world/problem that matter to the DM. They produce what they refer to as a general Bayesian update - a coherent method to produce a posterior distribution over some quantity without relying on a full model for the observations - when considering the non-inferential Bayesian decision problem. Similarly to Bayes linear methods (Goldstein, 1999), they consider producing posterior beliefs which do not require conditioning. They consider the Bayes act associated with beliefs corresponding to the data generating distribution and quantity of interest as


Rather than elicit a model representing the DM’s beliefs over , Bissiri . (2016)

argue that given a prior, a loss function and some data

, an updating of beliefs about the value of must be possible in the absence of a model for the sampling distribution. They suggest that the posterior distribution resulting from such an updating of beliefs, be as ‘close’ as possible in terms of expected loss to both the prior and the observed data , used for the updating. In this development the KL-divergence was the only way to measure ‘closeness’ to the prior which resulted in an additive Bayesian update - one where the order the data arrives in does not affect that parameter updating and is consistent with the likelihood principle. In the absence of a likelihood, the best choice to measure the ‘closeness’ of the posterior to the data is to use the expected loss of the observed data under the proposed posterior, Bissiri . (2016) show that the posterior minimising the expected loss criteria is:


This posterior is not an approximation. Neither is it a pseudo-posterior, but rather a valid, coherent representation of subjective uncertainty in the minimizer of the decision in equation (5). Often , the cumulative loss of the current data set. Updating using the cumulative loss amounts to replacing integrating over the data generating distribution in equation (5), with empirical integration over data whose distribution is . They also mention the need to calibrate the loss-function used in the general Bayesian update against the prior, by setting the value of in equation (6

). While any probability density is defined so that it integrates to 1, any loss-function can be defined to be arbitrarily large or small. It is important that the weight the loss-function has in the updating process is calibrated against the prior such that the posterior distributions resulting from the general Bayesian update retain probabilistic meaning in the absence of a model. So it follows that the general Bayesian must be prepared to post odds according to their posterior similarly to any other Bayesian.

It is straightforward to see that if the logarithmic score is used in the general Bayesian update then the traditional Bayesian update is recovered:


This echoes the well-known result that the Bayesian predictive distribution finds the distribution that is closest to the data generating distribution in terms of KL-divergence. General Bayesian updating allows the goals of the statistical analysis to be coupled together with the parameter updating. This is not something previously possible under traditional Bayesian statistics. While Bissiri . (2016) provide the framework to update a probability belief distribution using a loss function, they only consider the logarithmic score in an inferential scenario.

2.4 M-closed Bayesian minimum divergence estimation

Minimum divergence estimation (MDE) is an approach to robust inference, largely overlapping with minimum scoring rule inference Dawid . (2016). MDE considers making inference about the parameters of a parametric model by minimising the divergence between and the data generating function of the observed data. While Dawid . (2016) uses proper scores, MDE traditionally minimises a member of the class of discrepancies or disparities. Disparity based methods first build a non-parametric density estimate

of the data generating distribution from the data (often a Kernel Density estimate). They then conduct inference by minimising the divergence between the model and

. The frequentist literature in this area is vast so in this article we choose to focus on the Bayesian contributions. These have come first from Hooker  Vidyashankar (2014) and more recently from Ghosh  Basu (2016) and Ghosh  Basu (2017).

Authors Hooker  Vidyashankar (2014) consider posterior densities of the form


where is the data generating density and is the squared Hellinger divergence between and


Posteriors of this form are justified following the asymptotic approximation of the divergence to the KL-divergence when the data comes from the model. This results from the fact that the divergence between two probability distributions must be uniquely minimised to 0 when the probability distributions are equal (Grünwald  Dawid, 2004). Therefore, when and as the distribution minimising any divergences between it and the data generating process will be .

Rather than considering a discrepancy, Ghosh  Basu (2016) considers minimising the proper Tsallis score associated with the density power divergence in order to produce a ‘pseudo-posterior’. The density power divergence between and is


This can be seen as a Bayesian analogue of the minimum proper score inference considered by Dawid . (2016). Authors Ghosh  Basu (2016) identify the fact that the score associated with the density power divergence does not require a density estimate, but can still be used to provide robust inference. They term the distribution produced by their estimation method a ‘pseudo-posterior’. However Ghosh  Basu (2017) demonstrate exponential convergence results illustrating that this posterior is asymptotically optimal in exactly the same exponential rate as the traditional Bayesian posterior when the model is correctly specified (Barron, 1988).

3 Extending Bayesian MDE for the M-open world

The current Bayesian minimum divergence technology is currently only able to produce approximate or pseudo posteriors. Therefore, Bayes rule is currently the only method explicitly available in the literature that DMs can use to produce a well principled updating of their beliefs about the parameters of a model. We address this issue in this section.

3.1 Why the current justification is not enough

The equivalence between the KL-divergence and the Hellinger-divergence of the model from the data generating process mentioned in Section 2.4 can be seen by taking Taylor expansions about the KL-divergence minimising parameter . We have that


Now following the same arguments as used in Hooker  Vidyashankar (2014), if is consistent for then equations (11) and (12) are equivalent in the limit as . However when is not the data generating parameter because the model class is misspecified, and if is still converging to as then equations (11) and (12) will be different. In this setting the current literature gives no foundational reasoning why updating using the Hellinger divergence constitutes a principled updating of beliefs.

The ‘pseudo-posterior’ of Ghosh  Basu (2016) is justified as a valid posterior by Ghosh  Basu (2017) as the traditional posterior originating from an alternative belief model. However, this alternative model will not even correspond to the DM’s approximate beliefs about the data generating process. There is therefore, a lack of formal justification for a DM to update beliefs using Bayes rule on this object.

3.2 Principled justification for KL in -open world

In contrast, (Walker, 2013; Bissiri ., 2016) has provided a principled justification for Bayes’ rule in the -open world. This requires the DM to think about the KL-divergence minimising parameter. When the model is correctly specified, the Bayesian learns about the parameter that generated the data. This, for any statistical divergence is equivalent to learning the parameter, , minimising . When the model is considered to be incorrect, there is no longer any formal relationship between the parameter and the data. The likelihood no longer represents the probability of the observed data conditioned on the parameter. Therefore in order for -open inference to be meaningful a divergence measure must be chosen and the parameter of interest can then defined as . Authors Walker (2013) then state that once the parameter of interest has been defined as the minimum of some divergence, it is then possible for a practitioner to define their prior beliefs about where the minimiser of this divergence may lie. Then the practitioner’s final task is to ensure that their Bayesian learning machine is learning about the same parameter with which they defined their prior belief.

Recalling the well-known result that Bayesian updating learns the parameters of the model which minimises the KL divergence of the model from the data generating density. This allows the DM to continuing conducting belief updating in a principled fashion using Bayes rule provided they are interested in, and specify prior beliefs about, the parameter minimising the KL divergence of the model from the data generating process.

3.2.1 Moving away from KL-divergence in the M-open world

Viewed conversely, Walker (2013)

identified that using Bayes rule to update beliefs using a misspecified model results in the DM being implicitly concerned about producing predictions that are closest to the data generating process in terms of KL-divergence. In many scenarios using the KL-divergence is the most expedient thing to do. This probably explains why the KL-divergence is so prolific for statistical applications. The logarithmic form of the KL-divergence makes it straightforward to calculate the KL-divergence between many exponential families and the link with Bayes rule allows for conjugate prior distributions to be considered.

However as is well known but often forgotten, someone whose probabilities are elicited using the logarithmic score, or the KL-divergence, will have to beware of approximating the probability of an event by 0. Since if they are wrong, they will incur an infinite loss. As a result, accurate probability specification of tail events will be important to the DM using the logarithmic-score, something which is generally speaking extremely important in pure inference problems (Bernardo  Smith, 2001).

Often in reality inference is being done in order to produce estimates of expected utilities as part of a larger decision making process. While the tail specification is important for inferential procedures and in situations where the losses are unbounded, for example in gambling and odds setting scenarios, many loss functions connected to real decisions may wish to place less impact on rarely occurring observations in the tails. For entirely reasonable practical reasons these rare events are precisely ones which the DM will find hard to accurately elicit O’Hagan . (2006), see Winkler  Murphy (1968) for a demonstration of this in the context of forecasting the probability of precipitation. Therefore, basing inference on such severe penalties can be unstable. For example a treatment regime is likely to put greater weight on correctly treating the majority of the population than worrying about outlying extreme cases. Authors Watson . (2016) quite rightly point out that loss functions for real problems are often bounded and any method that simulates parameters via MCMC makes this assumption. As will be demonstrated in Section 4.1, inferential procedures assuming an unbounded loss function such as the logarithmic score, provide no guarantees about performance in more general decision making scenarios.

Once the consequences of using Bayes rule to solve decision problem in the -open world is understood, we believe that DMs may reasonably desire alternative options for parameter updating, that are as well principled as Bayes rule, but places less importance on tail misspecification. In this new era of “Big data” it becomes increasingly likely that the model used for inference is misspecified, especially in the tails of the process - see Section 5.4. We believe many DMs would consider it undesirable for models that approximate the distribution of the majority of the data to be disregarded because they poorly fit a few outlying observations.

3.3 Principled minimum divergence estimation

We next take a foundational approach to theoretically justify an updating of beliefs targeting the parameter minimising any statistical divergence between the model and the data generating density. Consider the general inference problem of wanting to find the parameters of the parametric model . The model here can be considered as the DM’s best guess at the data generating process. We consider it important to continue to use a model even under the acknowledgement that it is misspecified for the reasons outlined in Section 1. Following Bernardo  Smith (2001) and Walker (2013) we consider fitting these parameters in a decision making scenario, minimising some divergence function


to the data generating process . The goal is therefore to solve the decision problem


Here the entropy term in the definition of the divergence is removed from the minimisation because it does not depend on . In this scenario Walker (2013) forces the Bayesian into concerns associated with non-robustness by shackling them to minimising the KL-divergence in order to use Bayes’ rule. However, equation (14) is identical to the setting consider by Bissiri . (2016) in equation (5), except that now the loss function depends on the parameter through a model so is a scoring rule. Using equation 6, general Bayesian updating therefore gives us the tools to come up with a coherent updating of beliefs that target the parameter minimising the general divergence as


where we have set the calibration weight . This will be discussed further in Section 3.3.2. In order to stay consistent with Walker (2013), we use the notation to indicate that the prior and posterior belief distributions correspond to beliefs about the parameter minimising divergence . Applying the general Bayesian update in an inferential scenario like this, provides a compromise between purely loss based general Bayesian inference and traditional Bayesian updating, which is conducted completely independently of the problem specific loss function. By continuing to fit a generative model, the Bayesian minimum divergence posterior somewhat separates the inference from the decision making. The inference is still concerned with estimating model parameters based on how well the model’s predictions reflect the data generating process. However, the criteria for closeness between predictions and reality takes decision making into account. As a result, the minimum divergence posteriors can be seen as producing inferences that will be suitable for a broad range of loss functions. The DM is no longer required to exactly define the loss function associated with their decision problem at the inference stage. They need only consider broadly how robust to tail misspecifications they want their inference to be in order to define the target divergence (see Section 4.4). Considering the realistic -open nature of the model class also justifies making this compromise. The general Bayesian update, assumes absolutely no information about the data generating process, while using Bayes rule traditionally111Prior to the interpretation provided by Walker (2013) explained in section 3.2 assumes the data generating density is known precisely. As we point out, it is actually more likely that the decision maker is able to express informative but not exact beliefs about the data generating process and therefore a half-way-house between these two is appropriate in reality.

Plugging in the corresponding loss functions gives general Bayesian posteriors minimising the Hellinger and density power divergence as


Equation (16) introduces to estimate the data generating density (see Section 4.5 for more on this). As a result, the general Bayesian updating is being conducted using an empirical loss function, say , which approximates the true loss function associated with minimising the Hellinger divergence between the model and the data generating process .

Equation (17) is exactly the distribution resulting from the robust parameter update of Ghosh  Basu (2016), while equation (16) is similar to the posterior produced by Hooker  Vidyashankar (2014) except the divergence function has been decomposed into its score and entropy term here. This demonstrates that the posteriors above are not pseudo posteriors - as Ghosh  Basu (2016) suggests - or approximations of posteriors - as Hooker  Vidyashankar (2014) suggests - they are the correct way for the DM to update their prior beliefs about the parameter minimising an alternative divergence to the KL-divergence.

Authors Basu . (2011) showed that is a consistent estimate of , while Ghosh  Basu (2016) use this to show that under certain regularity conditions

follows a Bernstein-von Mises (BVM) type result, but with convergence in probability, where the asymptotic variance of the posterior is the expected second derivative of the loss function

. Authors Hooker  Vidyashankar (2014) prove the corresponding asymptotic normality result for the posterior minimising the Hellinger divergence, but assume that the data generating process is contained within the model class, as this was required to justify their posterior in the first place. Consistency to the minimiser of any divergence provides consistency with the data generating process when it is contained within the model class, as by definition all divergences are minimised to 0 when the distributions are the same. The results of Ghosh  Basu (2017) demonstrate that not only is the posterior mean consistent, but the whole posterior is asymptotically optimal in the same sense as the traditional Bayesian posterior is when the model class contains the data generating process.

Another principled alternative to traditional Bayesian updating when it is difficult to fully specify a model for the data generating process is Bayes linear methods (Goldstein, 1999). These only require the subjective specifications of expectations and covariances for various quantities the DM is well informed about and interested in order to do inference. Alternatively by choosing a more robust divergence to use for the updating our method tackles the same problem by being robust to routine assumptions making a full probability specification a much less strenuous task.

3.3.1 The likelihood principle and Bayesian additivity

Authors Hooker  Vidyashankar (2014) identify that their posterior distribution minimising the Hellinger divergence no longer satisfies Bayesian additivity. That is to say that the posteriors would look different if the data were observed in one go or in two halves for example. This is because the posterior now depends on the estimate of the data generating density, which depends in a non-additive way on the data. This constitutes a departure from the likelihood principle underpinning traditional Bayesian statistics. The likelihood principle says that the likelihood is sufficient for the data. When the model is perfectly specified this is a sensible principle, the likelihood of the observed data under the correct model represents all the information in the data. However in the -open world the likelihood principle is no longer a reasonable requirement. When the model is only considered as an approximation of the data generating process, it is not unreasonable to suspect that the data contains more information than is represented in the likelihood of an incorrect model. However, Bayesian additivity (defined as coherence by Bissiri . (2016)) is a central principle of the general Bayesian update. Bissiri . (2016) chose the KL-divergence to measure the distance between the prior and the posterior as it was the only divergence measure that left the belief update additive. The general Bayesian posterior minimising the Hellinger divergence combines the loss function for each observation in an additive way, which is consistent with the additivity demanded by Bissiri . (2016), but a different density estimate is produced when the data is considered as a whole or in parts. This causes the empirical loss function used for the updating to be different when the data arrives in sections, as opposed to one go. If the data generating density were available then the exact loss function associated with minimising the Hellinger divergence could be calculated and the Bayesian update would be additive. However because an approximation of the loss function associated with the Hellinger divergence is used, additivity is sacrificed.

While the posterior of Ghosh  Basu (2016) is still additive, the likelihood is also not considered to be sufficient for the data, as the posterior also depends on the integral, . Using an alternative divergence to the KL-divergence in order to conduct Bayesian updating, requires that additional information to the ‘local’ information provided by the likelihood of the observed data is incorporated into the loss function. For the Hellinger-divergence this information comes from the data in the way of a density estimate, while for the density power-divergence this comes from the model through .

3.3.2 A note on calibration

The two posteriors in equations (16) and (17) have set the general Bayesian calibration weight to 1. Unlike probability distributions, loss functions can be of arbitrary size and therefore it is important that they are calibrated with the prior. However, this is not a problem here. The posteriors above both use well-defined models and the canonical form of well-defined divergence functions and as a result there is no arbitrariness in the size of the loss. This is further demonstrated by the fact that Bayes rule corresponds to using the canonical form of the KL-divergence and a probability model with weight . We do note that when the model is correct the posterior variance of these method will be comparatively bigger for finite data samples than that of the traditional Bayesian posterior. This is to be expected, Zellner (1988) showed that Bayes rule processes information optimally and therefore produces the most precise posterior distributions. The posterior is simply a subjective reflection of the DM’s uncertainty after seeing the data, and if they believe their model is incorrect and therefore target a more robust divergence, they are likely to have greater posterior uncertainty than if they naively believe their model to be correct and proceed accordingly.

4 Possible divergences to consider

The current formulation of Bayesian MDE methods has limited which divergences have been proposed for implementation. However demonstrating that principled inference can be made using alternative divergence measures than the KL-divergence, as we have in Section 3.3, allows us to consider the selection of the divergence used for updating to be a subjective judgement made by the DM, alongside the prior and model, to help tailor the inference to the specific problem. Authors Celeux . (2017) observed that while Gelman  Hennig (2015) advocate greater freedom for subjective judgements to impact statistical methodology, they fail to consider the possibility of subjective Bayesian parameter updating. In the -closed world Bayes rule is objectively the correct thing to do, but in the -open world this is no longer so clear. Very few problems seek answers that are connected with a specific dataset or model, they seek answers about the real world process underpinning these. Authors Goldstein . (2006), focusing on belief statements, demonstrate that subjective judgements help to generalise conclusions from the model and the data to the real world process. Carefully selecting an appropriate divergence measure can further help a statistical analysis to do this. We see this as part of the meta-modelling process where the DM can embed beliefs about their nature and extent of their modelling class explicitly.

Below we list some divergences that a practitioner may wish to consider. We do not claim that the list is exhaustive, but merely contains the divergences we have considered using. We qualitatively describe the features of the inferences produced by targeting each specific divergence, which will be demonstrated empirically in Section 5. Throughout we consider the Lebesgue measure to be the dominating measure.

4.1 Total Variation Divergence

The Total-Variation divergence between probability distributions and is given by


and the general Bayesian update targeting the minimisation of the TV-divergence can be produced by using loss function


in equation (15). It is worth noting however that the loss function in equation (19) is not monotonic in the predicted probability of each observation. This is discussed further in Appendix A. In fact when informed decision making is the goal of the statistical analysis, closeness in terms of TV-divergence ought to be the canonical criteria the DM demands. If then for any utility function bounded by 1, the expected utility under of making the optimal decision believing the data was distributed according to is at most worse than the expected utility gained from the optimal decision under (see JQ. Smith (2010) for example). Therefore if the predictive distribution is close to the data generating process in terms of TV-divergence then the consequences of the model misspecification in terms of the expected utility of the decisions made is small. However explicit expressions of the TV-divergence even between known families, are rarely available. This somewhat hinders an algebraic analysis of the divergence.

It is straightforward to see that and the KL-divergence form an upper bound on the TV-divergence through Pinsker’s inequality. However, the KL-divergence does not bound the TV-divergence below so there are situations where the TV-divergence is very small but the KL-divergence is very large. In this scenario, a predictive distribution whose associated optimal decisions achieve a close to optimal expected utility estimate (as the distribution is close in TV-divergence) will receive very little posterior mass. This is clearly undesirable in a decision making context.

Authors Hooker  Vidyashankar (2014) identify some drawbacks of having a bounded score function. The score function being upper bounded means that there is some limit to the score that can be incurred in the tails of the posterior distribution. The score incurred for one value of , will here be very similar to the score incurred by another value of past a given size. Therefore the tails of the posterior will be equivalent to the tails of the prior. As a result the DM is required to think more carefully about their prior distribution. Not only are improper priors prohibited, but more data is required to move away from a poorly specified prior. This can result in poor finite sample efficiency when the data generating process is within the chosen model class.

4.2 Hellinger Divergence

Authors Devroye  Gyorfi (1985) observed, that the Hellinger divergence can be used to bound the TV-divergence both above and below:




As a result, the Hellinger-divergence and the TV-divergence are geometrically equivalent. Thus, if one of them is small, the other is small and similarly if one of them is large the other also is. So if one distribution is close to another in terms of TV-divergence, then the two distributions will be close in terms of Hellinger-divergence as well. Authors Beran (1977) first noted that minimising the Hellinger-divergence gave a robust alternative to minimising the KL-divergence, while Hooker  Vidyashankar (2014) proposed a Bayesian alternative (equation (16)) that has been discussed at length above. While Hooker  Vidyashankar (2014) motivated their posterior through asymptotic approximations, identifying the geometric equivalence between the Hellinger-divergence and TV-divergence proposes further justification for a robust Bayesian updating of beliefs similar to that of Hooker  Vidyashankar (2014). Specifically, if being close in terms of TV-divergence is the ultimate robust goal, then being close in Hellinger-divergence will guarantee closeness in TV-divergence. Hellinger can therefore serve as a proxy for TV-divergence that retains some desirable properties of the KL-divergence: it is possible to compute the Hellinger divergence between many known families (J. Smith, 1995) and the score associated with the Hellinger divergence has a similar strictly convex shape, this is discussed further in Appendix A. A posterior targeting the Hellinger-divergence does suffer from the same drawbacks associated with having a bounded scoring function that are mentioned at the end of the previous section. Lastly along with TV-divergence, the Hellinger-divergence is also a metric.

4.3 -divergences

A much wider class of divergences are provided by the two parameter -divergence family proposed in Cichocki . (2011)


Reparametrising this such that and gives the S-divergence of Ghosh . (2017).

The general Bayesian posterior targeting the parameter minimising the -divergence of the model from the data generating density is


Letting represent the data generating density, and some predictions of parametrised by , Cichocki . (2011) provide intuition about how the parameters and impact the influence each observation has on the inference about . They consider influence to be how the observations impact the estimating equation of , this is a frequentists setting but the intuition is equally valid in the Bayesian setting.


The size of for an observation defines how outlying (large values) or inlying (small values) the observation is. Choosing

ensures outliers relative to the model have little influence on the inference, while adjusting the value of

such that down-weights the influence of unlikely observations from the model .

The hyperparameters

and control the trade-off between robustness and efficiency and selecting these hyperparameters is part of the subjective judgement associated with selecting that divergence. This being the case we feel that to ask for values of and to be specified by a DM is perhaps over ambitious, even given the interpretation ascribed to these parameters above. However, using the -divergence for inference can provide greater flexibility to the DM, which may in some cases be useful. Here to simplify the subjective judgements made we focus on two important, one parameter subfamilies within the -divergence family - the alpha-divergence where and the beta-divergence where .

4.3.1 alpha Divergence

The alpha-divergence, introduced by Csisz . (1967) and extensively studied by (Shun-ichi, 2012) is


where . There exists various reparametrisations of this: Amari notation uses with ; or Cressie-Read notation (Cressie  Read, 1984) introduces with .

We generally restrict attention to values of . corresponds to a KL-divergence limiting case while is 4 times the Hellinger divergence. We think of these as two extremes of efficiency and robustness within the alpha-divergence family with which a DM would want to conduct inference between. The parameter thus controls this trade-off.

The general Bayesian posterior targeting the minimisation of the alpha-divergence is


Note the power on in equation (26) is reduced by 1 from equation (25) as we consider empirical expectations in order to estimate the expected score associated with the alpha-divergence.

It was demonstrated in Sason  Verdú (2015) (Corollary 1) that for the alpha-divergence can be bounded above by TV-divergence :


Therefore, if the TV-divergence is small then the alpha-divergence will be small (provided ). So a predictive distribution that is close to the data generating density in terms of TV-divergence will receive high posterior mass under an update targeting the alpha-divergence. Once again, the general Bayesian posterior requires a density estimate to compute the empirical estimate of the score associated with the divergence and the same issues associated with having a bounded score function that were discussed in Section 4.1 apply here.

Considering the alpha divergence as a subfamily of the -divergence takes , and therefore . Therefore the influence of observations are weighted based on their ratio of only and not on the value of in isolation. corresponded to the KL-divergence and in order to obtained greater robustness is chosen such that in order to down-weight the influence larger ratios of , corresponding to outlying values of , have on the analysis. However down-weighting some of these ratios relative to the KL-divergence sacrifices efficiency. This demonstrates the efficiency and robustness trade-off associated with the alpha-divergence.

4.3.2 beta Divergence

The beta-divergence, also known as the density power-divergence Basu . (1998), is a member of the Bregman divergence family


Taking , to be the Tsallis score returns the beta-divergence


. This results in the density power divergence given in equation (10), parameterised by rather than here. Both Basu . (1998) and Dawid . (2016) noticed that inference can be made using the beta-divergence without requiring a density estimate. This was used in Ghosh  Basu (2016) to produce a robust posterior distribution that did not require an estimate of the data generating density, which has been extensively discussed in previous sections.

Considering the density power divergence as part of the -divergence family takes , which results in treating all values of equally in isolation, but taking results in which means observations that have low predicted probability under the model are down-weighted. Under the KL-divergence at , the influence of an observation is inversely related to its probability under the model, through the logarithmic score. Raising above 0 will decrease the influence of the smaller values of , robustifying the inference to tail specification. However this results in a decrease in efficiency relative to methods minimising the KL-divergence (Ghosh  Basu, 2017). As a result there may be some issues using the density power-divergence when high dimensional observations and models are considered. As the dimension increases the predicted probability of each (multivariate) observation shrinks towards 0. The density power divergence down-weights the influence of observations with small predicted probabilities, and as a result ( when we consider the density power divergence) needs to be selected very carefully in order to prevent the analysis from disregarding the majority of the data. This is a price that is paid in order to not require a data generating density estimate.

4.3.3 The S-Hellinger divergence

One further one-parameter special case of the -divergence (S-divergence) is the S-Hellinger divergence given by Ghosh . (2017)


This is generated from the S-divergence by taking . Taking recovers twice the squared Hellinger divergence and gives the squared divergence. Ghosh . (2017) observe that the S-Hellinger divergence is a proper distance metric. Translating these back into the notation of -divergences gives


Therefore gives and . As a result, the squared Hellinger divergence down-weights the influence of large ratios of with respect to smaller ones and also down-weights these ratios when is small. In fact, there is a trade-off here: moving closer to 1 increase away from 1, thus increasing the amount that ratios of are down-weighted for small . However as moves closer to 1, also draws closer to 1, which reduced the amount ratios of are down weighted with respect to smaller ones. Once again we consider trading these two off against each other to be too greater task for the DM to consider. We therefore do not consider the S-Hellinger divergence again here and instead for brevity consider only the choice between the one parameter alpha and density power divergences. In any case, bridging the gap between the Hellinger and the KL-divergence, as the alpha-divergence does, appears to make more sense in an inferential setting than bridging the gap between the Hellinger and squared divergence.

4.4 Comparison

The bullet points below summaries the reasons, introduced above, for which a DM might prefer a specific divergence. Once again we stress that we consider the choice of divergence to be a subjective and context specific judgement to be made by the DM similarly to their prior and model.

  • [leftmargin=*,labelsep=5.8mm]

  • KL-divergence: suitable if the tail behaviour of the model is of high importance or there is the belief that the tails of the model are exactly correctly specified. Minimising the KL-divergence also makes best use out of the data so may be suitable if the sample size is very small and the prior is not necessarily informative.

  • TV-divergence/Hellinger-divergence: suitable if the data set has many observations and robustness for a decision problem with a bounded loss function is of high importance. The Hellinger-divergence appears to be computationally easier to work with (see Appendix A). Both require accurate density estimation

  • alpha-divergence: suitable for when a trade-off between robustness and efficiency is considered. Requires access to a density estimation technique.

  • The density power divergence: suitable again for when a trade-off between robustness and efficiency is considered. Does not require a density estimate but does require to be set carefully so that the data is not completely disregarded by the updating process.

Unfortunately there is no free lunch when it comes to applying any of these methods to complex, high dimensions problems. Minimising the KL-divergence is the most computationally efficient. It provides the possibility of implementing conjugate prior distributions and avoiding the computational burden of producing a density estimate. However as the dimensions and complexity of the problem increases, robustness to model misspecification will become more and more important. Minimising the Hellinger-divergence, TV-divergence and alpha-divergence requires an estimate of the data generating density which is certainly not straightforward for high dimensional data (see the discussion in Section

4.5). However once this has been estimated then can just be selected based on how important tail misspecifications are, with a guarantee on some reasonable efficiency. Minimising the density power divergence has the computational advantage of not requiring an estimate of the data generating density. However when the dimension of the problem is high, much more thought needs to be put into specifying the parameter to ensure that the updating does not completely disregard the data. The technology for doing this effectively is still in its infancy. Without the data generating density, the density power divergence requires information on the shape of the data from another form in order to fit efficiently and robustly.

4.5 Density estimation

As has been mentioned before, for the TV, Hellinger and divergence, it is not possible to exactly calculate the loss function associated with any value of and because the data generating density will not be available. In this case, a density estimate of is required to produce an empirical loss function. The Bayesian can consider the density estimate as providing additional information to the likelihood from the data (see Section 3.3.1’s discussion on the likelihood principle), and can thus consider their general Bayesian posterior inferences to be made conditional upon the density estimate as well as the data. The general Bayesian update is a valid update for any loss function, and therefore conditioning on the density estimate as well as the data still provides a valid posterior. However, how well this empirical loss function approximates the exact loss function associated with each divergence ought to be of interest. The exact loss function is of course the loss function the DM would prefer to use having made the subjective judgement to minimise that divergence. If the density estimate is consistent to the data generating process, then provided the sample size is large the density estimate will converge to the data generating density, and the empirical loss function will then correctly approximate the loss function associated with that divergence. It is this fact that ensures the consistency of the posterior estimates of the minimum Hellinger posterior Hooker  Vidyashankar (2014).

Authors Hooker  Vidyashankar (2014) use a fixed width kernel density estimate (FKDE) to estimate the underlying data generating density and in our examples in Section 5

we adopt this practice using a Radial Basis Function (RBF) kernel for simplicity and convenience. However we note that

Silverman (1986) identifies practical drawbacks of FKDEs, including their inability to correctly capture the tails of the data generating process, whilst not over smoothing the centre, as well as the number of data points required to fit these accurately in high dimensions. In addition to this Tamura  Boos (1986) observe that the variance of the FKDE when using a density kernel in high dimensions lead to asymptotic bias in the estimate that is larger than . Alternatives include using a kernel with better mean-squared error properties (Epanechnikov, 1969; Rosenblatt ., 1976), variable width adaptive KDEs (Abramson, 1982), which Hwang . (1994) show to be promising in high dimensions, piecewise-constant (alternatively tree based) density estimation (Ram  Gray, 2011; Lu ., 2013) which are also promising in high dimensions, or a fully Bayesian Gaussian process as is recommended in Li  Dunson (2016).

5 Illustrations

In this section we aim to illustrate some of the qualitative features associated with conducting inference targeting the minimisation of the different divergences identified in Section 4. Throughout these experiments stan (Carpenter ., 2016) is used to produce fast and efficient samples from the general Bayesian posteriors of interest.

5.1 -open robustness

5.1.1 Simple inference

The experiments below demonstrate the robustness of the general Bayesian update targeting KL-divergence (red), Hellinger-divergence (blue), TV-divergence (pink), alpha-divergence (green) and power-divergence (orange) (in future these may be referred to as KL-Bayes, Hell-Bayes, TV-Bayes, alpha-Bayes and power-Bayes respectively). For illustrative purposes we have fixed for the alpha-divergence and for the power-divergence. Figure 1

plots the posterior predictive originating from fitting a normal model

to two simulated data sets and one ‘real’ data set. For the first data set data points were simulated from a normal -contamination distribution


for the second

were simulated from a Student’s t-distribution with degrees of freedom 4 and the real data set, tracks1, was taken as the first variable from the ‘Geographical Original of Music Data Set’

222downloaded from containing

data points, where a KDE of the data appeared to be approximately normally distributed. Prior distributions

and were used for all examples.

Figure 1: Top: Posterior predictive distributions (smoothed from a sample) arising from Bayesian minimum divergence estimation fitting a normal distribution to an -contaminated normal (left), a t-distribution (middle) and the tracks1 dataset (right) using the KL-Bayes (red), Hell-Bayes (blue), TV-Bayes (pink), alpha-Bayes (green) and power-Bayes (orange). Bottom: Log-density plots (smoothed from a sample) (left) and Normal QQ plots (middle) for the -contaminated normal dataset. Lower right plots the posterior predictive distributions (smoothed from a sample) from alternative models using the KL-Bayes, Normal (red), t (blue), logNormal (green) and gamma (orange).

It is easy to see from the plots in figure 1 that minimising the KL-divergence, requires that the density of the outlying contamination centred at to be captured correctly. This is at the expense of capturing the density of the remaining 99% of the data, this is especially clear in the log-density plot. The TV-Bayes, alpha-Bayes and power-Bayes appear to correctly capture the distribution for 99% of the data. The boundedness of the TV-divergence and the alpha-divergence means that the contamination is not allowed to unduly affect the analysis, while the power-divergence is able to down-weight the influence of the contaminated points as they are ascribed low predicted probability under the model for 99% of the data. The Hell-Bayes appears to fit too small a variance even for 99% of the data. This is due to the small sample efficiency problems which will be discussed in Appendix A.

The Student’s t-distribution has consistently heavier tails than the normal distribution. Thus the top right hand plot of figure 1 more clearly demonstrates the importance placed on tail misspecification by each method. The KL-divergence fits the largest variance to correctly capture these tails for the most extreme tail observations, the alpha-divergence is able to fit a slight smaller variance because of its bounded nature. However the variance of the alpha-Bayes predictive is still larger than the other methods produce because of the greater convexity of the scoring function depicted in figure 4. Lastly Hell-Bayes, TV-Bayes and power-Bayes place the least weight on tail misspecification and are therefore able to fit a smaller variance and produce a predictive more closely resembling the data generating process for the majority of the data.

The importance given to tail misspecification is also evident in the ‘tracks’ example. Here there is an ordering from TV, power, Hellinger, alpha and KL divergence on both bias towards the right tail and on the size of the predictive variance, in response to the slight positive skew of the KDE of the data. Here it is clear to see that the power, TV, Hellinger and power divergence produce much better fits of the majority of the data than the other methods do. Lastly, the bottom right plot demonstrates how several possible alternative models to the Gaussian perform on the ‘tracks1’ data set

333the data set was transformed by adding

to every value in order to make the data strictly positive so the gamma and log-Normal distributions could be applied

, when updating using the KL-divergence. This shows that a Gaussian distribution was actually the best fit for the bulk of the data, and the poor fit achieved is down to the importance placed on tail misspecification by the estimating procedure rather than the model selected.

5.2 Regression under heteroscedasticity

In addition to the simple inference examples above we consider how changing the divergence can affect inference in a regression example. From the previous examples we can see that when the tails of the model are misspecified the KL-divergence minimising predictive distribution inflates the variance of the fitted model to ensure no observations are predicted with low probability. Further, placing large weight on tail observations, which occur with low probability, creates large variance across repeat sampling. This is exactly why parameter estimates in linear regression under heteroscedasticity errors have a large variance.

Bayesian minimum divergence inference places less weight on tail observations: it is thus able to produce inferences with a smaller predictive variance and a smaller variance across repeat sampling. While repeat sampling and estimation variance are not problems in the Bayesian paradigm these results do show that traditional Bayesian inference can be somewhat imprecise when the tails are misspecified which is clearly undesirable when conditioning on observed data.

In order to demonstrate this we simulated data points with repeats from the following heteroscedastic linear model


For the experiments we simulated and each were held constant across experiments. Figure 2 plots one realisation from this heteroscedastic linear model.

Figure 2: One of the repeat data sets simulated from the heteroscedastic linear model with p=1.

We then conducted general Bayesian updating using the five divergences mentioned above with priors and . The mean squared errors (MSE) for the posterior means of the parameters , the MSE for the predictive means on a test set of size 100 simulated from the model without error and the predictive mean variances are presented in Table 1 below. In order to apply the Hellinger, TV and alpha divergences to a regression problem an estimate of the conditional density of the response given the covariates is required. We follow authors Hooker  Vidyashankar (2014) and implement conditional KDEs to approximate the true data generating density. For simplicity, the familiar two-stage bandwidth estimation process of Hansen (2004) was used to find the optimal bandwidth parameters.

MSE KL Hell TV alpha power
Test Test Test Test Test
p=1 0.03 3.25 0.03 3.42 0.04 4.51 0.03 2.57 0.02 1.92
p=5 0.09 7.92 0.06 5.61 0.05 5.12 0.04 4.32 0.04 3.56
p=10 0.15 14.70 0.09 9.63 0.10 10.43 0.08 8.64 0.08 8.40
p=15 0.24 22.93 0.14 13.45 0.16 15.47 0.12 11.65 0.12 11.38
p=20 0.29 25.88 0.19 17.71 0.18 16.46 0.16 14.79 0.16 14.89

KL Hell TV alpha power
p=1 2.34 0.78 0.62 1.19 0.96
p=5 2.36 0.51 0.47 0.95 0.98
p=10 2.34 0.47 0.56 0.89 1.13
p=15 2.41 0.49 0.94 0.83 1.19
p=20 2.38 0.50 1.19 0.82 1.27
Table 1: Top: Table of posterior mean MSE values for parameters and MSE for a test set simulated without error, across repeats from datasets of size with the dimension of under the Bayesian minimum divergence technology. Bottom: Table of posterior mean values for the variance of the linear model to be interpreted in terms of the precision of the predictive distribution minimising the respective divergence criteria.

The bottom of table one demonstrates that the alternative divergences appear to be learning a smaller predictive variance than the KL-Bayes does under heteroscedastic errors. Under divergences alternative to the KL-divergence this is no longer an estimate of the variance of the responses in the data set, given the covariates. This should rather be interpreted predictively as the variance of the predictive distribution which is closest to the data generating distribution in terms of that alternative divergence. Therefore fitting a smaller variances demonstrates that the other divergences are placing more importance on fitting the majority of the data rather than just the outliers.

The top of table 1 illustrates the impact fitting a large variance has on the parameter estimates of the mean function. Placing less influence on outliers allows all of the alternative divergences to produce more precise estimates of the parameters of the underlying linear relationship. This then leads to better performance when predicting the test set. Clearly the errors for all of the methods will increase as increases as the same amount of data is used to estimate more parameters. However it is clear that the errors under the KL-divergence are rising more rapidly. By being less sensitive to the error distribution the alternative divergences are better able to capture the true underlying process.

5.3 Time series analysis

In order to further demonstrate how inflating the variance by targeting the KL-divergence under misspecification can damage inference, we consider a less trivial time series example. We simulate from an auto-regressive process of order L (AR(L)), and then consider additive independent generalised auto-regressive conditionally heteroscedastic of order (1,1) (GARCH(1,1)) errors, with


where , , . GARCH processes are used to model non-stationary, chaotic time series where the variance of the process depends on the magnitude and sign of the previous observations of the process. Eliciting a GARCH process from a DM is a difficult task. It is far from obvious how this GARCH process behaves as a function of its parameters and selecting a lag length for the AR process as well as two lag lengths for the GARCH Process increases the complexity of the model selection problem. Therefore it seems conceivable that a DM could want to fit a simple AR process to noisy time series data in order to investigate the underlying process. One situation where this may be desirable is in financial time series applications where large amounts of data can arrive at a very high frequency.

In order to investigate how the minimum divergence methods perform in this scenario we simulated 3 data sets with , and fitted and AR(L) process to these, where was chosen to match the underlying AR process. The 3 data sets were given by

  1. an AR(3) with

  2. an AR(1) with with GARCH(1,1) errors

  3. an AR(1) with with GARCH(1,1) errors

The plots in figure 3 demonstrate the one-step ahead posterior predictive performance of the minimum divergence posteriors on a test set , simulated from the underlying AR process. Under misspecification we show only the inference under the Hell-Bayes to avoid cluttering the plots, the other minimum divergence posteriors perform similarly. We use the same priors as the regression example and once again conditional density estimates were used for the Hellinger, TV and alpha divergences.

Figure 3: Top: One step ahead posterior predictions arising from Bayesian minimum divergence estimation fitting to correct AR model to an AR(3) (left), an AR(1) with GARCH(1,1) errors, , (middle) and a AR(1) with GARCH(1,1) errors, , (right) using the KL-Bayes (red), Hell-Bayes (blue), TV-Bayes (pink), alpha-Bayes (green) and power-Bayes (orange). Bottom: the difference in one step ahead posterior squared prediction squared errors between the KL-Bayes and the Hell-Bayes.

The left hand plots demonstrate that when the model is correctly specified the minimum divergence posteriors produce similar time series inference to the KL-divergence. This is most easily seen in the lower plot which shows the difference in the squared prediction errors between the KL-Bayes and the Hell-Bayes is mostly around 0 and table 2 which demonstrates the root mean squared error across the test set is the same under both methods. The middle plots demonstrate how the KL-Bayes and the Hell-Bayes perform under ‘extreme’ volatility in the error distribution. The tails of the model being very poorly specified causes the KL-Bayes posterior to fit a huge variance with the average posterior predictive variance across the test data set being slightly above 26. Fitting such a high variance makes the inference on the lag parameters insensitive to the data. As a result the underlying trend in the data is completely missed. In contrast the Hell-Bayes posterior predictive distributions have a much smaller variance of around 7. This allows the inference on the lag parameters to be much more sensitive to the underlying AR process. Clearly the Hell-Bayes is unable to exactly fit the truth as the model is misspecified. However it does a much better job of capturing the broad features of the underlying dependence between time points. This is clearly demonstrated by the considerably lower root mean squared predictive error showed in table 2 and by the error differences plot being mostly large and positive. The right hand plots demonstrates how the KL-Bayes and Hell-Bayes perform when the error distribution is less volatile. When the volatility is smaller the KL-Bayes predictive variance is also smaller. Therefore the inference is more sensitive to the underlying trend in the data than in the previous example. However the true dependence in the data is still under estimated relative to the Hell-Bayes, this is again demonstrated in table 2 and the error difference plot. Once again this shows that the way in which the KL-Bayes deals with misspecification, increasing the predictive variance, can mask some of the underlying trends in the data which can be discovered by other methods. Table 2 plots the root mean squared errors (RMSE) for the KL-Bayes and Hell-Bayes on the test set to quantify their correspondence to reality.

RMSE Correctly Specified High Volatility Low Volatility
KL 0.49 1.87 1.21
Hell 0.48 1.07 0.94
Table 2: Root mean squared errors (RMSE) for the KL-Bayes and Hell-Bayes posterior mean predictions for 100 test data points from the underlying AR model.

5.4 Application to high dimensions

The examples is this paper only demonstrate the performance of these minimum divergence techniques for relatively small dimensional problems so as to clearly demonstrate the effect that tail misspecifications can have. However, it is as the dimension and complexity of the problem increases that these methods become more and more important, this can be put down to two aspects. The first of these is that outliers or highly influential contaminant data-points become hard to identify in high dimensions. In our examples it is clear from looking at the KDE or histogram of the data that there are going to be outlying observations, but in many dimensions visualising the data in this way is not possible. In addition to this automatic methods for outlier detection struggle in high dimensions

(Filzmoser ., 2008).

The second factor in requiring robust inference in high dimensions, is that not only are outliers harder to identify, they are more likely to occur. The occurrence of outliers could be reinterpreted as indicating that the DM’s belief model is misspecified in the tails. The fact that these occur should be unsurprising. We have already discussed specifying beliefs about tail behaviour requires thinking about very small probabilities which is known to be difficult to do Winkler  Murphy (1968); O’Hagan . (2006), and often routine assumptions (for example Gaussianity) may be applied. As the dimension of the space increase the tails of the distribution account for a greater proportion of the overall density, increasing the chance of seeing observations that differ from the practitioners beliefs.

6 Discussion

This article uses general Bayesian updating Bissiri . (2016) in order to theoretically justify a Bayesian update that targets the parameters of a model that minimise a statistical divergence to the data generating process that is not the KL-divergence. When the -open world is considered, moving away from targeting the minimisation of the KL-divergence can provide an important tool in order to robustify a statistical analysis. The desire for robustness ought to only increase as increasingly bigger models are built to approximate more complex real world processes. This paper provides the statistical practitioner with the principled justification to select the divergence they use for their analysis in a subjective manner allowing them the potential to make more useful predictions from their best approximate belief model.

As it stands however more work is required to advise on the selection of the divergence used for the updating. Ways to more clearly articulate the impact of choosing a certain divergence to a DM needs further research along with ways to help guide the choice of any hyperparameters associated with the divergence. Further experimentation with complex real world data sets is also required to analyse how this robustness-efficiency trade-off associated with the selected divergence manifests itself in practice.

Another issue not addressed in this paper is how to tailor computational algorithms to the inferences we describe here. There exists a vast literature on optimising MCMC algorithms to sample from traditional Bayesian posteriors and in order to fully take advantage of the subjectivity this paper allows a statistician, a whole new class of computational algorithms tailored to different divergences may be required. In addition several of the divergences mentioned in this paper require a density estimate of the underlying process and further research into effectively doing this for complex high dimensional datasets can only improve the performances of these methods for real world problems.


We are indebted to 3 anonymous referees in helping us revise the structure and examples in this article. The authors would also like to thank Jeremias Knoblauch for his helpful discussions when revising the article. The first author acknowledges his research fellowship through the OxWaSP doctoral programme, funded by EPSRC.


  • Abramson (1982) abramson1982bandwidthAbramson, IS.  1982. On bandwidth variation in kernel estimates-a square root law On bandwidth variation in kernel estimates-a square root law. The annals of Statistics1217–1223.
  • Barron (1988) barron1988exponentialBarron, AR.  1988.

    The exponential convergence of posterior probabilities with implications for Bayes estimators of density functions The exponential convergence of posterior probabilities with implications for bayes estimators of density functions.

    Department of Statistics, University of Illinois.
  • Basu . (1998) basu1998robustBasu, A., Harris, IR., Hjort, NL.  Jones, M.  1998. Robust and efficient estimation by minimising a density power divergence Robust and efficient estimation by minimising a density power divergence. Biometrika853549–559.
  • Basu . (2011) basu2011statisticalBasu, A., Shioya, H.  Park, C.  2011. Statistical inference: the minimum distance approach Statistical inference: the minimum distance approach. CRC Press.
  • Beran (1977) beran1977minimumBeran, R.  1977. Minimum Hellinger distance estimates for parametric models Minimum hellinger distance estimates for parametric models. The Annals of Statistics445–463.
  • Berger . (1994) berger1994overviewBerger, JO., Moreno, E., Pericchi, LR., Bayarri, MJ., Bernardo, JM., Cano, JA.others  1994. An overview of robust Bayesian analysis An overview of robust bayesian analysis. Test315–124.
  • Bernardo  Smith (2001) bernardo2001bayesianBernardo, JM.  Smith, AF.  2001. Bayesian theory. Bayesian theory. IOP Publishing.
  • Bissiri . (2016) bissiri2016generalBissiri, P., Holmes, C.  Walker, SG.  2016. A general framework for updating belief distributions A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
  • Box (1980) box1980samplingBox, GE.  1980. Sampling and Bayes’ inference in scientific modelling and robustness Sampling and bayes’ inference in scientific modelling and robustness. Journal of the Royal Statistical Society. Series A (General)383–430.
  • Carpenter . (2016) carpenter2016stanCarpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M.Riddell, A.  2016. Stan: A probabilistic programming language Stan: A probabilistic programming language. Journal of Statistical Software20.
  • Celeux . (2017) celeux2017someCeleux, G., Jewson, J., Josse, J., Marin, JM.  Robert, CP.  2017. Some discussions on the Read Paper” Beyond subjective and objective in statistics” by A. Gelman and C. Hennig Some discussions on the read paper” beyond subjective and objective in statistics” by a. gelman and c. hennig. arXiv preprint arXiv:1705.03727.
  • Cichocki . (2011) cichocki2011generalizedCichocki, A., Cruces, S.  Amari, Si.  2011. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy131134–170.
  • Cressie  Read (1984) cressie1984multinomialCressie, N.  Read, TR.  1984. Multinomial goodness-of-fit tests Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society. Series B (Methodological)440–464.
  • Csisz . (1967) csisz1967informationCsisz, I. .  1967. Information-type measures of difference of probability distributions and indirect observations Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar.2299–318.
  • Dawid . (2016) dawid2016minimumDawid, AP., Musio, M.  Ventura, L.  2016. Minimum scoring rule inference Minimum scoring rule inference. Scandinavian Journal of Statistics431123–138.
  • Devroye  Gyorfi (1985) devroye1985nonparametricDevroye, L.  Gyorfi, L.  1985. Nonparametric density estimation: the L1 view Nonparametric density estimation: the l1 view ( 119). John Wiley & Sons Incorporated.
  • Epanechnikov (1969) epanechnikov1969nonEpanechnikov, VA.  1969. Non-parametric estimation of a multivariate probability density Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications141153–158.
  • Filzmoser . (2008) filzmoser2008outlierFilzmoser, P., Maronna, R.  Werner, M.  2008. Outlier identification in high dimensions Outlier identification in high dimensions. Computational Statistics & Data Analysis5231694–1711.
  • Gelman  Hennig (2015) gelman2015beyondGelman, A.  Hennig, C.  2015. Beyond subjective and objective in statistics Beyond subjective and objective in statistics. Journal of the Royal Statistical Society: Series A (Statistics in Society).
  • Ghosh  Basu (2016) ghosh2016robustGhosh, A.  Basu, A.  2016. Robust Bayes estimation using the density power divergence Robust bayes estimation using the density power divergence. Annals of the Institute of Statistical Mathematics682413–437.
  • Ghosh  Basu (2017) ghosh2017generalGhosh, A.  Basu, A.  2017. General Robust Bayes Pseudo-Posterior: Exponential Convergence results with Applications General robust bayes pseudo-posterior: Exponential convergence results with applications. arXiv preprint arXiv:1708.09692.
  • Ghosh . (2017) ghosh2017generalizedGhosh, A., Harris, IR., Maji, A., Basu, A., Pardo, L. .  2017. A generalized divergence for statistical inference A generalized divergence for statistical inference. Bernoulli234A2746–2783.
  • Goldstein (1990) goldstein1990influenceGoldstein, M.  1990. Influence and belief adjustment Influence and belief adjustment. Influence Diagrams, Belief Nets and Decision Analysis143–174.
  • Goldstein (1999) goldstein1999bayesGoldstein, M.  1999. Bayes linear analysis Bayes linear analysis. Wiley StatsRef: Statistics Reference Online.
  • Goldstein . (2006) goldstein2006subjectiveGoldstein, M. .  2006. Subjective Bayesian analysis: principles and practice Subjective bayesian analysis: principles and practice. Bayesian Analysis13403–420.
  • Greco . (2008) greco2008robustGreco, L., Racugno, W.  Ventura, L.  2008. Robust likelihood functions in Bayesian inference Robust likelihood functions in bayesian inference. Journal of Statistical Planning and Inference13851258–1270.
  • Grünwald  Dawid (2004) grunwald2004gameGrünwald, PD.  Dawid, AP.  2004. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. Annals of Statistics1367–1433.
  • Hampel . (2011) hampel2011robustHampel, FR., Ronchetti, EM., Rousseeuw, PJ.  Stahel, WA.  2011. Robust statistics: the approach based on influence functions Robust statistics: the approach based on influence functions ( 114). John Wiley & Sons.
  • Hansen (2004) hansen2004nonparametricHansen, BE.  2004. Nonparametric conditional density estimation Nonparametric conditional density estimation. Unpublished manuscript.
  • Hooker  Vidyashankar (2014) hooker2014bayesianHooker, G.  Vidyashankar, AN.  2014. Bayesian model robustness via disparities Bayesian model robustness via disparities. Test233556–584.
  • Huber  Ronchetti (1981) huber1981robustHuber, PJ.  Ronchetti, E.  1981. Robust statistics, Series in probability and mathematical statistics. Robust statistics, series in probability and mathematical statistics. John Wiley, New York.
  • Hwang . (1994) hwang1994nonparametricHwang, JN., Lay, SR.  Lippman, A.  1994. Nonparametric multivariate density estimation: a comparative study Nonparametric multivariate density estimation: a comparative study. IEEE Transactions on Signal Processing42102795–2810.
  • Lazer . (2014) lazer2014parableLazer, D., Kennedy, R., King, G.  Vespignani, A.  2014. The parable of Google Flu: traps in big data analysis The parable of google flu: traps in big data analysis. Science34361761203–1205.
  • Li  Dunson (2016) li2016frameworkLi, M.  Dunson, DB.  2016. A framework for probabilistic inferences from imperfect models A framework for probabilistic inferences from imperfect models. arXiv preprint arXiv:1611.01241.
  • Lu . (2013) lu2013multivariateLu, L., Jiang, H.  Wong, WH.  2013. Multivariate density estimation by bayesian sequential partitioning Multivariate density estimation by bayesian sequential partitioning. Journal of the American Statistical Association1085041402–1410.
  • Miller  Dunson (2015) miller2015robustMiller, JW.  Dunson, DB.  2015. Robust Bayesian inference via coarsening Robust bayesian inference via coarsening. arXiv preprint arXiv:1506.06101.
  • O’Hagan . (2006) o2006uncertainO’Hagan, A., Buck, CE., Daneshkhah, A., Eiser, JR., Garthwaite, PH., Jenkinson, DJ.Rakow, T.  2006. Uncertain judgements: eliciting experts’ probabilities Uncertain judgements: eliciting experts’ probabilities. John Wiley & Sons.
  • Owen (1991) owen1991empiricalOwen, A.  1991. Empirical likelihood for linear models Empirical likelihood for linear models. The Annals of Statistics1725–1747.
  • Ram  Gray (2011) ram2011densityRam, P.  Gray, AG.  2011. Density estimation trees Density estimation trees. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining Proceedings of the 17th acm sigkdd international conference on knowledge discovery and data mining ( 627–635).
  • Rosenblatt . (1976) rosenblatt1976maximalRosenblatt, M. .  1976. On the maximal deviation of -dimensional density estimates On the maximal deviation of -dimensional density estimates. The Annals of Probability461009–1015.
  • Sason  Verdú (2015) sason2015boundsSason, I.  Verdú, S.  2015. Bounds among f-divergences Bounds among f-divergences. submitted to the IEEE Trans. on Information Theory.
  • Shun-ichi (2012) shun2012differentialShun-ichi, A.  2012. Differential-geometrical methods in statistics Differential-geometrical methods in statistics ( 28). Springer Science & Business Media.
  • Silverman (1986) silverman1986densitySilverman, BW.  1986. Density estimation for statistics and data analysis Density estimation for statistics and data analysis ( 26). CRC press.
  • J. Smith (1995) smith1995bayesianSmith, J.  199510. Bayesian Approximations and the Hellinger metric. Bayesian approximations and the hellinger metric. Unpublished manuscript
  • JQ. Smith (2010) smith2010bayesianSmith, JQ.  2010. Bayesian decision analysis: principles and practice Bayesian decision analysis: principles and practice. Cambridge University Press.
  • Tamura  Boos (1986) tamura1986minimumTamura, RN.  Boos, DD.  1986. Minimum Hellinger distance estimation for multivariate location and covariance Minimum hellinger distance estimation for multivariate location and covariance. Journal of the American Statistical Association81393223–229.
  • Walker (2013) walker2013bayesianWalker, SG.  2013. Bayesian inference with misspecified models Bayesian inference with misspecified models. Journal of Statistical Planning and Inference143101621–1633.
  • Watson . (2016) watson2016approximateWatson, J., Holmes, C. .  2016. Approximate models and robust decisions Approximate models and robust decisions. Statistical Science314465–489.
  • Winkler  Murphy (1968) winkler1968evaluationWinkler, RL.  Murphy, AH.  1968. Evaluation of subjective precipitation probability forecasts Evaluation of subjective precipitation probability forecasts. Proceedings of the first national conference on statistical meteorology Proceedings of the first national conference on statistical meteorology ( 148–157).
  • Zellner (1988) zellner1988optimalZellner, A.  1988. Optimal information processing and Bayes’s theorem Optimal information processing and bayes’s theorem. The American Statistician424278–280.

Appendix A M-closed efficiency

The examples in Section 5, demonstrate how minimising an alternative divergence to the KL-divergence can lead to superior robustness to tail misspecifications. We next illustrate that by placing less importance on tail misspecifications in order to gain improved robustness, the DM must trade-off a decrease in efficiency in the case when the model class does in fact contain the data generating process. Minimising the KL-divergence uses Bayes rule and thus conditions on the data coming from the model. As a result minimising the KL-divergence is bound to perform the best when the model class contains the data generating distribution. Here we illustrate this but also demonstrate the reassuring fact that the MDE methods perform comparatively even in this situation.

In order to examine the frequentist efficiency, the observed mean squared error (MSE) , over repeats of a simulated experiment are examined. The MSE’s are examined on two examples and when fitting the model with prior and for . The prior for is centred on the data while the prior for is not. Table 3 plots the observed MSEs over replications444The stan output from a number of these experiments associated with the TV divergence suggested that the sample from the posterior was not sufficient to use for posterior inference, for the purpose of this preliminary analysis these experiments were disregarded here, but this suggests there exists computational issues associated with the TV-Bayes..

MSE KL Hell TV alpha power
n=50 1.89 4.43 1.50 16.88 2.68 6.77 2.03 4.29 0.76 93.80
n=100 0.64 1.29 0.57 5.19 1.20 1.83 0.67 1.25 0.44 33.80
n=200 0.35 0.47 0.32 1.50 0.48 0.50 0.35 0.47 0.31 7.83
n=500 0.15 0.16 0.14 0.33 0.28 0.31 0.15 0.16 0.20 1.42
n=50 1.11 1.17 1.51 1.28 3.47 6.09 1.13 1.13 8.23 67.94
n=100 0.60 0.61 0.98 0.87 1.30 1.61 0.65 0.64 1.99 10.64
n=200 0.23 0.23 0.57 0.52 0.42 0.44 0.30 0.29 0.46 1.09
n=500 0.10 0.10 0.26 0.26 0.29 0.27 0.12 0.13 0.18 0.21
Table 3: Table of posterior mean MSE values for , from data sets of size under the Bayesian minimum divergence technology.

Table 3 demonstrates several interesting points about these different methods. Firstly it appears as though the KL-Bayes, Hell-Bayes and alpha-Bayes perform very similarly when , while the power-Bayes and TV-Bayes performs marginally worse. As we should expect, for the KL-Bayes appears to be optimal for both priors with the -Bayes appearing to perform the next best. Table 3 also shows the differing impacts the prior has on the different Bayesian updates. The KL-Bayes is least affected by the prior with the -Bayes and the TV-Bayes the next least. Authors Hooker  Vidyashankar (2014) recognised that the boundedness of the Hell-Bayes score function compared with the KL-Bayes score function, may cause problems such as this. The -Bayes score function is larger than the Hell-Bayes score function so our findings would be consistent with this. The power-Bayes score function with is smaller than the Hell-Bayes score function when . This explains why the power-Bayes is impacted more by the prior, and provides a likely explanation of why the power-Bayes appears to perform so well when the prior is specified around the data. For small values of , which occur when the variance of is 100, the absolute size of the TV-Bayes score function is generally larger than the Hell-Bayes score, again explaining why the TV-Bayes performs better than the Hell-Bayes under a poorly specified prior.

Examining whether the estimates of were generally above or below the data generating parameter demonstrates how these different methods learn, see table 4. For all sample sizes the KL-Bayes appears to over or under estimate the true variance in equal proportions over the 50 repeats, while the Hell-Bayes and alpha-Bayes, to a lesser extent, generally under predict the true variance.

MSE KL Hell TV alpha power
n=50 -12 -32 -10 -50 -5 -37 -12 -32 -8 -50
n=100 -16 -40 -14 -50 -15 -22 -18 -40 -12 -50
n=200 -2 -24 -8 -46 1 -10 -2 -26 -2 -50
n=500 -6 -16 -6 -42 1 3 -10 -18 -8 -48
n=50 4 6 -26 -10 25 27 -8 -6 46 50
n=100 0 0 -36 -32 19 12 -14 -14 38 48
n=200 -2 -2 -40 -40 23 22 -24 -24 28 40
n=500 -2 -2 -36 -40 27 21 -28 -28 14 20
Table 4: Table of sums of posterior mean over (positive) or under (negative) estimation for , from data sets of size under the Bayesian minimum divergence technology

Figure 4 compares the scoring functions used in the general Bayesian update targeting the KL, Hellinger and alpha divergences. As increases, the upper bound on the score associated with the alpha-divergence increases and the scoring functions become more convex, tending towards the logarithmic score when . In contrast, the score associated with the Hellinger-divergence is closer to being linear. We believe that this is responsible for causing the smaller variance estimates when minimising the Hellinger-divergence and alpha-divergence when . The closer the score is to being linear the more similar the penalty is for over and under predicting the probability of an observation when compared with reality. The bounded nature of the scores prevents the penalty for under predicting being too high, and therefore more posterior predictive mass is placed near the MAP of the posterior predictive distribution. In contrast, a score function with greater convexity will penalise under prediction compared with the data generating density to a greater extent spreading the posterior mass out further. However, one must keep in mind that it is the sever nature of the penalty for under prediction incurred by the KL-divergence that renders it non-robust. On the other hand the TV-Bayes and the power-Bayes generally over predict the variance. For the power-Bayes this appears to be because the second term of the power divergence score is


for a multivariate Gaussian model, and this is decreased when the variances are made large. As was established in Section 4.3, the power-Bayes with decreases the influence observations with small likelihood under the model have on the inference. In the example above the data has a variance of and therefore the likelihood of each observation is very small. As a result the data is then excessively down-weighted by the power-Bayes when . When the sample size is small, a small increase in the variance may only decrease the observed likelihood term of the power-divergence score by a fraction and may increase the term in equation (38) by more, resulting in a variance estimate that is too large. The score function for the TV-Bayes is the only score function considered here that is not monotonic in the predicted probability of each observation. While under the other scores, the score for each individual observation can be increased by predicting that observation with greater probability, this is not the case for the TV-Bayes. We believe this causes the TV-Bayes to produces variance estimates that are too large. We note that all of the scores remain proper for large samples and that these are just observations about how the different methods perform for small samples sizes when the model is correct.

Figure 4: Score comparisons when quoting probability for event with probability . Red = KL, blue = Hellinger (, then divide by ), green = alpha ( = solid, = dashed, = dotted)