1 Introduction
In the closed world, where a fitted model class is implicitly assumed to contain the sample distribution from which the data came, Bayesian updating is highly compelling. However, modern statisticians are increasingly acknowledging that their inference is taking place in the open world (Bernardo Smith, 2001). Within this framework we acknowledge that any class of models we choose is unlikely to capture the actual sampling distribution, and is then at best an approximate description of our own beliefs and the underlying real world process. In this situation it is no longer possible to learn about the parameter which generated the data and a statistical divergence measure must be specified between the fitted model and the genuine one in order to define the parameter targeted by the inference Walker (2013). Remarkably in the open world standard Bayesian updating can be seen as a method which learns a model by minimising the predictive KullbackLeibler (KL)divergence from the model from which the data were sampled (Walker, 2013; Bissiri ., 2016). Therefore traditional Bayesian updating turns out to still be a well principled method for belief updating provided the decision maker (DM) concerns themselves with the parameter of the model that is closest to the data as measured by KLdivergence (Walker, 2013). Recall the KLdivergence of functioning model from data generating density is given by
(1) 
It is well known that when the KLdivergence is large, which is almost inevitable in high dimensional problems, the KLdivergence minimising models can give a very poor approximation if our main interest is in getting the central part of the posterior model uncertainty well estimated. This is because the KLdivergence gives large consideration to correctly specifying the tails of the generating sample distribution  see Section 3.2.1. As a result, when the DM acknowledges that they are in the open world, they have several options currently available to them:

[leftmargin=*,labelsep=4.9mm]

Proceed as though the model class does contain the true sample distribution and conduct a posteriori sensitivity analysis.

Modify the model class in order to improve its robustness properties.

Abandon the given parametric class and appeal to more data driven techniques.
To this we add:

[resume,leftmargin=*,labelsep=4.9mm]

Acknowledge that the model class is only approximate, but is the best available, and seek to infer the model parameters in a way that are most useful to the decision maker.
Method 1 is how Box (1980)
recommends approaching parametric model estimation and is the most popular approach amongst statisticians. Although it is acknowledged that the model is only approximate, Bayesian inference is applied as though the statistician believes the model to be correct. The results are then checked to examine how sensitive these are to the approximations made. Authors
Berger . (1994) provide a thorough review while Watson . (2016) consider this in a decision focused manner.Method 2 corresponds to the classical robustness approach. Within this approach, one model within the parametric class may be substituted for a model providing heavier tails (Berger ., 1994). Alternatively different estimators, for example Mestimators Huber Ronchetti (1981) Hampel . (2011), see Greco . (2008) for a Bayesian analogue, are used instead of those justified by the model class. Bayes linear methods Goldstein (1999)
, requiring many fewer belief statements than a full probability specification, form a special subclass of these techniques.
The third possibility is to abandon any parametric description of the probability space. Examples of this solution include, empirical likelihood methods (Owen, 1991); a decision focused general Bayesian update (Bissiri ., 2016)
; Bayesian nonparametric methods; or to appeal to statistical learning methods such as neural networks or support vector machines.
Though alternatives 2 and 3 can be shown to be very powerful in certain scenarios, it is our opinion that within the context described above that a fourth option holds considerable relative merit. In an applied statistical problem the model(s) represents the DM’s best guess of the sample distribution that could be applied. The model provides the only opportunity to input not only structural, but quantitative expert judgements about the domain, something that is often critical to a successful analysis  see Lazer . (2014) for example. The model provides an interpretable and transparent explanation about how different factors might be related to each other. This type of evidence is often essential when advising on decisions to policy makers. Often, simple assumptions play important roles in providing interpretability to the model and in particular preventing it from over fitting to any nongeneric features contained in any one data set.
For the above reasons an unambiguous statement of a model, however simple, is in our opinion an essential element for much of applied statistics. In light of this, statistical methodology should also be sufficiently flexible to cope with the fact that the DM’s model is inevitably an approximation both of their beliefs and the real world process. Currently, Bayesian statistics sometimes struggles to do this well. We argue below that this is because it implicitly minimises the KLdivergence to the underlying process.
By fitting model parameter in a way that can be nonrobust, the DM is having to combine their best guess belief model with something that will be robust to the parameter fitting. The DM must seek the best representation of their beliefs about a process in order to make future predictions. However under traditional Bayesian updating they must also give consideration to how robust these beliefs are. This seems an unfair task to ask of the DM. We therefore propose to decouple what the DM believes about the data generating process, from how the DM wishes the model to be fitted. This results in option 4 above.
This fourth option suggest the DM may actually want to explicitly target more robust divergences, a framework commonly known as Minimum Divergence Estimation (MDE), see Basu . (2011). Minimum divergence estimation is of course a welldeveloped field by frequentists, with Bayesian contributions coming more recently. However when the realistic assumption of being in the open world is considered the currently proposed Bayesian minimum divergence posteriors fail to fully appreciate the principled justification and motivation required to produce a coherent updating of beliefs. A Bayesian cannot therefore make principled inference using currently proposed methods in the open setting, except in a way that Miller Dunson (2015) describe as “tend(ing) to be either limited in scope, computationally prohibitive, or lacking a clear justification”. In order to make principled inference it appears as though the DM must currently concern themselves with the KLdivergence. However in this paper we remove this reliance upon the KLdivergence by providing a justification for Bayesian updating minimising alternative divergences, both theoretically and ideologically. Our updating of beliefs does not produce an approximate or pseudo posterior, but uses general Bayesian updating (Bissiri ., 2016) to produce the coherent posterior beliefs of a decision maker who wishes to produce predictions from a model that provide an explanation of the data that is as good as possible in terms of some prespecified divergence measure. By doing this the principled statistical practice of fitting model parameters to produce predictions is adhered to, but the parameter fitting is done so acknowledging the open nature of the problem.
In Section 2 of this article we review the currently available Bayesian minimum divergence estimation techniques and introduce general Bayesian updating Bissiri . (2016). The contributions of this article are then as follows: in Section 3 we identify the inadequacies in the justification provided by the currently available Bayesian MDE methods and use general Bayesian updating to prove that the Bayesian can still do principled inference on the parameters of the model using alternative, more robust divergences to KLdivergence. This theoretical contribution then allows us to propose a wider variety of divergences that the Bayesian could wish to minimise than have currently been consider in the literature in Section 4. In this section we also consider decision theoretic reasons why targeting alternative divergences to the KLdivergence can be more desirable. Lastly in Section 5 we demonstrate the impact model misspecifications can have on a traditional Bayesian analysis for simple inference, regression and time series analysis, and that superior robustness can be obtained by minimising alternative divergences to the KLdivergence. In appendix A we also show that when the observed data is in fact generated from the model, these methods can be shown not to lose too much precision. In this paper we demonstrate that this advice is not simply based on expedience but has a foundation in a principled inferential methodology. In this purpose we have deliberately restricted ourselves to simple demonstrations designed to provide a transparent illustration of the impact that changing the divergence measure can have on inferential conclusions. However we discuss how robust methodology becomes more important as problems and models become more complex and high dimensional and thus encourage practitioners to experiment with this methodology in practice.
2 Related work
2.1 Bayesian theory and scoring functions
Authors Bernardo Smith (2001) consider Bayesian inference through the lens of a Bayesian decision problem, where quoting a probability belief distribution for future uncertainties is the action to be taken. In this scenario, a scoring rule is used to define the loss of quoting a distribution and observing a realisation . They argue that the scoring rule associated with scoring probabilistic predictions should be proper and local. A proper scoring rule results in the DM’s expected loss being minimised when the DM quotes their true beliefs, and a local scoring rule is one where the score only depends on the quoted probability of the actual observed outcome and nothing else. The only proper, local scoring rule is the logarithmic scoring rule:
(2) 
The expected additional loss when quoting distribution , with density , for future outcomes when the data is distributed according to distribution , with density , is then:
(3) 
the KullbackLeibler (KL) divergence between the two probability distributions in equation (
1). In fact authors Grünwald Dawid (2004) define any divergence , associated with proper scoring rule as the extra penalty for believing was distributed according to when it was actually distributed according to .(4) 
2.2 The data generating process and the Mopen world
Sometimes in this paper we use the phrase “the data generating process”. The data generating process is a widespread term in the literature and appears to suggest that ‘Nature’ is using a simulator to generate observations. While this may fit nicely with some theoretical contributions, it becomes difficult to argue for in reality. In this article we consider the data generating process to represent the DM’s true beliefs about the sample distribution of the observations. However in order to correctly specify these the DM must be able to take the time and infinite introspection to consider all of the information available to them/in the world in order to produce probability specifications in the finest of details. As is pointed out by Goldstein (1990) this requires many more probability specifications to be made at a much higher precision than any DM is ever likely to be able to manage within time constraints of the problem. As a result these genuine beliefs must be approximated. This defines the subjectivist interpretation of the open world we adopt in this paper  the model used for the belief updating is only ever feasibly an approximation of the DM’s true beliefs about the sample distribution they might use if they had enough time to fully reflect. In the special case when the data is the result of a draw from a known probability model  a common initial step in validating a methodology  then this thoughtful analysis and ”a data generating process” obviously coincide. Henceforth we use the “the data generating process” in this sense to align our terminology as closely as possible with that in common usage.
2.3 General Bayesian updating
Following the above motivation, Bissiri . (2016) consider only conducting inference about the factors of the world/problem that matter to the DM. They produce what they refer to as a general Bayesian update  a coherent method to produce a posterior distribution over some quantity without relying on a full model for the observations  when considering the noninferential Bayesian decision problem. Similarly to Bayes linear methods (Goldstein, 1999), they consider producing posterior beliefs which do not require conditioning. They consider the Bayes act associated with beliefs corresponding to the data generating distribution and quantity of interest as
(5) 
Rather than elicit a model representing the DM’s beliefs over , Bissiri . (2016)
argue that given a prior, a loss function and some data
, an updating of beliefs about the value of must be possible in the absence of a model for the sampling distribution. They suggest that the posterior distribution resulting from such an updating of beliefs, be as ‘close’ as possible in terms of expected loss to both the prior and the observed data , used for the updating. In this development the KLdivergence was the only way to measure ‘closeness’ to the prior which resulted in an additive Bayesian update  one where the order the data arrives in does not affect that parameter updating and is consistent with the likelihood principle. In the absence of a likelihood, the best choice to measure the ‘closeness’ of the posterior to the data is to use the expected loss of the observed data under the proposed posterior, Bissiri . (2016) show that the posterior minimising the expected loss criteria is:(6) 
This posterior is not an approximation. Neither is it a pseudoposterior, but rather a valid, coherent representation of subjective uncertainty in the minimizer of the decision in equation (5). Often , the cumulative loss of the current data set. Updating using the cumulative loss amounts to replacing integrating over the data generating distribution in equation (5), with empirical integration over data whose distribution is . They also mention the need to calibrate the lossfunction used in the general Bayesian update against the prior, by setting the value of in equation (6
). While any probability density is defined so that it integrates to 1, any lossfunction can be defined to be arbitrarily large or small. It is important that the weight the lossfunction has in the updating process is calibrated against the prior such that the posterior distributions resulting from the general Bayesian update retain probabilistic meaning in the absence of a model. So it follows that the general Bayesian must be prepared to post odds according to their posterior similarly to any other Bayesian.
It is straightforward to see that if the logarithmic score is used in the general Bayesian update then the traditional Bayesian update is recovered:
(7) 
This echoes the wellknown result that the Bayesian predictive distribution finds the distribution that is closest to the data generating distribution in terms of KLdivergence. General Bayesian updating allows the goals of the statistical analysis to be coupled together with the parameter updating. This is not something previously possible under traditional Bayesian statistics. While Bissiri . (2016) provide the framework to update a probability belief distribution using a loss function, they only consider the logarithmic score in an inferential scenario.
2.4 Mclosed Bayesian minimum divergence estimation
Minimum divergence estimation (MDE) is an approach to robust inference, largely overlapping with minimum scoring rule inference Dawid . (2016). MDE considers making inference about the parameters of a parametric model by minimising the divergence between and the data generating function of the observed data. While Dawid . (2016) uses proper scores, MDE traditionally minimises a member of the class of discrepancies or disparities. Disparity based methods first build a nonparametric density estimate
of the data generating distribution from the data (often a Kernel Density estimate). They then conduct inference by minimising the divergence between the model and
. The frequentist literature in this area is vast so in this article we choose to focus on the Bayesian contributions. These have come first from Hooker Vidyashankar (2014) and more recently from Ghosh Basu (2016) and Ghosh Basu (2017).Authors Hooker Vidyashankar (2014) consider posterior densities of the form
(8) 
where is the data generating density and is the squared Hellinger divergence between and
(9) 
Posteriors of this form are justified following the asymptotic approximation of the divergence to the KLdivergence when the data comes from the model. This results from the fact that the divergence between two probability distributions must be uniquely minimised to 0 when the probability distributions are equal (Grünwald Dawid, 2004). Therefore, when and as the distribution minimising any divergences between it and the data generating process will be .
Rather than considering a discrepancy, Ghosh Basu (2016) considers minimising the proper Tsallis score associated with the density power divergence in order to produce a ‘pseudoposterior’. The density power divergence between and is
(10) 
This can be seen as a Bayesian analogue of the minimum proper score inference considered by Dawid . (2016). Authors Ghosh Basu (2016) identify the fact that the score associated with the density power divergence does not require a density estimate, but can still be used to provide robust inference. They term the distribution produced by their estimation method a ‘pseudoposterior’. However Ghosh Basu (2017) demonstrate exponential convergence results illustrating that this posterior is asymptotically optimal in exactly the same exponential rate as the traditional Bayesian posterior when the model is correctly specified (Barron, 1988).
3 Extending Bayesian MDE for the Mopen world
The current Bayesian minimum divergence technology is currently only able to produce approximate or pseudo posteriors. Therefore, Bayes rule is currently the only method explicitly available in the literature that DMs can use to produce a well principled updating of their beliefs about the parameters of a model. We address this issue in this section.
3.1 Why the current justification is not enough
The equivalence between the KLdivergence and the Hellingerdivergence of the model from the data generating process mentioned in Section 2.4 can be seen by taking Taylor expansions about the KLdivergence minimising parameter . We have that
(11)  
(12) 
Now following the same arguments as used in Hooker Vidyashankar (2014), if is consistent for then equations (11) and (12) are equivalent in the limit as . However when is not the data generating parameter because the model class is misspecified, and if is still converging to as then equations (11) and (12) will be different. In this setting the current literature gives no foundational reasoning why updating using the Hellinger divergence constitutes a principled updating of beliefs.
The ‘pseudoposterior’ of Ghosh Basu (2016) is justified as a valid posterior by Ghosh Basu (2017) as the traditional posterior originating from an alternative belief model. However, this alternative model will not even correspond to the DM’s approximate beliefs about the data generating process. There is therefore, a lack of formal justification for a DM to update beliefs using Bayes rule on this object.
3.2 Principled justification for KL in open world
In contrast, (Walker, 2013; Bissiri ., 2016) has provided a principled justification for Bayes’ rule in the open world. This requires the DM to think about the KLdivergence minimising parameter. When the model is correctly specified, the Bayesian learns about the parameter that generated the data. This, for any statistical divergence is equivalent to learning the parameter, , minimising . When the model is considered to be incorrect, there is no longer any formal relationship between the parameter and the data. The likelihood no longer represents the probability of the observed data conditioned on the parameter. Therefore in order for open inference to be meaningful a divergence measure must be chosen and the parameter of interest can then defined as . Authors Walker (2013) then state that once the parameter of interest has been defined as the minimum of some divergence, it is then possible for a practitioner to define their prior beliefs about where the minimiser of this divergence may lie. Then the practitioner’s final task is to ensure that their Bayesian learning machine is learning about the same parameter with which they defined their prior belief.
Recalling the wellknown result that Bayesian updating learns the parameters of the model which minimises the KL divergence of the model from the data generating density. This allows the DM to continuing conducting belief updating in a principled fashion using Bayes rule provided they are interested in, and specify prior beliefs about, the parameter minimising the KL divergence of the model from the data generating process.
3.2.1 Moving away from KLdivergence in the Mopen world
Viewed conversely, Walker (2013)
identified that using Bayes rule to update beliefs using a misspecified model results in the DM being implicitly concerned about producing predictions that are closest to the data generating process in terms of KLdivergence. In many scenarios using the KLdivergence is the most expedient thing to do. This probably explains why the KLdivergence is so prolific for statistical applications. The logarithmic form of the KLdivergence makes it straightforward to calculate the KLdivergence between many exponential families and the link with Bayes rule allows for conjugate prior distributions to be considered.
However as is well known but often forgotten, someone whose probabilities are elicited using the logarithmic score, or the KLdivergence, will have to beware of approximating the probability of an event by 0. Since if they are wrong, they will incur an infinite loss. As a result, accurate probability specification of tail events will be important to the DM using the logarithmicscore, something which is generally speaking extremely important in pure inference problems (Bernardo Smith, 2001).
Often in reality inference is being done in order to produce estimates of expected utilities as part of a larger decision making process. While the tail specification is important for inferential procedures and in situations where the losses are unbounded, for example in gambling and odds setting scenarios, many loss functions connected to real decisions may wish to place less impact on rarely occurring observations in the tails. For entirely reasonable practical reasons these rare events are precisely ones which the DM will find hard to accurately elicit O’Hagan . (2006), see Winkler Murphy (1968) for a demonstration of this in the context of forecasting the probability of precipitation. Therefore, basing inference on such severe penalties can be unstable. For example a treatment regime is likely to put greater weight on correctly treating the majority of the population than worrying about outlying extreme cases. Authors Watson . (2016) quite rightly point out that loss functions for real problems are often bounded and any method that simulates parameters via MCMC makes this assumption. As will be demonstrated in Section 4.1, inferential procedures assuming an unbounded loss function such as the logarithmic score, provide no guarantees about performance in more general decision making scenarios.
Once the consequences of using Bayes rule to solve decision problem in the open world is understood, we believe that DMs may reasonably desire alternative options for parameter updating, that are as well principled as Bayes rule, but places less importance on tail misspecification. In this new era of “Big data” it becomes increasingly likely that the model used for inference is misspecified, especially in the tails of the process  see Section 5.4. We believe many DMs would consider it undesirable for models that approximate the distribution of the majority of the data to be disregarded because they poorly fit a few outlying observations.
3.3 Principled minimum divergence estimation
We next take a foundational approach to theoretically justify an updating of beliefs targeting the parameter minimising any statistical divergence between the model and the data generating density. Consider the general inference problem of wanting to find the parameters of the parametric model . The model here can be considered as the DM’s best guess at the data generating process. We consider it important to continue to use a model even under the acknowledgement that it is misspecified for the reasons outlined in Section 1. Following Bernardo Smith (2001) and Walker (2013) we consider fitting these parameters in a decision making scenario, minimising some divergence function
(13) 
to the data generating process . The goal is therefore to solve the decision problem
(14) 
Here the entropy term in the definition of the divergence is removed from the minimisation because it does not depend on . In this scenario Walker (2013) forces the Bayesian into concerns associated with nonrobustness by shackling them to minimising the KLdivergence in order to use Bayes’ rule. However, equation (14) is identical to the setting consider by Bissiri . (2016) in equation (5), except that now the loss function depends on the parameter through a model so is a scoring rule. Using equation 6, general Bayesian updating therefore gives us the tools to come up with a coherent updating of beliefs that target the parameter minimising the general divergence as
(15) 
where we have set the calibration weight . This will be discussed further in Section 3.3.2. In order to stay consistent with Walker (2013), we use the notation to indicate that the prior and posterior belief distributions correspond to beliefs about the parameter minimising divergence . Applying the general Bayesian update in an inferential scenario like this, provides a compromise between purely loss based general Bayesian inference and traditional Bayesian updating, which is conducted completely independently of the problem specific loss function. By continuing to fit a generative model, the Bayesian minimum divergence posterior somewhat separates the inference from the decision making. The inference is still concerned with estimating model parameters based on how well the model’s predictions reflect the data generating process. However, the criteria for closeness between predictions and reality takes decision making into account. As a result, the minimum divergence posteriors can be seen as producing inferences that will be suitable for a broad range of loss functions. The DM is no longer required to exactly define the loss function associated with their decision problem at the inference stage. They need only consider broadly how robust to tail misspecifications they want their inference to be in order to define the target divergence (see Section 4.4). Considering the realistic open nature of the model class also justifies making this compromise. The general Bayesian update, assumes absolutely no information about the data generating process, while using Bayes rule traditionally^{1}^{1}1Prior to the interpretation provided by Walker (2013) explained in section 3.2 assumes the data generating density is known precisely. As we point out, it is actually more likely that the decision maker is able to express informative but not exact beliefs about the data generating process and therefore a halfwayhouse between these two is appropriate in reality.
Plugging in the corresponding loss functions gives general Bayesian posteriors minimising the Hellinger and density power divergence as
(16)  
(17) 
Equation (16) introduces to estimate the data generating density (see Section 4.5 for more on this). As a result, the general Bayesian updating is being conducted using an empirical loss function, say , which approximates the true loss function associated with minimising the Hellinger divergence between the model and the data generating process .
Equation (17) is exactly the distribution resulting from the robust parameter update of Ghosh Basu (2016), while equation (16) is similar to the posterior produced by Hooker Vidyashankar (2014) except the divergence function has been decomposed into its score and entropy term here. This demonstrates that the posteriors above are not pseudo posteriors  as Ghosh Basu (2016) suggests  or approximations of posteriors  as Hooker Vidyashankar (2014) suggests  they are the correct way for the DM to update their prior beliefs about the parameter minimising an alternative divergence to the KLdivergence.
Authors Basu . (2011) showed that is a consistent estimate of , while Ghosh Basu (2016) use this to show that under certain regularity conditions
follows a Bernsteinvon Mises (BVM) type result, but with convergence in probability, where the asymptotic variance of the posterior is the expected second derivative of the loss function
. Authors Hooker Vidyashankar (2014) prove the corresponding asymptotic normality result for the posterior minimising the Hellinger divergence, but assume that the data generating process is contained within the model class, as this was required to justify their posterior in the first place. Consistency to the minimiser of any divergence provides consistency with the data generating process when it is contained within the model class, as by definition all divergences are minimised to 0 when the distributions are the same. The results of Ghosh Basu (2017) demonstrate that not only is the posterior mean consistent, but the whole posterior is asymptotically optimal in the same sense as the traditional Bayesian posterior is when the model class contains the data generating process.Another principled alternative to traditional Bayesian updating when it is difficult to fully specify a model for the data generating process is Bayes linear methods (Goldstein, 1999). These only require the subjective specifications of expectations and covariances for various quantities the DM is well informed about and interested in order to do inference. Alternatively by choosing a more robust divergence to use for the updating our method tackles the same problem by being robust to routine assumptions making a full probability specification a much less strenuous task.
3.3.1 The likelihood principle and Bayesian additivity
Authors Hooker Vidyashankar (2014) identify that their posterior distribution minimising the Hellinger divergence no longer satisfies Bayesian additivity. That is to say that the posteriors would look different if the data were observed in one go or in two halves for example. This is because the posterior now depends on the estimate of the data generating density, which depends in a nonadditive way on the data. This constitutes a departure from the likelihood principle underpinning traditional Bayesian statistics. The likelihood principle says that the likelihood is sufficient for the data. When the model is perfectly specified this is a sensible principle, the likelihood of the observed data under the correct model represents all the information in the data. However in the open world the likelihood principle is no longer a reasonable requirement. When the model is only considered as an approximation of the data generating process, it is not unreasonable to suspect that the data contains more information than is represented in the likelihood of an incorrect model. However, Bayesian additivity (defined as coherence by Bissiri . (2016)) is a central principle of the general Bayesian update. Bissiri . (2016) chose the KLdivergence to measure the distance between the prior and the posterior as it was the only divergence measure that left the belief update additive. The general Bayesian posterior minimising the Hellinger divergence combines the loss function for each observation in an additive way, which is consistent with the additivity demanded by Bissiri . (2016), but a different density estimate is produced when the data is considered as a whole or in parts. This causes the empirical loss function used for the updating to be different when the data arrives in sections, as opposed to one go. If the data generating density were available then the exact loss function associated with minimising the Hellinger divergence could be calculated and the Bayesian update would be additive. However because an approximation of the loss function associated with the Hellinger divergence is used, additivity is sacrificed.
While the posterior of Ghosh Basu (2016) is still additive, the likelihood is also not considered to be sufficient for the data, as the posterior also depends on the integral, . Using an alternative divergence to the KLdivergence in order to conduct Bayesian updating, requires that additional information to the ‘local’ information provided by the likelihood of the observed data is incorporated into the loss function. For the Hellingerdivergence this information comes from the data in the way of a density estimate, while for the density powerdivergence this comes from the model through .
3.3.2 A note on calibration
The two posteriors in equations (16) and (17) have set the general Bayesian calibration weight to 1. Unlike probability distributions, loss functions can be of arbitrary size and therefore it is important that they are calibrated with the prior. However, this is not a problem here. The posteriors above both use welldefined models and the canonical form of welldefined divergence functions and as a result there is no arbitrariness in the size of the loss. This is further demonstrated by the fact that Bayes rule corresponds to using the canonical form of the KLdivergence and a probability model with weight . We do note that when the model is correct the posterior variance of these method will be comparatively bigger for finite data samples than that of the traditional Bayesian posterior. This is to be expected, Zellner (1988) showed that Bayes rule processes information optimally and therefore produces the most precise posterior distributions. The posterior is simply a subjective reflection of the DM’s uncertainty after seeing the data, and if they believe their model is incorrect and therefore target a more robust divergence, they are likely to have greater posterior uncertainty than if they naively believe their model to be correct and proceed accordingly.
4 Possible divergences to consider
The current formulation of Bayesian MDE methods has limited which divergences have been proposed for implementation. However demonstrating that principled inference can be made using alternative divergence measures than the KLdivergence, as we have in Section 3.3, allows us to consider the selection of the divergence used for updating to be a subjective judgement made by the DM, alongside the prior and model, to help tailor the inference to the specific problem. Authors Celeux . (2017) observed that while Gelman Hennig (2015) advocate greater freedom for subjective judgements to impact statistical methodology, they fail to consider the possibility of subjective Bayesian parameter updating. In the closed world Bayes rule is objectively the correct thing to do, but in the open world this is no longer so clear. Very few problems seek answers that are connected with a specific dataset or model, they seek answers about the real world process underpinning these. Authors Goldstein . (2006), focusing on belief statements, demonstrate that subjective judgements help to generalise conclusions from the model and the data to the real world process. Carefully selecting an appropriate divergence measure can further help a statistical analysis to do this. We see this as part of the metamodelling process where the DM can embed beliefs about their nature and extent of their modelling class explicitly.
Below we list some divergences that a practitioner may wish to consider. We do not claim that the list is exhaustive, but merely contains the divergences we have considered using. We qualitatively describe the features of the inferences produced by targeting each specific divergence, which will be demonstrated empirically in Section 5. Throughout we consider the Lebesgue measure to be the dominating measure.
4.1 Total Variation Divergence
The TotalVariation divergence between probability distributions and is given by
(18) 
and the general Bayesian update targeting the minimisation of the TVdivergence can be produced by using loss function
(19) 
in equation (15). It is worth noting however that the loss function in equation (19) is not monotonic in the predicted probability of each observation. This is discussed further in Appendix A. In fact when informed decision making is the goal of the statistical analysis, closeness in terms of TVdivergence ought to be the canonical criteria the DM demands. If then for any utility function bounded by 1, the expected utility under of making the optimal decision believing the data was distributed according to is at most worse than the expected utility gained from the optimal decision under (see JQ. Smith (2010) for example). Therefore if the predictive distribution is close to the data generating process in terms of TVdivergence then the consequences of the model misspecification in terms of the expected utility of the decisions made is small. However explicit expressions of the TVdivergence even between known families, are rarely available. This somewhat hinders an algebraic analysis of the divergence.
It is straightforward to see that and the KLdivergence form an upper bound on the TVdivergence through Pinsker’s inequality. However, the KLdivergence does not bound the TVdivergence below so there are situations where the TVdivergence is very small but the KLdivergence is very large. In this scenario, a predictive distribution whose associated optimal decisions achieve a close to optimal expected utility estimate (as the distribution is close in TVdivergence) will receive very little posterior mass. This is clearly undesirable in a decision making context.
Authors Hooker Vidyashankar (2014) identify some drawbacks of having a bounded score function. The score function being upper bounded means that there is some limit to the score that can be incurred in the tails of the posterior distribution. The score incurred for one value of , will here be very similar to the score incurred by another value of past a given size. Therefore the tails of the posterior will be equivalent to the tails of the prior. As a result the DM is required to think more carefully about their prior distribution. Not only are improper priors prohibited, but more data is required to move away from a poorly specified prior. This can result in poor finite sample efficiency when the data generating process is within the chosen model class.
4.2 Hellinger Divergence
Authors Devroye Gyorfi (1985) observed, that the Hellinger divergence can be used to bound the TVdivergence both above and below:
(20) 
where
(21) 
As a result, the Hellingerdivergence and the TVdivergence are geometrically equivalent. Thus, if one of them is small, the other is small and similarly if one of them is large the other also is. So if one distribution is close to another in terms of TVdivergence, then the two distributions will be close in terms of Hellingerdivergence as well. Authors Beran (1977) first noted that minimising the Hellingerdivergence gave a robust alternative to minimising the KLdivergence, while Hooker Vidyashankar (2014) proposed a Bayesian alternative (equation (16)) that has been discussed at length above. While Hooker Vidyashankar (2014) motivated their posterior through asymptotic approximations, identifying the geometric equivalence between the Hellingerdivergence and TVdivergence proposes further justification for a robust Bayesian updating of beliefs similar to that of Hooker Vidyashankar (2014). Specifically, if being close in terms of TVdivergence is the ultimate robust goal, then being close in Hellingerdivergence will guarantee closeness in TVdivergence. Hellinger can therefore serve as a proxy for TVdivergence that retains some desirable properties of the KLdivergence: it is possible to compute the Hellinger divergence between many known families (J. Smith, 1995) and the score associated with the Hellinger divergence has a similar strictly convex shape, this is discussed further in Appendix A. A posterior targeting the Hellingerdivergence does suffer from the same drawbacks associated with having a bounded scoring function that are mentioned at the end of the previous section. Lastly along with TVdivergence, the Hellingerdivergence is also a metric.
4.3 divergences
A much wider class of divergences are provided by the two parameter divergence family proposed in Cichocki . (2011)
(22) 
Reparametrising this such that and gives the Sdivergence of Ghosh . (2017).
The general Bayesian posterior targeting the parameter minimising the divergence of the model from the data generating density is
(23) 
Letting represent the data generating density, and some predictions of parametrised by , Cichocki . (2011) provide intuition about how the parameters and impact the influence each observation has on the inference about . They consider influence to be how the observations impact the estimating equation of , this is a frequentists setting but the intuition is equally valid in the Bayesian setting.
(24) 
The size of for an observation defines how outlying (large values) or inlying (small values) the observation is. Choosing
ensures outliers relative to the model have little influence on the inference, while adjusting the value of
such that downweights the influence of unlikely observations from the model .The hyperparameters
and control the tradeoff between robustness and efficiency and selecting these hyperparameters is part of the subjective judgement associated with selecting that divergence. This being the case we feel that to ask for values of and to be specified by a DM is perhaps over ambitious, even given the interpretation ascribed to these parameters above. However, using the divergence for inference can provide greater flexibility to the DM, which may in some cases be useful. Here to simplify the subjective judgements made we focus on two important, one parameter subfamilies within the divergence family  the alphadivergence where and the betadivergence where .4.3.1 alpha Divergence
The alphadivergence, introduced by Csisz . (1967) and extensively studied by (Shunichi, 2012) is
(25) 
where . There exists various reparametrisations of this: Amari notation uses with ; or CressieRead notation (Cressie Read, 1984) introduces with .
We generally restrict attention to values of . corresponds to a KLdivergence limiting case while is 4 times the Hellinger divergence. We think of these as two extremes of efficiency and robustness within the alphadivergence family with which a DM would want to conduct inference between. The parameter thus controls this tradeoff.
The general Bayesian posterior targeting the minimisation of the alphadivergence is
(26) 
Note the power on in equation (26) is reduced by 1 from equation (25) as we consider empirical expectations in order to estimate the expected score associated with the alphadivergence.
It was demonstrated in Sason Verdú (2015) (Corollary 1) that for the alphadivergence can be bounded above by TVdivergence :
(27) 
Therefore, if the TVdivergence is small then the alphadivergence will be small (provided ). So a predictive distribution that is close to the data generating density in terms of TVdivergence will receive high posterior mass under an update targeting the alphadivergence. Once again, the general Bayesian posterior requires a density estimate to compute the empirical estimate of the score associated with the divergence and the same issues associated with having a bounded score function that were discussed in Section 4.1 apply here.
Considering the alpha divergence as a subfamily of the divergence takes , and therefore . Therefore the influence of observations are weighted based on their ratio of only and not on the value of in isolation. corresponded to the KLdivergence and in order to obtained greater robustness is chosen such that in order to downweight the influence larger ratios of , corresponding to outlying values of , have on the analysis. However downweighting some of these ratios relative to the KLdivergence sacrifices efficiency. This demonstrates the efficiency and robustness tradeoff associated with the alphadivergence.
4.3.2 beta Divergence
The betadivergence, also known as the density powerdivergence Basu . (1998), is a member of the Bregman divergence family
(28) 
Taking , to be the Tsallis score returns the betadivergence
(29) 
. This results in the density power divergence given in equation (10), parameterised by rather than here. Both Basu . (1998) and Dawid . (2016) noticed that inference can be made using the betadivergence without requiring a density estimate. This was used in Ghosh Basu (2016) to produce a robust posterior distribution that did not require an estimate of the data generating density, which has been extensively discussed in previous sections.
Considering the density power divergence as part of the divergence family takes , which results in treating all values of equally in isolation, but taking results in which means observations that have low predicted probability under the model are downweighted. Under the KLdivergence at , the influence of an observation is inversely related to its probability under the model, through the logarithmic score. Raising above 0 will decrease the influence of the smaller values of , robustifying the inference to tail specification. However this results in a decrease in efficiency relative to methods minimising the KLdivergence (Ghosh Basu, 2017). As a result there may be some issues using the density powerdivergence when high dimensional observations and models are considered. As the dimension increases the predicted probability of each (multivariate) observation shrinks towards 0. The density power divergence downweights the influence of observations with small predicted probabilities, and as a result ( when we consider the density power divergence) needs to be selected very carefully in order to prevent the analysis from disregarding the majority of the data. This is a price that is paid in order to not require a data generating density estimate.
4.3.3 The SHellinger divergence
One further oneparameter special case of the divergence (Sdivergence) is the SHellinger divergence given by Ghosh . (2017)
(30) 
This is generated from the Sdivergence by taking . Taking recovers twice the squared Hellinger divergence and gives the squared divergence. Ghosh . (2017) observe that the SHellinger divergence is a proper distance metric. Translating these back into the notation of divergences gives
(31) 
Therefore gives and . As a result, the squared Hellinger divergence downweights the influence of large ratios of with respect to smaller ones and also downweights these ratios when is small. In fact, there is a tradeoff here: moving closer to 1 increase away from 1, thus increasing the amount that ratios of are downweighted for small . However as moves closer to 1, also draws closer to 1, which reduced the amount ratios of are down weighted with respect to smaller ones. Once again we consider trading these two off against each other to be too greater task for the DM to consider. We therefore do not consider the SHellinger divergence again here and instead for brevity consider only the choice between the one parameter alpha and density power divergences. In any case, bridging the gap between the Hellinger and the KLdivergence, as the alphadivergence does, appears to make more sense in an inferential setting than bridging the gap between the Hellinger and squared divergence.
4.4 Comparison
The bullet points below summaries the reasons, introduced above, for which a DM might prefer a specific divergence. Once again we stress that we consider the choice of divergence to be a subjective and context specific judgement to be made by the DM similarly to their prior and model.

[leftmargin=*,labelsep=5.8mm]

KLdivergence: suitable if the tail behaviour of the model is of high importance or there is the belief that the tails of the model are exactly correctly specified. Minimising the KLdivergence also makes best use out of the data so may be suitable if the sample size is very small and the prior is not necessarily informative.

TVdivergence/Hellingerdivergence: suitable if the data set has many observations and robustness for a decision problem with a bounded loss function is of high importance. The Hellingerdivergence appears to be computationally easier to work with (see Appendix A). Both require accurate density estimation

alphadivergence: suitable for when a tradeoff between robustness and efficiency is considered. Requires access to a density estimation technique.

The density power divergence: suitable again for when a tradeoff between robustness and efficiency is considered. Does not require a density estimate but does require to be set carefully so that the data is not completely disregarded by the updating process.
Unfortunately there is no free lunch when it comes to applying any of these methods to complex, high dimensions problems. Minimising the KLdivergence is the most computationally efficient. It provides the possibility of implementing conjugate prior distributions and avoiding the computational burden of producing a density estimate. However as the dimensions and complexity of the problem increases, robustness to model misspecification will become more and more important. Minimising the Hellingerdivergence, TVdivergence and alphadivergence requires an estimate of the data generating density which is certainly not straightforward for high dimensional data (see the discussion in Section
4.5). However once this has been estimated then can just be selected based on how important tail misspecifications are, with a guarantee on some reasonable efficiency. Minimising the density power divergence has the computational advantage of not requiring an estimate of the data generating density. However when the dimension of the problem is high, much more thought needs to be put into specifying the parameter to ensure that the updating does not completely disregard the data. The technology for doing this effectively is still in its infancy. Without the data generating density, the density power divergence requires information on the shape of the data from another form in order to fit efficiently and robustly.4.5 Density estimation
As has been mentioned before, for the TV, Hellinger and divergence, it is not possible to exactly calculate the loss function associated with any value of and because the data generating density will not be available. In this case, a density estimate of is required to produce an empirical loss function. The Bayesian can consider the density estimate as providing additional information to the likelihood from the data (see Section 3.3.1’s discussion on the likelihood principle), and can thus consider their general Bayesian posterior inferences to be made conditional upon the density estimate as well as the data. The general Bayesian update is a valid update for any loss function, and therefore conditioning on the density estimate as well as the data still provides a valid posterior. However, how well this empirical loss function approximates the exact loss function associated with each divergence ought to be of interest. The exact loss function is of course the loss function the DM would prefer to use having made the subjective judgement to minimise that divergence. If the density estimate is consistent to the data generating process, then provided the sample size is large the density estimate will converge to the data generating density, and the empirical loss function will then correctly approximate the loss function associated with that divergence. It is this fact that ensures the consistency of the posterior estimates of the minimum Hellinger posterior Hooker Vidyashankar (2014).
Authors Hooker Vidyashankar (2014) use a fixed width kernel density estimate (FKDE) to estimate the underlying data generating density and in our examples in Section 5
we adopt this practice using a Radial Basis Function (RBF) kernel for simplicity and convenience. However we note that
Silverman (1986) identifies practical drawbacks of FKDEs, including their inability to correctly capture the tails of the data generating process, whilst not over smoothing the centre, as well as the number of data points required to fit these accurately in high dimensions. In addition to this Tamura Boos (1986) observe that the variance of the FKDE when using a density kernel in high dimensions lead to asymptotic bias in the estimate that is larger than . Alternatives include using a kernel with better meansquared error properties (Epanechnikov, 1969; Rosenblatt ., 1976), variable width adaptive KDEs (Abramson, 1982), which Hwang . (1994) show to be promising in high dimensions, piecewiseconstant (alternatively tree based) density estimation (Ram Gray, 2011; Lu ., 2013) which are also promising in high dimensions, or a fully Bayesian Gaussian process as is recommended in Li Dunson (2016).5 Illustrations
In this section we aim to illustrate some of the qualitative features associated with conducting inference targeting the minimisation of the different divergences identified in Section 4. Throughout these experiments stan (Carpenter ., 2016) is used to produce fast and efficient samples from the general Bayesian posteriors of interest.
5.1 open robustness
5.1.1 Simple inference
The experiments below demonstrate the robustness of the general Bayesian update targeting KLdivergence (red), Hellingerdivergence (blue), TVdivergence (pink), alphadivergence (green) and powerdivergence (orange) (in future these may be referred to as KLBayes, HellBayes, TVBayes, alphaBayes and powerBayes respectively). For illustrative purposes we have fixed for the alphadivergence and for the powerdivergence. Figure 1
plots the posterior predictive originating from fitting a normal model
to two simulated data sets and one ‘real’ data set. For the first data set data points were simulated from a normal contamination distribution(32) 
for the second
were simulated from a Student’s tdistribution with degrees of freedom 4 and the real data set, tracks1, was taken as the first variable from the ‘Geographical Original of Music Data Set’
^{2}^{2}2downloaded from https://archive.ics.uci.edu/ml/datasets/Geographical+Original+of+Music containingdata points, where a KDE of the data appeared to be approximately normally distributed. Prior distributions
and were used for all examples.It is easy to see from the plots in figure 1 that minimising the KLdivergence, requires that the density of the outlying contamination centred at to be captured correctly. This is at the expense of capturing the density of the remaining 99% of the data, this is especially clear in the logdensity plot. The TVBayes, alphaBayes and powerBayes appear to correctly capture the distribution for 99% of the data. The boundedness of the TVdivergence and the alphadivergence means that the contamination is not allowed to unduly affect the analysis, while the powerdivergence is able to downweight the influence of the contaminated points as they are ascribed low predicted probability under the model for 99% of the data. The HellBayes appears to fit too small a variance even for 99% of the data. This is due to the small sample efficiency problems which will be discussed in Appendix A.
The Student’s tdistribution has consistently heavier tails than the normal distribution. Thus the top right hand plot of figure 1 more clearly demonstrates the importance placed on tail misspecification by each method. The KLdivergence fits the largest variance to correctly capture these tails for the most extreme tail observations, the alphadivergence is able to fit a slight smaller variance because of its bounded nature. However the variance of the alphaBayes predictive is still larger than the other methods produce because of the greater convexity of the scoring function depicted in figure 4. Lastly HellBayes, TVBayes and powerBayes place the least weight on tail misspecification and are therefore able to fit a smaller variance and produce a predictive more closely resembling the data generating process for the majority of the data.
The importance given to tail misspecification is also evident in the ‘tracks’ example. Here there is an ordering from TV, power, Hellinger, alpha and KL divergence on both bias towards the right tail and on the size of the predictive variance, in response to the slight positive skew of the KDE of the data. Here it is clear to see that the power, TV, Hellinger and power divergence produce much better fits of the majority of the data than the other methods do. Lastly, the bottom right plot demonstrates how several possible alternative models to the Gaussian perform on the ‘tracks1’ data set
^{3}^{3}3the data set was transformed by addingto every value in order to make the data strictly positive so the gamma and logNormal distributions could be applied
, when updating using the KLdivergence. This shows that a Gaussian distribution was actually the best fit for the bulk of the data, and the poor fit achieved is down to the importance placed on tail misspecification by the estimating procedure rather than the model selected.
5.2 Regression under heteroscedasticity
In addition to the simple inference examples above we consider how changing the divergence can affect inference in a regression example. From the previous examples we can see that when the tails of the model are misspecified the KLdivergence minimising predictive distribution inflates the variance of the fitted model to ensure no observations are predicted with low probability. Further, placing large weight on tail observations, which occur with low probability, creates large variance across repeat sampling. This is exactly why parameter estimates in linear regression under heteroscedasticity errors have a large variance.
Bayesian minimum divergence inference places less weight on tail observations: it is thus able to produce inferences with a smaller predictive variance and a smaller variance across repeat sampling. While repeat sampling and estimation variance are not problems in the Bayesian paradigm these results do show that traditional Bayesian inference can be somewhat imprecise when the tails are misspecified which is clearly undesirable when conditioning on observed data.
In order to demonstrate this we simulated data points with repeats from the following heteroscedastic linear model
(33) 
For the experiments we simulated and each were held constant across experiments. Figure 2 plots one realisation from this heteroscedastic linear model.
We then conducted general Bayesian updating using the five divergences mentioned above with priors and . The mean squared errors (MSE) for the posterior means of the parameters , the MSE for the predictive means on a test set of size 100 simulated from the model without error and the predictive mean variances are presented in Table 1 below. In order to apply the Hellinger, TV and alpha divergences to a regression problem an estimate of the conditional density of the response given the covariates is required. We follow authors Hooker Vidyashankar (2014) and implement conditional KDEs to approximate the true data generating density. For simplicity, the familiar twostage bandwidth estimation process of Hansen (2004) was used to find the optimal bandwidth parameters.
MSE  KL  Hell  TV  alpha  power  

Test  Test  Test  Test  Test  
p=1  0.03  3.25  0.03  3.42  0.04  4.51  0.03  2.57  0.02  1.92 
p=5  0.09  7.92  0.06  5.61  0.05  5.12  0.04  4.32  0.04  3.56 
p=10  0.15  14.70  0.09  9.63  0.10  10.43  0.08  8.64  0.08  8.40 
p=15  0.24  22.93  0.14  13.45  0.16  15.47  0.12  11.65  0.12  11.38 
p=20  0.29  25.88  0.19  17.71  0.18  16.46  0.16  14.79  0.16  14.89 
KL  Hell  TV  alpha  power  

p=1  2.34  0.78  0.62  1.19  0.96 
p=5  2.36  0.51  0.47  0.95  0.98 
p=10  2.34  0.47  0.56  0.89  1.13 
p=15  2.41  0.49  0.94  0.83  1.19 
p=20  2.38  0.50  1.19  0.82  1.27 
The bottom of table one demonstrates that the alternative divergences appear to be learning a smaller predictive variance than the KLBayes does under heteroscedastic errors. Under divergences alternative to the KLdivergence this is no longer an estimate of the variance of the responses in the data set, given the covariates. This should rather be interpreted predictively as the variance of the predictive distribution which is closest to the data generating distribution in terms of that alternative divergence. Therefore fitting a smaller variances demonstrates that the other divergences are placing more importance on fitting the majority of the data rather than just the outliers.
The top of table 1 illustrates the impact fitting a large variance has on the parameter estimates of the mean function. Placing less influence on outliers allows all of the alternative divergences to produce more precise estimates of the parameters of the underlying linear relationship. This then leads to better performance when predicting the test set. Clearly the errors for all of the methods will increase as increases as the same amount of data is used to estimate more parameters. However it is clear that the errors under the KLdivergence are rising more rapidly. By being less sensitive to the error distribution the alternative divergences are better able to capture the true underlying process.
5.3 Time series analysis
In order to further demonstrate how inflating the variance by targeting the KLdivergence under misspecification can damage inference, we consider a less trivial time series example. We simulate from an autoregressive process of order L (AR(L)), and then consider additive independent generalised autoregressive conditionally heteroscedastic of order (1,1) (GARCH(1,1)) errors, with
(34)  
(35)  
(36)  
(37) 
where , , . GARCH processes are used to model nonstationary, chaotic time series where the variance of the process depends on the magnitude and sign of the previous observations of the process. Eliciting a GARCH process from a DM is a difficult task. It is far from obvious how this GARCH process behaves as a function of its parameters and selecting a lag length for the AR process as well as two lag lengths for the GARCH Process increases the complexity of the model selection problem. Therefore it seems conceivable that a DM could want to fit a simple AR process to noisy time series data in order to investigate the underlying process. One situation where this may be desirable is in financial time series applications where large amounts of data can arrive at a very high frequency.
In order to investigate how the minimum divergence methods perform in this scenario we simulated 3 data sets with , and fitted and AR(L) process to these, where was chosen to match the underlying AR process. The 3 data sets were given by

an AR(3) with

an AR(1) with with GARCH(1,1) errors

an AR(1) with with GARCH(1,1) errors
The plots in figure 3 demonstrate the onestep ahead posterior predictive performance of the minimum divergence posteriors on a test set , simulated from the underlying AR process. Under misspecification we show only the inference under the HellBayes to avoid cluttering the plots, the other minimum divergence posteriors perform similarly. We use the same priors as the regression example and once again conditional density estimates were used for the Hellinger, TV and alpha divergences.
The left hand plots demonstrate that when the model is correctly specified the minimum divergence posteriors produce similar time series inference to the KLdivergence. This is most easily seen in the lower plot which shows the difference in the squared prediction errors between the KLBayes and the HellBayes is mostly around 0 and table 2 which demonstrates the root mean squared error across the test set is the same under both methods. The middle plots demonstrate how the KLBayes and the HellBayes perform under ‘extreme’ volatility in the error distribution. The tails of the model being very poorly specified causes the KLBayes posterior to fit a huge variance with the average posterior predictive variance across the test data set being slightly above 26. Fitting such a high variance makes the inference on the lag parameters insensitive to the data. As a result the underlying trend in the data is completely missed. In contrast the HellBayes posterior predictive distributions have a much smaller variance of around 7. This allows the inference on the lag parameters to be much more sensitive to the underlying AR process. Clearly the HellBayes is unable to exactly fit the truth as the model is misspecified. However it does a much better job of capturing the broad features of the underlying dependence between time points. This is clearly demonstrated by the considerably lower root mean squared predictive error showed in table 2 and by the error differences plot being mostly large and positive. The right hand plots demonstrates how the KLBayes and HellBayes perform when the error distribution is less volatile. When the volatility is smaller the KLBayes predictive variance is also smaller. Therefore the inference is more sensitive to the underlying trend in the data than in the previous example. However the true dependence in the data is still under estimated relative to the HellBayes, this is again demonstrated in table 2 and the error difference plot. Once again this shows that the way in which the KLBayes deals with misspecification, increasing the predictive variance, can mask some of the underlying trends in the data which can be discovered by other methods. Table 2 plots the root mean squared errors (RMSE) for the KLBayes and HellBayes on the test set to quantify their correspondence to reality.
RMSE  Correctly Specified  High Volatility  Low Volatility 

KL  0.49  1.87  1.21 
Hell  0.48  1.07  0.94 
5.4 Application to high dimensions
The examples is this paper only demonstrate the performance of these minimum divergence techniques for relatively small dimensional problems so as to clearly demonstrate the effect that tail misspecifications can have. However, it is as the dimension and complexity of the problem increases that these methods become more and more important, this can be put down to two aspects. The first of these is that outliers or highly influential contaminant datapoints become hard to identify in high dimensions. In our examples it is clear from looking at the KDE or histogram of the data that there are going to be outlying observations, but in many dimensions visualising the data in this way is not possible. In addition to this automatic methods for outlier detection struggle in high dimensions
(Filzmoser ., 2008).The second factor in requiring robust inference in high dimensions, is that not only are outliers harder to identify, they are more likely to occur. The occurrence of outliers could be reinterpreted as indicating that the DM’s belief model is misspecified in the tails. The fact that these occur should be unsurprising. We have already discussed specifying beliefs about tail behaviour requires thinking about very small probabilities which is known to be difficult to do Winkler Murphy (1968); O’Hagan . (2006), and often routine assumptions (for example Gaussianity) may be applied. As the dimension of the space increase the tails of the distribution account for a greater proportion of the overall density, increasing the chance of seeing observations that differ from the practitioners beliefs.
6 Discussion
This article uses general Bayesian updating Bissiri . (2016) in order to theoretically justify a Bayesian update that targets the parameters of a model that minimise a statistical divergence to the data generating process that is not the KLdivergence. When the open world is considered, moving away from targeting the minimisation of the KLdivergence can provide an important tool in order to robustify a statistical analysis. The desire for robustness ought to only increase as increasingly bigger models are built to approximate more complex real world processes. This paper provides the statistical practitioner with the principled justification to select the divergence they use for their analysis in a subjective manner allowing them the potential to make more useful predictions from their best approximate belief model.
As it stands however more work is required to advise on the selection of the divergence used for the updating. Ways to more clearly articulate the impact of choosing a certain divergence to a DM needs further research along with ways to help guide the choice of any hyperparameters associated with the divergence. Further experimentation with complex real world data sets is also required to analyse how this robustnessefficiency tradeoff associated with the selected divergence manifests itself in practice.
Another issue not addressed in this paper is how to tailor computational algorithms to the inferences we describe here. There exists a vast literature on optimising MCMC algorithms to sample from traditional Bayesian posteriors and in order to fully take advantage of the subjectivity this paper allows a statistician, a whole new class of computational algorithms tailored to different divergences may be required. In addition several of the divergences mentioned in this paper require a density estimate of the underlying process and further research into effectively doing this for complex high dimensional datasets can only improve the performances of these methods for real world problems.
Acknowledgments
We are indebted to 3 anonymous referees in helping us revise the structure and examples in this article. The authors would also like to thank Jeremias Knoblauch for his helpful discussions when revising the article. The first author acknowledges his research fellowship through the OxWaSP doctoral programme, funded by EPSRC.
References
 Abramson (1982) abramson1982bandwidthAbramson, IS. 1982. On bandwidth variation in kernel estimatesa square root law On bandwidth variation in kernel estimatesa square root law. The annals of Statistics1217–1223.

Barron (1988)
barron1988exponentialBarron, AR.
1988.
The exponential convergence of posterior probabilities with implications for Bayes estimators of density functions The exponential convergence of posterior probabilities with implications for bayes estimators of density functions.
Department of Statistics, University of Illinois.  Basu . (1998) basu1998robustBasu, A., Harris, IR., Hjort, NL. Jones, M. 1998. Robust and efficient estimation by minimising a density power divergence Robust and efficient estimation by minimising a density power divergence. Biometrika853549–559.
 Basu . (2011) basu2011statisticalBasu, A., Shioya, H. Park, C. 2011. Statistical inference: the minimum distance approach Statistical inference: the minimum distance approach. CRC Press.
 Beran (1977) beran1977minimumBeran, R. 1977. Minimum Hellinger distance estimates for parametric models Minimum hellinger distance estimates for parametric models. The Annals of Statistics445–463.
 Berger . (1994) berger1994overviewBerger, JO., Moreno, E., Pericchi, LR., Bayarri, MJ., Bernardo, JM., Cano, JA.others 1994. An overview of robust Bayesian analysis An overview of robust bayesian analysis. Test315–124.
 Bernardo Smith (2001) bernardo2001bayesianBernardo, JM. Smith, AF. 2001. Bayesian theory. Bayesian theory. IOP Publishing.
 Bissiri . (2016) bissiri2016generalBissiri, P., Holmes, C. Walker, SG. 2016. A general framework for updating belief distributions A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
 Box (1980) box1980samplingBox, GE. 1980. Sampling and Bayes’ inference in scientific modelling and robustness Sampling and bayes’ inference in scientific modelling and robustness. Journal of the Royal Statistical Society. Series A (General)383–430.
 Carpenter . (2016) carpenter2016stanCarpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M.Riddell, A. 2016. Stan: A probabilistic programming language Stan: A probabilistic programming language. Journal of Statistical Software20.
 Celeux . (2017) celeux2017someCeleux, G., Jewson, J., Josse, J., Marin, JM. Robert, CP. 2017. Some discussions on the Read Paper” Beyond subjective and objective in statistics” by A. Gelman and C. Hennig Some discussions on the read paper” beyond subjective and objective in statistics” by a. gelman and c. hennig. arXiv preprint arXiv:1705.03727.
 Cichocki . (2011) cichocki2011generalizedCichocki, A., Cruces, S. Amari, Si. 2011. Generalized alphabeta divergences and their application to robust nonnegative matrix factorization Generalized alphabeta divergences and their application to robust nonnegative matrix factorization. Entropy131134–170.
 Cressie Read (1984) cressie1984multinomialCressie, N. Read, TR. 1984. Multinomial goodnessoffit tests Multinomial goodnessoffit tests. Journal of the Royal Statistical Society. Series B (Methodological)440–464.
 Csisz . (1967) csisz1967informationCsisz, I. . 1967. Informationtype measures of difference of probability distributions and indirect observations Informationtype measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar.2299–318.
 Dawid . (2016) dawid2016minimumDawid, AP., Musio, M. Ventura, L. 2016. Minimum scoring rule inference Minimum scoring rule inference. Scandinavian Journal of Statistics431123–138.
 Devroye Gyorfi (1985) devroye1985nonparametricDevroye, L. Gyorfi, L. 1985. Nonparametric density estimation: the L1 view Nonparametric density estimation: the l1 view ( 119). John Wiley & Sons Incorporated.
 Epanechnikov (1969) epanechnikov1969nonEpanechnikov, VA. 1969. Nonparametric estimation of a multivariate probability density Nonparametric estimation of a multivariate probability density. Theory of Probability & Its Applications141153–158.
 Filzmoser . (2008) filzmoser2008outlierFilzmoser, P., Maronna, R. Werner, M. 2008. Outlier identification in high dimensions Outlier identification in high dimensions. Computational Statistics & Data Analysis5231694–1711.
 Gelman Hennig (2015) gelman2015beyondGelman, A. Hennig, C. 2015. Beyond subjective and objective in statistics Beyond subjective and objective in statistics. Journal of the Royal Statistical Society: Series A (Statistics in Society).
 Ghosh Basu (2016) ghosh2016robustGhosh, A. Basu, A. 2016. Robust Bayes estimation using the density power divergence Robust bayes estimation using the density power divergence. Annals of the Institute of Statistical Mathematics682413–437.
 Ghosh Basu (2017) ghosh2017generalGhosh, A. Basu, A. 2017. General Robust Bayes PseudoPosterior: Exponential Convergence results with Applications General robust bayes pseudoposterior: Exponential convergence results with applications. arXiv preprint arXiv:1708.09692.
 Ghosh . (2017) ghosh2017generalizedGhosh, A., Harris, IR., Maji, A., Basu, A., Pardo, L. . 2017. A generalized divergence for statistical inference A generalized divergence for statistical inference. Bernoulli234A2746–2783.
 Goldstein (1990) goldstein1990influenceGoldstein, M. 1990. Influence and belief adjustment Influence and belief adjustment. Influence Diagrams, Belief Nets and Decision Analysis143–174.
 Goldstein (1999) goldstein1999bayesGoldstein, M. 1999. Bayes linear analysis Bayes linear analysis. Wiley StatsRef: Statistics Reference Online.
 Goldstein . (2006) goldstein2006subjectiveGoldstein, M. . 2006. Subjective Bayesian analysis: principles and practice Subjective bayesian analysis: principles and practice. Bayesian Analysis13403–420.
 Greco . (2008) greco2008robustGreco, L., Racugno, W. Ventura, L. 2008. Robust likelihood functions in Bayesian inference Robust likelihood functions in bayesian inference. Journal of Statistical Planning and Inference13851258–1270.
 Grünwald Dawid (2004) grunwald2004gameGrünwald, PD. Dawid, AP. 2004. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. Annals of Statistics1367–1433.
 Hampel . (2011) hampel2011robustHampel, FR., Ronchetti, EM., Rousseeuw, PJ. Stahel, WA. 2011. Robust statistics: the approach based on influence functions Robust statistics: the approach based on influence functions ( 114). John Wiley & Sons.
 Hansen (2004) hansen2004nonparametricHansen, BE. 2004. Nonparametric conditional density estimation Nonparametric conditional density estimation. Unpublished manuscript.
 Hooker Vidyashankar (2014) hooker2014bayesianHooker, G. Vidyashankar, AN. 2014. Bayesian model robustness via disparities Bayesian model robustness via disparities. Test233556–584.
 Huber Ronchetti (1981) huber1981robustHuber, PJ. Ronchetti, E. 1981. Robust statistics, Series in probability and mathematical statistics. Robust statistics, series in probability and mathematical statistics. John Wiley, New York.
 Hwang . (1994) hwang1994nonparametricHwang, JN., Lay, SR. Lippman, A. 1994. Nonparametric multivariate density estimation: a comparative study Nonparametric multivariate density estimation: a comparative study. IEEE Transactions on Signal Processing42102795–2810.
 Lazer . (2014) lazer2014parableLazer, D., Kennedy, R., King, G. Vespignani, A. 2014. The parable of Google Flu: traps in big data analysis The parable of google flu: traps in big data analysis. Science34361761203–1205.
 Li Dunson (2016) li2016frameworkLi, M. Dunson, DB. 2016. A framework for probabilistic inferences from imperfect models A framework for probabilistic inferences from imperfect models. arXiv preprint arXiv:1611.01241.
 Lu . (2013) lu2013multivariateLu, L., Jiang, H. Wong, WH. 2013. Multivariate density estimation by bayesian sequential partitioning Multivariate density estimation by bayesian sequential partitioning. Journal of the American Statistical Association1085041402–1410.
 Miller Dunson (2015) miller2015robustMiller, JW. Dunson, DB. 2015. Robust Bayesian inference via coarsening Robust bayesian inference via coarsening. arXiv preprint arXiv:1506.06101.
 O’Hagan . (2006) o2006uncertainO’Hagan, A., Buck, CE., Daneshkhah, A., Eiser, JR., Garthwaite, PH., Jenkinson, DJ.Rakow, T. 2006. Uncertain judgements: eliciting experts’ probabilities Uncertain judgements: eliciting experts’ probabilities. John Wiley & Sons.
 Owen (1991) owen1991empiricalOwen, A. 1991. Empirical likelihood for linear models Empirical likelihood for linear models. The Annals of Statistics1725–1747.
 Ram Gray (2011) ram2011densityRam, P. Gray, AG. 2011. Density estimation trees Density estimation trees. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining Proceedings of the 17th acm sigkdd international conference on knowledge discovery and data mining ( 627–635).
 Rosenblatt . (1976) rosenblatt1976maximalRosenblatt, M. . 1976. On the maximal deviation of dimensional density estimates On the maximal deviation of dimensional density estimates. The Annals of Probability461009–1015.
 Sason Verdú (2015) sason2015boundsSason, I. Verdú, S. 2015. Bounds among fdivergences Bounds among fdivergences. submitted to the IEEE Trans. on Information Theory.
 Shunichi (2012) shun2012differentialShunichi, A. 2012. Differentialgeometrical methods in statistics Differentialgeometrical methods in statistics ( 28). Springer Science & Business Media.
 Silverman (1986) silverman1986densitySilverman, BW. 1986. Density estimation for statistics and data analysis Density estimation for statistics and data analysis ( 26). CRC press.
 J. Smith (1995) smith1995bayesianSmith, J. 199510. Bayesian Approximations and the Hellinger metric. Bayesian approximations and the hellinger metric. Unpublished manuscript
 JQ. Smith (2010) smith2010bayesianSmith, JQ. 2010. Bayesian decision analysis: principles and practice Bayesian decision analysis: principles and practice. Cambridge University Press.
 Tamura Boos (1986) tamura1986minimumTamura, RN. Boos, DD. 1986. Minimum Hellinger distance estimation for multivariate location and covariance Minimum hellinger distance estimation for multivariate location and covariance. Journal of the American Statistical Association81393223–229.
 Walker (2013) walker2013bayesianWalker, SG. 2013. Bayesian inference with misspecified models Bayesian inference with misspecified models. Journal of Statistical Planning and Inference143101621–1633.
 Watson . (2016) watson2016approximateWatson, J., Holmes, C. . 2016. Approximate models and robust decisions Approximate models and robust decisions. Statistical Science314465–489.
 Winkler Murphy (1968) winkler1968evaluationWinkler, RL. Murphy, AH. 1968. Evaluation of subjective precipitation probability forecasts Evaluation of subjective precipitation probability forecasts. Proceedings of the first national conference on statistical meteorology Proceedings of the first national conference on statistical meteorology ( 148–157).
 Zellner (1988) zellner1988optimalZellner, A. 1988. Optimal information processing and Bayes’s theorem Optimal information processing and bayes’s theorem. The American Statistician424278–280.
Appendix A Mclosed efficiency
The examples in Section 5, demonstrate how minimising an alternative divergence to the KLdivergence can lead to superior robustness to tail misspecifications. We next illustrate that by placing less importance on tail misspecifications in order to gain improved robustness, the DM must tradeoff a decrease in efficiency in the case when the model class does in fact contain the data generating process. Minimising the KLdivergence uses Bayes rule and thus conditions on the data coming from the model. As a result minimising the KLdivergence is bound to perform the best when the model class contains the data generating distribution. Here we illustrate this but also demonstrate the reassuring fact that the MDE methods perform comparatively even in this situation.
In order to examine the frequentist efficiency, the observed mean squared error (MSE) , over repeats of a simulated experiment are examined. The MSE’s are examined on two examples and when fitting the model with prior and for . The prior for is centred on the data while the prior for is not. Table 3 plots the observed MSEs over replications^{4}^{4}4The stan output from a number of these experiments associated with the TV divergence suggested that the sample from the posterior was not sufficient to use for posterior inference, for the purpose of this preliminary analysis these experiments were disregarded here, but this suggests there exists computational issues associated with the TVBayes..
MSE  KL  Hell  TV  alpha  power  

n=50  1.89  4.43  1.50  16.88  2.68  6.77  2.03  4.29  0.76  93.80  
n=100  0.64  1.29  0.57  5.19  1.20  1.83  0.67  1.25  0.44  33.80  
n=200  0.35  0.47  0.32  1.50  0.48  0.50  0.35  0.47  0.31  7.83  
n=500  0.15  0.16  0.14  0.33  0.28  0.31  0.15  0.16  0.20  1.42  
n=50  1.11  1.17  1.51  1.28  3.47  6.09  1.13  1.13  8.23  67.94  
n=100  0.60  0.61  0.98  0.87  1.30  1.61  0.65  0.64  1.99  10.64  
n=200  0.23  0.23  0.57  0.52  0.42  0.44  0.30  0.29  0.46  1.09  
n=500  0.10  0.10  0.26  0.26  0.29  0.27  0.12  0.13  0.18  0.21 
Table 3 demonstrates several interesting points about these different methods. Firstly it appears as though the KLBayes, HellBayes and alphaBayes perform very similarly when , while the powerBayes and TVBayes performs marginally worse. As we should expect, for the KLBayes appears to be optimal for both priors with the Bayes appearing to perform the next best. Table 3 also shows the differing impacts the prior has on the different Bayesian updates. The KLBayes is least affected by the prior with the Bayes and the TVBayes the next least. Authors Hooker Vidyashankar (2014) recognised that the boundedness of the HellBayes score function compared with the KLBayes score function, may cause problems such as this. The Bayes score function is larger than the HellBayes score function so our findings would be consistent with this. The powerBayes score function with is smaller than the HellBayes score function when . This explains why the powerBayes is impacted more by the prior, and provides a likely explanation of why the powerBayes appears to perform so well when the prior is specified around the data. For small values of , which occur when the variance of is 100, the absolute size of the TVBayes score function is generally larger than the HellBayes score, again explaining why the TVBayes performs better than the HellBayes under a poorly specified prior.
Examining whether the estimates of were generally above or below the data generating parameter demonstrates how these different methods learn, see table 4. For all sample sizes the KLBayes appears to over or under estimate the true variance in equal proportions over the 50 repeats, while the HellBayes and alphaBayes, to a lesser extent, generally under predict the true variance.
MSE  KL  Hell  TV  alpha  power  

n=50  12  32  10  50  5  37  12  32  8  50  
n=100  16  40  14  50  15  22  18  40  12  50  
n=200  2  24  8  46  1  10  2  26  2  50  
n=500  6  16  6  42  1  3  10  18  8  48  
n=50  4  6  26  10  25  27  8  6  46  50  
n=100  0  0  36  32  19  12  14  14  38  48  
n=200  2  2  40  40  23  22  24  24  28  40  
n=500  2  2  36  40  27  21  28  28  14  20 
Figure 4 compares the scoring functions used in the general Bayesian update targeting the KL, Hellinger and alpha divergences. As increases, the upper bound on the score associated with the alphadivergence increases and the scoring functions become more convex, tending towards the logarithmic score when . In contrast, the score associated with the Hellingerdivergence is closer to being linear. We believe that this is responsible for causing the smaller variance estimates when minimising the Hellingerdivergence and alphadivergence when . The closer the score is to being linear the more similar the penalty is for over and under predicting the probability of an observation when compared with reality. The bounded nature of the scores prevents the penalty for under predicting being too high, and therefore more posterior predictive mass is placed near the MAP of the posterior predictive distribution. In contrast, a score function with greater convexity will penalise under prediction compared with the data generating density to a greater extent spreading the posterior mass out further. However, one must keep in mind that it is the sever nature of the penalty for under prediction incurred by the KLdivergence that renders it nonrobust. On the other hand the TVBayes and the powerBayes generally over predict the variance. For the powerBayes this appears to be because the second term of the power divergence score is
(38) 
for a multivariate Gaussian model, and this is decreased when the variances are made large. As was established in Section 4.3, the powerBayes with decreases the influence observations with small likelihood under the model have on the inference. In the example above the data has a variance of and therefore the likelihood of each observation is very small. As a result the data is then excessively downweighted by the powerBayes when . When the sample size is small, a small increase in the variance may only decrease the observed likelihood term of the powerdivergence score by a fraction and may increase the term in equation (38) by more, resulting in a variance estimate that is too large. The score function for the TVBayes is the only score function considered here that is not monotonic in the predicted probability of each observation. While under the other scores, the score for each individual observation can be increased by predicting that observation with greater probability, this is not the case for the TVBayes. We believe this causes the TVBayes to produces variance estimates that are too large. We note that all of the scores remain proper for large samples and that these are just observations about how the different methods perform for small samples sizes when the model is correct.
Comments
There are no comments yet.