In the -closed world, where a fitted model class is implicitly assumed to contain the sample distribution from which the data came, Bayesian updating is highly compelling. However, modern statisticians are increasingly acknowledging that their inference is taking place in the -open world (Bernardo Smith, 2001). Within this framework we acknowledge that any class of models we choose is unlikely to capture the actual sampling distribution, and is then at best an approximate description of our own beliefs and the underlying real world process. In this situation it is no longer possible to learn about the parameter which generated the data and a statistical divergence measure must be specified between the fitted model and the genuine one in order to define the parameter targeted by the inference Walker (2013). Remarkably in the -open world standard Bayesian updating can be seen as a method which learns a model by minimising the predictive Kullback-Leibler (KL)-divergence from the model from which the data were sampled (Walker, 2013; Bissiri ., 2016). Therefore traditional Bayesian updating turns out to still be a well principled method for belief updating provided the decision maker (DM) concerns themselves with the parameter of the model that is closest to the data as measured by KL-divergence (Walker, 2013). Recall the KL-divergence of functioning model from data generating density is given by
It is well known that when the KL-divergence is large, which is almost inevitable in high dimensional problems, the KL-divergence minimising models can give a very poor approximation if our main interest is in getting the central part of the posterior model uncertainty well estimated. This is because the KL-divergence gives large consideration to correctly specifying the tails of the generating sample distribution - see Section 3.2.1. As a result, when the DM acknowledges that they are in the -open world, they have several options currently available to them:
Proceed as though the model class does contain the true sample distribution and conduct a posteriori sensitivity analysis.
Modify the model class in order to improve its robustness properties.
Abandon the given parametric class and appeal to more data driven techniques.
To this we add:
Acknowledge that the model class is only approximate, but is the best available, and seek to infer the model parameters in a way that are most useful to the decision maker.
Method 1 is how Box (1980)
recommends approaching parametric model estimation and is the most popular approach amongst statisticians. Although it is acknowledged that the model is only approximate, Bayesian inference is applied as though the statistician believes the model to be correct. The results are then checked to examine how sensitive these are to the approximations made. AuthorsBerger . (1994) provide a thorough review while Watson . (2016) consider this in a decision focused manner.
Method 2 corresponds to the classical robustness approach. Within this approach, one model within the parametric class may be substituted for a model providing heavier tails (Berger ., 1994). Alternatively different estimators, for example M-estimators Huber Ronchetti (1981) Hampel . (2011), see Greco . (2008) for a Bayesian analogue, are used instead of those justified by the model class. Bayes linear methods Goldstein (1999)
, requiring many fewer belief statements than a full probability specification, form a special subclass of these techniques.
The third possibility is to abandon any parametric description of the probability space. Examples of this solution include, empirical likelihood methods (Owen, 1991); a decision focused general Bayesian update (Bissiri ., 2016)
Though alternatives 2 and 3 can be shown to be very powerful in certain scenarios, it is our opinion that within the context described above that a fourth option holds considerable relative merit. In an applied statistical problem the model(s) represents the DM’s best guess of the sample distribution that could be applied. The model provides the only opportunity to input not only structural, but quantitative expert judgements about the domain, something that is often critical to a successful analysis - see Lazer . (2014) for example. The model provides an interpretable and transparent explanation about how different factors might be related to each other. This type of evidence is often essential when advising on decisions to policy makers. Often, simple assumptions play important roles in providing interpretability to the model and in particular preventing it from over fitting to any non-generic features contained in any one data set.
For the above reasons an unambiguous statement of a model, however simple, is in our opinion an essential element for much of applied statistics. In light of this, statistical methodology should also be sufficiently flexible to cope with the fact that the DM’s model is inevitably an approximation both of their beliefs and the real world process. Currently, Bayesian statistics sometimes struggles to do this well. We argue below that this is because it implicitly minimises the KL-divergence to the underlying process.
By fitting model parameter in a way that can be non-robust, the DM is having to combine their best guess belief model with something that will be robust to the parameter fitting. The DM must seek the best representation of their beliefs about a process in order to make future predictions. However under traditional Bayesian updating they must also give consideration to how robust these beliefs are. This seems an unfair task to ask of the DM. We therefore propose to decouple what the DM believes about the data generating process, from how the DM wishes the model to be fitted. This results in option 4 above.
This fourth option suggest the DM may actually want to explicitly target more robust divergences, a framework commonly known as Minimum Divergence Estimation (MDE), see Basu . (2011). Minimum divergence estimation is of course a well-developed field by frequentists, with Bayesian contributions coming more recently. However when the realistic assumption of being in the -open world is considered the currently proposed Bayesian minimum divergence posteriors fail to fully appreciate the principled justification and motivation required to produce a coherent updating of beliefs. A Bayesian cannot therefore make principled inference using currently proposed methods in the -open setting, except in a way that Miller Dunson (2015) describe as “tend(ing) to be either limited in scope, computationally prohibitive, or lacking a clear justification”. In order to make principled inference it appears as though the DM must currently concern themselves with the KL-divergence. However in this paper we remove this reliance upon the KL-divergence by providing a justification for Bayesian updating minimising alternative divergences, both theoretically and ideologically. Our updating of beliefs does not produce an approximate or pseudo posterior, but uses general Bayesian updating (Bissiri ., 2016) to produce the coherent posterior beliefs of a decision maker who wishes to produce predictions from a model that provide an explanation of the data that is as good as possible in terms of some pre-specified divergence measure. By doing this the principled statistical practice of fitting model parameters to produce predictions is adhered to, but the parameter fitting is done so acknowledging the -open nature of the problem.
In Section 2 of this article we review the currently available Bayesian minimum divergence estimation techniques and introduce general Bayesian updating Bissiri . (2016). The contributions of this article are then as follows: in Section 3 we identify the inadequacies in the justification provided by the currently available Bayesian MDE methods and use general Bayesian updating to prove that the Bayesian can still do principled inference on the parameters of the model using alternative, more robust divergences to KL-divergence. This theoretical contribution then allows us to propose a wider variety of divergences that the Bayesian could wish to minimise than have currently been consider in the literature in Section 4. In this section we also consider decision theoretic reasons why targeting alternative divergences to the KL-divergence can be more desirable. Lastly in Section 5 we demonstrate the impact model misspecifications can have on a traditional Bayesian analysis for simple inference, regression and time series analysis, and that superior robustness can be obtained by minimising alternative divergences to the KL-divergence. In appendix A we also show that when the observed data is in fact generated from the model, these methods can be shown not to lose too much precision. In this paper we demonstrate that this advice is not simply based on expedience but has a foundation in a principled inferential methodology. In this purpose we have deliberately restricted ourselves to simple demonstrations designed to provide a transparent illustration of the impact that changing the divergence measure can have on inferential conclusions. However we discuss how robust methodology becomes more important as problems and models become more complex and high dimensional and thus encourage practitioners to experiment with this methodology in practice.
2 Related work
2.1 Bayesian theory and scoring functions
Authors Bernardo Smith (2001) consider Bayesian inference through the lens of a Bayesian decision problem, where quoting a probability belief distribution for future uncertainties is the action to be taken. In this scenario, a scoring rule is used to define the loss of quoting a distribution and observing a realisation . They argue that the scoring rule associated with scoring probabilistic predictions should be proper and local. A proper scoring rule results in the DM’s expected loss being minimised when the DM quotes their true beliefs, and a local scoring rule is one where the score only depends on the quoted probability of the actual observed outcome and nothing else. The only proper, local scoring rule is the logarithmic scoring rule:
The expected additional loss when quoting distribution , with density , for future outcomes when the data is distributed according to distribution , with density , is then:
the Kullback-Leibler (KL) divergence between the two probability distributions in equation (1). In fact authors Grünwald Dawid (2004) define any divergence , associated with proper scoring rule as the extra penalty for believing was distributed according to when it was actually distributed according to .
2.2 The data generating process and the M-open world
Sometimes in this paper we use the phrase “the data generating process”. The data generating process is a widespread term in the literature and appears to suggest that ‘Nature’ is using a simulator to generate observations. While this may fit nicely with some theoretical contributions, it becomes difficult to argue for in reality. In this article we consider the data generating process to represent the DM’s true beliefs about the sample distribution of the observations. However in order to correctly specify these the DM must be able to take the time and infinite introspection to consider all of the information available to them/in the world in order to produce probability specifications in the finest of details. As is pointed out by Goldstein (1990) this requires many more probability specifications to be made at a much higher precision than any DM is ever likely to be able to manage within time constraints of the problem. As a result these genuine beliefs must be approximated. This defines the subjectivist interpretation of the -open world we adopt in this paper - the model used for the belief updating is only ever feasibly an approximation of the DM’s true beliefs about the sample distribution they might use if they had enough time to fully reflect. In the special case when the data is the result of a draw from a known probability model - a common initial step in validating a methodology - then this thoughtful analysis and ”a data generating process” obviously coincide. Henceforth we use the “the data generating process” in this sense to align our terminology as closely as possible with that in common usage.
2.3 General Bayesian updating
Following the above motivation, Bissiri . (2016) consider only conducting inference about the factors of the world/problem that matter to the DM. They produce what they refer to as a general Bayesian update - a coherent method to produce a posterior distribution over some quantity without relying on a full model for the observations - when considering the non-inferential Bayesian decision problem. Similarly to Bayes linear methods (Goldstein, 1999), they consider producing posterior beliefs which do not require conditioning. They consider the Bayes act associated with beliefs corresponding to the data generating distribution and quantity of interest as
Rather than elicit a model representing the DM’s beliefs over , Bissiri . (2016)
argue that given a prior, a loss function and some data, an updating of beliefs about the value of must be possible in the absence of a model for the sampling distribution. They suggest that the posterior distribution resulting from such an updating of beliefs, be as ‘close’ as possible in terms of expected loss to both the prior and the observed data , used for the updating. In this development the KL-divergence was the only way to measure ‘closeness’ to the prior which resulted in an additive Bayesian update - one where the order the data arrives in does not affect that parameter updating and is consistent with the likelihood principle. In the absence of a likelihood, the best choice to measure the ‘closeness’ of the posterior to the data is to use the expected loss of the observed data under the proposed posterior, Bissiri . (2016) show that the posterior minimising the expected loss criteria is:
This posterior is not an approximation. Neither is it a pseudo-posterior, but rather a valid, coherent representation of subjective uncertainty in the minimizer of the decision in equation (5). Often , the cumulative loss of the current data set. Updating using the cumulative loss amounts to replacing integrating over the data generating distribution in equation (5), with empirical integration over data whose distribution is . They also mention the need to calibrate the loss-function used in the general Bayesian update against the prior, by setting the value of in equation (6
). While any probability density is defined so that it integrates to 1, any loss-function can be defined to be arbitrarily large or small. It is important that the weight the loss-function has in the updating process is calibrated against the prior such that the posterior distributions resulting from the general Bayesian update retain probabilistic meaning in the absence of a model. So it follows that the general Bayesian must be prepared to post odds according to their posterior similarly to any other Bayesian.
It is straightforward to see that if the logarithmic score is used in the general Bayesian update then the traditional Bayesian update is recovered:
This echoes the well-known result that the Bayesian predictive distribution finds the distribution that is closest to the data generating distribution in terms of KL-divergence. General Bayesian updating allows the goals of the statistical analysis to be coupled together with the parameter updating. This is not something previously possible under traditional Bayesian statistics. While Bissiri . (2016) provide the framework to update a probability belief distribution using a loss function, they only consider the logarithmic score in an inferential scenario.
2.4 M-closed Bayesian minimum divergence estimation
Minimum divergence estimation (MDE) is an approach to robust inference, largely overlapping with minimum scoring rule inference Dawid . (2016). MDE considers making inference about the parameters of a parametric model by minimising the divergence between and the data generating function of the observed data. While Dawid . (2016) uses proper scores, MDE traditionally minimises a member of the class of discrepancies or disparities. Disparity based methods first build a non-parametric density estimate
of the data generating distribution from the data (often a Kernel Density estimate). They then conduct inference by minimising the divergence between the model and. The frequentist literature in this area is vast so in this article we choose to focus on the Bayesian contributions. These have come first from Hooker Vidyashankar (2014) and more recently from Ghosh Basu (2016) and Ghosh Basu (2017).
Authors Hooker Vidyashankar (2014) consider posterior densities of the form
where is the data generating density and is the squared Hellinger divergence between and
Posteriors of this form are justified following the asymptotic approximation of the divergence to the KL-divergence when the data comes from the model. This results from the fact that the divergence between two probability distributions must be uniquely minimised to 0 when the probability distributions are equal (Grünwald Dawid, 2004). Therefore, when and as the distribution minimising any divergences between it and the data generating process will be .
Rather than considering a discrepancy, Ghosh Basu (2016) considers minimising the proper Tsallis score associated with the density power divergence in order to produce a ‘pseudo-posterior’. The density power divergence between and is
This can be seen as a Bayesian analogue of the minimum proper score inference considered by Dawid . (2016). Authors Ghosh Basu (2016) identify the fact that the score associated with the density power divergence does not require a density estimate, but can still be used to provide robust inference. They term the distribution produced by their estimation method a ‘pseudo-posterior’. However Ghosh Basu (2017) demonstrate exponential convergence results illustrating that this posterior is asymptotically optimal in exactly the same exponential rate as the traditional Bayesian posterior when the model is correctly specified (Barron, 1988).
3 Extending Bayesian MDE for the M-open world
The current Bayesian minimum divergence technology is currently only able to produce approximate or pseudo posteriors. Therefore, Bayes rule is currently the only method explicitly available in the literature that DMs can use to produce a well principled updating of their beliefs about the parameters of a model. We address this issue in this section.
3.1 Why the current justification is not enough
The equivalence between the KL-divergence and the Hellinger-divergence of the model from the data generating process mentioned in Section 2.4 can be seen by taking Taylor expansions about the KL-divergence minimising parameter . We have that
Now following the same arguments as used in Hooker Vidyashankar (2014), if is consistent for then equations (11) and (12) are equivalent in the limit as . However when is not the data generating parameter because the model class is misspecified, and if is still converging to as then equations (11) and (12) will be different. In this setting the current literature gives no foundational reasoning why updating using the Hellinger divergence constitutes a principled updating of beliefs.
The ‘pseudo-posterior’ of Ghosh Basu (2016) is justified as a valid posterior by Ghosh Basu (2017) as the traditional posterior originating from an alternative belief model. However, this alternative model will not even correspond to the DM’s approximate beliefs about the data generating process. There is therefore, a lack of formal justification for a DM to update beliefs using Bayes rule on this object.
3.2 Principled justification for KL in -open world
In contrast, (Walker, 2013; Bissiri ., 2016) has provided a principled justification for Bayes’ rule in the -open world. This requires the DM to think about the KL-divergence minimising parameter. When the model is correctly specified, the Bayesian learns about the parameter that generated the data. This, for any statistical divergence is equivalent to learning the parameter, , minimising . When the model is considered to be incorrect, there is no longer any formal relationship between the parameter and the data. The likelihood no longer represents the probability of the observed data conditioned on the parameter. Therefore in order for -open inference to be meaningful a divergence measure must be chosen and the parameter of interest can then defined as . Authors Walker (2013) then state that once the parameter of interest has been defined as the minimum of some divergence, it is then possible for a practitioner to define their prior beliefs about where the minimiser of this divergence may lie. Then the practitioner’s final task is to ensure that their Bayesian learning machine is learning about the same parameter with which they defined their prior belief.
Recalling the well-known result that Bayesian updating learns the parameters of the model which minimises the KL divergence of the model from the data generating density. This allows the DM to continuing conducting belief updating in a principled fashion using Bayes rule provided they are interested in, and specify prior beliefs about, the parameter minimising the KL divergence of the model from the data generating process.
3.2.1 Moving away from KL-divergence in the M-open world
Viewed conversely, Walker (2013)
identified that using Bayes rule to update beliefs using a misspecified model results in the DM being implicitly concerned about producing predictions that are closest to the data generating process in terms of KL-divergence. In many scenarios using the KL-divergence is the most expedient thing to do. This probably explains why the KL-divergence is so prolific for statistical applications. The logarithmic form of the KL-divergence makes it straightforward to calculate the KL-divergence between many exponential families and the link with Bayes rule allows for conjugate prior distributions to be considered.
However as is well known but often forgotten, someone whose probabilities are elicited using the logarithmic score, or the KL-divergence, will have to beware of approximating the probability of an event by 0. Since if they are wrong, they will incur an infinite loss. As a result, accurate probability specification of tail events will be important to the DM using the logarithmic-score, something which is generally speaking extremely important in pure inference problems (Bernardo Smith, 2001).
Often in reality inference is being done in order to produce estimates of expected utilities as part of a larger decision making process. While the tail specification is important for inferential procedures and in situations where the losses are unbounded, for example in gambling and odds setting scenarios, many loss functions connected to real decisions may wish to place less impact on rarely occurring observations in the tails. For entirely reasonable practical reasons these rare events are precisely ones which the DM will find hard to accurately elicit O’Hagan . (2006), see Winkler Murphy (1968) for a demonstration of this in the context of forecasting the probability of precipitation. Therefore, basing inference on such severe penalties can be unstable. For example a treatment regime is likely to put greater weight on correctly treating the majority of the population than worrying about outlying extreme cases. Authors Watson . (2016) quite rightly point out that loss functions for real problems are often bounded and any method that simulates parameters via MCMC makes this assumption. As will be demonstrated in Section 4.1, inferential procedures assuming an unbounded loss function such as the logarithmic score, provide no guarantees about performance in more general decision making scenarios.
Once the consequences of using Bayes rule to solve decision problem in the -open world is understood, we believe that DMs may reasonably desire alternative options for parameter updating, that are as well principled as Bayes rule, but places less importance on tail misspecification. In this new era of “Big data” it becomes increasingly likely that the model used for inference is misspecified, especially in the tails of the process - see Section 5.4. We believe many DMs would consider it undesirable for models that approximate the distribution of the majority of the data to be disregarded because they poorly fit a few outlying observations.
3.3 Principled minimum divergence estimation
We next take a foundational approach to theoretically justify an updating of beliefs targeting the parameter minimising any statistical divergence between the model and the data generating density. Consider the general inference problem of wanting to find the parameters of the parametric model . The model here can be considered as the DM’s best guess at the data generating process. We consider it important to continue to use a model even under the acknowledgement that it is misspecified for the reasons outlined in Section 1. Following Bernardo Smith (2001) and Walker (2013) we consider fitting these parameters in a decision making scenario, minimising some divergence function
to the data generating process . The goal is therefore to solve the decision problem
Here the entropy term in the definition of the divergence is removed from the minimisation because it does not depend on . In this scenario Walker (2013) forces the Bayesian into concerns associated with non-robustness by shackling them to minimising the KL-divergence in order to use Bayes’ rule. However, equation (14) is identical to the setting consider by Bissiri . (2016) in equation (5), except that now the loss function depends on the parameter through a model so is a scoring rule. Using equation 6, general Bayesian updating therefore gives us the tools to come up with a coherent updating of beliefs that target the parameter minimising the general divergence as
where we have set the calibration weight . This will be discussed further in Section 3.3.2. In order to stay consistent with Walker (2013), we use the notation to indicate that the prior and posterior belief distributions correspond to beliefs about the parameter minimising divergence . Applying the general Bayesian update in an inferential scenario like this, provides a compromise between purely loss based general Bayesian inference and traditional Bayesian updating, which is conducted completely independently of the problem specific loss function. By continuing to fit a generative model, the Bayesian minimum divergence posterior somewhat separates the inference from the decision making. The inference is still concerned with estimating model parameters based on how well the model’s predictions reflect the data generating process. However, the criteria for closeness between predictions and reality takes decision making into account. As a result, the minimum divergence posteriors can be seen as producing inferences that will be suitable for a broad range of loss functions. The DM is no longer required to exactly define the loss function associated with their decision problem at the inference stage. They need only consider broadly how robust to tail misspecifications they want their inference to be in order to define the target divergence (see Section 4.4). Considering the realistic -open nature of the model class also justifies making this compromise. The general Bayesian update, assumes absolutely no information about the data generating process, while using Bayes rule traditionally111Prior to the interpretation provided by Walker (2013) explained in section 3.2 assumes the data generating density is known precisely. As we point out, it is actually more likely that the decision maker is able to express informative but not exact beliefs about the data generating process and therefore a half-way-house between these two is appropriate in reality.
Plugging in the corresponding loss functions gives general Bayesian posteriors minimising the Hellinger and density power divergence as
Equation (16) introduces to estimate the data generating density (see Section 4.5 for more on this). As a result, the general Bayesian updating is being conducted using an empirical loss function, say , which approximates the true loss function associated with minimising the Hellinger divergence between the model and the data generating process .
Equation (17) is exactly the distribution resulting from the robust parameter update of Ghosh Basu (2016), while equation (16) is similar to the posterior produced by Hooker Vidyashankar (2014) except the divergence function has been decomposed into its score and entropy term here. This demonstrates that the posteriors above are not pseudo posteriors - as Ghosh Basu (2016) suggests - or approximations of posteriors - as Hooker Vidyashankar (2014) suggests - they are the correct way for the DM to update their prior beliefs about the parameter minimising an alternative divergence to the KL-divergence.
follows a Bernstein-von Mises (BVM) type result, but with convergence in probability, where the asymptotic variance of the posterior is the expected second derivative of the loss function. Authors Hooker Vidyashankar (2014) prove the corresponding asymptotic normality result for the posterior minimising the Hellinger divergence, but assume that the data generating process is contained within the model class, as this was required to justify their posterior in the first place. Consistency to the minimiser of any divergence provides consistency with the data generating process when it is contained within the model class, as by definition all divergences are minimised to 0 when the distributions are the same. The results of Ghosh Basu (2017) demonstrate that not only is the posterior mean consistent, but the whole posterior is asymptotically optimal in the same sense as the traditional Bayesian posterior is when the model class contains the data generating process.
Another principled alternative to traditional Bayesian updating when it is difficult to fully specify a model for the data generating process is Bayes linear methods (Goldstein, 1999). These only require the subjective specifications of expectations and covariances for various quantities the DM is well informed about and interested in order to do inference. Alternatively by choosing a more robust divergence to use for the updating our method tackles the same problem by being robust to routine assumptions making a full probability specification a much less strenuous task.
3.3.1 The likelihood principle and Bayesian additivity
Authors Hooker Vidyashankar (2014) identify that their posterior distribution minimising the Hellinger divergence no longer satisfies Bayesian additivity. That is to say that the posteriors would look different if the data were observed in one go or in two halves for example. This is because the posterior now depends on the estimate of the data generating density, which depends in a non-additive way on the data. This constitutes a departure from the likelihood principle underpinning traditional Bayesian statistics. The likelihood principle says that the likelihood is sufficient for the data. When the model is perfectly specified this is a sensible principle, the likelihood of the observed data under the correct model represents all the information in the data. However in the -open world the likelihood principle is no longer a reasonable requirement. When the model is only considered as an approximation of the data generating process, it is not unreasonable to suspect that the data contains more information than is represented in the likelihood of an incorrect model. However, Bayesian additivity (defined as coherence by Bissiri . (2016)) is a central principle of the general Bayesian update. Bissiri . (2016) chose the KL-divergence to measure the distance between the prior and the posterior as it was the only divergence measure that left the belief update additive. The general Bayesian posterior minimising the Hellinger divergence combines the loss function for each observation in an additive way, which is consistent with the additivity demanded by Bissiri . (2016), but a different density estimate is produced when the data is considered as a whole or in parts. This causes the empirical loss function used for the updating to be different when the data arrives in sections, as opposed to one go. If the data generating density were available then the exact loss function associated with minimising the Hellinger divergence could be calculated and the Bayesian update would be additive. However because an approximation of the loss function associated with the Hellinger divergence is used, additivity is sacrificed.
While the posterior of Ghosh Basu (2016) is still additive, the likelihood is also not considered to be sufficient for the data, as the posterior also depends on the integral, . Using an alternative divergence to the KL-divergence in order to conduct Bayesian updating, requires that additional information to the ‘local’ information provided by the likelihood of the observed data is incorporated into the loss function. For the Hellinger-divergence this information comes from the data in the way of a density estimate, while for the density power-divergence this comes from the model through .
3.3.2 A note on calibration
The two posteriors in equations (16) and (17) have set the general Bayesian calibration weight to 1. Unlike probability distributions, loss functions can be of arbitrary size and therefore it is important that they are calibrated with the prior. However, this is not a problem here. The posteriors above both use well-defined models and the canonical form of well-defined divergence functions and as a result there is no arbitrariness in the size of the loss. This is further demonstrated by the fact that Bayes rule corresponds to using the canonical form of the KL-divergence and a probability model with weight . We do note that when the model is correct the posterior variance of these method will be comparatively bigger for finite data samples than that of the traditional Bayesian posterior. This is to be expected, Zellner (1988) showed that Bayes rule processes information optimally and therefore produces the most precise posterior distributions. The posterior is simply a subjective reflection of the DM’s uncertainty after seeing the data, and if they believe their model is incorrect and therefore target a more robust divergence, they are likely to have greater posterior uncertainty than if they naively believe their model to be correct and proceed accordingly.
4 Possible divergences to consider
The current formulation of Bayesian MDE methods has limited which divergences have been proposed for implementation. However demonstrating that principled inference can be made using alternative divergence measures than the KL-divergence, as we have in Section 3.3, allows us to consider the selection of the divergence used for updating to be a subjective judgement made by the DM, alongside the prior and model, to help tailor the inference to the specific problem. Authors Celeux . (2017) observed that while Gelman Hennig (2015) advocate greater freedom for subjective judgements to impact statistical methodology, they fail to consider the possibility of subjective Bayesian parameter updating. In the -closed world Bayes rule is objectively the correct thing to do, but in the -open world this is no longer so clear. Very few problems seek answers that are connected with a specific dataset or model, they seek answers about the real world process underpinning these. Authors Goldstein . (2006), focusing on belief statements, demonstrate that subjective judgements help to generalise conclusions from the model and the data to the real world process. Carefully selecting an appropriate divergence measure can further help a statistical analysis to do this. We see this as part of the meta-modelling process where the DM can embed beliefs about their nature and extent of their modelling class explicitly.
Below we list some divergences that a practitioner may wish to consider. We do not claim that the list is exhaustive, but merely contains the divergences we have considered using. We qualitatively describe the features of the inferences produced by targeting each specific divergence, which will be demonstrated empirically in Section 5. Throughout we consider the Lebesgue measure to be the dominating measure.
4.1 Total Variation Divergence
The Total-Variation divergence between probability distributions and is given by
and the general Bayesian update targeting the minimisation of the TV-divergence can be produced by using loss function
in equation (15). It is worth noting however that the loss function in equation (19) is not monotonic in the predicted probability of each observation. This is discussed further in Appendix A. In fact when informed decision making is the goal of the statistical analysis, closeness in terms of TV-divergence ought to be the canonical criteria the DM demands. If then for any utility function bounded by 1, the expected utility under of making the optimal decision believing the data was distributed according to is at most worse than the expected utility gained from the optimal decision under (see JQ. Smith (2010) for example). Therefore if the predictive distribution is close to the data generating process in terms of TV-divergence then the consequences of the model misspecification in terms of the expected utility of the decisions made is small. However explicit expressions of the TV-divergence even between known families, are rarely available. This somewhat hinders an algebraic analysis of the divergence.
It is straightforward to see that and the KL-divergence form an upper bound on the TV-divergence through Pinsker’s inequality. However, the KL-divergence does not bound the TV-divergence below so there are situations where the TV-divergence is very small but the KL-divergence is very large. In this scenario, a predictive distribution whose associated optimal decisions achieve a close to optimal expected utility estimate (as the distribution is close in TV-divergence) will receive very little posterior mass. This is clearly undesirable in a decision making context.
Authors Hooker Vidyashankar (2014) identify some drawbacks of having a bounded score function. The score function being upper bounded means that there is some limit to the score that can be incurred in the tails of the posterior distribution. The score incurred for one value of , will here be very similar to the score incurred by another value of past a given size. Therefore the tails of the posterior will be equivalent to the tails of the prior. As a result the DM is required to think more carefully about their prior distribution. Not only are improper priors prohibited, but more data is required to move away from a poorly specified prior. This can result in poor finite sample efficiency when the data generating process is within the chosen model class.
4.2 Hellinger Divergence
Authors Devroye Gyorfi (1985) observed, that the Hellinger divergence can be used to bound the TV-divergence both above and below:
As a result, the Hellinger-divergence and the TV-divergence are geometrically equivalent. Thus, if one of them is small, the other is small and similarly if one of them is large the other also is. So if one distribution is close to another in terms of TV-divergence, then the two distributions will be close in terms of Hellinger-divergence as well. Authors Beran (1977) first noted that minimising the Hellinger-divergence gave a robust alternative to minimising the KL-divergence, while Hooker Vidyashankar (2014) proposed a Bayesian alternative (equation (16)) that has been discussed at length above. While Hooker Vidyashankar (2014) motivated their posterior through asymptotic approximations, identifying the geometric equivalence between the Hellinger-divergence and TV-divergence proposes further justification for a robust Bayesian updating of beliefs similar to that of Hooker Vidyashankar (2014). Specifically, if being close in terms of TV-divergence is the ultimate robust goal, then being close in Hellinger-divergence will guarantee closeness in TV-divergence. Hellinger can therefore serve as a proxy for TV-divergence that retains some desirable properties of the KL-divergence: it is possible to compute the Hellinger divergence between many known families (J. Smith, 1995) and the score associated with the Hellinger divergence has a similar strictly convex shape, this is discussed further in Appendix A. A posterior targeting the Hellinger-divergence does suffer from the same drawbacks associated with having a bounded scoring function that are mentioned at the end of the previous section. Lastly along with TV-divergence, the Hellinger-divergence is also a metric.
A much wider class of divergences are provided by the two parameter -divergence family proposed in Cichocki . (2011)
Reparametrising this such that and gives the S-divergence of Ghosh . (2017).
The general Bayesian posterior targeting the parameter minimising the -divergence of the model from the data generating density is
Letting represent the data generating density, and some predictions of parametrised by , Cichocki . (2011) provide intuition about how the parameters and impact the influence each observation has on the inference about . They consider influence to be how the observations impact the estimating equation of , this is a frequentists setting but the intuition is equally valid in the Bayesian setting.
The size of for an observation defines how outlying (large values) or inlying (small values) the observation is. Choosing
ensures outliers relative to the model have little influence on the inference, while adjusting the value ofsuch that down-weights the influence of unlikely observations from the model .
The hyperparametersand control the trade-off between robustness and efficiency and selecting these hyperparameters is part of the subjective judgement associated with selecting that divergence. This being the case we feel that to ask for values of and to be specified by a DM is perhaps over ambitious, even given the interpretation ascribed to these parameters above. However, using the -divergence for inference can provide greater flexibility to the DM, which may in some cases be useful. Here to simplify the subjective judgements made we focus on two important, one parameter subfamilies within the -divergence family - the alpha-divergence where and the beta-divergence where .
4.3.1 alpha Divergence
where . There exists various reparametrisations of this: Amari notation uses with ; or Cressie-Read notation (Cressie Read, 1984) introduces with .
We generally restrict attention to values of . corresponds to a KL-divergence limiting case while is 4 times the Hellinger divergence. We think of these as two extremes of efficiency and robustness within the alpha-divergence family with which a DM would want to conduct inference between. The parameter thus controls this trade-off.
The general Bayesian posterior targeting the minimisation of the alpha-divergence is
It was demonstrated in Sason Verdú (2015) (Corollary 1) that for the alpha-divergence can be bounded above by TV-divergence :
Therefore, if the TV-divergence is small then the alpha-divergence will be small (provided ). So a predictive distribution that is close to the data generating density in terms of TV-divergence will receive high posterior mass under an update targeting the alpha-divergence. Once again, the general Bayesian posterior requires a density estimate to compute the empirical estimate of the score associated with the divergence and the same issues associated with having a bounded score function that were discussed in Section 4.1 apply here.
Considering the alpha divergence as a subfamily of the -divergence takes , and therefore . Therefore the influence of observations are weighted based on their ratio of only and not on the value of in isolation. corresponded to the KL-divergence and in order to obtained greater robustness is chosen such that in order to down-weight the influence larger ratios of , corresponding to outlying values of , have on the analysis. However down-weighting some of these ratios relative to the KL-divergence sacrifices efficiency. This demonstrates the efficiency and robustness trade-off associated with the alpha-divergence.
4.3.2 beta Divergence
The beta-divergence, also known as the density power-divergence Basu . (1998), is a member of the Bregman divergence family
Taking , to be the Tsallis score returns the beta-divergence
. This results in the density power divergence given in equation (10), parameterised by rather than here. Both Basu . (1998) and Dawid . (2016) noticed that inference can be made using the beta-divergence without requiring a density estimate. This was used in Ghosh Basu (2016) to produce a robust posterior distribution that did not require an estimate of the data generating density, which has been extensively discussed in previous sections.
Considering the density power divergence as part of the -divergence family takes , which results in treating all values of equally in isolation, but taking results in which means observations that have low predicted probability under the model are down-weighted. Under the KL-divergence at , the influence of an observation is inversely related to its probability under the model, through the logarithmic score. Raising above 0 will decrease the influence of the smaller values of , robustifying the inference to tail specification. However this results in a decrease in efficiency relative to methods minimising the KL-divergence (Ghosh Basu, 2017). As a result there may be some issues using the density power-divergence when high dimensional observations and models are considered. As the dimension increases the predicted probability of each (multivariate) observation shrinks towards 0. The density power divergence down-weights the influence of observations with small predicted probabilities, and as a result ( when we consider the density power divergence) needs to be selected very carefully in order to prevent the analysis from disregarding the majority of the data. This is a price that is paid in order to not require a data generating density estimate.
4.3.3 The S-Hellinger divergence
One further one-parameter special case of the -divergence (S-divergence) is the S-Hellinger divergence given by Ghosh . (2017)
This is generated from the S-divergence by taking . Taking recovers twice the squared Hellinger divergence and gives the squared divergence. Ghosh . (2017) observe that the S-Hellinger divergence is a proper distance metric. Translating these back into the notation of -divergences gives
Therefore gives and . As a result, the squared Hellinger divergence down-weights the influence of large ratios of with respect to smaller ones and also down-weights these ratios when is small. In fact, there is a trade-off here: moving closer to 1 increase away from 1, thus increasing the amount that ratios of are down-weighted for small . However as moves closer to 1, also draws closer to 1, which reduced the amount ratios of are down weighted with respect to smaller ones. Once again we consider trading these two off against each other to be too greater task for the DM to consider. We therefore do not consider the S-Hellinger divergence again here and instead for brevity consider only the choice between the one parameter alpha and density power divergences. In any case, bridging the gap between the Hellinger and the KL-divergence, as the alpha-divergence does, appears to make more sense in an inferential setting than bridging the gap between the Hellinger and squared divergence.
The bullet points below summaries the reasons, introduced above, for which a DM might prefer a specific divergence. Once again we stress that we consider the choice of divergence to be a subjective and context specific judgement to be made by the DM similarly to their prior and model.
KL-divergence: suitable if the tail behaviour of the model is of high importance or there is the belief that the tails of the model are exactly correctly specified. Minimising the KL-divergence also makes best use out of the data so may be suitable if the sample size is very small and the prior is not necessarily informative.
TV-divergence/Hellinger-divergence: suitable if the data set has many observations and robustness for a decision problem with a bounded loss function is of high importance. The Hellinger-divergence appears to be computationally easier to work with (see Appendix A). Both require accurate density estimation
alpha-divergence: suitable for when a trade-off between robustness and efficiency is considered. Requires access to a density estimation technique.
The density power divergence: suitable again for when a trade-off between robustness and efficiency is considered. Does not require a density estimate but does require to be set carefully so that the data is not completely disregarded by the updating process.
Unfortunately there is no free lunch when it comes to applying any of these methods to complex, high dimensions problems. Minimising the KL-divergence is the most computationally efficient. It provides the possibility of implementing conjugate prior distributions and avoiding the computational burden of producing a density estimate. However as the dimensions and complexity of the problem increases, robustness to model misspecification will become more and more important. Minimising the Hellinger-divergence, TV-divergence and alpha-divergence requires an estimate of the data generating density which is certainly not straightforward for high dimensional data (see the discussion in Section4.5). However once this has been estimated then can just be selected based on how important tail misspecifications are, with a guarantee on some reasonable efficiency. Minimising the density power divergence has the computational advantage of not requiring an estimate of the data generating density. However when the dimension of the problem is high, much more thought needs to be put into specifying the parameter to ensure that the updating does not completely disregard the data. The technology for doing this effectively is still in its infancy. Without the data generating density, the density power divergence requires information on the shape of the data from another form in order to fit efficiently and robustly.
4.5 Density estimation
As has been mentioned before, for the TV, Hellinger and divergence, it is not possible to exactly calculate the loss function associated with any value of and because the data generating density will not be available. In this case, a density estimate of is required to produce an empirical loss function. The Bayesian can consider the density estimate as providing additional information to the likelihood from the data (see Section 3.3.1’s discussion on the likelihood principle), and can thus consider their general Bayesian posterior inferences to be made conditional upon the density estimate as well as the data. The general Bayesian update is a valid update for any loss function, and therefore conditioning on the density estimate as well as the data still provides a valid posterior. However, how well this empirical loss function approximates the exact loss function associated with each divergence ought to be of interest. The exact loss function is of course the loss function the DM would prefer to use having made the subjective judgement to minimise that divergence. If the density estimate is consistent to the data generating process, then provided the sample size is large the density estimate will converge to the data generating density, and the empirical loss function will then correctly approximate the loss function associated with that divergence. It is this fact that ensures the consistency of the posterior estimates of the minimum Hellinger posterior Hooker Vidyashankar (2014).
we adopt this practice using a Radial Basis Function (RBF) kernel for simplicity and convenience. However we note thatSilverman (1986) identifies practical drawbacks of FKDEs, including their inability to correctly capture the tails of the data generating process, whilst not over smoothing the centre, as well as the number of data points required to fit these accurately in high dimensions. In addition to this Tamura Boos (1986) observe that the variance of the FKDE when using a density kernel in high dimensions lead to asymptotic bias in the estimate that is larger than . Alternatives include using a kernel with better mean-squared error properties (Epanechnikov, 1969; Rosenblatt ., 1976), variable width adaptive KDEs (Abramson, 1982), which Hwang . (1994) show to be promising in high dimensions, piecewise-constant (alternatively tree based) density estimation (Ram Gray, 2011; Lu ., 2013) which are also promising in high dimensions, or a fully Bayesian Gaussian process as is recommended in Li Dunson (2016).
In this section we aim to illustrate some of the qualitative features associated with conducting inference targeting the minimisation of the different divergences identified in Section 4. Throughout these experiments stan (Carpenter ., 2016) is used to produce fast and efficient samples from the general Bayesian posteriors of interest.
5.1 -open robustness
5.1.1 Simple inference
The experiments below demonstrate the robustness of the general Bayesian update targeting KL-divergence (red), Hellinger-divergence (blue), TV-divergence (pink), alpha-divergence (green) and power-divergence (orange) (in future these may be referred to as KL-Bayes, Hell-Bayes, TV-Bayes, alpha-Bayes and power-Bayes respectively). For illustrative purposes we have fixed for the alpha-divergence and for the power-divergence. Figure 1
plots the posterior predictive originating from fitting a normal modelto two simulated data sets and one ‘real’ data set. For the first data set data points were simulated from a normal -contamination distribution
for the second
were simulated from a Student’s t-distribution with degrees of freedom 4 and the real data set, tracks1, was taken as the first variable from the ‘Geographical Original of Music Data Set’222downloaded from https://archive.ics.uci.edu/ml/datasets/Geographical+Original+of+Music containing
data points, where a KDE of the data appeared to be approximately normally distributed. Prior distributionsand were used for all examples.
It is easy to see from the plots in figure 1 that minimising the KL-divergence, requires that the density of the outlying contamination centred at to be captured correctly. This is at the expense of capturing the density of the remaining 99% of the data, this is especially clear in the log-density plot. The TV-Bayes, alpha-Bayes and power-Bayes appear to correctly capture the distribution for 99% of the data. The boundedness of the TV-divergence and the alpha-divergence means that the contamination is not allowed to unduly affect the analysis, while the power-divergence is able to down-weight the influence of the contaminated points as they are ascribed low predicted probability under the model for 99% of the data. The Hell-Bayes appears to fit too small a variance even for 99% of the data. This is due to the small sample efficiency problems which will be discussed in Appendix A.
The Student’s t-distribution has consistently heavier tails than the normal distribution. Thus the top right hand plot of figure 1 more clearly demonstrates the importance placed on tail misspecification by each method. The KL-divergence fits the largest variance to correctly capture these tails for the most extreme tail observations, the alpha-divergence is able to fit a slight smaller variance because of its bounded nature. However the variance of the alpha-Bayes predictive is still larger than the other methods produce because of the greater convexity of the scoring function depicted in figure 4. Lastly Hell-Bayes, TV-Bayes and power-Bayes place the least weight on tail misspecification and are therefore able to fit a smaller variance and produce a predictive more closely resembling the data generating process for the majority of the data.
The importance given to tail misspecification is also evident in the ‘tracks’ example. Here there is an ordering from TV, power, Hellinger, alpha and KL divergence on both bias towards the right tail and on the size of the predictive variance, in response to the slight positive skew of the KDE of the data. Here it is clear to see that the power, TV, Hellinger and power divergence produce much better fits of the majority of the data than the other methods do. Lastly, the bottom right plot demonstrates how several possible alternative models to the Gaussian perform on the ‘tracks1’ data set333the data set was transformed by adding
to every value in order to make the data strictly positive so the gamma and log-Normal distributions could be applied
, when updating using the KL-divergence. This shows that a Gaussian distribution was actually the best fit for the bulk of the data, and the poor fit achieved is down to the importance placed on tail misspecification by the estimating procedure rather than the model selected.
5.2 Regression under heteroscedasticity
In addition to the simple inference examples above we consider how changing the divergence can affect inference in a regression example. From the previous examples we can see that when the tails of the model are misspecified the KL-divergence minimising predictive distribution inflates the variance of the fitted model to ensure no observations are predicted with low probability. Further, placing large weight on tail observations, which occur with low probability, creates large variance across repeat sampling. This is exactly why parameter estimates in linear regression under heteroscedasticity errors have a large variance.
Bayesian minimum divergence inference places less weight on tail observations: it is thus able to produce inferences with a smaller predictive variance and a smaller variance across repeat sampling. While repeat sampling and estimation variance are not problems in the Bayesian paradigm these results do show that traditional Bayesian inference can be somewhat imprecise when the tails are misspecified which is clearly undesirable when conditioning on observed data.
In order to demonstrate this we simulated data points with repeats from the following heteroscedastic linear model
For the experiments we simulated and each were held constant across experiments. Figure 2 plots one realisation from this heteroscedastic linear model.
We then conducted general Bayesian updating using the five divergences mentioned above with priors and . The mean squared errors (MSE) for the posterior means of the parameters , the MSE for the predictive means on a test set of size 100 simulated from the model without error and the predictive mean variances are presented in Table 1 below. In order to apply the Hellinger, TV and alpha divergences to a regression problem an estimate of the conditional density of the response given the covariates is required. We follow authors Hooker Vidyashankar (2014) and implement conditional KDEs to approximate the true data generating density. For simplicity, the familiar two-stage bandwidth estimation process of Hansen (2004) was used to find the optimal bandwidth parameters.
The bottom of table one demonstrates that the alternative divergences appear to be learning a smaller predictive variance than the KL-Bayes does under heteroscedastic errors. Under divergences alternative to the KL-divergence this is no longer an estimate of the variance of the responses in the data set, given the covariates. This should rather be interpreted predictively as the variance of the predictive distribution which is closest to the data generating distribution in terms of that alternative divergence. Therefore fitting a smaller variances demonstrates that the other divergences are placing more importance on fitting the majority of the data rather than just the outliers.
The top of table 1 illustrates the impact fitting a large variance has on the parameter estimates of the mean function. Placing less influence on outliers allows all of the alternative divergences to produce more precise estimates of the parameters of the underlying linear relationship. This then leads to better performance when predicting the test set. Clearly the errors for all of the methods will increase as increases as the same amount of data is used to estimate more parameters. However it is clear that the errors under the KL-divergence are rising more rapidly. By being less sensitive to the error distribution the alternative divergences are better able to capture the true underlying process.
5.3 Time series analysis
In order to further demonstrate how inflating the variance by targeting the KL-divergence under misspecification can damage inference, we consider a less trivial time series example. We simulate from an auto-regressive process of order L (AR(L)), and then consider additive independent generalised auto-regressive conditionally heteroscedastic of order (1,1) (GARCH(1,1)) errors, with
where , , . GARCH processes are used to model non-stationary, chaotic time series where the variance of the process depends on the magnitude and sign of the previous observations of the process. Eliciting a GARCH process from a DM is a difficult task. It is far from obvious how this GARCH process behaves as a function of its parameters and selecting a lag length for the AR process as well as two lag lengths for the GARCH Process increases the complexity of the model selection problem. Therefore it seems conceivable that a DM could want to fit a simple AR process to noisy time series data in order to investigate the underlying process. One situation where this may be desirable is in financial time series applications where large amounts of data can arrive at a very high frequency.
In order to investigate how the minimum divergence methods perform in this scenario we simulated 3 data sets with , and fitted and AR(L) process to these, where was chosen to match the underlying AR process. The 3 data sets were given by
an AR(3) with
an AR(1) with with GARCH(1,1) errors
an AR(1) with with GARCH(1,1) errors
The plots in figure 3 demonstrate the one-step ahead posterior predictive performance of the minimum divergence posteriors on a test set , simulated from the underlying AR process. Under misspecification we show only the inference under the Hell-Bayes to avoid cluttering the plots, the other minimum divergence posteriors perform similarly. We use the same priors as the regression example and once again conditional density estimates were used for the Hellinger, TV and alpha divergences.
The left hand plots demonstrate that when the model is correctly specified the minimum divergence posteriors produce similar time series inference to the KL-divergence. This is most easily seen in the lower plot which shows the difference in the squared prediction errors between the KL-Bayes and the Hell-Bayes is mostly around 0 and table 2 which demonstrates the root mean squared error across the test set is the same under both methods. The middle plots demonstrate how the KL-Bayes and the Hell-Bayes perform under ‘extreme’ volatility in the error distribution. The tails of the model being very poorly specified causes the KL-Bayes posterior to fit a huge variance with the average posterior predictive variance across the test data set being slightly above 26. Fitting such a high variance makes the inference on the lag parameters insensitive to the data. As a result the underlying trend in the data is completely missed. In contrast the Hell-Bayes posterior predictive distributions have a much smaller variance of around 7. This allows the inference on the lag parameters to be much more sensitive to the underlying AR process. Clearly the Hell-Bayes is unable to exactly fit the truth as the model is misspecified. However it does a much better job of capturing the broad features of the underlying dependence between time points. This is clearly demonstrated by the considerably lower root mean squared predictive error showed in table 2 and by the error differences plot being mostly large and positive. The right hand plots demonstrates how the KL-Bayes and Hell-Bayes perform when the error distribution is less volatile. When the volatility is smaller the KL-Bayes predictive variance is also smaller. Therefore the inference is more sensitive to the underlying trend in the data than in the previous example. However the true dependence in the data is still under estimated relative to the Hell-Bayes, this is again demonstrated in table 2 and the error difference plot. Once again this shows that the way in which the KL-Bayes deals with misspecification, increasing the predictive variance, can mask some of the underlying trends in the data which can be discovered by other methods. Table 2 plots the root mean squared errors (RMSE) for the KL-Bayes and Hell-Bayes on the test set to quantify their correspondence to reality.
|RMSE||Correctly Specified||High Volatility||Low Volatility|
5.4 Application to high dimensions
The examples is this paper only demonstrate the performance of these minimum divergence techniques for relatively small dimensional problems so as to clearly demonstrate the effect that tail misspecifications can have. However, it is as the dimension and complexity of the problem increases that these methods become more and more important, this can be put down to two aspects. The first of these is that outliers or highly influential contaminant data-points become hard to identify in high dimensions. In our examples it is clear from looking at the KDE or histogram of the data that there are going to be outlying observations, but in many dimensions visualising the data in this way is not possible. In addition to this automatic methods for outlier detection struggle in high dimensions(Filzmoser ., 2008).
The second factor in requiring robust inference in high dimensions, is that not only are outliers harder to identify, they are more likely to occur. The occurrence of outliers could be reinterpreted as indicating that the DM’s belief model is misspecified in the tails. The fact that these occur should be unsurprising. We have already discussed specifying beliefs about tail behaviour requires thinking about very small probabilities which is known to be difficult to do Winkler Murphy (1968); O’Hagan . (2006), and often routine assumptions (for example Gaussianity) may be applied. As the dimension of the space increase the tails of the distribution account for a greater proportion of the overall density, increasing the chance of seeing observations that differ from the practitioners beliefs.
This article uses general Bayesian updating Bissiri . (2016) in order to theoretically justify a Bayesian update that targets the parameters of a model that minimise a statistical divergence to the data generating process that is not the KL-divergence. When the -open world is considered, moving away from targeting the minimisation of the KL-divergence can provide an important tool in order to robustify a statistical analysis. The desire for robustness ought to only increase as increasingly bigger models are built to approximate more complex real world processes. This paper provides the statistical practitioner with the principled justification to select the divergence they use for their analysis in a subjective manner allowing them the potential to make more useful predictions from their best approximate belief model.
As it stands however more work is required to advise on the selection of the divergence used for the updating. Ways to more clearly articulate the impact of choosing a certain divergence to a DM needs further research along with ways to help guide the choice of any hyperparameters associated with the divergence. Further experimentation with complex real world data sets is also required to analyse how this robustness-efficiency trade-off associated with the selected divergence manifests itself in practice.
Another issue not addressed in this paper is how to tailor computational algorithms to the inferences we describe here. There exists a vast literature on optimising MCMC algorithms to sample from traditional Bayesian posteriors and in order to fully take advantage of the subjectivity this paper allows a statistician, a whole new class of computational algorithms tailored to different divergences may be required. In addition several of the divergences mentioned in this paper require a density estimate of the underlying process and further research into effectively doing this for complex high dimensional datasets can only improve the performances of these methods for real world problems.
We are indebted to 3 anonymous referees in helping us revise the structure and examples in this article. The authors would also like to thank Jeremias Knoblauch for his helpful discussions when revising the article. The first author acknowledges his research fellowship through the OxWaSP doctoral programme, funded by EPSRC.
- Abramson (1982) abramson1982bandwidthAbramson, IS. 1982. On bandwidth variation in kernel estimates-a square root law On bandwidth variation in kernel estimates-a square root law. The annals of Statistics1217–1223.
The exponential convergence of posterior probabilities with implications for Bayes estimators of density functions The exponential convergence of posterior probabilities with implications for bayes estimators of density functions.Department of Statistics, University of Illinois.
- Basu . (1998) basu1998robustBasu, A., Harris, IR., Hjort, NL. Jones, M. 1998. Robust and efficient estimation by minimising a density power divergence Robust and efficient estimation by minimising a density power divergence. Biometrika853549–559.
- Basu . (2011) basu2011statisticalBasu, A., Shioya, H. Park, C. 2011. Statistical inference: the minimum distance approach Statistical inference: the minimum distance approach. CRC Press.
- Beran (1977) beran1977minimumBeran, R. 1977. Minimum Hellinger distance estimates for parametric models Minimum hellinger distance estimates for parametric models. The Annals of Statistics445–463.
- Berger . (1994) berger1994overviewBerger, JO., Moreno, E., Pericchi, LR., Bayarri, MJ., Bernardo, JM., Cano, JA.others 1994. An overview of robust Bayesian analysis An overview of robust bayesian analysis. Test315–124.
- Bernardo Smith (2001) bernardo2001bayesianBernardo, JM. Smith, AF. 2001. Bayesian theory. Bayesian theory. IOP Publishing.
- Bissiri . (2016) bissiri2016generalBissiri, P., Holmes, C. Walker, SG. 2016. A general framework for updating belief distributions A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
- Box (1980) box1980samplingBox, GE. 1980. Sampling and Bayes’ inference in scientific modelling and robustness Sampling and bayes’ inference in scientific modelling and robustness. Journal of the Royal Statistical Society. Series A (General)383–430.
- Carpenter . (2016) carpenter2016stanCarpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M.Riddell, A. 2016. Stan: A probabilistic programming language Stan: A probabilistic programming language. Journal of Statistical Software20.
- Celeux . (2017) celeux2017someCeleux, G., Jewson, J., Josse, J., Marin, JM. Robert, CP. 2017. Some discussions on the Read Paper” Beyond subjective and objective in statistics” by A. Gelman and C. Hennig Some discussions on the read paper” beyond subjective and objective in statistics” by a. gelman and c. hennig. arXiv preprint arXiv:1705.03727.
- Cichocki . (2011) cichocki2011generalizedCichocki, A., Cruces, S. Amari, Si. 2011. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy131134–170.
- Cressie Read (1984) cressie1984multinomialCressie, N. Read, TR. 1984. Multinomial goodness-of-fit tests Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society. Series B (Methodological)440–464.
- Csisz . (1967) csisz1967informationCsisz, I. . 1967. Information-type measures of difference of probability distributions and indirect observations Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar.2299–318.
- Dawid . (2016) dawid2016minimumDawid, AP., Musio, M. Ventura, L. 2016. Minimum scoring rule inference Minimum scoring rule inference. Scandinavian Journal of Statistics431123–138.
- Devroye Gyorfi (1985) devroye1985nonparametricDevroye, L. Gyorfi, L. 1985. Nonparametric density estimation: the L1 view Nonparametric density estimation: the l1 view ( 119). John Wiley & Sons Incorporated.
- Epanechnikov (1969) epanechnikov1969nonEpanechnikov, VA. 1969. Non-parametric estimation of a multivariate probability density Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications141153–158.
- Filzmoser . (2008) filzmoser2008outlierFilzmoser, P., Maronna, R. Werner, M. 2008. Outlier identification in high dimensions Outlier identification in high dimensions. Computational Statistics & Data Analysis5231694–1711.
- Gelman Hennig (2015) gelman2015beyondGelman, A. Hennig, C. 2015. Beyond subjective and objective in statistics Beyond subjective and objective in statistics. Journal of the Royal Statistical Society: Series A (Statistics in Society).
- Ghosh Basu (2016) ghosh2016robustGhosh, A. Basu, A. 2016. Robust Bayes estimation using the density power divergence Robust bayes estimation using the density power divergence. Annals of the Institute of Statistical Mathematics682413–437.
- Ghosh Basu (2017) ghosh2017generalGhosh, A. Basu, A. 2017. General Robust Bayes Pseudo-Posterior: Exponential Convergence results with Applications General robust bayes pseudo-posterior: Exponential convergence results with applications. arXiv preprint arXiv:1708.09692.
- Ghosh . (2017) ghosh2017generalizedGhosh, A., Harris, IR., Maji, A., Basu, A., Pardo, L. . 2017. A generalized divergence for statistical inference A generalized divergence for statistical inference. Bernoulli234A2746–2783.
- Goldstein (1990) goldstein1990influenceGoldstein, M. 1990. Influence and belief adjustment Influence and belief adjustment. Influence Diagrams, Belief Nets and Decision Analysis143–174.
- Goldstein (1999) goldstein1999bayesGoldstein, M. 1999. Bayes linear analysis Bayes linear analysis. Wiley StatsRef: Statistics Reference Online.
- Goldstein . (2006) goldstein2006subjectiveGoldstein, M. . 2006. Subjective Bayesian analysis: principles and practice Subjective bayesian analysis: principles and practice. Bayesian Analysis13403–420.
- Greco . (2008) greco2008robustGreco, L., Racugno, W. Ventura, L. 2008. Robust likelihood functions in Bayesian inference Robust likelihood functions in bayesian inference. Journal of Statistical Planning and Inference13851258–1270.
- Grünwald Dawid (2004) grunwald2004gameGrünwald, PD. Dawid, AP. 2004. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. Annals of Statistics1367–1433.
- Hampel . (2011) hampel2011robustHampel, FR., Ronchetti, EM., Rousseeuw, PJ. Stahel, WA. 2011. Robust statistics: the approach based on influence functions Robust statistics: the approach based on influence functions ( 114). John Wiley & Sons.
- Hansen (2004) hansen2004nonparametricHansen, BE. 2004. Nonparametric conditional density estimation Nonparametric conditional density estimation. Unpublished manuscript.
- Hooker Vidyashankar (2014) hooker2014bayesianHooker, G. Vidyashankar, AN. 2014. Bayesian model robustness via disparities Bayesian model robustness via disparities. Test233556–584.
- Huber Ronchetti (1981) huber1981robustHuber, PJ. Ronchetti, E. 1981. Robust statistics, Series in probability and mathematical statistics. Robust statistics, series in probability and mathematical statistics. John Wiley, New York.
- Hwang . (1994) hwang1994nonparametricHwang, JN., Lay, SR. Lippman, A. 1994. Nonparametric multivariate density estimation: a comparative study Nonparametric multivariate density estimation: a comparative study. IEEE Transactions on Signal Processing42102795–2810.
- Lazer . (2014) lazer2014parableLazer, D., Kennedy, R., King, G. Vespignani, A. 2014. The parable of Google Flu: traps in big data analysis The parable of google flu: traps in big data analysis. Science34361761203–1205.
- Li Dunson (2016) li2016frameworkLi, M. Dunson, DB. 2016. A framework for probabilistic inferences from imperfect models A framework for probabilistic inferences from imperfect models. arXiv preprint arXiv:1611.01241.
- Lu . (2013) lu2013multivariateLu, L., Jiang, H. Wong, WH. 2013. Multivariate density estimation by bayesian sequential partitioning Multivariate density estimation by bayesian sequential partitioning. Journal of the American Statistical Association1085041402–1410.
- Miller Dunson (2015) miller2015robustMiller, JW. Dunson, DB. 2015. Robust Bayesian inference via coarsening Robust bayesian inference via coarsening. arXiv preprint arXiv:1506.06101.
- O’Hagan . (2006) o2006uncertainO’Hagan, A., Buck, CE., Daneshkhah, A., Eiser, JR., Garthwaite, PH., Jenkinson, DJ.Rakow, T. 2006. Uncertain judgements: eliciting experts’ probabilities Uncertain judgements: eliciting experts’ probabilities. John Wiley & Sons.
- Owen (1991) owen1991empiricalOwen, A. 1991. Empirical likelihood for linear models Empirical likelihood for linear models. The Annals of Statistics1725–1747.
- Ram Gray (2011) ram2011densityRam, P. Gray, AG. 2011. Density estimation trees Density estimation trees. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining Proceedings of the 17th acm sigkdd international conference on knowledge discovery and data mining ( 627–635).
- Rosenblatt . (1976) rosenblatt1976maximalRosenblatt, M. . 1976. On the maximal deviation of -dimensional density estimates On the maximal deviation of -dimensional density estimates. The Annals of Probability461009–1015.
- Sason Verdú (2015) sason2015boundsSason, I. Verdú, S. 2015. Bounds among f-divergences Bounds among f-divergences. submitted to the IEEE Trans. on Information Theory.
- Shun-ichi (2012) shun2012differentialShun-ichi, A. 2012. Differential-geometrical methods in statistics Differential-geometrical methods in statistics ( 28). Springer Science & Business Media.
- Silverman (1986) silverman1986densitySilverman, BW. 1986. Density estimation for statistics and data analysis Density estimation for statistics and data analysis ( 26). CRC press.
- J. Smith (1995) smith1995bayesianSmith, J. 199510. Bayesian Approximations and the Hellinger metric. Bayesian approximations and the hellinger metric. Unpublished manuscript
- JQ. Smith (2010) smith2010bayesianSmith, JQ. 2010. Bayesian decision analysis: principles and practice Bayesian decision analysis: principles and practice. Cambridge University Press.
- Tamura Boos (1986) tamura1986minimumTamura, RN. Boos, DD. 1986. Minimum Hellinger distance estimation for multivariate location and covariance Minimum hellinger distance estimation for multivariate location and covariance. Journal of the American Statistical Association81393223–229.
- Walker (2013) walker2013bayesianWalker, SG. 2013. Bayesian inference with misspecified models Bayesian inference with misspecified models. Journal of Statistical Planning and Inference143101621–1633.
- Watson . (2016) watson2016approximateWatson, J., Holmes, C. . 2016. Approximate models and robust decisions Approximate models and robust decisions. Statistical Science314465–489.
- Winkler Murphy (1968) winkler1968evaluationWinkler, RL. Murphy, AH. 1968. Evaluation of subjective precipitation probability forecasts Evaluation of subjective precipitation probability forecasts. Proceedings of the first national conference on statistical meteorology Proceedings of the first national conference on statistical meteorology ( 148–157).
- Zellner (1988) zellner1988optimalZellner, A. 1988. Optimal information processing and Bayes’s theorem Optimal information processing and bayes’s theorem. The American Statistician424278–280.
Appendix A M-closed efficiency
The examples in Section 5, demonstrate how minimising an alternative divergence to the KL-divergence can lead to superior robustness to tail misspecifications. We next illustrate that by placing less importance on tail misspecifications in order to gain improved robustness, the DM must trade-off a decrease in efficiency in the case when the model class does in fact contain the data generating process. Minimising the KL-divergence uses Bayes rule and thus conditions on the data coming from the model. As a result minimising the KL-divergence is bound to perform the best when the model class contains the data generating distribution. Here we illustrate this but also demonstrate the reassuring fact that the MDE methods perform comparatively even in this situation.
In order to examine the frequentist efficiency, the observed mean squared error (MSE) , over repeats of a simulated experiment are examined. The MSE’s are examined on two examples and when fitting the model with prior and for . The prior for is centred on the data while the prior for is not. Table 3 plots the observed MSEs over replications444The stan output from a number of these experiments associated with the TV divergence suggested that the sample from the posterior was not sufficient to use for posterior inference, for the purpose of this preliminary analysis these experiments were disregarded here, but this suggests there exists computational issues associated with the TV-Bayes..
Table 3 demonstrates several interesting points about these different methods. Firstly it appears as though the KL-Bayes, Hell-Bayes and alpha-Bayes perform very similarly when , while the power-Bayes and TV-Bayes performs marginally worse. As we should expect, for the KL-Bayes appears to be optimal for both priors with the -Bayes appearing to perform the next best. Table 3 also shows the differing impacts the prior has on the different Bayesian updates. The KL-Bayes is least affected by the prior with the -Bayes and the TV-Bayes the next least. Authors Hooker Vidyashankar (2014) recognised that the boundedness of the Hell-Bayes score function compared with the KL-Bayes score function, may cause problems such as this. The -Bayes score function is larger than the Hell-Bayes score function so our findings would be consistent with this. The power-Bayes score function with is smaller than the Hell-Bayes score function when . This explains why the power-Bayes is impacted more by the prior, and provides a likely explanation of why the power-Bayes appears to perform so well when the prior is specified around the data. For small values of , which occur when the variance of is 100, the absolute size of the TV-Bayes score function is generally larger than the Hell-Bayes score, again explaining why the TV-Bayes performs better than the Hell-Bayes under a poorly specified prior.
Examining whether the estimates of were generally above or below the data generating parameter demonstrates how these different methods learn, see table 4. For all sample sizes the KL-Bayes appears to over or under estimate the true variance in equal proportions over the 50 repeats, while the Hell-Bayes and alpha-Bayes, to a lesser extent, generally under predict the true variance.
Figure 4 compares the scoring functions used in the general Bayesian update targeting the KL, Hellinger and alpha divergences. As increases, the upper bound on the score associated with the alpha-divergence increases and the scoring functions become more convex, tending towards the logarithmic score when . In contrast, the score associated with the Hellinger-divergence is closer to being linear. We believe that this is responsible for causing the smaller variance estimates when minimising the Hellinger-divergence and alpha-divergence when . The closer the score is to being linear the more similar the penalty is for over and under predicting the probability of an observation when compared with reality. The bounded nature of the scores prevents the penalty for under predicting being too high, and therefore more posterior predictive mass is placed near the MAP of the posterior predictive distribution. In contrast, a score function with greater convexity will penalise under prediction compared with the data generating density to a greater extent spreading the posterior mass out further. However, one must keep in mind that it is the sever nature of the penalty for under prediction incurred by the KL-divergence that renders it non-robust. On the other hand the TV-Bayes and the power-Bayes generally over predict the variance. For the power-Bayes this appears to be because the second term of the power divergence score is
for a multivariate Gaussian model, and this is decreased when the variances are made large. As was established in Section 4.3, the power-Bayes with decreases the influence observations with small likelihood under the model have on the inference. In the example above the data has a variance of and therefore the likelihood of each observation is very small. As a result the data is then excessively down-weighted by the power-Bayes when . When the sample size is small, a small increase in the variance may only decrease the observed likelihood term of the power-divergence score by a fraction and may increase the term in equation (38) by more, resulting in a variance estimate that is too large. The score function for the TV-Bayes is the only score function considered here that is not monotonic in the predicted probability of each observation. While under the other scores, the score for each individual observation can be increased by predicting that observation with greater probability, this is not the case for the TV-Bayes. We believe this causes the TV-Bayes to produces variance estimates that are too large. We note that all of the scores remain proper for large samples and that these are just observations about how the different methods perform for small samples sizes when the model is correct.