# Detecting conflicting summary statistics in likelihood-free inference

Bayesian likelihood-free methods implement Bayesian inference using simulation of data from the model to substitute for intractable likelihood evaluations. Most likelihood-free inference methods replace the full data set with a summary statistic before performing Bayesian inference, and the choice of this statistic is often difficult. The summary statistic should be low-dimensional for computational reasons, while retaining as much information as possible about the parameter. Using a recent idea from the interpretable machine learning literature, we develop some regression-based diagnostic methods which are useful for detecting when different parts of a summary statistic vector contain conflicting information about the model parameters. Conflicts of this kind complicate summary statistic choice, and detecting them can be insightful about model deficiencies and guide model improvement. The diagnostic methods developed are based on regression approaches to likelihood-free inference, in which the regression model estimates the posterior density using summary statistics as features. Deletion and imputation of part of the summary statistic vector within the regression model can remove conflicts and approximate posterior distributions for summary statistic subsets. A larger than expected change in the estimated posterior density following deletion and imputation can indicate a conflict in which inferences of interest are affected. The usefulness of the new methods is demonstrated in a number of real examples.

## Authors

• 2 publications
• 3 publications
• 20 publications
• 9 publications
• ### Non-linear regression models for Approximate Bayesian Computation

Approximate Bayesian inference on the basis of summary statistics is wel...
09/24/2008 ∙ by M. G. B. Blum, et al. ∙ 0

• ### On deletion diagnostic statistic in regression

A brief review about choosing a normalizing or scaling matrix for deleti...
12/28/2020 ∙ by Myung Geun Kim, et al. ∙ 0

• ### A Comparison of Likelihood-Free Methods With and Without Summary Statistics

Likelihood-free methods are useful for parameter estimation of complex m...
03/03/2021 ∙ by Christopher Drovandi, et al. ∙ 0

• ### Towards constraining warm dark matter with stellar streams through neural simulation-based inference

A statistical analysis of the observed perturbations in the density of s...
11/30/2020 ∙ by Joeri Hermans, et al. ∙ 5

• ### On Model Selection with Summary Statistics

Recently, many authors have cast doubts on the validity of ABC model cho...
03/28/2018 ∙ by Erlis Ruli, et al. ∙ 0

• ### Robust and integrative Bayesian neural networks for likelihood-free parameter inference

State-of-the-art neural network-based methods for learning summary stati...
02/12/2021 ∙ by Fredrik Wrede, et al. ∙ 12

• ### Parameter inference and model comparison using theoretical predictions from noisy simulations

When inferring unknown parameters or comparing different models, data mu...
09/21/2018 ∙ by Niall Jeffrey, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Often scientific knowledge relevant to statistical model development comes in the form of a possible generative process for the data. Computation of the likelihood is sometimes intractable for models specified in this way, and then likelihood-free inference methods can be used if simulation of data from the model is easily done. A first step in most likelihood-free inference algorithms is to reduce the full data set to a low-dimensional summary statistic, which is used to define a distance for comparing observed and simulated data sets. The choice of this summary statistic must be done carefully, balancing computational and statistical considerations (Blum et al., 2013). Once summary statistics are chosen, Bayesian likelihood-free inference can proceed using a variety of techniques, such as approximate Bayesian computation (ABC) (Marin et al., 2012; Sisson et al., 2018a) synthetic likelihood (Wood, 2010; Price et al., 2018) or regression based approaches (Beaumont et al., 2002; Blum and François, 2010; Fan et al., 2013; Raynal et al., 2018).

Sometimes different components of the summary statistic vector bring conflicting information about the model parameter. This can arise due to model deficiencies that we wish to be aware of, and can complicate summary statistic choice. Furthermore, it can lead to computational difficulties when observed summary statistics are hard to match under the assumed model. We develop some diagnostics for detecting when this problem occurs, based on a recent idea in the interpretable machine learning literature. The suggested diagnostics are used in conjunction with regression-based approaches to likelihood-free inference.

The following example of inference on a Poisson mean, which is discussed in Sisson et al. (2018b), illustrates the motivation for our work. Here the likelihood is tractable, but the example is useful for pedagogical purposes. Suppose data are observed, and that we model as a Poisson random sample with mean . For Bayesian inference we use a prior density for . The gamma prior is conjugate, and the posterior density is , where is the sample mean of . Letting

denote the sample variance of

, both and are point estimates of , since for the Poisson model the mean and variance are equal to ; furthermore, is a minimal sufficient statistic.

Consider a standard rejection ABC method for this problem, which involves generating values from the prior, generating summary statistics under the model conditional on the generated values, and then keeping the values which lead to simulated summary statistics closest to those observed. See Sisson et al. (2018b) for further discussion of basic ABC methods. Consider first the summary statistic , and apply rejection ABC with prior samples and retaining of these in the ABC rejection step. This gives a very close approximation to the true posterior (Figure 1).

The computations were implemented using the default implementation of rejection ABC in the R package abc (Csilléry et al., 2012). Reduction of the full data to involves no loss of information, as is sufficient.

Since we observe values of and , these summaries bring conflicting information about , since they are both point estimates of and take quite different values. If we consider the summary statistic , and apply the same rejection ABC algorithm as before with prior samples and accepted samples, the estimated posterior density changes appreciably – see Figure 1. Although is still sufficient, and the corresponding posterior is the same as the posterior given , the ABC computational algorithm performs poorly because the observed summary statistic is hard to match with a reasonable fixed computational budget. Simultaneously observing and is unusual under the assumed model for any value of the model parameter. This results in accepting parameter samples having simulated summary statistic values far away from the observed value of . Also shown in Figure 1 is the posterior conditional on , which is even further from the true posterior than the posterior conditional on .

The conflicting information in the summary statistics and may indicate a model failure, motivating a more flexible model than the Poisson, such as negative binomial, in which the mean and variance are not required to be equal. Sisson et al. (2018b) considered this example partly as a demonstration of the difficulties which arise in greedy algorithms for summary statistic choice. An analyst following a policy of adding one summary statistic at a time, and stopping when the posterior density no longer changes appreciably, would incorrectly judge to be a better summary statistic than . Our example illustrates that if different parts of a summary statistic vector are conflicting, then detecting this can be insightful about model failures and can suggest model improvements. This is the motivation for the conflict detection diagnostics developed here.

The structure of the paper is as follows. In the next Section we describe regression approaches to likelihood-free inference. Section 3 then discusses a recent suggestion in the literature on interpretable machine learning, which has the goal of explaining how a flexible classifier makes the class prediction it does for a certain observation. This method is based on deletion of part of a feature vector, followed by imputation of what was deleted and assessment of how evidence for the observed class changes following the deletion and imputation process. In regression approaches to computing likelihood-free inferences, where regression models are considered with simulated parameters as the response values and simulated summary statistics as features, we show that a similar approach can be used to estimate the changes in a posterior distribution under sequential updating for summary statistic subsets. If the change in the posterior density is larger than expected when a subset of summary statistics is deleted, this may indicate that different parts of the summary statistic vector carry conflicting information about the parameter.

In Section 4, our diagnostics are formalized and placed in the broader framework of Bayesian model checking. We make connections with the literature on prior-data conflict, suggest a method of calibration, and also review related literature on model checking for likelihood-free methods. Section 5 considers three examples. First we revisit the pedagogical example above and show how our diagnostics apply in that simple case, before considering two further real examples. The first real example considers data on stereological extremes, where the model is not flexible enough to fit well in both the lower and upper tails. The second example concerns an ecological time series model. An analysis of these data using summary statistics derived from an auxiliary threshold autoregressive model is first considered, where the summary statistics can be thought of as capturing the different dynamic behaviour of the time series at high and low levels. Conflicts between summary statistic subsets for different regimes are insightful for understanding deficiencies in the original model. An analysis of these data without summary statistics is also considered, where the regression model is based on a convolutional neural network, and the feature vector consists of the whole time series. Here we explore temporally localized parts of the series not fitted well by the model. Section 6 gives some concluding discussion.

## 2 Regression approaches to likelihood-free inference

Consider a Bayesian inference problem with data

and a parametric model for

with density , with parameter . The prior density for is denoted , with corresponding posterior density , where denotes the observed value of . Suppose that computation of the likelihood is intractable, so that likelihood-free inference approaches are needed to approximate the posterior distribution. In this case, we can consider summary statistics , with , and writing for the sampling density of , we use likelihood-free inference methods to approximate .

The posterior density is the conditional density of given in the joint Bayesian model for specified by . Since regression is conditional density estimation, we can fit regression models to samples , from in order to estimate for any . The responses are , and the features are . The regression predictive density given , denoted , is an estimate of .

There are many flexible regression approaches that have been applied to computation of likelihood-free inferences, such as quantile regression forests

(Meinshausen, 2006; Raynal et al., 2018), copula methods (Li et al., 2017; Fan et al., 2013; Klein et al., 2019), neural approaches of various kinds (Papamakarios and Murray, 2016; Papamakarios et al., 2019) and orthogonal series methods (Izbicki and Lee, 2017; Izbicki et al., 2019) among others. In our later applications to the detection of conflicting information in summary statistics we will be concerned with the effects of conflict on inference for scalar functions of the parameter, and so only univariate regression models are needed for estimating one-dimensional marginal posterior distributions. To save notation we will use the notation both for the model parameter as well as a one-dimensional function of interest of it, where the meaning intended will be clear from the context.

In checking for summary statistic conflicts, we will partition as and we ask if the information about in conflicts with that in . We write . Our conflict diagnostic will compare an estimate of to an estimate of . If the change in the estimated posterior distribution is large, this may indicate that the summary statistics carrying conflicting information about the parameter . We must define what large means, and this is an issue of calibration of the diagnostics.

To overcome the computational difficulties of estimating the posterior distributions, we will first consider a likelihood-free regression method with features to compute . Then based on this same regression, the method of Zintgraf et al. (2017) described in Section 3 will be used to approximate by deletion and multiple imputation of . This can be done without refitting the regression. This last point is important, because sometimes fitting very flexible regression models can be computationally burdensome. For example, in the application of Section 5.2, we consider a training sample of observations, where we fit a quantile regression forest in which algorithm fitting parameters are tuned by cross-validation. Applying the method of Zintgraf et al. (2017) for approximating the posterior density given , using the fitted regression with features , is much less computationally demanding than refitting the regression with features to approximate the subset posterior density directly. Since we may wish to consider many different choices of and for diagnostic purposes, this is another reason to avoid the approach of refitting the regression model with different choices of the features.

Although the main idea of our method is to compare estimates of and , making this precise will involve the development of some further background. We describe the method of Zintgraf et al. (2017) next, and then in Section 4 we formalize our diagnostics and consider connections to the existing literature on Bayesian model checking.

## 3 Interpretable machine learning using imputation

We describe a method developed in Zintgraf et al. (2017)

for visualizing classifications made by a deep neural network. Their approach builds on earlier work of

Robnik-Šikonja and Kononenko (2008)

. The method deletes part of a feature vector, and then replaces the deleted part with an imputation based on the other features. The change in the observed class probability is then examined to measure of how important the deleted features were to the classification made. To avoid repetition, the discussion below will be concerned with a general regression model where the response variable can be continuous or discrete, although

Zintgraf et al. (2017) deals only with the case where the response is a discrete class label. It is the case of regression models with a continuous response that is relevant to our discussion of likelihood-free inference.

Suppose that , are response and feature vector pairs in some training set of observations. The feature vectors are -dimensional, written as . We assume the responses are scalar valued, although extension to multivariate responses is possible. Write and so that is an -vector and is an matrix. We consider Bayesian inference in this regression model with a prior density for , and likelihood . Note the different notation here to Section 2, where we considered a likelihood-free inference problem with data and parameter . Here and are the responses and features in the regression, and in likelihood-free regression the features will be simulated summary statistics, whereas the responses are simulated values of the parameters , hence the different notation. More precisely, in the likelihood-free problems of interest to us, and in the notation of Section 2, and , .

The posterior density of in the regression model is

 p(θ|Y,X) ∝p(θ)p(Y|θ,X).

For some feature vector , and corresponding value of interest for the response , we can compute the predictive density of given as

 p(y∗|x∗,Y,X) =∫p(y∗|θ,x∗)p(θ|Y,X)dθ,

where is the likelihood term for and it is assumed that is conditionally independent of given and .

Suppose that some subector of is not observed, say the th component . Assume that the features are modelled as independent and identically distributed from a density , where the parameters are disjoint from , and that and are independent in the prior. We write for with deleted. Then under the above assumptions,

 p(y∗|x∗−i,X,Y) =∫p(y∗|x∗,Y,X)p(x∗i|x∗−i,X)dx∗i, (1)

where

 p(x∗i|x∗−i,X) =∫p(x∗i|x∗−i,λ)p(λ|X)dλ.

Equation (1) says that to get we can average over the distribution of where is imputed from . In the discussion above only deletion of the th component of has been considered, but the generalization to removing arbitrary subsets of is immediate. In the likelihood-free problems discussed later, the expression (1) will give us a way of estimating the posterior distribution of based on summary statistic subsets using a regression fitted using the full set of summary statistics and a convenient imputation model. We will partition the summary statistic as , and then with and where is the observed value of , the formula (1) will give us a way of estimating the posterior distributon based on the regression predictive distribution using features .

In practice we might approximate as for some point estimate of . For example, we might assume a multivariate normal model for (in which case is the mean and covariance matrix of the normal model) and plug in the sample mean and covariance matrix as the point estimate. This is the approach considered in Zintgraf et al. (2017). Robnik-Šikonja and Kononenko (2008) assumed the features to be independent in their model. Zintgraf et al. (2017) note that more sophisticated imputation methods can be used. We can also approximate by for some point estimate of .

In the case of a classification problem where the response is a class label, we can follow Zintgraf et al. (2017) and Robnik-Šikonja and Kononenko (2008) and measure how much class probabilities have changed upon deletion and imputation of using an evidence measure, called weight of evidence:

 WE(y∗|x∗) =log2{p(y∗|x∗,Y,X)1−p(y∗|x∗,Y,X)}−log2{p(y∗|x∗−i,Y,X)1−p(y∗|x∗−i,Y,X)}. (2)

is positive (negative) if observing rather than imputing it from makes the class more (less) probable. Zintgraf et al. (2017) generally consider the observed class for . Measures of evidence suitable for the regression context are considered further in the next section.

Zintgraf et al. (2017) consider problems in image classification where the feature vector is a vector of intensities of pixels in an image. Because of the smoothness of images, if only one pixel is deleted, then it can be imputed very accurately from its neighbours. If this is the case, is not useful for measuring the importance of a pixel to the classification. Zintgraf et al. (2017) suggest to measure pixel importance by deleting all image patches of a certain size, and then average weight of evidence measures for patches that contain a certain pixel. They plot images of the averaged weight of evidence measure as a way of visualizing the importance of different parts of an image for a classification made by the network. They also consider deletion and imputation for latent quantities in deep neural network models.

An appropriate imputation method will vary according to the problem at hand, and there is no single method that will be appropriate for all problems. In our later examples for computing likelihood-free inferences we consider three different imputation methods. The first uses linear regression in a situation where the features are approximately linearly related, and we use the implementation in the

mice package in R (van Buuren and Groothuis-Oudshoorn, 2011)

. The second is based on random forests, as implemented in the

missRanger package (Mayer, 2019). Our final example considers time series data, and in the application where we consider the raw series as the feature vector, we use the methods in the imputeTS package (Moritz and Bartz-Beielstein, 2017). Further details are given in Section 5.

## 4 The diagnostic method

In the notation of Section 2, the method of Zintgraf et al. (2017) will allow us to estimate the posterior density , from a regression using the full set of summary statistics as the features. This is described in more detail in Section 4.2 below. We can then compare estimates of and . If the difference between these two posterior distributions is large, then it indicates that different values of the parameter are providing a good fit to the data in the likelihood terms and , indicating a possible model failure. The weight of evidence measure of Zintgraf et al. (2017) doesn’t apply directly as a way of numerically describing the difference in the posterior distributions and in the likelihood-free application. Here we use measures of discrepancy considered in the prior-data conflict checking literature (Nott et al., 2020) based on the concept of relative belief (Evans, 2015) instead of weight of evidence. We place our diagnostics within the framework of Bayesian model checking, and this suggests a method for calibrating the diagnostics. The discussion of prior-data conflict checking is relevant here, because the comparison of to can be thought of as a prior-to-posterior comparison in the situation where has already been observed, and then is the prior for further updating by the likelilhood term . A more detailed explanation of this is given below.

### 4.1 Prior-data conflict checking

Prior-data conflicts occur if there are values of the model parameter that receive likelihood support, but the prior does not put any weight on them (Evans and Moshonov, 2006; Presanis et al., 2013). In other words, prior-data conflict occurs when the prior puts all its mass out in the tails of the likelihood. Nott et al. (2020) consider some methods for checking for prior-data conflict based on prior-to-posterior divergences. In the present setting of a posterior density for summary statistics and with prior , the checks of Nott et al. (2020) would check for conflict between and using a prior-to-posterior -divergence (Rényi, 1961) taking the form

 Rα(S) (3)

where . These divergences are related to relative belief functions (Evans, 2015) measuring evidence in terms of the prior-to-posterior change. Consider a function of . We define the relative belief function for as

 RB(ψ|d) =p(ψ|d)p(ψ)=p(d|ψ)p(d), (4)

where is the posterior density for given , is the prior density, is the marginal likelihood given , and is the prior predictive density at . For a given value of , if , this means there is evidence in favour of , whereas if there is evidence against. See Evans (2015) for a deeper discussion of reasons why relative belief is an attractive measure of evidence, and for a systematic approach to inference based on it.

As , (3

) gives the prior-to-posterior Kullback-Leibler divergence (which is the posterior expectation of

), and letting gives the maximum value of the log relative belief, . So the statistic (3) is a measure of the overall size of a relative belief function, and a measure of how much our beliefs have changed from prior to posterior. Nott et al. (2020) suggest to compare the value of to the distribution of , where is the prior predictive density for ,

 p(S) =∫p(S|η)p(η)dη,

by using the tail probability

 pS =P(Rα(S)≥Rα(Sobs)). (5)

The tail probability (5) is the probability that the overall size of the relative belief function RB is large compared to what is expected for data following the prior predictive density . So a small value for the tail probability (5) suggest the prior and likelihood contain conflicting information, as the change from prior to posterior is larger than expected. See Evans and Moshonov (2006) and Nott et al. (2020) for further discussion of prior-data conflict checking, and why the check in Nott et al. (2020) satisfies certain logical requirements for such a check.

### 4.2 Definition of the diagnostics

In the present work, we have a summary statistic partitioned as , and it is conflict between and that interests us. This is related to prior-data conflict checking in the following way. Suppose that we have the posterior density . We can regard as a prior density to be updated by the likelihood term to get the posterior density . That is, . The density usefully reflects information in about , if we have checked both the model for and for prior-data conflict between and in conventional ways. If updating by the information in gives a posterior density that is surprisingly far from , this indicates that and bring conflicting information about . The appropriate reference distribution for the check is

 p(SB|SA) =∫p(SB|SA,η)p(η|SA)dη,

since we are conditioning on in the prior that reflects knowledge of before updating for .

Suppose that the regression model used for the likelihood-free inference gives an approximate posterior density for summary statistic value . The change in the approximate posterior when deleting from can be examined using a mulitple imputation procedure, using the method of Zintgraf et al. (2017). We produce imputations , of and writing and following (1) we approximate by

 ˜p(η|SA,obs) =1MM∑i=1˜p(η|S(i)).

The change in the approximate posterior density from to can be summarized by the maximum log relative belief

 R∞(SB,obs|SA,obs) =supηlog˜p(η|Sobs)˜p(η|SA,obs). (6)

Our discussion of the prior-data conflict checks of Nott et al. (2020) suggests that an appropriate reference distribution for calibrating a Bayesian model check using (6) would compute the tail probability

 p =P(R∞(SB|SA,obs)≥R∞(SB,% obs|SA,obs)), (7)

where . The imputations in the multiple imputation procedure approximate draws from , and we suggest drawing a fresh set of imputations , , separate from those used in computing , and to approximate (7) by

 ˜p =1M∗M∗∑i=1I(R∞(S∗B(i)|SA,obs)≥R∞(SB,obs|SA,obs)), (8)

where is the indicator function which is when event occurs and zero otherwise. Our suggested checks use the statistic (6), calibrated using the estimated tail probability (8).

### 4.3 The logic of Bayesian model checking

Evans and Moshonov (2006) discuss a strategy for model checking based on a decomposition of the joint Bayesian model , generalizing an earlier approach due to Box (1980). Different terms in the decomposition should be used for different purposes such as inference and checking of different model components. We do not discuss their approach in detail, but an important principle is that for checking model components we should check the sampling model first, and only afterwards check for prior-data conflict. The reason is that if the model for the data is inadequate, sound inferences cannot result no matter what the prior is.

In our analysis, the posterior distribution for is

 p(η|Sobs) ∝p(η)p(Sobs|η). (9)

Our diagnostics consider sequential Bayesian updating in two stages,

 p(η|SA,obs) ∝p(η)p(SA,obs|η), (10)

and

 p(η|Sobs) ∝p(η|SA,obs)p(SB,obs|SA,obs% ,η). (11)

In (10), we can check the sampling model and then check for conflict between and , and if these checks are passed, then usefully reflects information about in . We could then proceed to check the components in (11). Checking in (10) and then in (11) is not the same as checking in (9). The reason is that there can be values of providing a good fit to the data for and values of providing a good fit to the data in , but these need not be the same parameter values.

Formally, we could imagine an expansion of the original model from

 p(S|η) =p(SA|η)p(SB|SA,η), (12)

to

 p(S|η1,η2) =p(SA|η1)p(SB|SA,η2), (13)

so that the original parameter is now allowed to vary between the two likelihood terms, with the original model corresponding to . If a good fit to the data can only be obtained with , one would expect that our diagnostics would detect this, as the likelihood terms are peaked in different regions of the parameter space, and this will become evident in the sequential updating of the posterior for the different summary statistic subsets.

The inherent asymmetry in and in the decomposition (12) has consequences for interpretation of the diagnostics. To explain this, consider once more the toy Poisson model in the introduction. Consider first and . Since is sufficient, we have , and our diagnostics would not detect any conflict. However, conventional model checking of the likelihood term would detect that the model fits poorly in this case. On the other hand, if and , then the term depends on since is non-sufficient, and is very different to . In this case, our diagnostics do indeed reveal the conflict as we show later in Section 5.1. The idea of a conflict between summary statistics is being formalized here in terms of conflict between the likelihood terms and and this is not symmetric in and . There is nothing wrong with this, but the interpretation of the diagnostics needs to be properly understood.

### 4.4 Other work on model criticism for likelihood-free inference

We discuss now related work on Bayesian model criticism for likelihood-free inference, which is an active area of current research. Theoretical examinations of ABC methods under model misspecification have been given recently by Ridgway (2017) and Frazier et al. (2020b). The latter authors contrast the behaviour of rejection ABC and regression adjusted ABC methods (Beaumont et al., 2002; Blum and François, 2010; Li and Fearnhead, 2018) in the case of incompatible summary statistics, where the observed summary statistics can’t be matched by the model for any parameter. Regression adjustment methods can behave in undesirable ways under incompatibility, and the authors suggest a modified regression adjustment for this case. They develop misspecification diagnostics based on algorithm acceptance probabilities, and comparing rejection and regression-adjusted ABC inferences.

Ratmann et al. (2009) considered model criticism based on a reinterpretation of the algorithmic tolerance parameter in ABC. They consider a projection of data onto the ABC error, and use a certain predictive density for this error for model criticism. Ratmann et al. (2011) gives a succinct discussion of the fundamentals of the approach and computational aspects of implementation. In synthetic likelihood approaches to likelihood-free inference (Wood, 2010; Price et al., 2018) recent work of Frazier and Drovandi (2019) has considered some model expansions that are analogous to the method of Ratmann et al. (2011) for the synthetic likelihood. Their method makes Bayesian synthetic likelihood methods less sensitive to assumptions, and the posterior distribution of the expansion parameters can also be insightful about the nature of any misspecification. The work of Wilkinson (2013), although not focusing on model criticism, is also related to Ratmann et al. (2011). Wilkinson (2013) shows how the ABC tolerance can be thought of as relating to an additive “model error”, and that ABC algorithms give exact results under this interpretation. Thomas et al. (2020) discuss Bayesian optimization approaches to learning high-dimensional posterior distributions in the ABC setting in the context of the coherent loss-based Bayesian inference framework of Bissiri et al. (2016). Their work is focused on robustness to assumptions and computation in high-dimensions, rather than on methods for diagnosing model misspecification. Recent work of Frazier et al. (2020a) considers a robust ABC approach related to the model expansion synthetic likelihood method of Frazier and Drovandi (2019). This method can also be applied in conjunction with regression adjustment, and the regression adjustment approach can have better behaviour in the case of misspecification.

Also relevant to our work is some of the literature on summary statistic choice. In particuclar, Joyce and Marjoram (2008) have considered an approach to this problem which examines changes in an estimated posterior density upon addition of a new summary statistic. This is a useful method for a difficult problem, but their method is not used as a diagnostic for model misspecification, and in fact can sometimes perform poorly in exactly this situation.

In a sense, model criticism for Bayesian likelihood-free inference is conceptually no different to Bayesian model criticism with a tractable likelihood. However, issues of computation and summary statistic choice sometimes justify the use of different methods. For general discussions of principles of Bayesian model checking see Gelman et al. (1996), Bayarri and Castellanos (2007) and Evans (2015).

## 5 Examples

### 5.1 Poisson model

We return to the toy motivating example given in the introduction on inference for a Poisson mean, where data are observed. For these data the sample mean is and the sample variance is . In the Poisson model, and are both point estimates of the Poisson mean . The estimates bring conflicting information about the parameter, because simultaneously observing and is unlikely for any value of . Figure 1 in the introduction showed ABC estimates of the posterior density for different summary statistic choices, for the prior described in Section 1. Recall that the rejection ABC algorithm performs poorly for the summary statistic , because of the difficulty of matching the observed values, and this may indicate a poorly fitting model.

We apply the diagnostic method of Section 4 for this example, to show how it can alert the user to the conflicting information in the summaries. The summary statisitcs and are approximately linearly related to each other, since they both estimate , and so we use linear regression for imputation using the R package mice (van Buuren and Groothuis-Oudshoorn, 2011). The mice function from this package is used to generate multiple imputations with the Bayesian linear regression method and default settings for other tuning parameters. For the regression likelihood-free inference, we are using the quantile regression forests approach (Meinshausen, 2006) adapted to the ABC setting by Raynal et al. (2018) and available in the R package abcrf. For the random forests regression we use simulations from the prior, and we tune three parameters of the random forests algorithm (‘m.try’, ‘min.node.size’ and ‘sample.fraction’) using the tuneRanger package (Probst et al., 2019). Figure 2 shows the true posterior density, and some random forests estimates of it. There are three estimates obtained from the fitted random forests regression with the full set of features . The first is based on the observed summary statistic for . The second uses the same fitted regression but replaces the observed summary statistics with , where denotes an imputed value for from the observed . With imputation the estimate shown is an average over imputed values. The third uses where is an imputed value for based on the observed . Also shown are density estimates where the regression is fitted using only features , and where the regression is fitted only using features .

We make four observations. First, using the full summary statistic with quantile regression forests results in an accurate estimate of the true posterior density, in contrast to the ABC algorithm described in the introduction. The reason for this is the variable selection capability of the quantile regression forest approach, which results in the summary statistic being ignored as it is not informative given the minimal sufficient statistic . This is what is expected for our diagnostic, following the discussion of Section 4.3. Our second observation is that imputing from results in very little change to the estimated posterior density. Again this is because the random forest regression fit ignores , and so replacing its observed value with a very different imputation is unimportant. Our third observation is that imputing from does result in a posterior density with very different inferential implications. Finally, fitting the quantile regression forest with features and then deleting and imputing one of the features for the observed data gives a similar posterior density estimate to that obtained by fitting the quantile regression with only the feature that was not deleted. This is desirable, because if we examine deletion and imputation for many subsets of features it may be computationally intensive to refit regression models repeatedly, but the imputation process is less computationally burdensome.

Finally, Figure 3 illustrates our suggested calibration of for the choices , and , and . The vertical lines in the graphs are the observed statistic values, and the histograms shows reference distribution values for the statistics for imputations of from . The estimated tail probability (8) corresponds to the proportion of reference distribution values above the observed value. The change in the posterior density when is imputed from is surprisingly large (bottom panel).

What would we do in this and similar examples once a conflict is found? In this example it is natural to change the model so that the mean and variance need not be the same (using a negative binomial model rather than Poisson, for instance). Alternatively, if the problem is such that the current model is “good enough” in the sense that the fitted model reproducing aspects of the data we care about in the application at hand, we might change the summary statistics so that the conflict is removed while retaining the important information.

### 5.2 Stereological extremes

The next example was discussed in Bortot et al. (2007) and examines an ellpitical model for diameters of inclusions in a block of steel. The size of the largest inclusion is thought to be an important quantity related to the steel strength. The ellpitical model, which has an intractable likelihood, extends an earlier spherical model considered in Anderson and Coles (2002). We do not give full details of the model but refer to Bortot et al. (2007) for a more detailed description.

There are 3 parameters in the model. The first is the intensity of the homogeneous Poisson point process of the inclusion locations. There are also two parameters in a model for the inclusion diameters. This model is a generalized Pareto distribution with scale parameter and shape , for diameters larger than a threshold which is taken as m. Writing , the prior density for is uniform on . The observed data consists of inclusion diameters observed from a single planar slice. The summary statistic used consists of and the quantiles of diameters at levels . Here there may be some interest in whether the simple Pareto model can fit well for the observations in the lower and upper tail simultaneously. Quantile-based summary statistics for ABC were also considered in Erhardt and Sisson (2016). It is expected that the Pareto model will fit the upper quantiles well, but using only the most extreme quantiles might result in a loss of information. Capturing the behaviour of the diameters in the upper tail is most important for the application here, given the relationship between the largest inclusions and steel strength.

We apply our diagnostic method in the following way. First, we used quantile regression forests to estimate the marginal posterior densities for each parameter , , separately, and using the full set of summary statistics, which we denote by . We can write , where denotes the vector of lower quantiles at levels and denotes the vector of upper quantiles at levels . In fitting the random forests regressions we used simulations of parameters and summary statistics from the prior, and the random forest tuning parameters were chosen using the tuneRanger package in R. We apply our diagnostic by deleting and imputing and respectively to see if there is conflicting information in the lower and upper quantiles about the parameters. The imputation is done using the R package missRanger, which uses a random forests approach (Mayer, 2019). We use the missRanger function in this package with and 20 trees with other tuning parameters set at the default values. Figure 4 shows how the estimated marginal posterior densities change for the various parameters upon deletion and imputation of and .

We make two observations. First, when the upper quantiles are imputed from the lower ones, the estimated marginal posterior densities change substantially for and , the parameters in the inclusion model. This does suggest that the Pareto model is not able to simultaneously fit the lower and upper quantiles well. Given the importance in this application of capturing the behaviour of the largest diameters, we might remove the lower quantiles from the vector of summary statistics to obtain a model that is fit for purpose. Secondly, and unlike our first example, the results of deletion and imputation do not match very well with the estimated posterior densities obtained by refitting the regression for parameter subsets directly. However, the estimates based on deletion and imputation are sufficiently good for the diagnostic purpose of revealing the conflict, and the computational burden of imputation is much less than refitting, given that our regression training set contains observations, and that algorithm parameters need to be tuned by cross-validation.

Figure 5 shows our suggested calibration of for the choices , and , and . The vertical lines in the graphs are the observed statistic values, and the histograms shows reference distribution values for the statistics for imputations of from . When is imputed from (bottom row), the change in the estimated posterior density is surprising for and .

### 5.3 Ricker model with threshold-autoregressive auxiliary model

Here we consider a Ricker model (Ricker, 1954), a simple model for the dynamics of animal population sizes in ecology. Likelihood-free inference will be considered using summary statistics which are point estimates from an auxiliary model with tractable likelihood.

The auxiliary model is a two-component self-exiciting threshold autoregression (SETAR) model (Tong, 2011), in which the two auto-regressive components describe dynamic behaviour of the process at high and low levels respectively. Maximum likelihood estimates of the SETAR model parameters provide the summary statistics. The use of an auxiliary models with tractable likelihood is a common way to obtain summary statistics for likelihood-free inference - see Drovandi et al. (2015) for a unified discussion of such approaches. We focus on whether there is conflict between the summary statistics for the two autoregressive components in our auxiliary model. Our checks can be insightful about the ways in which the Ricker model fails to fit well at high and low levels.

Let , be a series of unobserved population sizes, and suppose we observe values , where is a sampling parameter. The series has some initial value and one-step conditional distributions are defined by

 Nt+1 =rNtexp(−Nt+et+1),

where is a growth parameter and is an independent environmental noise series. We write for the parameters, and the prior density for is uniform on the range . The prior was chosen to achieve a reasonable scale and variation based on prior predictive simulations. The Ricker model can exhibit chaotic behaviour when the environmental noise variance is small. In this case, it can be challenging to fit the model using methods which perform state estimation to approximate the likelihood. Wood (2010) and Fasiolo et al. (2016) discuss further the motivations for using likelihood-free methods for time series models with chaotic behaviour. Model misspecification could be another reason for conditioning only on data summaries that we care about – the model for the full data may be considered as a device for implicitly defining a model for summary statistics, and it may be that we only care about good model specification for certain summary statistics important in the application.

For a time series , , the SETAR model used to obtain summary statistics takes the form

 Xt ={a0+a1Xt−1+a2Xt−2+ϵt,ϵt∼N(0,ρ2)if Xt−1

Independence is assumed at different times for the noise sequence . The parameter is a threshold parameter, and the dynamics of the process switches between two autoregressive components of order 2 depending on whether the threshold is exceeded. To obtain our summary statistics, we fit this model to the observed data, and fix based on the observed data fit. With this fixed the SETAR model is then fitted to any simulated data series to obtain maximum likelihood estimates of the SETAR model parameters, which are the summary statistics denoted by . We write , where and are maximum likelihood estimates for the autoregressive component parameters for low and high levels respectively. The SETAR models are fitted using the TAR package (Zhang and Nieto, 2017) in R. In simulating data from the model there were some cases where there were no values of the series above the threshold . Since the number of such cases was small, we simply discarded these simulations.

The observed data are a series of blowfly counts described in Nicholson (1954). Nicholson’s experiments examined the effect of different feeding restrictions on blowfly population dynamics. We use the series for the adult food limitation condition shown in Figure 3 of Nicholson (1954). The Ricker model fits the blowfly data poorly, but the example is instructive for illustrating methodology for model checking. A much better model is described in Wood (2010), based on a related model considered in Gurney et al. (1980). In the model of Wood (2010), the population size is a sum of a recruitment and survival process, with the recruitment process related to previous population size at some time lag. This model is able to reproduce the strong periodicity and other complex features of the data, which the simple Ricker model does not do.

We compute ABC posterior estimates for each parameter using quantile regression forests, once again tuning algorithm parameters by cross-validation using the tuneRanger package. The quantile regression forests are fitted based on values from the prior with corresponding TAR summary statistics. We consider fitting the random forests regression using summaries , only and only. For the fit with summary statistics we consider the posterior density estimated using the observed , as well as values of where is imputed from , and where is imputed from . The imputation is done using the R package missRanger (Mayer, 2019) using the missRanger function with and 20 trees with other tuning parameters set at the default values. The results are shown in Figure 6.

We make two observations. First, the posterior density for is very different when only is used compared to either only or . This suggests that the dynamics at high and low levels of the series are inconsistent with a single fixed value of the growth parameter . Second, the posterior densities estimated using quantile regression forests with features only, are estimated quite well by the corresponding estimates using features and deletion and imputation. For the posterior densities estimated using only, they are also estimated quite well for , but not the other parameters, using features and deletion and imputation of .

Figure 7 shows our suggested calibration of for the choices , and , . The vertical lines in the graphs are the observed statistic values, and the histograms shows reference distribution values for the statistics for imputations of from . When is imputed from (bottom row), the change in the estimated posterior density for is surprising.

### 5.4 Ricker model without summary statistics: convolutional neural networks

We now consider using the whole time series as the feature vector in the Ricker model, and use a convolutional neural network as the regression method. Convolutional networks are not described in detail here; the reader is referred to Polson and Sokolov (2017) and Fan et al. (2019) for introductory discussions suitable for statisticians. Dinev and Gutmann (2018) considered the use of convolutional networks for automated summary statistic choice for likelihood-free inference for time series and showed that a single architecture can provide good performance in a range of problems. In fitting the convolutional network we use a squared error loss, and after obtaining point estimates for the network parameters and response variance we use a plug-in normal predictive distribution.

Writing, as before, for both the full parameter , as well as some scalar function of interest (where the meaning intended is clear from the context), we estimate the marginal posterior density for from samples , . We used below. The regression model is

 η(i) =fw(d(i))+ϵ(i),

for errors , where is a convolutional neural network with weights . Figure 8 shows the architecture used, which is based on the architecture shown in Figure 6 (a) of Dinev and Gutmann (2018).

Training of the network was done using the Keras API for Python, using the Adam optimizer with default settings. The model was trained for 50 epochs with a batch size of 64. After estimating

as , the estimated posterior density for data is .

We wish to develop some diagnostics useful for measuring the influence of the observation at time on inference. A method is developed here similar to that of Zintgraf et al. (2017) for measuring the influence of a pixel in image classification. Deletion of “windows” of the raw series containing can be considered (i.e. are the values of the series in a window of width , say, containing , and is the remainder of the series) and then we consider averages of the statistics over all windows containing . In this example we will consider windows of size , and window values are imputed from one neighbouring value to the left and right of the window. At the boundary points where only one neighbouring value is available we use this. The method is described more precisely in the Appendix, as well as our approach to the multiple imputation. When computing the maximum log relative belief statistic (6), we do not truncate the normal predictive densities for the regression model to the support of the parameter space, but we do compute the supremum over this support.

For the parameter , Figure 9 shows a plot of the raw series values with points with calibration probability for our diagnostic less than indicated by vertical lines (top panel), and a plot of the values themsleves indicated by coloured bands (bottom panel). Some of the smallest values in the series around four of the local minima are influential for the estimation of . Similar plots for the other two parameters did not show any points where the calibration probabilities are very small, except for one point near the boundary for ; however, in that case the failure of the imputation method to handle boundary cases well may be an issue. Compared to our previous analysis, the approach using convolutional networks examines misfit for temporally localized parts of the series. Poor fitting regions seem to particularly affect the environmental noise variance parameter that describes variation unexplained by the model dynamics.

## 6 Discussion

We have discussed a method for detecting the existence of conficting summary statistics using regression approaches to likelihood-free inference. The approach uses a recent idea from the interpretable machine learning literature, based on deletion and imputation of part of a feature vector to judge the importance of subsets of features for the prediction. In likelihood-free regression approximations, where the regression predictive distribution is the estimated posterior distribution, the method estimates the posterior distribution conditioning only on a subset of features, and compares this with the full posterior distribution. The deletion and imputation process is less computationally burdensome than fitting a flexible regression model for different feature subsets directly. The approach can be related to prior-data conflict checking methods examining the consistency of different sources of information, and in this interpretation the posterior distribution based on a subset of the features is considered as the prior, which is updated by the likelihood term for the remaining features. Based on this connection, a way to calibrate the diagnostics is suggested.

A concern expressed in the work of Zintgraf et al. (2017) was possible sensitivity of their procedure to the imputation method used. That is a concern in our applications also, although we used a number of different imputation methods in the examples and did not find this sensitivity to be a problem for reasonable choices. The diagnostics described are computed using regression approaches to likelihood-free inference, but they may be useful even if another likelihood-free inference approach is employed for inferential purposes, or in the case where the likelihood is tractable. We believe that insightful methods for model criticism are very important, since they can result in meaningful model expansions and refinements, and ultimately more appropriate inferences and decisions.

## Acknowledgements

Michael Evans was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

## Appendix

### Details of diagnostic for the example of Section 5.4

We describe how we implement our diagnostic for the example of Section 5.4. Roughly speaking, all windows of width are considered including a time , and then we average over where consists of the series values in and is the remaining values.

To make the method precise we need some further notation. Let denote a time series of length . For some subset of the times, , we write and . Let be a fixed time. Let denote the set of all windows of width containing of the form for some . For each suppose that is some time series of length , and write . Write for the value of where for all where is the observed series. Let denote the value of where and i.e. the observation for in are generated from the conditional prior predictive given the observed value for the remainder of the series. The draws are independent for different . Let

 Rt,k(dt,k.) =1nt,knt,k∑j=1R∞(dt,kj(Ct,kj)|dt,kj(−Ct,kj)).

We base our diagnostic on , calibrated by

 pt =P(Rt,k(dt,k,∗.)≥Rt,k(dt,kobs)), (14)

and estimate (14) by

 ˜pt =1M∗M∗∑i=1I(Rt,k(dt,k,i.)≥Rt,k(dt,kobs)), (15)

where , are approximations of draws of based on imputation, i.e. we have imputed from independently for each and .

### Details of the imputation method for the example of Section 5.4

Figure 10 illustrates the idea behind the window-based imputation we use in the example of Section 5.4. We need to impute values of the series for a window of size which has been deleted (indicated by the blue region in the figure). A larger window around the one of width is considered (red patch in the figure). To impute, we first obtain a mean imputation using the functions in the imputeTS package (Moritz and Bartz-Beielstein, 2017). In particular, for imputing a conditional mean we use the na_interpolation function in imputeTS

with spline interpolation and default settings for other tuning parameters. For multiple imputation, we add noise to the conditional mean by fitting a stationary Gaussian autoregressive model of order one to the observed series and then consider zero mean Gaussian noise, where the covariance matrix of the noise is the conditional covariance matrix of the autoregressive process in the blue region given the remaining observations in the red patch. Note that although the series values are counts, these counts are generally large and we treat them as continuous quantities in the imputation procedure.

## References

• C. W. Anderson and S. G. Coles (2002) The largest inclusions in a piece of steel. Extremes 5 (3), pp. 237–252. Cited by: §5.2.
• M. J. Bayarri and M. E. Castellanos (2007) Bayesian checking of the second levels of hierarchical models. Statistical Science 22, pp. 322–343. Cited by: §4.4.
• M. A. Beaumont, W. Zhang, and D. J. Balding (2002) Approximate Bayesian computation in population genetics. Genetics 162, pp. 2025–2035. Cited by: §1, §4.4.
• P. G. Bissiri, C. C. Holmes, and S. G. Walker (2016) A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (5), pp. 1103–1130. Cited by: §4.4.
• M. G. B. Blum, M. A. Nunes, D. Prangle, and S. A. Sisson (2013) A comparative review of dimension reduction methods in approximate Bayesian computation. Statistical Science 28 (2), pp. 189–208. Cited by: §1.
• M. G. B. Blum and O. François (2010) Non-linear regression models for approximate Bayesian computation. Statistics and Computing 20, pp. 63–75. Cited by: §1, §4.4.
• P. Bortot, S. Coles, and S. Sisson (2007) Inference for stereological extremes. Journal of the American Statistical Association 102 (477), pp. 84–92. Cited by: §5.2.
• G. E. P. Box (1980) Sampling and Bayes’ inference in scientific modelling and robustness (with discussion). Journal of the Royal Statistical Society, Series A 143, pp. 383–430. Cited by: §4.3.
• K. Csilléry, O. Francçois, and M. G. B. Blum (2012) abc: an R package for approximate Bayesian computation (ABC). Methods in Ecology and Evolution 3 (3), pp. 475–479. Cited by: §1.
• T. Dinev and M.U. Gutmann (2018) Dynamic likelihood-free inference via ratio estimation (DIRE). arXiv:1810.09899. Cited by: Figure 8, §5.4, §5.4.
• C. C. Drovandi, A. N. Pettitt, and A. Lee (2015) Bayesian indirect inference using a parametric auxiliary model. Statistical Science 30 (1), pp. 72–95. Cited by: §5.3.
• R. Erhardt and S. A. Sisson (2016) Modelling extremes using approximate Bayesian computation. In Extreme Value Modelling and Risk Analysis, D. Dey and J. Yan (Eds.), pp. 281–306. Cited by: §5.2.
• M. Evans and H. Moshonov (2006) Checking for prior-data conflict. Bayesian Analysis 1, pp. 893–914. Cited by: §4.1, §4.1, §4.3.
• M. Evans (2015) Measuring statistical evidence using relative belief. Taylor & Francis. Cited by: §4.1, §4.4, §4.
• J. Fan, C. Ma, and Y. Zhong (2019)

A selective overview of deep learning

.
arXiv preprint arXiv:1904.05526. Cited by: §5.4.
• Y. Fan, D. J. Nott, and S. A. Sisson (2013) Approximate Bayesian computation via regression density estimation. Stat 2 (1), pp. 34–48. External Links: Document, ISSN 2049-1573, Link Cited by: §1, §2.
• M. Fasiolo, N. Pya, and S. N. Wood (2016) A comparison of inferential methods for highly nonlinear state space models in ecology and epidemiology. Statistical Science 31, pp. 96–118. Cited by: §5.3.
• D. T. Frazier, C. Drovandi, and R. Loaiza-Maya (2020a) Robust approximate Bayesian computation: an adjustment approach. arXiv plreprint arXiv:2008.04099. Cited by: §4.4.
• D. T. Frazier and C. Drovandi (2019) Robust approximate Bayesian inference with synthetic likelihood. arXiv preprint arXiv:1904.04551. Cited by: §4.4.
• D. T. Frazier, C. P. Robert, and J. Rousseau (2020b) Model misspecification in approximate Bayesian computation: consequences and diagnostics. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 (2), pp. 421–444. Cited by: §4.4.
• A. Gelman, X.-L. Meng, and H. Stern (1996) Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica 6, pp. 733–807. Cited by: §4.4.
• W. Gurney, S. Blythe, and R. Nisbet (1980) Nicholson’s blowflies revisited. Nature 287, pp. 17–21. Cited by: §5.3.
• R. Izbicki, A. B. Lee, and T. Pospisil (2019)

ABC–CDE: toward approximate Bayesian computation with complex high-dimensional data and limited simulations

.
Journal of Computational and Graphical Statistics (To appear). Cited by: §2.
• R. Izbicki and A. B. Lee (2017) Converting high-dimensional regression to high-dimensional conditional density estimation. Electronic Journal of Statistics 11, pp. 2800–2831. Cited by: §2.
• P. Joyce and P. Marjoram (2008) Approximately sufficient statistics and Bayesian computation. Statistical applications in genetics and molecular biology 7, pp. Article 26. External Links: Document Cited by: §4.4.
• N. Klein, D. J. Nott, and M. S. Smith (2019) Marginally-calibrated deep distributional regression. arXiv preprint arXiv:1908.09482. Cited by: §2.
• J. Li, D. J. Nott, Y. Fan, and S. A. Sisson (2017) Extending approximate Bayesian computation methods to high dimensions via Gaussian copula. Computational Statistics and Data Analysis 106, pp. 77–89. Cited by: §2.
• W. Li and P. Fearnhead (2018) Convergence of regression-adjusted approximate Bayesian computation. Biometrika 105 (2), pp. 301–318. Cited by: §4.4.
• J. Marin, P. Pudlo, C. P. Robert, and R. J. Ryder (2012) Approximate Bayesian computational methods. Statistics and Computing 22 (6), pp. 1167–1180. Cited by: §1.
• M. Mayer (2019) MissRanger: fast imputation of missing values. Note: R package version 2.1.0 External Links: Link Cited by: §3, §5.2, §5.3.
• N. Meinshausen (2006) Quantile regression forests. J. Mach. Learn. Res. 7, pp. 983–999. Cited by: §2, §5.1.
• S. Moritz and T. Bartz-Beielstein (2017) imputeTS: Time Series Missing Value Imputation in R. The R Journal 9 (1), pp. 207–218. External Links:
• A. Nicholson (1954) An outline of the dynamics of animal populations.. Australian Journal of Zoology 2 (1), pp. 9–65. Cited by: §5.3.
• D. J. Nott, X. Wang, M. Evans, and B. Englert (2020) Checking for prior-data conflict using prior-to-posterior divergences. Statistical Science 35 (2), pp. 234–253. Cited by: §4.1, §4.1, §4.2, §4.
• G. Papamakarios and I. Murray (2016) Fast -free inference of simulation models with Bayesian conditional density estimation. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1028–1036. Cited by: §2.
• G. Papamakarios, D. Sterratt, and I. Murray (2019) Sequential neural likelihood: fast likelihood-free inference with autoregressive flows. In Proceedings of Machine Learning R