## 1 Introduction

Approximate Bayesian computation (ABC) is now a mature algorithm for likelihood-free estimation. It has been successfully applied to a wide range of real-world problems for which more standard analytic tools were unsuitable due to the absence or complexity of the associated likelihood. It has also paved the way for a range of algorithmic extensions that take advantage of appealing ideas embedded in other approaches. Despite the usefulness of ABC, the method does have a number of drawbacks. The approach is simulation intensive, requires tuning of the tolerance threshold, discrepancy function and weighting function, and suffers from a curse of dimensionality of the summary statistic. The latter issue stems from the fact that ABC uses a non-parametric estimate of the likelihood function of a summary statistic

(Blum, 2010).In this chapter we review two alternative methods of approximating the intractable likelihood function for the model of interest, both of which aim to improve computational efficiency relative to ABC. The first is the synthetic likelihood (SL, originally developed by Wood (2010)), which uses a multivariate normal approximation to the summary statistic likelihood. This auxiliary likelihood can be maximised directly or incorporated in a Bayesian framework, which we refer to as BSL. The BSL approach requires substantially less tuning than ABC. Further, BSL scales more efficiently with an increase in the dimension of the summary statistic due to the parametric approximation of the summary statistic likelihood. However, the BSL approach remains simulation intensive. In another chapter, Fasiolo et al. (2016) apply BSL to dynamic ecological models and compare it with an alternative Bayesian method for state space models. In this chapter, we provide a more thorough review of SL both in the classical and Bayesian frameworks.

The second approach we consider uses an empirical likelihood (EL) within a Bayesian framework (, see Mengersen et al. (2013)

). This approach can in some cases avoid the need for model simulation completely and inherits the established theoretical and practical advantages of synthetic likelihood. This improvement in computational efficiency is at the expense of specification of constraints and making equivalence statements about parameters under the different models. Of note is that the latter enables, for the first time, model comparison using Bayes factors even if the priors are improper. In summary, in the Bayesian context, both of these approaches replace intractable likelihoods with alternative likelihoods that are more manageable computationally.

## 2 Synthetic Likelihood

The first approach to approximating the likelihood that is considered in this chapter is the use of a synthetic likelihood (SL), which was first introduced by Wood (2010)

. The key idea behind the SL is the assumption that the summary statistic conditional on a parameter value has a multivariate normal distribution with mean vector

and covariance matrix . That is, we assume thatwhere denotes the density of the multivariate normal distribution. Of course, in general for models with intractable likelihoods, the distribution of the summary statistic is unknown and thus and are generally unavailable. However, it is possible to estimate these quantities empirically via simulation. Consider generating independent and identically distributed (iid) summary statistic values, , from the model based on a particular value of , . Then the mean and covariance matrix can be estimated via

(1) |

where the superscript denotes transpose. The likelihood of the observed summary statistic, , is estimated via . We use the subscript on to denote the fact that the approximate likelihood will depend on the choice of . The larger the value of the better the mean and covariance parameters of the multivariate normal distribution can be approximated. However, larger values of need more computation to estimate the likelihood. It is likely that a suitable value of will be problem dependent, in particular, it may depend on the actual distribution of the summary statistic and also the dimension of the summary statistic. The value of must be large enough so that the empirical covariance matrix is positive definite.

Note that Wood (2010) described some extensions, such as using robust covariance matrix estimation to handle some non-normality in the summary statistics and robustifying the SL when the observed summary statistic falls in the tails of the summary statistic distribution (i.e. when a poor parameter value is proposed or when the model is mis-specified).

The SL may be incorporated into a classical or Bayesian framework, which are both described below. Then, attempts in the literature to accelerate the SL method are described. We finish the section with a real data example in cell biology.

### 2.1 Classical synthetic Likelihood

The approach adopted in Wood (2010) is to consider the following estimator

(2) |

which is the maximum SL estimator. We use the subscript to denote that the estimator will depend on the value of , with higher accuracy likely to be obtained for larger values of . We note that also because the likelihood is stochastic a different answer will be obtained for fixed if a different random seed is applied. Since the optimisation in (2) is stochastic, Wood (2010)

applied a Markov chain Monte Carlo (MCMC) to explore the space of

and select the value of that produced the highest value of the SL. Some recent applications of the SL method have appeared in Hartig et al. (2014), who used the FORMIND model for explaining complicated biological processes that occur in natural forests, and Brown et al. (2014), who considered models for the transmission dynamics of avian influenza viruses in different bird types.The synthetic likelihood approach has a strong connection with indirect inference, which is a classical method for obtaining point estimates of parameters of models with intractable likelihoods. In the simulated quasi-maximum likelihood (SQML) approach of Smith (1993), an auxiliary model with a tractable likelihood function, , where is the parameter of that model, is used. Define the function as the relationship between the parameter of the model of interest and the auxiliary model. This is often referred to as the binding function in the indirect inference literature. The SQML method aims to maximise the auxiliary likelihood rather than the intractable likelihood of the model of interest

Unfortunately the binding function is typically unavailable. However, it can be estimated by generating iid datasets, , from the model of interest (the generative model) conditional on a value of . Define the auxiliary parameter estimate based on the th simulated dataset as

Then we have

The binding function is defined as as . The SQML estimator then becomes

The synthetic likelihood falls within the SQML framework but where has been reduced to and the density of the multivariate normal distribution is used for .

### 2.2 Bayesian synthetic Likelihood

An intuitive approach to incorporating SL into a Bayesian framework involves combining the prior with the synthetic likelihood, which induces the following approximate posterior

where the subscript denotes that the approximate posterior depends on the choice of . Drovandi et al. (2015) consider a general framework called parametric Bayesian indirect likelihood (pBIL), where the likelihood of some auxiliary model with parameter , , is used to replace the intractable likelihood of the actual or generative model, . Since the binding function is generally not available in closed form, it can be estimated by simulation via drawing iid datasets from the generative model and fitting the auxiliary model to this simulated data (as in the SQML method mentioned previously), producing . Drovandi et al. (2015) demonstrate that the resulting approximate posterior depends on , since in general

is not an unbiased estimate of

even when is an unbiased estimate of . We note that when non-negative and unbiased likelihood estimates are used within Monte Carlo methods such as MCMC (Andrieu and Roberts, 2009) and sequential Monte Carlo (SMC, Chopin et al. (2013)) algorithms, the resulting target distribution is the posterior based on the originally intended likelihood function. Such approaches are referred to as pseudo-marginal or exact-approximate methods in the literature. BSL fits within the pBIL framework, but where the auxiliary model is applied at a summary statistic level rather than the full data level and that the auxiliary model is the multivariate normal distribution, so that the auxiliary parameter estimates have an analytic expression as shown in (1). Despite the fact that we use unbiased estimators for and (under the normality assumption) it is clear that is not an unbiased estimate of . Therefore the BSL posterior is inherently dependent on . However, under the assumption that the model is able to recover the observed statistic, Price et al. (2018) present extensive empirical evidence that the BSL posterior is remarkably insensitive to . Further, some empirical evidence demonstrates that BSL shows some robustness to the lack of multivariate normality.Price et al. (2018) developed a new BSL method that uses an exactly unbiased estimator of the normal likelihood, which is developed by Ghurye and Olkin (1969). Using the notation of Ghurye and Olkin (1969), let

and for a square matrix write if and otherwise, where is the determinant of and means that is positive definite. The result of Ghurye and Olkin (1969) shows that an exactly unbiased estimator of is (in the case where the summary statistics are normal and )

where

. It is interesting to note that this estimator is a mixture of a discrete and a continuous random variable (a realisation of the estimator can be identically 0 with positive probability). Thus, if this estimator is used within a Monte Carlo method, the target distribution is proportional to

regardless of the value of (under the multivariate normality assumption). Price et al. (2018) referred to this method as uBSL, where ‘u’ denotes unbiased.To sample from the BSL posteriors, an MCMC algorithm can be used, for example. We refer to this as MCMC BSL, which is shown in Algorithm 1. Given the insensitivity of the BSL posteriors to the value of , it is of interest to maximise the computational efficiency of the MCMC method. For large , the SL is estimated with high precision but the cost per iteration is high. Conversely, for small , the cost per iteration is low but the SL is estimated less precisely, which reduces the MCMC acceptance rate. Price et al. (2018) found empirically that the value of that leads to an estimated log SL (at a

with high BSL posterior support) with a standard deviation of roughly 2 produces efficient results. However,

Price et al. (2018) also found that there a wide variety of values that lead to similar efficiency. When the unbiased SL is used in place of the SL shown in Algorithm 1, the MCMC uBSL algorithm is obtained. In the examples of Price et al. (2018), MCMC BSL and MCMC uBSL have a similar efficiency. We also note that the MCMC BSL posteriors appear to exhibit very slow convergence when starting at a point with negligible posterior support. The reason for this is that the SL is estimated with a large variance when the observed statistic

lies in the tail of the actual SL. Thus additional research is required on more sophisticated methods for sampling from the BSL posteriors.The BSL method has been applied in the literature. Fasiolo et al. (2016) used BSL for posterior inference for state space models in ecology and epidemiology based on data reduction and compared it with particle Markov chain Monte Carlo (Andrieu et al., 2010). Hartig et al. (2014) implemented BSL for a forest simulation model.

BSL could be seen as a direct competitor with ABC as they are both simulation-based methods and differ only in the way the intractable summary statistic likelihood is approximated. Importantly, BSL does not require the user to select a discrepancy function, as one is naturally induced via the multivariate normal approximation. The simulated summary statistics in BSL are automatically scaled, whereas an appropriate weighting matrix to compare summary statistics in ABC must be done manually. As noted in Blum (2010) and Drovandi et al. (2015)

, ABC uses a non-parametric approximation of the summary statistic likelihood based on similar procedures used in kernel density estimation. From this point of view, the ABC approach may be more accurate when the summary statistic

is low dimensional, however the accuracy/efficiency trade-off is less clear when the summary statistic is high dimensional. Price et al. (2018) demonstrated on a toy example that BSL becomes increasingly more computationally efficient than ABC as the dimension of the summary statistic grows beyond 2. Furthermore, Price et al. (2018) demonstrated that BSL outperformed ABC in a cell biology application with 145 summary statistics.### 2.3 Accelerating synthetic likelihood

As with ABC, the SL method is very simulation intensive. There have been several attempts in the literature to accelerate the SL method by reducing the number of model simulations required. Meeds and Welling (2014) assumed that the summary statistics are independent and during their MCMC BSL algorithm fit a Gaussian process (GP) to each summary statistic output as a function of the model parameter . The Gaussian process (GP) is then used to predict the model output at proposed values of , provided that the prediction is accurate enough. If the GP prediction cannot be performed with sufficient accuracy, more model simulations are taken at that proposed and the GP is re-fit for each summary statistic. The independence assumption of the summary statistics is questionable, and may overstate the information contained in .

In contrast, Wilkinson (2014) used a GP to model the SL as a function of directly and use the GP to predict the SL at new values of . The GP is fit using a history matching approach (Craig et al., 1997). Once the final GP fit is obtained, an MCMC algorithm is used with the GP prediction used in place of the SL.

Moores et al. (2015)

considered accelerating Bayesian inference for the Potts model, which is a complex single parameter spatial model. Simulations are performed across a pre-defined grid with the mean and standard deviation of the summary statistic (which turns out to be sufficient in the case of the Potts model, as it belongs to the exponential family) estimated from these simulations. Non-parametric regressions are then fitted individually to the mean and standard deviation estimates in order to produce an estimate of the mappings

and across the space of , where is the standard deviation of the summary statistic. The regressions are then able to predict the mean and standard deviation of the summary statistic at values not contained in the grid. Further, the regression also smooths out the mappings, which are estimated using a finite value of . The estimated mapping is then used in a sequential Monte Carlo Bayesian algorithm.### 2.4 Example

Cell motility, cell proliferation and cell-to-cell adhesion play an important role in collective cell spreading, which is critical to many key biological processes, including skin cancer growth and wound healing (e.g. Cai et al. (2007); Treloar et al. (2013)). The main function of many medical treatments is to influence the biology underpinning these processes (Decaestecker et al., 2007). In order to measure the efficacy of such treatments, it is important that estimates of the parameters governing these cell spreading processes can be obtained along with some characterisation of their uncertainty. Agent-based computational models are frequently used to interpret these cell biology processes since they can produce discrete image-based and movie-based information which is ideally suited to collaborative investigations involving applied mathematicians and experimental cell biologists. Unfortunately, the likelihood functions for these models are computationally intractable, so standard statistical inferential methods for these models are not applicable.

To deal with the intractable likelihood, several papers have adopted an ABC approach to estimate the parameters (Johnston et al., 2014; Vo et al., 2015a, b). One difficulty with these cell biology applications is that the observed data are typically available as sequences of images and therefore it is not trivial to reduce the dimension of the summary statistic to a suitable level for ABC while simultaneously retaining relevant information contained in the images. For example, Johnston et al. (2014) considered data collected every 5 minutes for 12 hours but only analyse images at 3 time points. Vo et al. (2015a) reduced images initially down to a 15 dimensional summary statistic, but perform a further dimension reduction based on the approach of Fearnhead and Prangle (2012) to ensure there is one summary statistic per parameter.

Here we will re-analyse the data considered in Treloar et al. (2013) and Vo et al. (2015a). The data consist of images of spatially expanding human malignant melanoma cell populations. To initiate each experiment, either 20,000 or 30,000 cells are approximately evenly distributed within a circular barrier, located at the centre of the well. Subsequently, the barriers are lifted and population-scale images are recorded at either 24 hours or 48 hours, independently. Furthermore, there are two types of experiments conducted. The first uses a treatment in order to inhibit cells giving birth (cell proliferation) while the second does not use the treatment. Each combination of initial cell density, experimental elapsed time and treatment is repeated 3 times, for a total of 24 images. The reader is referred to Treloar et al. (2013) for more details on the experimental protocol. For simplicity, we consider here the 3 images related to using 20,000 initial cells, 24 hours elapsed experimental time and no cell proliferation inhibitor.

In order to summarise an image, Vo et al. (2015a) considered 6 sub-regions along a transect of each image. The position of the cells in these regions is mapped to a square lattice. The number of cells in each sub-region is counted, together with the number of isolated cells. A cell is identified as isolated if all of its nearest neighbours (north, south, east, west) are unoccupied. For each region, these summary statistics are then averaged over the three independent replicates. We refer to these 12 summary statistics as and , where and are the number of cells and the percentage of isolated cells (averaged over the 3 images) for region , respectively. Vo et al. (2015a) also estimated the radius of the entire cell colony using image analysis. Thus Vo et al. (2015a) included three additional summary statistics, , which are the estimated and ordered radii for the three images. For more details on how these summary statistics are obtained, the reader is referred to Vo et al. (2015a). This creates a total of 15 summary statistics, which is computationally challenging to deal with in ABC. As mentioned earlier, Vo et al. (2015a) found it beneficial to apply the technique of Fearnhead and Prangle (2012), which uses a regression to estimate the posterior means of the model parameters from the initial summaries, which are then used as summary statistics in a subsequent ABC analysis. Here we attempt to see whether or not BSL is able to accommodate the 15 summary statistics, and compare the results with the ABC approach of Vo et al. (2015a).

Treloar et al. (2013) and Vo et al. (2015a) considered a discretised time and space (two-dimensional lattice) stochastic model to explain the cell spreading process of melanoma cells. For more details on this model, the reader is referred to Treloar et al. (2013) and Vo et al. (2015a). The model contains three parameters: (probability that an isolated agent can move to a neighbouring lattice site in one time step), (probability that an agent will attempt to proliferate and deposit a daughter at a neighbouring lattice site within a time step) and (the strength of cell-to-cell adhesion, that is, cells sticking together). These model parameters can then be related to biologically relevant parameters such as cell diffusivity and the cell proliferation rate. Here we will report inferences in terms of the parameter .

Here we consider a simulated dataset with , and (same simulated data as analysed in Vo et al. (2015a)) and the real data. We ran BSL using Algorithm 1 with independent prior distributions on each parameter. We used a starting value and proposal distribution for the MCMC based on the results provided in Vo et al. (2015a), so we do not apply any burn-in. We also applied the uBSL algorithm. The BSL approaches were run with , 48, 80 and 112 (the independent simulations were farmed out across 16 processors of a computer node). To compare the efficiency of the different choices of we considered the effective sample size (ESS) for each parameter divided by the total number of model simulations performed multiplied by a large constant scalar to increase the magnitude of the numbers (we refer to this as the normalised ESS).

Marginal posterior distributions for the parameters obtained from BSL and uBSL for different values of are shown in Figures 1 and 2, respectively. It is evident that the posteriors are largely insensitive to , which is consistent with the empirical results obtained in Price et al. (2018). The normalised ESS values and MCMC acceptance rates for the BSL approaches are shown in Table 1 for different values of . The efficiency of BSL and uBSL appears to be similar. The optimal value of out of the trialled values appears to be 32 or 48. However, even is relatively efficient. For the increase in acceptance rate is relatively small given the extra amount of computation required per iteration.

We also applied the BSL approaches with similar settings to the real data. The posterior results are presented in Figures 3 and 4. Again we found the results are relatively insensitive to . Table 2 suggests that or are the most efficient choices for out of those trialled. However, it is again apparent that there are a wide variety of values that lead to similar efficiency.

acc. rate (%) | ESS | ESS | ESS | |
---|---|---|---|---|

32 | 17/17 | 96/114 | 86/113 | 115/126 |

48 | 27/32 | 95/103 | 93/92 | 110/115 |

80 | 35/37 | 82/76 | 74/67 | 106/89 |

112 | 38/40 | 61/65 | 61/58 | 68/70 |

acc. rate (%) | ESS | ESS | ESS | |
---|---|---|---|---|

32 | 8/9 | 46/51 | 38/45 | 41/43 |

48 | 17/18 | 76/71 | 56/63 | 70/54 |

80 | 27/28 | 66/64 | 66/60 | 68/58 |

112 | 32/33 | 58/60 | 51/54 | 54/43 |

The results, in comparison to those obtained in Vo et al. (2015a), are shown in Figure 5 for the simulated data and Figure 6 for the real data. From Figure 5 it can be seen that BSL approaches produced results similar to that of ABC for the simulated data. It appears that BSL is able to accommodate the 15 summary statistics directly without further dimension reduction. However, it is clear that the dimension reduction procedure of Vo et al. (2015a) performs well. From Figure 6 (real data) it is evident that ABC and the BSL approaches produce similar posterior distributions for and . For , there is a difference of roughly 0.01 between the posterior means of the BSL and ABC approaches and an increase in precision for BSL. This discrepancy for the real data not apparent in the simulated data requires further investigation. One potential source of error for BSL is the multivariate normal assumption. The estimated marginal distributions of the summary statistics (using ) when the parameter is is shown in Figure 7. All distributions seem quite stable but there is an indication of non-normality for some of the summary statistics. Given the results in Figure 5 and 6, it appears that BSL is showing at least some robustness to this lack of normality.

## 3 Further Reading

Ong et al. (2018) developed a stochastic optimisation algorithm to obtain a variational approximation of the BSL posterior. The authors utilise an unbiased estimator of the log of the multivariate normal density due to Ripley (1996, pp. 56). Ong et al. (2018) demonstrated that significant computational savings can be achieved relative to MCMC BSL, at the expense of resorting to a parametric approximation of the posterior. This work has been extended by Ong et al. (2017) to higher dimensional summary statistic and parameter spaces.

An et al. (2016) and Ong et al. (2017) considered shrinkage estimators of the covariance matrix of the model summary statistic in order to reduce the number of simulations required to estimate the synthetic likelihood.

Pham et al. (2014)

replaced the ratio of intractable summary statistic likelihoods of the Metropolis-Hastings ratio in an MCMC algorithm with the outcome of a classification algorithm. Datasets are simulated under the current parameter and proposed parameter with the former observations labelled as class 1 and the latter labelled as class 2. A classifier,

Pham et al. (2014)used random forests, is then applied. From the fitted classifier, the odds for the value of the observed summary statistic

is computed and used as a replacement to the ratio of intractable likelihoods. Pham et al. (2014) noted that BSL is a special case of this approach when classical quadratic discriminant analysis is adopted as the classifier.Everitt et al. (2017) suggested that the SL can be used to perform Bayesian model selection in doubly intractable models, which contain an intractable normalising constant that is a function of . Such models can occur in complex exponential family models such as exponential random graph models for social networks and the Potts model for image analysis. Everitt et al. (2017) developed computational algorithms in order to produce an SL approximation to the evidence for each model.

## 4 Bayesian empirical likelihood

ABC is a popular computational method of choice not only when there is no likelihood, but also when the likelihood is available but difficult or impossible to evaluate. Another popular idea is to replace the likelihood itself with an empirical alternative. This so-called empirical likelihood (EL) can be embedded within an ABC algorithm or provide an alternative to ABC. The approach is appealing even for less complex models if there is a concern that the model is poorly specified. For instance, if the likelihood is a mixture but is misspecified as a single normal distribution, the corresponding parameter estimates, intervals and inferences may exhibit unacceptably poor behaviour (Chen and Quin, 2003)

. In this case, normal approximation confidence intervals perform poorly in the area of interest, i.e. the lower tail, but intervals based on an EL are shown to perform as well as intervals based on a correctly specified mixture model.

EL has been shown to have good small sample performance compared with methods that rely on asymptotic normality. Moreover, it enables distribution-free tests without simulation, and provides confidence intervals and regions that have appealing theoretical and computational properties (Owen, 1988, 2001). Some of these features are discussed in more detail below.

Close parallels with the EL approach have been drawn with estimating equations (Qin and Lawless, 2001; Grendar and Judge, 2007), kernel smoothing in regression (Chen and Van Keilegom, 2009a, b; Haardle, 1990; Fan and Gijbels, 1996), maximum entropy (Rochet, 2012) and functional data analysis (Lian, 2012). We do not elaborate on these associations in this chapter, but refer the interested reader to the cited references.

EL approaches have been developed in both frequentist and Bayesian contexts. This section provides a brief overview of the method under both paradigms. We then focus on a particular algorithm, , proposed by Mengersen et al. (2013), which was first conceived as part of an ABC algorithm but was then developed independently of the ABC architecture.

### 4.1 Empirical Likelihood

Empirical likelihood (EL) has been a topic of active research and investigation for over a quarter of a century. Although similar ideas were established earlier (see, for example, the proposal of a “scale-load” method for survey sampling by Hartley and Rao (1968)), EL was formally introduced by Owen (1988) as a form of robust likelihood ratio test.

Following Owen (1988) and Owen (2001), assume that we have i.i.d. data from a distribution . An EL denoted is given by

The likelihood ratio and corresponding confidence region are given by, respectively,

where is the empirical distribution function and for some appropriate value of .

Given parameters of interest and an appropriate sufficient statistic for it, a profile likelihood and corresponding confidence region become, respectively,

If there are no ties, we let , , , and find that

Obvious adjustments are made to and if there are ties.

A fundamental result obtained by Owen (1988) is that if the mean of the distribution is finite and its covariance matrix is finite with rank , then as ,

This is the same as that obtained by Wilks’ Theorem for the parametric setup. Thus for a % confidence region, .

As a concrete example of EL, suppose that interest is in estimation of the mean, i.e., . Then , with confidence region and profile likelihood given by, respectively,

Thus under certain conditions, a -level EL confidence interval for is given by

where is the log EL function and is the upper quantile of the

distribution with one degree of freedom.

The above set-up can also be seen as an estimating equation problem, where the true value satisfies the estimating equation

with denoting a vector-valued (estimating) function. Hence we can take to indicate a vector mean, for , for the th quantile of if is continuous, for the median, and so on.

More generally, we have one or more constraints of the form , where the dimension of sets the number of constraints in unequivocally defining the parameters of interest . Then the EL is defined as

for , with constraints

Perhaps surprisingly, there are relatively few Bayesian formulations of EL in the published literature. An earlier Bayesian ABC approach using an approximation of the EL based on the pairwise score equation was proposed by Pauli and Adimara (2010). The authors focused on establishing the validity of the procedure, arguing that its asymptotic properties were preferred over the approximations employed by Pauli et al. (2011). See also Ruli et al. (2016). Owen (2001) (Ch. 9) noted some parallels between EL and the Bayesian bootstrap (Rubin, 1981), and Rochet (2012) has suggested a Bayesian approach to generalised empirical likelihood, and generalised method of moments, via a form of maximum entropy. Chaudhuri and Ghosh (2011) describe Bayesian EL approaches in a spatial modelling context, as discussed in more detail below.

More direct research into Bayesian EL comprise a Monte Carlo study (Lazar, 2003) and two probabilistic studies (Schennach, 2005; Grendar and Judge, 2007). In contrast to the reported Bayesian bootstrap-type approaches of Owen (2001), Schennach (2005) and Rasuga (2006), Grendar and Judge (2007, 2009)

proposed a Bayesian large deviations (law of large numbers) probabilistic interpretation and justification of EL. They showed that, in a parametric estimating equations setting, the EL method is an asymptotic instance of the Bayesian non-parametric maximum a-posteriori approach.

### 4.2 Features of EL

Since Owen’s paper in 1988, the properties of EL have been comprehensively investigated and reviewed (Hall and La Scala, 1990; Owen, 2001).

EL methods have been favourably contrasted with common alternatives for estimation of complex models. For example, a natural competitor is calibration, which proceeds by choosing, by some method, parameter values that match selected features of the observed data. This can be difficult for richly parametrised models with strong correlation structure. EL can be perceived as a more statistically formal method of calibration in that it uses moments for matching. Another common competitor, maximum likelihood, requires the definition, estimation and maximisation of a likelihood and can be both analytically and computationally demanding for complex models. In contrast, EL requires only summary (moment) statistics and can perform inference on an approximate likelihood, but inherits the properties of standard likelihood (Owen, 2001). These properties of standard likelihood are principally obtained by appeal to Wilks’ Theorem (Qin and Lawless, 2001).

As described above, likelihood ratio confidence regions can be constructed by EL that often do not require estimation of the variance (Chen and Van Keilegom, 2009a, b) and have the same order of magnitude error as their parametric counterparts. This also applies for more general regression contexts (Chen, 1993, 1994; Chen and Cui, 2006; Chen and Gao, 2007). The confidence regions constructed in this manner respect the boundaries of the support of the target parameter and are more natural in shape and orientation of the data since they contour a log-likelihood ratio. In particular, they are often superior to confidence regions based on asymptotic normality when the sample size is small. The confidence regions can be further improved by applying Bartlett’s correction, , where involves higher order moments of (DiCiccio et al., 1991).

A key assumption underlying standard EL is that the random variables are independent with a common distribution. An analogue, the weighted EL or exponentially tilted distribution accommodates data that are independent but not necessarily identically distributed. This approach was introduced by Schennach (2005) and has been taken up by a large number of authors (Owen, 2001; Kitamura, 2006; Glenn and Zhao, 2007). Chaudhuri and Ghosh (2011) contrast the two approaches as follows. They frame the EL as

where

for some specified function . They then consider standard EL as a form of constrained maximum of a nonparametric likelihood since for a given , equals the EL when , and the exponentially tilted likelihood as a form of maximum entropy such that . As these authors discussed, the exponentially tilted likelihood can also be seen as a profile likelihood for .

Moreover, Schennach (2005) shows that this reformulation of the maximisation problem of the EL allows for a probabilistic interpretation which justifies its use in a Bayesian setting. More precisely, the posterior distribution for a parameter of interest may be seen as

where represents a (potentially infinite-dimensional) nuisance parameter which absorbs all those aspects of the model not described by the parameter of interest . The information contained in the nuisance parameter may be discretised by a vector of parameter with . The nuisance parmater may then be given a prior which shares the Dirichlet prior’s property of providing posteriors which assign probability one to distributions supported by the sample. Schennach (2005) shows then this reformulation has a computationally convenient representation, for which the posterior of the parameter of interest may be obtained through

where are the weights obtained as solution of the maximization problem

under constraints , where “BETEL” stands for “Bayesian exponentially tilted likelihood”. This method may be called “Bayesian exponentially tilted EL”, because it uses the exponential tilting proposed in Efron (1981) and has a Bayesian interpretation. This version of the EL will be used in Section 4.7.

Glenn and Zhao (2007) examined the robustness properties of the estimates arising from the tilted distribution. For example, whereas the root mean squared error (RMSE) of the EL estimator for the mean increases as the non-iidness of the sample increases, the RMSE of the weighted EL estimator remains closer to its theoretical value. Other extensions to standard EL, such as the continuous updating estimator, have also been proposed (Hansen et al., 1996).

In a Bayesian framework, the standard and exponentially tilted likelihoods have been shown to be appropriate for Bayesian inference for a range of set-ups and under certain conditions on the prior, particularly for a prior with sufficiently large variance (Monahan and Boos, 1992; Lazar, 2003; Chaudhuri and Ghosh, 2011).

Notwithstanding these attractions, there are some drawbacks in applying EL. One substantive issue is the formulation of the estimating equations. The number of equations is one issue: there should be at least as many as the dimension of the parameter space, but any more than this (which may be available and desirable from the perspective of model description) has been argued to adversely affect inference (Qin and Lawless, 2001). However, it is suggested by Mengersen et al. (2013) that this concern may not apply in all circumstances, in a Bayesian setup; this is illustrated in the -and- example given below.

### 4.3 Estimation

The most common approach to estimation of the EL is through the method of Lagrange multipliers. In general terms, this method aims to maximise subject to a (multivariate) constraint . This is achieved by finding maximising such that . Then solves the constrained problem. Considering again the example of estimating , the aim is to maximise

under the constraints

We write

where and are the Lagrange multipliers. This can be solved to find a unique solution for .

There is a range of software for computing the EL, particularly targeted towards specific applications. A helpful repository and description of available code is on Art Owen’s website. A powerful library available in the R software is the package ‘emplik’ (Zhou and Yang, 2014). The underlying computational method is based on the Newton-Lagrange algorithm, whereby the Lagrangian function described above is solved by an application of Newton’s method, which iterativly uses a second order Taylor approximation of to find an optimal value satisfying .

For example, the package el.test in the emplik library conducts a simple EL ratio test that returns log likelihood ratio (-2LLR

, which has an approximate chi-squared distribution under the null hypothesis), the associated p-value, the final value of the Lagrange multiplier (

lambda), the gradient at the maximum (grad), the hessian matrix (hess), weights on the observations (wts) and the number of iterations performed (nits).The following code, provided in the emplik documentation, illustrates two tests on a two-dimensional set of data: (i) and (ii) .

# generate data x <- matrix(c(rnorm(50,mean=1), rnorm(50,mean=2)), ncol=2,nrow=50) y <- 2*x[,1]-x[,2] # test hypothesis (i) el.test(x, mu=c(1,2)) # test hypothesis (ii) el.test(y, mu=0)

In one realisation of this code, the results of the first test were returned after four iterations, with weights ranging between 0.75 and 1.51, and with -2LL=1.50 and a p-value of 0.47 under the assumption that LL is approximately chi-square under . The second null hypothesis returned a p-value of 0.22.

Examples of the use of the emplik library for survival analysis are given by Zhou (2015). Whereas el.test requires uncensored data, the packages developed by Zhou and embedded in the emplik

library enable estimation of hazard functions, cumulative distribution functions and confidence bands for various types of censored data under a range of survival models.

As an example, the package em.cen.EM can be used to test the hypothesis versus , where is a user supplied function. For instance, can be the test about the Kaplan-Meier mean and . The myeloma code in the Appendix illustrates this by testing in the myeloma dataset incorporated in the emplik library. The code also finds the upper and lower confidence limit of a Wilks confidence interval. The output of this analysis provides a value -2LL and a corresponding p-value.

Bayesian EL methods are typically analysed by solving the EL using a Lagrange or similar method, then generating observations from the posterior distribution of the parameters of interest by an MCMC method. A more detailed description of this approach is given in the context of spatial modelling in the next section. An alternative approach, , which employs the emplik library to obtain the required likelihood values, is also detailed in a subsequent section.

### 4.4 EL in practice

The EL approach has been shown to be applicable in a broad range of contexts (Qin and Lawless, 2001)

. For example, following its formulation for estimation in linear regression

(Owen, 1991), it has been extended to nonlinear, generalised, parameric, nonparametric and semiparametric models with and without missing data and censoring, time series models and varying-coefficient models; see the review of Chen and Van Keilegom (2009b) and the references therein. The approach has also been proposed for testing; see again Chen and Van Keilegom (2009b). Einmahl and McKeague (2003) have proposed omnibus tests based on EL for a wide range of hypothesis tests, including symmetry, exponentiality, independence and change of direction. Tests for stochastic ordering using EL have been proposed by El Barmi and McKeague (2013) and El Barmi and McKeague (2015).Chaudhuri and Ghosh (2011)

have proposed an EL approach for small area estimation and have suggested that the approach is also applicable to general random and mixed effects models. As the authors argue, EL overcomes the distributional assumptions of the more dominant parametric models as well as the linearity assumptions of the nonparametric models that have been proposed for this problem. In addition, EL avoids the need for resampling methods like jacknife and bootstrap to obtain mean squared error estimation. The authors’ methodology is developed using a multivariate-

prior for the parameter vector and both the regular and exponentially tilted formulations for the EL.A Bayesian EL approach for constructing intervals for the analysis of survey data has been explored by Rao and Wu (2010). This work builds on the EL approaches for complex survey analysis proposed by Chen and Sitter (1999) and Wu and Rao (2006). Rao and Wu (2010) provides a clear exposition of EL methods for sample surveys. The authors set up the problem as one in which denotes the number of units , in a finite population of size , that have the value , and denotes the number of units in the sample having this value . The sample data are then reduced to a set of so-called scale-loads . Assuming a negligible sampling fraction, the likelihood can be approximated by using the multinomial distribution with a log likelihood given by

with , and the MLE of

is the sample mean

The authors make the connection with the work of Chen and Sitter (1999) and argue that this ‘scale-load’ approach is “in the same spirit” as EL as described by Owen (1988).

As described above, survival analysis is another area that lends itself naturally to EL. The popular Kaplan-Meier curve is a nonparametric estimator of the survival function , where denotes the time to an event. It is conceptually straightforward to see that can be estimated as a maximum EL estimator. This field has been developed by a number of authors: see, for instance, Wang and Jing (2001) for a general exposition of the survival model, Murphy and van der Vaart (1997) for doubly censored data, Qin and Jing (1994) for Cox modelling using EL, and McKeague and Zhao (2002) for an EL approach to relative survival. The recent text by Zhou (2015) provides an excellent overview of the field as well as new models and computational algorithms, with associated R code to facilitate implementation. A simple illustration of an EL approach to survival analysis is provided in the next section.

Recent years have also seen an increase in popularity of EL for spatial modelling. Chaudhuri and Ghosh (2011) pioneered a Bayesian EL approach for small area estimation. Their model can accommodate continuous and discrete and area- and unit-level data, random and mixed effects, and the original and exponentially tilted empirical likelihoods.

A similar approach has also been proposed recently by Porter et al. (2015a) for this purpose. The so-called semiparametric hierarchical EL (SHEL) model can be applied to irregular lattices and irregularly spaced point-referenced data, and was shown to have improved mean squared prediction error compared with standard parametric analyses in a simulation study, a large community survey and a bird survey. In the SHEL model, EL is employed in an empirical data model, which is combined with a parametric process model that accounts for the spatial dependence through a rank-reduced intrinsic conditional autoregressive (ICAR) prior and, finally, with a model at the highest level of the hierarchy describing the parameter.

A companion paper by the same authors (Porter et al., 2015b)

extends this work to a multivariate context, with focus on the Fay-Herriot (FH) model which is a mainstay in small area estimation. The argument is made that this approach encompasses spatial correlation (via the FH model) but avoids the usual restrictive Gaussian distributional assumptions (via EL).

One of the fields in which EL has been prominent is economics and related fields. For example, Riscado (2012)

promote the use of EL as a natural framework for estimation of dynamic stochastic general equilibrium (DSGE) models for macroeconomic analysis, since these models represent complex economic systems as a constrained optimisation problem and can be described as a set of moment conditions. The authors favourably compare EL with calibration and ML approaches, since the model parameters have complex correlation structures that hinder calibration and are typically characterised by nonlinear systems of difference equations that have no closed form and hence hinder ML. The likelihood is thus often approximated and then estimated (and maximised) using methods such as the Kalman filter and sequential Monte Carlo. The authors interpret the EL approach as mapping from the set of moment conditions to the stochastic processes of the economic variables, and then performing estimation by inverting that mapping. As discussed above, the importance and very often the difficulty of defining a set of “good” moment conditions, or constraints, is highlighted in this setting.

### 4.5 The algorithm

A Bayesian EL algorithm was proposed by Mengersen et al. (2013). It was originally designed in the spirit of ABC, in that it avoids computation of the likelihood, but during its development the authors realised that simulation from the likelihood could also be avoided and replaced with importance sampling. Thus the so-called algorithm generates values from the prior distribution for and uses the values as (unnormalised) weights in an importance sampling (IS) framework. The basic sampler is given below. Of course, this IS algorithm will not be efficient if the posterior is very different to the prior. Later, we describe a more sophisticated algorithm based on adaptive multiple importance sampling (AMIS, Cornuet et al. (2012)).

#### 4.5.1 Example 1

As a concrete example, consider estimation of the population mean based on a sample of observations . In this case, two main decisions are required: the prior on and the constraints. The computation of the EL can be performed using the el.test package in the emplik library in R, as described earlier in this chapter. In this case, the unnormalised weight is taken to be equal to the value of the empirical likelihood, which is calculated from the value of -2LLR obtained from the el.test function.

Suppose that a sample of size 100 is drawn from an (unknown) distribution and the aim is to estimate the population mean . A prior is imposed on and a first-moment constraint is chosen, i.e., that the sample mean should equal the population mean. For the analysis, it is decided to run iterations, noting that in practice a smaller value of can be used but care must be taken to check that the weight has not concentrated too strongly on a small number of sampled values of . A resampling step can be included to mitigate this, although at a cost of introducing additional variance. In this case, the algorithm becomes:

Example R code for this algorithm is given below.

data = rnorm(100,10,1) M = 5000; theta.propose=w=rep(0,M) for (i in 1:M){ theta.propose[i] = runif(1,-10,30) el = el.test(data,mu=theta.propose[i]) w[i] = exp(-0.5*(el$’-2LLR’)) } theta=sample(theta.propose,M,prob=w,replace=TRUE) mean(theta); sd(theta); quantile(theta,probs=c(0.1,0.9)) hist(theta, main="",xlab="theta")

As noted above, the resampling step could be replaced with a weighted mean, standard deviation and quantiles. One realisation of this code provided the following estimates: ; s.d.; 95% CrI=. A histogram of the obtained sample of is given in Figure 8.

Mengersen et al. (2013) comment on the performance of this algorithm with different constraints, namely one, two and three central moments, , , . They observe that one and two constraints work well, but three constraints performed more poorly. This was seen to support the general suggestion by Owen (2001) that the number of constraints should be equal to the number of parameters. Interestingly, this was not seen to be the case for the -and- distributions, as described in the next example.

A possible measure of the efficiency of the algorithm is the effective sample size (ESS). The ESS is reportedly a measure of the ‘equivalent number of independent observations’ in a sample, that is, the value that equates the obtained variance of the estimator of interest with the equivalent variance assuming an independent sample. For weighted samples as in EL, the ESS can be estimated as

Kish (1965) argues that this substitution (of the EL for the exact likelihood) can be employed in any algorithm that samples from a posterior distribution. For example, it can be employed in composite likelihoods which are commonly used in areas such as population genetics where the likelihoods are known but complex, and hence computationally difficult. The ‘traditional’ composite likelihood approach decomposes the target distribution into several multivariate Student distributions. In the approach, the EL is used instead. The computation is achieved using AMSI, which can be parallelised on a multi-core computer. The method can also be tailored for some non-i.i.d. problems such as dynamic models with AR structure, although the challenge here is in selecting appropriate constraints; see Mengersen et al. (2013) for details.

### 4.6 Example 2

We illustrate the use of by expanding on the discussion by Mengersen et al. (2013) of quantile distributions. These distributions are appealing for ABC in general, and in particular: there is typically no closed form expression for the likelihood, so regular algorithms such as MCMC are not immediately applicable; and it is fast and easy to obtain simulations from a quantile function via an inversion algorithm.

There is a body of literature on using ABC for estimation of quantile distributions. Allingham et al. (2009) proposed an ABC-MCMC algorithm in which draws of the parameters of the quantile distribution are based on a Metropolis algorithm with a Gaussian proposal distribution, and are accepted based on the rule , where is the entire set of order statistics, is the Euclidean norm and

is heuristically chosen after inspection of a histogram of

obtained from a preliminary run using a very large value of . Peters and Sisson (2006) also developed an ABC-MCMC algorithm for complex quantile functions. A range of improvements in the MCMC algorithm, selecting low-dimensional summary statistics and methods of choosing have since been suggested (Prangle, 2011; McVinish, 2012). Sequential Monte Carlo approaches for multivariate extensions of quantile distributions have also been proposed (Drovandi and Pettitt, 2011).The -and- distribution is a popular example of a quantile distribution. This is a transformation of the standard normal distribution function, as follows:

where and is commonly set fixed at 0.8; see Rayner and MacGillivray (2002). Here denotes the th quantile from the -and- distribution and is the corresponding quantile of the standard normal distribution. Thus simulation from the -and- distribution requires only the generation of uniform() variates.

Figure 9 shows the estimated cdf of a standard normal distribution based on a -and- approximation, using the basic procedure described in Algorithm 2. The parameters of the -and- distribution corresponding to a distribution are . The analysis was based on observations, 100,000 iterations and 10,000 resampled parameter values. The percentiles were chosen arbitrarily to form the constraints for EL and all parameters were generated from a prior distribution.

The Bayes Factor R code available in the Appendix illustrates the ease with which Bayes Factors (BF) can be computed for -and- distributions using EL. The example assumes a true model (Model 1) with versus two alternatives, (Model 2) and (Model 3). Here, all models have zero mean () and unit variance (

) but differ in the degree of skewness, with Model 3 having no skewness (

) and hence representing a standard normal distribution. The cumulative distributions functions for these three models are depicted in Figure 9. Two sample sizes of and and five constraints are considered. The resultant boxplots shown in Figure 10 confirm that Model 1 is preferred over both of the alternative models, with a stronger log BF obtained for the larger sample size as anticipated.#### 4.6.1 Example 3

Mengersen et al. (2013) also describe a variation on the basic algorithm which employs AMIS in order to improve computational efficiency over plain importance sampling. The so-called -AMIS sampler employs multivariate Student distributions (3 degrees of freedom, mean , covariance matrix ) as importance sampling distributions, as described in the following algorithm. The output of this algorithm is a weighted sample of size .

### 4.7 Extensions of the algorithm

Since its introduction, the approach has been applied to a range of problems. For example, Cheng et al. (2014) cite the approach as the foundation for their proposed method for estimating the parameters of the extreme value model of Heffernan and Tawn (2004)

. Through a large simulation study, the method was found to provide good coverage of credible intervals, although one of the parameters needed more informative priors under some more challenging setups.

In a second example, Grazian and Liseo (2017) discuss the use of for copula estimation, whereby the marginal likelihood of the quantity of interest is approximated by the EL.

Copula models are an important tool in multivariate analysis: while a huge literature exist about methods of estimating univariate marginal distributions, the problem of estimating the dependence structure of a multivariate distribution is more complex. Copula models allow for separately working with the univariate marginals and the joint distribution. They are widely used in many applications, including actuarial sciences

(Frey and McNeil, 2001), epidemiology (Clayton, 1978), finance (Cherubini et al., 2004), hydrology (Salvadori and De Michele, 2007) among others.A copula model is a way of representing the joint distribution of a random vector . Given a -variate cumulative distribution function which depends on some parameter , it is possible to show (Sklar, 2010) that there always exists a -variate function , such that

where is the marginal distribution of , indexed by a parameter , and is a parameter characterizing the joint distribution.

In other terms, the copula is a distribution function with uniform margins on which takes value from the univariate (which may be of the same form or may differ in terms of the parameters or of the forms) in order to produce the -variate distribution . The resulting model is very flexible, because it may utilise different types of marginal distributions and dependence structures.

Many different types of copula functions have been proposed in the literature; see Joe (2015) for a review. An example is the Clayton copula, defined in the general -dimension case as

where is a one-dimensional parameter. The Clayton copula is characterized by lower-tail dependence (that approaches as ) and no upper-tail dependence. A representation of the Clayton copula (obtained through simulation) is available in Figure 11.

The frequentist standard method of estimating copula models is the “inference from the margins” (IFM) approach (Joe, 2015), i.e. a two-step procedure, where first the marginal distribution functions are separately estimated, either in a parametric or in a nonparametric way (depending on the information available on the marginals) and then the copula function is estimated. Bayesian alternatives have been explored, nevertheless they are still limited. The reader may refer to Smith (2011) for a review.

In some cases the interest of the analysis is in a function of interest of the copula and not in the complete dependence structure; this may be due to a weak information about the type of structure or to the need of a low-dimensional quantification of the dependence. Some typical quantities of interest are, for example, tail dependence indices, Spearman’s or Kendall’s . While tail dependence indices represent, in the bivariate case, the probability that a random variable exceeds a certain threshold given that another random variable has already exceeded that threshold (Großmaß, 2007), Spearman’s and Kendall’s are measures of rank correlation, which are both expressible in terms of the copula . For example, the Spearman’s in the bivariate case is defined as

(3) |

In this case, Grazian and Liseo (2017), in the same spirit of the IFM method, propose to first estimate the marginal distributions and then study the interest measure of multivariate dependence with an approximate Bayesian approach based on an estimation of the likelihood of via EL (the authors use its Bayesian modification described in Schennach (2005) and in Section 4.2). In this way, it is possible to avoid the complete definition of the dependence structure (usually difficult to be determined) and elicite the prior distribution only for the quantity of interest, in order to reduce the bias derived from wrong distributional assumptions. Moreover their Bayesian approach avoids the loss of information of the IFM method and may be proved to be consistent.

The (“Bayesian computation for copulas”) algorithm follows and its final output will then be a posterior sample drawn from an approximation of the posterior distribution of the quantity of interest (see Algorithm 5).

Given a data set and marginal posterior samples for

This approach presents several advantages with respect to classical approaches to copula estimation. First, it may be applied to a generic dimension , while in the literature there is a huge difference in terms of consistency results on the proposed estimators between the bivariate and the multivariate case. The authors have applied the algorithm to a maximum dimension equal to with no loss of precision and with a reasonable computational expense (it has to be noted that the algorithm may be easily parallelised in the first step of estimation of the marginals). Second, the method provides a quantification of the error of estimation, not easily available in the classical approach (see Schmid and Schmidt (2007) for the Spearman’s and Schmidt and Stadtmüller (2006) for the tail dependence indices). Third, it avoids the specification of the particular copula function which describes the dependence structure; this is particularly important in absence of information on it, since methods to select the copula function are not yet fully developed.

Since the interest is in small dimension parameter (often only one measure of dependence), the choice of the constraints should be easy; unfortunately, in practical applications these conditions might hold only asymptotically. This is the case, for example, of the Spearman’s : its sample counterpart is only an asymptotically unbiased estimator of so the moment condition is strictly valid only for large samples.

Grazian and Liseo (2017) also apply the method to a real data-set based on the study of the dependence among five Italian financial institutes, where the returns are supposed to marginally follow a model with Student’s innovations. They show how it is possible to obtain an approximated posterior distribution of the Spearman’s of the financial asset returns of these institutes with Algorithm 5.

As an application, consider the setting of Section 4.6, where five sets of observations are simulated from -and- identical but not independent quantile distributions with , , and . The dependence structure is described by a multivariate Clayton copula (McNeil and Nešlehová, 2009) with true unknown multivariate equal to . There are many ways to extend the bivariate Spearman’s defined in (3) to the multivariate case and they are not in general equivalent; nevertheless it is often of interest in many fields of application to describe the dependence with a low-dimensional quantity, for example in the multivariate analysis of financial asset returns where there is the need to express the amount of dependence in a portfolio by a single number. Here, the following is considered

(4) |

where is the upper Fréchet- Hoeffding bound, and . For a review of the definitions of the Spearman’s in the literature one may refer to Schmid and Schmidt (2007).

Uniform data have been generated from a multivariate Clayton copula with and then inverted in order to obtain data from the corresponding quantile distributions. Figure 11 shows the correlation between the first two sets of observations generated with this procedure.

Figure 12 described the approximation to the posterior distribution of , as defined in (4), obtained via Algorithm 5: it is possible to see that the posterior distribution is concentrated around the true value from which the data have been generated.

The R code used is available in the Appendix (“Copula code”).

As noted above, one of the key considerations in developing and implementing is the choice of constraints for the EL. This consideration is not particular to , but applies to all EL methods. However, the difference here is that the selected constraints must be also applicable to the ABC context. With this goal in mind, Ruli et al. (2016) advocate the use of scaled composite likelihood score functions as summary statistics in ABC. The scaling takes into account a measure of the relative amount of information provided by the different parameters. They argue that the corresponding ABC procedure is therefore invariant to reparametrisation and accommodates automatically the curvature of the posterior distribution. This approach is argued to be an improvement over that proposed by Pauli et al. (2011) and more ‘fully ABC’ than the

Comments

There are no comments yet.