 # Approximate Bayesian Computation for Finite Mixture Models

Finite mixture models are used in statistics and other disciplines, but inference for mixture models is challenging. The multimodality of the likelihood function and the so called label switching problem contribute to the challenge. We propose extensions of the Approximate Bayesian Computation Population Monte-Carlo (ABC-PMC) algorithm as an alternative framework for inference on finite mixture models. There are several decisions to make when implementing an ABC-PMC algorithm for finite mixture models, including the selection of the kernel used for moving the particles through the iterations, how to address the label switching problem and the choice of informative summary statistics. Examples are presented to demonstrate the performance of the proposed ABC-PMC algorithm for mixture modeling.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Mixture models have been used in statistics since the late s when Karl Pearson introduced them in an analysis of crab morphometry (Pearson, 1894a, b). Subsequently mixture modeling has grown in popularity in the statistical community as a powerful framework for modeling clustered data; the book by McLachlan and Peel (2004) provides a general overview of mixture modeling, and a more Bayesian perspective can be found in Marin et al. (2005) or Frühwirth-Schnatter (2006). In recent decades mixture models have become routinely applied in various applications (Lancaster and Imbens, 1996; Richardson and Green, 1997; Roeder and Wasserman, 1997; Henderson and Shimakura, 2003; Gianola et al., 2006; Lin et al., 2012; Wu et al., 2015). One reason for the general success of mixture models is due to the opportunity of specifying the number of possibly different component distributions, allowing for flexibility in describing complex systems (Marin et al., 2005).

The general definition for a finite mixture model with fixed integer components is:

 K∑i=1fi⋅pi(y∣θi), (1)

with mixture weights such that and

are the component distributions of the mixture, often parametrically specified with a vector of the unknown parameters,

.

Finite mixture models present computational and methodological challenges due, at least in part, to the complex and possibly multimodal likelihood function along with the invariance under permutation of the component indices. The EM algorithm (Dempster et al., 1977) provides a method for numerically retrieving the maximum likelihood estimates, although the possible multimodality of the likelihood function makes finding the global maximum challenging (Marin et al., 2005). Extensions of the EM algorithm have been proposed for improving its speed of convergence and avoiding local optima (Naim and Gildea, 2012; Miyahara et al., 2016).

Bayesian approaches for mixture modeling have increased in the last decades (Robert, 2007; Casella et al., 2002; Marin et al., 2005; Shabram et al., 2014; Mena and Walker, 2015)

. Bayesian inference for mixture models often relies on MCMC which can lead to the so called

label switching problem (Diebolt and Robert, 1994; Stephens, 2000; Jasra et al., 2007; Rodríguez and Walker, 2014), because the likelihood function is invariant to the re-labeling of the mixture components. Additionally the resulting posterior distribution is multimodal and asymmetric, which makes summarizing the posterior distribution using common statistics such as the posterior mean or the HPD interval unhelpful (Stephens, 2000; Stoneking, 2014; Mena and Walker, 2015).

A different framework for inference that can be considered for addressing the issues related to the use of mixture models is Approximate Bayesian Computation (ABC). ABC is often used in situations where the likelihood function is complex or not available, but simulation of data through a forward model is possible. With mixture models, though the likelihood function is available, working with is difficult. Though it has its own challenges, ABC can be successfully implemented for retrieving the posterior distributions of the parameters, providing an alternative to the MCMC algorithms.

Assuming is the possibly multidimensional inferential target, the basic ABC algorithm consists of the following four steps, outlined by Pritchard et al. (1999): (i) sample from the prior distribution, ; (ii) produce a generated sample of the data by using in the forward model (FM), ; (iii) compare the true data with the generated sample using a distance function, , letting where is some (possibly multi-dimensional) summary statistic of the data; (iv) if the distance, , is smaller than a fixed tolerance then is retained, otherwise it is discarded. Repeat until the desired particle sample size is achieved. In order to improve the computational efficiency of the basic ABC algorithm, there have been many extensions which include a sequential element through the tolerance parameter, (Sisson et al., 2007; Joyce and Marjoram, 2008; Beaumont et al., 2009; Csilléry et al., 2010; Moral et al., 2011; Marin et al., 2012; Ratmann et al., 2013). In this method we extended the Approximate Bayesian Computation Population Monte-Carlo algorithm (ABC-PMC) of Beaumont et al. (2009), which is introduced in the next Section. Interestingly, the original Population Monte-Carlo sampler (Cappé et al., 2004) was introduced for solving problems related to the multimodality of the likelihood function for situations in which mixture models were used.

In any ABC analysis the definition of both suitable summary statistics, , and a distance function, , for comparing the true data to the generated sample is needed and is crucial for getting useful inference results (Beaumont et al., 2002). The definition of summary statistics is necessary because ABC suffers of the curse of dimensionality and using the entire dataset is not computationally feasible (Blum, 2010a, b; Blum et al., 2013; Prangle, 2015). In the present work we used the Hellinger distance for comparing the true data to the generated sample . Details can be found in Section 2.4. While the proposed method is valid for , for illustration purposes, we will focus on the one dimensional case where .

The paper is organized as follow. In Section we present the ABC-PMC algorithm by Beaumont et al. (2009) along with the proposed extensions for finite mixture models. Section and Section are dedicated, respectively, to a simulation study and a popular real data example. Concluding remarks are presented in Section .

## 2 Methodology

Sequential versions of the basic ABC algorithm have been proposed to improve the computational efficiency of the algorithm. This is typically accomplished by considering a sequence of decreasing tolerances, such that , where is the total number of iterations. Then, rather than sampling from the prior in each time step after the initial step, improved proposals from the previous steps ABC posterior can be used. The ABC-PMC algorithm of Beaumont et al. (2009) is based on importance sampling ideas in order to improve the efficiency of the algorithm by constructing a series of intermediate proposal distributions using the sequential ABC posteriors. The first iteration of the ABC-PMC algorithm uses tolerance and draws proposals from the specified prior distributions; from the second iteration through the , proposals are picked from the previous iteration’s ABC posterior and then moved according to some perturbation kernel (e.g. Gaussian kernel). Since the proposals are drawn from the previous iteration’s ABC posterior and moved according to a kernel rather than proposed directly from the prior distributions, importance weights are used. The importance weight for particle in iteration , where is the desired particle sample size is:

 W(J)t∝π(θ(J)t)/N∑K=1W(K)t−1ϕ(θ(J)t−θ(K)t−1τθ,t−1),

where is a Gaussian perturbation kernel and

is twice the (weighted) sample variance of the particles for

at iteration , as recommended in Beaumont et al. (2009); other perturbation kernels can be used and the importance weights would be updated accordingly. Further details on the original ABC - PMC algorithm can be found in Beaumont et al. (2009), though there are several elements that require additional comments. In particular, an implementation of ABC-PMC requires the user to select the tolerance sequence , the perturbation kernels, and the summary statistic(s) and distance function.

First, the method in which we define the sequence is an adaptive approach based on the accepted distances from the previous time step. That is,

is set to a particular quantile of the accepted distances from time step

. Selecting the quantile at which to shrink the tolerance too high could result in a decrease in computational efficiency because more iterations are needed to shrink the tolerance to a small enough value; selecting a quantile that is too low can also contribute to computational inefficiencies because more draws from the proposal are needed within each iteration to find datasets that achieve the small tolerance.

One of the challenges in using ABC-PMC for mixture models is related to selecting an appropriate perturbation kernel for the mixture weights because the individual weights must be between 0 and 1, and the weights must sum to 1; the usual Gaussian perturbation kernel is not a viable option because this kernel can lead to proposed mixture weights that are not consistent with their support. An additional challenge in using ABC-PMC for mixture models is due to the label switching problem, and for reasons discussed below, this has to be addressed at the end of each iteration. Finally, the selection of an appropriate and informative summary statistic for multimodal data requires a summary that can capture the overall shape of the distribution of the data.

For these reasons the original version of the ABC-PMC cannot be used for mixture models. Our proposed extensions are discussed below, as well as the definition of the finite Gaussian mixture model framework, which is used to illustrate the performance of the proposed ABC-PMC algorithm. Algorithm

1 summarizes our proposed ABC-PMC algorithm for the special case of a Gaussian mixture model, and the details of the steps of the algorithm are discussed in the following subsections.

### 2.1 Finite Gaussian Mixture Models

A common choice for introduced in Equation (1

) is the Gaussian distribution. This particular class of models, called Gaussian Mixture Models (GMMs), is very flexible and widely used in various applications

(Gianola et al., 2006; Lin et al., 2012; Stoneking, 2014). Maintaining the notation of (1), a GMM is defined as:

 K∑i=1fi⋅ϕ(y∣θi), (2)

where

is the density function of the Normal distribution,

consists of the vector of unknown means and variances for each of the groups. Hence , and adding the mixture weights previously defined, . More specifically, lies in the unit simplex, inside the unit cube .

A common choice of the prior distribution for is the Dirichlet distribution of order

with hyperparameter

, where often , as proposed by Wasserman (2000). Another common choice has been proposed by Rousseau and Mengersen (2011), who defined the hyperparameter ; in this way the prior is marginally a Jeffrey prior distribution. The priors for the mean and the variance of the GMM can be defined as follows:

 μi∣σi∼ϕ(ξ,κ),σ−2i∼\text{Gamma}(α,β), (3)

with mean , variance , and shape parameter and rate parameter . There are several methods for selecting the hyperparameters, , such as the Empirical Bayes approach (Casella et al., 2004) and the “weakly informative principle” (Richardson and Green, 1997). Both of these options are considered in the simulation study so as to be consistent with the original authors of each example.

### 2.2 Perturbation kernel functions

One of the advantages of ABC-PMC over the basic ABC algorithm is that, starting from the second iteration, rather than drawing proposals from the prior distribution, proposed particles are drawn from the previous step’s ABC posterior according to their importance weights. Then, instead of using the actual proposed value that was drawn, it is perturbed according to some kernel. There are a number of possible kernel functions, , to perturb the proposed particles. Beaumont et al. (2009) suggest a Gaussian kernel centered on the selected element from the previous iteration and a variance equal to twice the empirical variance of the previous iteration’s ABC posterior. This is a reasonable choice if there are no constraints on the support of the parameter of interest. However, when constraints on the parameter support are present, such as for variances or mixture weights, a perturbation kernel should be selected so that it does not propose values outside the parameter’s support.

When moving the selected values for the mixture weights, not only is there the constraint that each mixture weight component must be in , but it is also required that , making the Gaussian kernel inappropriate.

In the first iteration of the proposed ABC-PMC algorithm, the mixture weights are directly sampled from the prior distribution, which is a ), where . For , proposals are drawn from the previous step particle system according to their importance weights. After randomly selecting a mixture weight, , we want to “jitter” or move it in manner that preserves some information coming from the selected particle, but not let it be an identical copy, leading to the resampled mixture weight . This is carried out using Algorithm 2.

From the steps outlined in Algorithm 2, we note that is the sum of two independent random variables, with and , so that the resampled mixture weight .

The parameter is a fixed real number with range that determines how much information to retain from . The choice of has an impact on both the allowed variability of the marginal ABC posterior distributions for the mixture weights and the efficiency of the entire procedure. In particular fixing a close to leads to a Dirichlet resampling in which the new set of mixture weights is close to the previous set (if , then ). On the other hand a choice for close to 0 implies that the information coming from is weakly considered (for a new set of particles is drawn directly from the prior distribution and no information about has been retained). We found to be a good choice for balancing efficiency and variability (i.e., it allows for some retention of information of the selected particle, but does not lead to nearly identical copies of it).

### 2.3 Algorithm for addressing the label switching problem

As noted earlier, a common problem arising in the Bayesian mixture models framework is label switching. When drawing a sample from a posterior (for both MCMC and ABC), the sampled values are not necessarily ordered according to their mixture component assignments because the likelihood is exchangeable. For example, suppose a particle is accepted for a component GMM. This particle was selected with a particular ordering of the components with , , and from the same mixture component, ; however, a new particle that is accepted will not necessarily follow that same ordering of the components. Somehow the particles have to be ordered in such a way that aligns different realizations of the components in order to eliminate the ambiguity.

Several approaches have been proposed to address the label switching problem and are known as relabelling algorithms. A first group of relabeling algorithms consists of imposing an artificial identifiability constraint in order to arbitrarily pick a parameter (e.g. the mixture weights) and sort all the parameters for each accepted particle according to that parameter’s order (Diebolt and Robert, 1994; Richardson and Green, 1997). However, the majority of the algorithms proposed for addressing the label switching are deterministic (e.g Stephen’s method (Stephens, 2000) and the pivotal reordering algorithm (Marin et al., 2005)). A third class of strategies, called probabilistic relabelling algorithms, uses a probabilistic approach for addressing the label switching problem (Sperrin et al., 2010). A detailed overview of current methods for addressing label switching is presented in Papastamoulis (2015). In Section 3.1, we provide an example that illustrates the problem with the artificial identifiability constraint approach. Instead, we propose a deterministic strategy for addressing the label switching by selecting a parameter that has at least two well-separated components.

Addressing the label switching problem is especially important for the proposed sequential ABC algorithm because each time step’s ABC posterior is used as the proposal in the subsequent step of the algorithm so the label switching has to be resolved before using it as a proposal distribution. Algorithm 3 outlines the proposed strategy, and is carried out at the end of each iteration. The key aspect of Algorithm 3 is to select the parameter that has at least two well-separated components. To determine this, each set of parameters (e.g. the means, the variances, the mixture weights), is arranged an increasing order. For example, for each particle , , would be ordered so that , with as the order statistic; let represent the smallest mean particle value. This is carried out for each set of parameters with analogous notation.

The next step is to determine which set of parameters has the best separated components values. We propose first shifting and scaling each set of parameters to be supported within the range

so that scaling issues are mitigated and the parameter set values are comparable. One option for this adjustment is to use some distribution function, such as a Normal distribution with a mean and standard deviation equal to the sample mean and the sample standard deviation of the considered parameter set (e.g. the sample mean for the

’s is , and the sample standard deviation for the ’s is ). The resulting -smallest standardized value is, for the mixture mean, . This is carried out for each set of parameters with analogous notation.

Then, for each component of each ordered and standardized particle a representative value, such as a mean, is computed (e.g. is the representative value of the component of the mean parameter). This is carried out for each set of parameters with analogous notation. The pairwise distances (pdist) between the representative values within each parameter set is determined. The parameter set that has the largest separation between any two of its representative values is selected for the overall ordering of the particle system.

Other methods for addressing the label switching problem were considered. For example, rather than sorting based on the parameter set with largest separation between any two of its representative values, we considered basing the sorting on the parameter set with the largest separation between its two closest representative values (i.e. the maximum of the minimum separations); however, this sorting did not perform well empirically. The issue seemed to be that parameter set with the largest separation between its two closest representative values may actually have all of its components relatively close; after multiple iterations, none of the components would separate out from the other components. This lead to iteration after iteration of components that remained a blend of components rather than separating out into pure components. Overall, from empirical experiments, the algorithm outlined in Algorithm 3 performed the best and thus is our recommendation. However, we emphasize that alterations to this procedure may be necessary for mixture models that have additional structure or correlations among the parameters of the component distributions or to deal with multi–dimensional mixture models.

### 2.4 Summary statistics

Comparing the entire simulated dataset, , with the entire set of observations,

, in an ABC procedure is not computationally feasible. For this reason the definition of a lower dimensional summary statistic is necessary. For mixture models, due to the multimodality of the data, common summaries such as means or higher order moments do not capture relevant aspects of the distribution. However, an estimate of the density of the data will better account for its key features. We suggest using kernel density estimates of the simulated data,

, and the observations, , to summarize the data, and then the Hellinger distance () to carry out the comparison. The Hellinger distance quantifies the similarity between two density functions, and , and is defined as:

 H(f,g)=(∫(√f(y)−√g(y))2dy)12. (4)

At each iteration of the proposed ABC-PMC procedure, a proposed is accepted if , where is the tolerance.

## 3 Simulation study

In this Section a simulation study is presented to evaluate the behavior of the proposed ABC-PMC algorithm presented in Section 2 using popular GMM examples. In particular we are interested in evaluating the success of the procedure with respect to the label switching problem and the reliability of the Hellinger distance as a summary statistic. To determine the number of iterations, a stopping rule was defined based the Hellinger distance between sequential ABC posteriors; once the sequential Hellinger distance dropped below a threshold of for each of the marginal ABC posteriors, the algorithm was stopped.

### 3.1 Mixture Model with equal and unequal group sizes

The first example is taken from Mena and Walker (2015), which considered a GMM obtained by simulating data coming from groups of equal size which was designed for evaluating the performance with respect to the label switching problem. The 40 observations were simulated as follows:

 Yi=1,…,20∼ϕ(−20,1),Yi=21,…,40∼ϕ(20,1). (5)

The variance is assumed known for both groups and hence the parameters of interest are the mixture weights, and (with ), and the means . The prior distributions defined to run the analysis are the same as those used by Mena and Walker (2015), where the values of the hyperparameters are ,

 μi∣σi∼ϕ(0,1/100).

The prior for the mixture weights is the Dirichlet distribution with hyperparameters ,

 f=(fi,f2)∼Dirichlet(1,1).

The number of particles was set to and the quantile used for shrinking the tolerance is . The algorithm was stopped after iterations, since further reduction of the tolerance did not lead to an improvement of the ABC posterior distributions (evaluated by calculating the Hellinger distance between the sequential ABC posterior distributions as noted in the introduction to this section).

Figure 1 displays the resulting ABC posteriors (“ABC Posterior good LS”, the “ABC Posterior bad LS” is discussed later for illustrating the label switching issue) and the corresponding MCMC posteriors, which are used as a benchmark to access the performance of the proposed method. The ABC posteriors closely match the MCMC posteriors. The summary of the results presented in Table 1 demonstrate that the ABC posterior distributions are a suitable approximation of the MCMC posteriors. The Hellinger distance between the marginal ABC and the MCMC posteriors is also displayed in the last column of Table 1; the Hellinger distance between the MCMC and the ABC posterior is for the mixture weights and for the means. Figure 1: Comparison between the ABC and the MCMC marginal posterior distributions for the two-component GMM example from Mena and Walker (2015). The final ABC posteriors obtained using the label switching (LS) procedure proposed in §2.3 are the solid black lines (ABC Posterior good LS), and the naive approach that sorts based on the mixture weights are the solid cyan lines (ABC Posterior bad LS). The number of particles for the ABC analysis and the number of elements kept from the MCMC analysis (after the burn-in) are equal to 5000.

As noted in Section 2.3, the label switching problem has to be carefully addressed when using the ABC-PMC algorithm. For each time step following the initial step, the previous step’s ABC posterior is used as the proposal rather than the prior distribution so the procedure for addressing the label switching proposed in Section 2.3 is used at the end of each time step. In order to illustrate the consequences of incorrectly addressing the label switching, we ran the proposed ABC algorithm on the example proposed by Mena and Walker (2015), except rather than using the method proposed in Section 2.3, the ordering of the particle system is carried out using the ordering of the mixture weights; the mixture weights are equal in this example making them a poor choice for attempting to separate out the mixture components. The resulting ABC posteriors are displayed in Figure 1(cyan lines). The means, , of the mixture components are shuffled and not close to the MCMC posterior, while the mixture weights are sorted such a way all the elements of are smaller than and all the elements of are larger than .

To complete this first example, Mena and Walker (2015) added a third component to the mixture in (5), simulating five additional observations from a standard Normal distribution, obtaining a three-component GMM with known variances. The ABC-PMC algorithm was run with the same specifications as the first part of the example, but required 25 time steps to achieve our stopping rule.

Figure 2 shows the MCMC and the ABC posteriors for the weights and the means of the mixture components. The behavior of the ABC posterior distributions is consistent with their MCMC benchmarks. A summary of the results presented in Table 2 shows that the posterior means (and the posterior standard deviations) for the ABC posterior distributions are consistent with the ones retrieved using MCMC. Finally, in the third column, the Hellinger distances between the ABC and MCMC posteriors are provided. Figure 2: ABC and the MCMC marginal posterior distributions for the three-component GMM example from Mena and Walker (2015). The number of particles for the ABC analysis and the number of elements kept from the MCMC analysis (after the burn-in) are equal to 5000.

### 3.2 Mixture Model with unequal group size

Even in those cases in which the definition of the mixture model does not lead to the label switching problem, a second category of issues related to the multimodality of the likelihood function is present. This behavior has been studied from both the frequentist and the Bayesian viewpoints. In particular, Marin et al. (2005) defined the following simple two-component mixture model to illustrate the multimodality issue:

 f⋅ϕ(μ1,1)+(1−f)⋅ϕ(μ2,1), (6)

where the weight is assumed known and different from (avoiding the label switching problem). According to the specifications by Marin et al. (2005), samples were drawn from (6), with . The bimodality of the likelihood function (Figure 3) makes the use of both the EM algorithm (Dempster et al., 1977) and the Gibbs Sampler (Diebolt and Robert, 1994) risky, because their success depends on the set of initial values selected for initiating the algorithms.

The Population Monte Carlo (PMC) sampler (Cappé et al., 2004; Marin et al., 2005) is used as a benchmark for the proposed ABC-PMC solution. Figure 3 displays the log likelihood function (note the two modes), and the final ABC posteriors with MCMC posteriors using good and bad starting values along with the posteriors using the PMC algorithm. Table 3 lists the means for the final ABC, MCMC, and PMC posteriors for and , along with the Hellinger distance between the final ABC-PMC posteriors and the PMC posterior. The ABC-PMC posteriors closely match the PMC posteriors. Figure 3: (top) The log-likelihood surface of the Gaussian mixture model proposed by Marin et al. (2005). There are two modes in the log-likelihood function, one close to the true value, (0,2.5), and a second local mode. (bottom) The marginal ABC, PMC, and MCMC posterior distributions; the displayed MCMC posteriors include the results for good initial starting values (MCMC Posterior (good initial choice)) and bad initial starting values (MCMC Posterior (bad initial choice)).

## 4 Application to Galaxy Data

The galaxy dataset was introduced to the statistical community in Roeder (1990), and has been commonly used for testing clustering methods. The data contain the recessional velocities of galaxies (km/sec) from six well separated sections of the Corona Borealis region. In the last twenty years this dataset has been studied in a number of papers (Richardson and Green 1997; Roeder and Wasserman 1997; Lau and Green 2007; Wang and Dunson 2011). The recessional velocities of the galaxies are typically considered realizations of independent and identically distributed Normal random variables, but there is discrepancy in their conclusions about the number of groups in the GMM; estimates vary from three components (Roeder and Wasserman, 1997) to five or six (Richardson and Green, 1997).

In this analysis, we focused on the model with three components (Roeder and Wasserman, 1997), in order to be consistent with Mena and Walker (2015) and Marin et al. (2005). For the hyperparameters, we used the Empirical Bayes approach suggested by Casella et al. (2004). Additionally, since each recessional velocity was also assigned a measurement error, the ABC forward model has been modified to take into account this information. In order to include the measurement errors in the forward model, each simulated recessional velocity is assigned one of the observed measurement errors. The simulated and observed recessional velocities were matched according to their ranks, and the measurement error of the observations were assigned to the simulated data according to this matching. Then, Gaussian noise was added to each simulated recessional velocity with a standard deviation equal to its assigned measurement error.

The posterior means for each component’s parameters are listed in Table 4. The third component was found to have a weight equal to , and a mean and variance equal to and , respectively. The main difference with the proposed ABC-PMC estimates and those from Marin et al. (2005), in which they also fixed the number of components to three, is with the estimated mean and variance of the third component; however, the proposed ABC-PMC estimates are consistent with those obtained by Mena and Walker (2015).

By using the additional information about the measurement errors, the proposed ABC-PMC algorithm can provide a more accurate evaluation of the dataset. Including measurement errors in the forward model affects the resulting ABC posterior as reported in Table 4 and the relative estimates plotted in orange in Figure 4. In particular, the variance of the estimated posterior means are reduced, which is a positive consequence of appropriately accounting for the extra uncertainty in the data. Figure 4: Histogram of the recessional velocity of 82 galaxies and the estimated three-component Gaussian mixture models for each study. The posterior means for the mixture weights, means and variances used are displayed in Table 4.

## 5 Concluding Remarks

The recent popularity of ABC is, at least in part, due to its capacity to handle complex models. Extensions of the basic algorithm have lead to improved efficiency of the sampling, such as the ABC-PMC algorithm of Beaumont et al. (2009). We propose an ABC-PMC algorithm that can successfully handle finite mixture models. Some of the challenges with inference for finite mixture models are due to the complexity of the likelihood function including its possible multimodality, and the exchangeability of the mixture component labels leading to the label switching problem. Fortunately, ABC can handle complicated likelihood functions, but the label switching problem must be addressed. We suggest a procedure for addressing the label switching problem within the proposed ABC algorithm that works well empirically. Some additional challenges with using ABC for mixture models include the selection of informative summary statistics, and defining a kernel for moving the mixture weights since they are constrained to be between 0 and 1 and must sum to 1. For the summary statistics, we propose using the Hellinger distance between kernel density estimates of the real and simulated observations; this allows the multimodality of the data to be accounted for and compared between the two sets of data. We propose a Dirichlet resampling algorithm for moving the mixture component weights that preserves some information from the sampled particle, but also improves the efficiency of the ABC-PMC procedure (by not having to draw from the Dirichlet prior at each time step).

The proposed ABC algorithm has been explored and tested empirically using popular examples from the literature. The resulting ABC posteriors were compared to the corresponding MCMC posteriors, and in all cases considered the proposed ABC and MCMC posteriors were very similar. We also considered the galaxy velocity data from the Corona Borealis Region (Roeder, 1990), which is commonly used in assessing the performance of mixture models. An advantage of ABC over other commonly used methods is that the forward model can be easily expanded to better represent the physical process that is being modeled. For the galaxy data, measurement errors are available in the original data set, but are not generally used when analyzing the data. We extend the proposed ABC-PMC forward model used in this example to include the measurement errors, which provides a more accurate assessment of the uncertainty in the data.

Though the presented examples focused on one–dimensional GMMs, the component distributions can be easily changed to other distributions in the ABC forward model. Overall, the proposed ABC-PMC algorithm performs well and is able to recover the benchmark MCMC posteriors suggesting that ABC is a viable approach for carrying out inference for finite mixture models.

## 6 Acknowledgements

We would like to thanks the Yale Center for Research Computing for the resources we used for producing this paper.

## References

• Beaumont et al. (2002) M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian computation in population genetics. Genetics, 162:2025 – 2035, 2002.
• Beaumont et al. (2009) M. A. Beaumont, J.-M. Cornuet, J.-M. Marin, and C. P. Robert. Adaptive approximate Bayesian computation. Biometrika, 96(4):983 – 990, 2009.
• Blum et al. (2013) M. Blum, M. Nunes, D. Prangle, and S. Sisson. A comparative review of dimension reduction methods in approximate Bayesian computation. Statistical Science, 28(2):189 – 208, 2013.
• Blum (2010a) M. G. Blum. Approximate Bayesian computation: A nonparametric perspective. Journal of American Statistical Association, 105(491):1178 – 1187, 2010a.
• Blum (2010b) M. G. Blum. Choosing the summary statistics and the acceptance rate in approximate bayesian computation. In COMPSTAT 2010: Proceedings in Computational Statistics, page 47?56, 2010b.
• Cappé et al. (2004) O. Cappé, A. Guillin, J.-M. Marin, and C. P. Robert. Population monte carlo. Journal of Computational and Graphical Statistics, 13(4):907–929, 2004.
• Casella et al. (2002) G. Casella, K. L. Mengersen, C. P. Robert, and D. M. Titterington. Perfect samplers for mixtures of distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4):777–790, 2002.
• Casella et al. (2004) G. Casella, C. P. Robert, and M. T. Wells. Mixture models, latent variables and partitioned importance sampling. Statistical Methodology, 1(1):1–18, 2004.
• Csilléry et al. (2010) K. Csilléry, M. G. Blum, O. E. Gaggiotti, and O. François. Approximate Bayesian computation (ABC) in practice. Trends in ecology & evolution, 25(7):410 – 418, 2010.
• Dempster et al. (1977) A. P. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. J. Roy. Statist. Soc. Ser. B, 1(39):1 – 38, 1977.
• Diebolt and Robert (1994) J. Diebolt and C. P. Robert. Estimation of finite mixture distributions through bayesian sampling. Journal of the Royal Statistical Society. Series B (Methodological), pages 363–375, 1994.
• Frühwirth-Schnatter (2006) S. Frühwirth-Schnatter. Finite mixture and Markov switching models. Springer Science & Business Media, 2006.
• Gianola et al. (2006) D. Gianola, B. Heringstad, and J. Odegaard. On the quantitative genetics of mixture characters. Genetics, 173(4):2247–2255, 2006.
• Henderson and Shimakura (2003) R. Henderson and S. Shimakura. A serially correlated gamma frailty model for longitudinal count data. Biometrika, 90(2):355–366, 2003.
• Jasra et al. (2007) A. Jasra, D. Stephens, and C. Holmes. On population-based simulation for static inference. Statistics and Computing, 17(3):263–279, 2007.
• Joyce and Marjoram (2008) P. Joyce and P. Marjoram. Approximately sufficient statistics and Bayesian computation. Statistical Applications in Genetics and Molecular Biology, 7(1):1 – 16, 2008.
• Lancaster and Imbens (1996) T. Lancaster and G. Imbens. Case-control studies with contaminated controls. Journal of Econometrics, 71(1):145–160, 1996.
• Lau and Green (2007) J. W. Lau and P. J. Green. Bayesian model-based clustering procedures. Journal of Computational and Graphical Statistics, 16(3):526–558, 2007.
• Lin et al. (2012) C.-Y. Lin, Y. Lo, and K. Q. Ye. Genotype copy number variations using gaussian mixture models: Theory and algorithms. Statistical applications in genetics and molecular biology, 11(5), 2012.
• Marin et al. (2005) J.-M. Marin, K. Mengersen, and C. P. Robert. Bayesian modelling and inference on mixtures of distributions. Handbook of statistics, 25:459–507, 2005.
• Marin et al. (2012) J.-M. Marin, P. Pudlo, C. P. Robert, and R. J. Ryder. Approximate Bayesian computational methods. Statistics and Computing, 22(6):1167 – 1180, 2012.
• McLachlan and Peel (2004) G. McLachlan and D. Peel. Finite mixture models. John Wiley & Sons, 2004.
• Mena and Walker (2015) R. H. Mena and S. G. Walker. On the bayesian mixture model and identifiability. Journal of Computational and Graphical Statistics, 24(4):1155–1169, 2015.
• Miyahara et al. (2016) H. Miyahara, K. Tsumura, and Y. Sughiyama. Relaxation of the em algorithm via quantum annealing for gaussian mixture models. In Decision and Control (CDC), 2016 IEEE 55th Conference on, pages 4674–4679. IEEE, 2016.
• Moral et al. (2011) P. D. Moral, A. Doucet, and A. Jasra. An adaptive sequential Monte Carlo method for approximate Bayesian computation. Statistics and Computing, 22(5):1009–1020, 2011.
• Naim and Gildea (2012) I. Naim and D. Gildea. Convergence of the em algorithm for gaussian mixtures with unbalanced mixing coefficients. arXiv preprint arXiv:1206.6427, 2012.
• Papastamoulis (2015) P. Papastamoulis. label. switching: An r package for dealing with the label switching problem in mcmc outputs. arXiv preprint arXiv:1503.02271, 2015.
• Pearson (1894a) K. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71–110, 1894a.
• Pearson (1894b) K. Pearson.

Mathematical contributions to the theory of evolution. ii. skew variation in homogeneous material.

Proceedings of the Royal Society of London, 57(340-346):257–260, 1894b.
• Prangle (2015) D. Prangle. Summary statistics in approximate bayesian computation. arXiv preprint arXiv:1512.05633, 2015.
• Pritchard et al. (1999) J. K. Pritchard, M. T. Seielstad, and A. Perez-Lezaun. Population growth of human Y chromosomes: A study of Y chromosome microsatellites. Molecular Biology and Evolution, 16(12):1791 – 1798, 1999.
• Ratmann et al. (2013) O. Ratmann, A. Camacho, A. Meijer, and G. Donker. Statistical modelling of summary values leads to accurate approximate Bayesian computations. Unpublished, 2013.
• Richardson and Green (1997) S. Richardson and P. J. Green. On bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: series B (statistical methodology), 59(4):731–792, 1997.
• Robert (2007) C. Robert. The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer Science & Business Media, 2007.
• Rodríguez and Walker (2014) C. E. Rodríguez and S. G. Walker. Label switching in bayesian mixture models: Deterministic relabeling strategies. Journal of Computational and Graphical Statistics, 23(1):25–45, 2014.
• Roeder (1990) K. Roeder. Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. Journal of the American Statistical Association, 85(411):617–624, 1990.
• Roeder and Wasserman (1997) K. Roeder and L. Wasserman. Practical bayesian density estimation using mixtures of normals. Journal of the American Statistical Association, 92(439):894–902, 1997.
• Rousseau and Mengersen (2011) J. Rousseau and K. Mengersen. Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):689–710, 2011.
• Shabram et al. (2014) M. Shabram, B.-O. Demory, J. Cisewski, E. B. Ford, and L. Rogers. The eccentricity distribution of short-period planet candidates detected by kepler in occultation. Preparing for submission., 2014.
• Sisson et al. (2007) S. A. Sisson, Y. Fan, and M. M. Tanaka. Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Science, 104(6):1760 – 1765, 2007.
• Sperrin et al. (2010) M. Sperrin, T. Jaki, and E. Wit. Probabilistic relabelling strategies for the label switching problem in bayesian mixture models. Statistics and Computing, 20(3):357–366, 2010.
• Stephens (2000) M. Stephens. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4):795–809, 2000.
• Stoneking (2014) C. J. Stoneking. Bayesian inference of gaussian mixture models with noninformative priors. arXiv preprint arXiv:1405.4895, 2014.
• Wang and Dunson (2011) L. Wang and D. B. Dunson. Fast bayesian inference in dirichlet process mixture models. Journal of Computational and Graphical Statistics, 20(1):196–216, 2011.
• Wasserman (2000) L. Wasserman. Asymptotic inference for mixture models using data dependent priors. Journal of the Royal Statistical Society, 12(62):159–180, 2000.
• Wu et al. (2015) G. Wu, S. H. Holan, C. H. Nilon, C. K. Wikle, et al. Bayesian binomial mixture models for estimating abundance in ecological monitoring studies. The Annals of Applied Statistics, 9(1):1–26, 2015.