1 Introduction
Consider a mixture model with density function given by
(1) 
where is a known kernel density and is an unknown mixing distribution. The goal is estimation of based on independent and identically distributed data from the mixture in (1), a classically challenging problem in statistics. If is a discrete distribution with fixed and finite number of components, then (1) is a finite mixture model and is relatively straightforward; indeed, maximum likelihood computation is feasible with the EM algorithm (Dempster et al. 1977) and the usual asymptotic theory is available (Redner and Walker 1984). The catch is that the number of mixture components can be difficult to specify. Therefore, there has also been a lot of work on finite mixtures with an unknown number of components (e.g., Miller and Harrison 2017; Rousseau and Mengerson 2011; Leroux 1992; James et al. 2001; Woo and Sriram 2006; Richardson and Green 1997; Roeder and Wasserman 1997).
Likelihoodbased methods for estimating are available even without explicitly making the problem finitedimensional. Indeed, the likelihood function for in the nonparametric case is
(2) 
and it is known that the maximizer , the nonparametric maximum likelihood estimator (MLE), is discrete, with at most components (Lindsay 1983, 1995). Discreteness simplifies computation, and fast algorithms are available, e.g., Wang (2007) and Koenker and Mizera (2014). However, if is believed to have a density with respect to, say, the Lebesgue measure, then the discrete estimator may not be satisfactory. To remedy this, various smoothed versions of have been proposed (e.g., Eggermont and LaRiccia 1995, 1997; Silverman et al. 1990; Green 1990; Laird and Louis 1991; Liu et al. 2009), but some of these are rather complicated and there seems to be no general consensus that one smoothing method is any better than another.
For Bayesian mixture models, the Dirichlet process prior (Ferguson 1973) and variants of its stickbreaking representation (Sethuraman 1994)
have become a mainstay, largely because of the plethora of powerful Markov chain Monte Carlo methods available for evaluating the corresponding posterior
(e.g., Escobar and West 1995; MacEachern 1998; Dunson and Park 2007; Walker 2007; Kalli et al. 2011). The focus of these developments, however, has been the mixture density, with the mixing distribution serving merely as a modeling tool; but see Nguyen (2013). As with the nonparametric MLE, the inherent discreteness of stickbreaking priors, while advantageous for mixture density estimation and modeling latent structures, is problematic in the context of nonparametric Bayesian estimation of a mixing density.A Bayesianstyle recursive estimate for , called predictive recursion, was proposed by Newton (2002) and studied theoretically by Martin and Ghosh (2008), Tokdar et al. (2009), and Martin and Tokdar (2009). The algorithm is fast and provides an estimator having a smooth density with respect to any specified dominating measure. However, its dependence on the (arbitrary) order in which the data is processed, hence it is not a Bayesian estimator, along with its inability to be characterized as an optimizer of any objective function, makes the predictive recursion estimator difficult to interpret.
In this paper, we investigate properties of a simple and fast iterative algorithm for estimating , one that shares certain features with the MLE, a Bayesian approach, as well as predictive recursion. Versions of this algorithm have been presented in the literature before, and its convergence properties have been conjectured but not rigorously proved. Here we fill this gap by providing a proof that algorithm converges to the nonparametric MLE as the number of iterations approaches infinity. While the limit is a discrete distribution, it is interesting that at every finite number of iterations, the algorithm returns a continuous density. This suggests that a smooth nearMLE of the density can be obtained by stopping the algorithm before convergence is achieved, and we propose an datadriven stopping rule and demonstrate empirically the quality performance of this nonparametric nearMLE of the mixing density compared to predictive recursion.
2 A simple and fast algorithm
2.1 Definition
As discussed above, our focus is on estimating a smooth mixing density associated with the mixing distribution in (1); throughout, we assume that is a density with respect to Lebesgue measure, though other choices of dominating measure could be handled similarly. Given a prior guess of , if a data point is observed, then the Bayesian update of to , say, is
(3) 
where . However, we can carry out this singleobservation update for any and, since observations ordering is irrelevant, it is reasonable to take an average:
This same argument can be applied, with replaced by , to get an updated estimate , and so on. This suggests the following iterative algorithm for an estimator of :
(4) 
where for general . Algorithms similar to (4) for certain applications or models have appeared in the literature; see, e.g., Vardi et al. (1985), Laird and Louis (1991), and Vardi and Lee (1993). But despite the hints in these papers about more general versions, it seems that the algorithm (4) has not been studied thoroughly and in the level of generality considered here.
Aside from this Bayesianmotivated formulation, there are number of ways to think about this algorithm and understand what it is trying to do. First, note the similarities with the predictive recursion algorithm of Newton (2002) which updates by taking a weighted average of the current guess and the singleobservation Bayes update (3) based on the current guess as the prior. These computations proceed along the sequence and, therefore, the result depends on the arbitrary order of the data sequence. The proposed algorithm (4) can, therefore, be viewed as an orderinvariant version of predictive recursion that can also be refined ad infinitum, by taking , if desired.
Second, suppose that converges, in some sense, to a limit . Then that limit, in particular, its corresponding mixture , must satisfy
This condition boils down to one involving the directional derivative of the loglikelihood function , for as in (2), which, according to Theorem 19 in Lindsay (1995), characterizes the nonparametric MLE. Therefore, we can expect that, if the limit exists, then it must be the nonparametric MLE, and details supporting this claim are presented in Section 2.3. However, for all finite , the continuity of the initial guess persists, so stopping at some makes a sort of “smoothed” nonparametric MLE; see Section 3.
2.2 Illustration
Example 1.2 in Böhning (2000) presents data on the number of illness spells for 602 preschool children in Thailand over a twoweek period; see, also, Wang (2007, Table 1). Böhning argues that a Poisson model provides poor fit, so he proposes a nonparametric Poisson mixture model instead. Wang produces the nonparametric MLE for the mixing distribution displayed in the first five panels Figure 1 (vertical gray lines). These panels also show the mixing density estimates from (4) for five different stopping times, namely, , based on . It is interesting that the estimate very quickly forgets the shape of the uniform initial guess, that at 500 iterations it has mostly identified the locations of the nonparametric MLE, but that it takes many more iterations before it starts to clearly concentrate around those locations. Most interesting, however, is that the loglikelihood trajectory is nondecreasing and very quickly jumps up to near , the maximum possible value; in fact, the relative difference between and the maximum is about 0.003, close to satisfying any reasonable convergence criterion. So, at least in the sense of the likelihood value, is virtually as good as , and can be computed very quickly and simply; plus, is also a continuous density.
2.3 Properties
This section seeks to give a rigorous demonstration of two key results suggested in Sections 2.1–2.2 above. First is the conjecture coming from Panel (f) in Figure 1, namely, that the likelihood function is nondecreasing as a function of in (4), which implies that the iterations are not moving down on the likelihood surface.
Proposition 1.
For defined as (4), the likelihood is nondecreasing.
Proof.
Write the iterates in (4) as
where is the empirical distribution of . Then the loglikelihood satisfies
Applying Jensen’s inequality, the righthand side is lowerbounded by
which is the Kullback–Leibler divergence of
from . Since the latter is nonnegative, it follows that , as was to be shown. ∎We have opted here for a direct proof of nondecreasing likelihood, though this could also be verified by confirming that (4) is an EM algorithm with the parameter space
, the set of all probability measures
on having strictly positive densities. While a characterization of (4) as an EM algorithm may not be new—e.g., Laird and Louis (1991) give a similar characterization of a special case of (4)—there appears to be no general results on whether the sequence actually converges to the nonparametric MLE, . Of course, not moving downwards on the likelihood surface suggests that the algorithm converges to , but a proof requires some care. Of critical importance is that, since is convex, the general results in Csiszár and Tusnády (1984, Sec. 5A) imply that(5) 
a necessary condition for to converge to .
Proposition 2.
Suppose for each that is a continuous and strictly positive map on . Also, assume that for every there exists a compact such that for each . If there exists the unique nonparametric MLE , then weakly as .
Proof.
Since a continuous function on a compact set is bounded, the map is bounded on for each . Since by (5), we have for each . For , let be a compact set such that for every . Then, there exists an integer such that for every and . It follows that is a uniformly tight sequence of probability measures.
By Prokhorov’s theorem, every subsequence has a further subsequence that converges weakly to a probability measure in , the set of all probability measures on . Since is continuous and bounded on , is also continuous with respect to the topology of weak convergence on . Therefore,
It follows from (5) that is a nonparametric MLE. We assumed that is the unique nonparametric MLE, so every subsequence of has a further subsequence converging weakly to , which implies weakly. ∎
While convergence of (4) to the nonparametric MLE is important to explain and justify the method itself, the EM’s convergence to the maximizer is known to be relatively slow (e.g., Jamshidian and Jennrich 1997; Meng and van Dyk 1997). In fact, Figure 1 reveals that we need more than 5000 iterations of (4) in order to reach the MLE. However, since we are interested in estimating a density, we prefer to stop the algorithm short of convergence to . So Propositions 1–2, along with the speed at which the iterations get close to the maximum, as demonstrated in Panel (f) of Figure 1, suggest that , for suitable , is a nearmaximum likelihood estimator (nMLE) of the mixing density.
3 A smooth nonparametric nearMLE
3.1 Proposal
As described above, for the purpose of maintaining smoothness of the mixing density estimator , we have reason to stop the algorithm at a relatively small value . But we want to ensure that the likelihood at is sufficiently large and, hence, can be understood as a nearMLE, or nMLE. Towards this, we propose the following stopping rule:
(6) 
where is a small constant to be specified, and is the loglikelihood evaluated at some external estimate of the mixture density . One option is for to be the likelihood evaluated at the actual nonparametric MLE, but a reasonable alternative is to take based on a simple kernel estimate of . In our experiments that follow, we take and based on the default kernel estimator in the density function in R.
3.2 Numerical results
Here we carry out a number of simulations to demonstrate the performance of our proposed nonparametric nMLE of the mixing density. In each case, we take to be an iid sample of size from the mixture density . Three different kernels will be considered:
 Kernel 1.

;
 Kernel 2.

;
 Kernel 3.

.
We will also consider three different mixing densities, supported (roughly) on :
 Mixing 1.

;
 Mixing 2.

;
 Mixing 3.

.
For each kernel and mixing density pair, 100 datasets are generated, and mixing density estimates using the proposed nMLE based on stopping rule (6) above; all are based on a starting initial guess. Figure 2
summarizes the results of these simulations, plotting the 100 mixing density estimates for the individual datasets, the truth, the pointwise average of those 100 estimate, and the pointwise onestandard deviation range around the average. In all cases, although there is variability around the truth in each individual case, as is expected, our proposed estimator is quite accurate overall. Also, the
defined by (6) was less than 5 in all of these runs. For comparison, we also produced the mixing density estimator based on the predictive recursion algorithm described briefly in Section 1. Figure 3 shows the predictive recursion to nMLE relative error. Since the values all tend to be greater than one for all kernel and mixing density pairs, we conclude that nMLE tends to be more accurate than predictive recursion.Acknowledgments
This work is partially supported by the National Science Foundation under grants DMS–1612891 and DMS–1737929
References
 Böhning (2000) Böhning, D. (2000). Computerassisted Analysis of Mixtures and Applications: Metaanalysis, Disease Mapping, and Others. Chapman and Hall–CRC, Boca Raton.
 Csiszár and Tusnády (1984) Csiszár, I. and Tusnády, G. (1984). Information geometry and alternating minimization procedures. Statist. Decisions, (suppl. 1):205–237. Recent results in estimation theory and related topics.
 Dempster et al. (1977) Dempster, A., Laird, N., and Rubin, D. (1977). Maximumlikelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B, 39(1):1–38.
 Dunson and Park (2007) Dunson, D. B. and Park, J.H. (2007). Kernel stickbreaking processes. Biometrika, 95:307–323.
 Eggermont and LaRiccia (1995) Eggermont, P. P. B. and LaRiccia, V. N. (1995). Maximum smoothed likelihood density estimation for inverse problems. Ann. Statist., 23(1):199–220.
 Eggermont and LaRiccia (1997) Eggermont, P. P. B. and LaRiccia, V. N. (1997). Nonlinearly smoothed EM density estimation with automated smoothing parameter selection for nonparametric deconvolution problems. J. Amer. Statist. Assoc., 92(440):1451–1458.
 Escobar and West (1995) Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc., 90(430):577–588.
 Ferguson (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist., 1:209–230.
 Green (1990) Green, P. J. (1990). On use of the EM algorithm for penalized likelihood estimation. J. Roy. Statist. Soc. Ser. B, 52(3):443–452.
 James et al. (2001) James, L. F., Priebe, C. E., and Marchette, D. J. (2001). Consistent estimation of mixture complexity. Ann. Statist., 29(5):1281–1296.
 Jamshidian and Jennrich (1997) Jamshidian, M. and Jennrich, R. I. (1997). Acceleration of the EM algorithm by using quasiNewton methods. J. Roy. Statist. Soc. Ser. B, 59(3):569–587.
 Kalli et al. (2011) Kalli, M., Griffin, J. E., and Walker, S. G. (2011). Slice sampling mixture models. Stat. Comput., 21(1):93–105.
 Koenker and Mizera (2014) Koenker, R. and Mizera, I. (2014). Convex Optimization, Shape Constraints, Compound Decisions, and Empirical Bayes Rules. J. Amer. Statist. Assoc., 109(506):674–685.
 Laird and Louis (1991) Laird, N. M. and Louis, T. A. (1991). Smoothing the nonparametric estimate of a prior distribution by roughening: a computational study. Comput. Statist. Data Anal., 12(1):27–37.
 Leroux (1992) Leroux, B. G. (1992). Consistent estimation of a mixing distribution. Ann. Statist., 20(3):1350–1360.
 Lindsay (1983) Lindsay, B. G. (1983). The geometry of mixture likelihoods: a general theory. Ann. Statist., 11(1):86–94.
 Lindsay (1995) Lindsay, B. G. (1995). Mixture Models: Theory, Geometry and Applications. IMS, Haywood, CA.
 Liu et al. (2009) Liu, L., Levine, M., and Zhu, Y. (2009). A functional EM algorithm for mixing density estimation via nonparametric penalized likelihood maximization. J. Comput. Graph. Statist., 18(2):481–504.

MacEachern (1998)
MacEachern, S. N. (1998).
Computational methods for mixture of Dirichlet process models.
In Dey, D., Müller, P., and Sinha, D., editors,
Practical Nonparametric and Semiparametric Bayesian Statistics
, volume 133 of Lecture Notes in Statist., pages 23–43. Springer, New York.  Martin and Ghosh (2008) Martin, R. and Ghosh, J. K. (2008). Stochastic approximation and Newton’s estimate of a mixing distribution. Statist. Sci., 23(3):365–382.
 Martin and Tokdar (2009) Martin, R. and Tokdar, S. T. (2009). Asymptotic properties of predictive recursion: robustness and rate of convergence. Electron. J. Stat., 3:1455–1472.
 Meng and van Dyk (1997) Meng, X.L. and van Dyk, D. (1997). The EM algorithm—an old folksong sung to a fast new tune. J. Roy. Statist. Soc. Ser. B, 59(3):511–567. With discussion and a reply by the authors.
 Miller and Harrison (2017) Miller, J. and Harrison, M. (2017). Mixture models with a prior on the number of components. J. Amer. Statist. Assoc., to appear. arXiv:1502.06241
 Newton (2002) Newton, M. A. (2002). On a nonparametric recursive estimator of the mixing distribution. Sankhyā Ser. A, 64(2):306–322.
 Nguyen (2013) Nguyen, X. (2013). Convergence of latent mixing measures in finite and infinite mixture models. Ann. Statist., 41(1):370–400.
 Redner and Walker (1984) Redner, R. A. and Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev., 26(2):195–239.
 Richardson and Green (1997) Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. J. Roy. Statist. Soc. Ser. B, 59(4):731–792.
 Roeder and Wasserman (1997) Roeder, K. and Wasserman, L. (1997). Practical Bayesian density estimation using mixtures of normals. J. Amer. Statist. Assoc., 92(439):894–902.
 Rousseau and Mengerson (2011) Rousseau, J. and Mengerson, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. J. Roy. Statist. Soc. Ser. B, 73(5):689–710.
 Sethuraman (1994) Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica, 4(2):639–650.
 Silverman et al. (1990) Silverman, B. W., Jones, M. C., Nychka, D. W., and Wilson, J. D. (1990). A smoothed EM approach to indirect estimation problems, with particular reference to stereology and emission tomography. J. Roy. Statist. Soc. Ser. B, 52(2):271–324.
 Tokdar et al. (2009) Tokdar, S. T., Martin, R., and Ghosh, J. K. (2009). Consistency of a recursive estimate of mixing distributions. Ann. Statist., 37(5A):2502–2522.
 Vardi and Lee (1993) Vardi, Y. and Lee, D. (1993). From image deblurring to optimal investments: maximum likelihood solutions for positive linear inverse problems. J. Roy. Statist. Soc. Ser. B, 55(3):569–612. With discussion.
 Vardi et al. (1985) Vardi, Y., Shepp, L. A., and Kaufman, L. (1985). A statistical model for positron emission tomography. J. Amer. Statist. Assoc., 80(389):8–37. With discussion.
 Walker (2007) Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Comm. Statist. Simulation Comput., 36(13):45–54.
 Wang (2007) Wang, Y. (2007). On fast computation of the nonparametric maximum likelihood estimate of a mixing distribution. J. R. Stat. Soc. Ser. B, 69(2):185–198.
 Woo and Sriram (2006) Woo, M.J. and Sriram, T. N. (2006). Robust estimation of mixture complexity. J. Amer. Statist. Assoc., 101(476):1475–1486.
Comments
There are no comments yet.