# Fast nonparametric near-maximum likelihood estimation of a mixing density

Mixture models are regularly used in density estimation applications, but the problem of estimating the mixing distribution remains a challenge. Nonparametric maximum likelihood produce estimates of the mixing distribution that are discrete, and these may be hard to interpret when the true mixing distribution is believed to have a smooth density. In this paper, we investigate an algorithm that produces a sequence of smooth estimates that has been conjectured to converge to the nonparametric maximum likelihood estimator. Here we give a rigorous proof of this conjecture, and propose a new data-driven stopping rule that produces smooth near-maximum likelihood estimates of the mixing density, and simulations demonstrate the quality empirical performance of this estimator.

## Authors

• 5 publications
• 32 publications
• 14 publications
08/28/2021

### Nonparametric estimation of the incubation time distribution

We discuss nonparametric estimators of the distribution of the incubatio...
10/12/2021

### Data Harmonization Via Regularized Nonparametric Mixing Distribution Estimation

Data harmonization is the process by which an equivalence is developed b...
12/13/2020

### Process monitoring based on orthogonal locality preserving projection with maximum likelihood estimation

By integrating two powerful methods of density reduction and intrinsic d...
01/21/2021

### Boosting in Univariate Nonparametric Maximum Likelihood Estimation

Nonparametric maximum likelihood estimation is intended to infer the unk...
10/20/2020

### Estimating a mixing distribution on the sphere using predictive recursion

Mixture models are commonly used when data show signs of heterogeneity a...
12/05/2018

### On nonparametric estimation of a mixing density via the predictive recursion algorithm

Nonparametric estimation of a mixing density based on observations from ...
04/23/2018

### Randomized Mixture Models for Probability Density Approximation and Estimation

Randomized neural networks (NNs) are an interesting alternative to conve...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider a mixture model with density function given by

 f(y)=∫k(y∣x)P(dx) (1)

where is a known kernel density and is an unknown mixing distribution. The goal is estimation of based on independent and identically distributed data from the mixture in (1), a classically challenging problem in statistics. If is a discrete distribution with fixed and finite number of components, then (1) is a finite mixture model and is relatively straightforward; indeed, maximum likelihood computation is feasible with the EM algorithm (Dempster et al. 1977) and the usual asymptotic theory is available (Redner and Walker 1984). The catch is that the number of mixture components can be difficult to specify. Therefore, there has also been a lot of work on finite mixtures with an unknown number of components (e.g., Miller and Harrison 2017; Rousseau and Mengerson 2011; Leroux 1992; James et al. 2001; Woo and Sriram 2006; Richardson and Green 1997; Roeder and Wasserman 1997).

Likelihood-based methods for estimating are available even without explicitly making the problem finite-dimensional. Indeed, the likelihood function for in the nonparametric case is

 L(P)=n∏i=1∫k(Yi∣x)P(dx), (2)

and it is known that the maximizer , the nonparametric maximum likelihood estimator (MLE), is discrete, with at most components (Lindsay 1983, 1995). Discreteness simplifies computation, and fast algorithms are available, e.g., Wang (2007) and Koenker and Mizera (2014). However, if is believed to have a density with respect to, say, the Lebesgue measure, then the discrete estimator may not be satisfactory. To remedy this, various smoothed versions of have been proposed (e.g., Eggermont and LaRiccia 1995, 1997; Silverman et al. 1990; Green 1990; Laird and Louis 1991; Liu et al. 2009), but some of these are rather complicated and there seems to be no general consensus that one smoothing method is any better than another.

For Bayesian mixture models, the Dirichlet process prior (Ferguson 1973) and variants of its stick-breaking representation (Sethuraman 1994)

have become a mainstay, largely because of the plethora of powerful Markov chain Monte Carlo methods available for evaluating the corresponding posterior

(e.g., Escobar and West 1995; MacEachern 1998; Dunson and Park 2007; Walker 2007; Kalli et al. 2011). The focus of these developments, however, has been the mixture density, with the mixing distribution serving merely as a modeling tool; but see Nguyen (2013). As with the nonparametric MLE, the inherent discreteness of stick-breaking priors, while advantageous for mixture density estimation and modeling latent structures, is problematic in the context of nonparametric Bayesian estimation of a mixing density.

A Bayesian-style recursive estimate for , called predictive recursion, was proposed by Newton (2002) and studied theoretically by Martin and Ghosh (2008), Tokdar et al. (2009), and Martin and Tokdar (2009). The algorithm is fast and provides an estimator having a smooth density with respect to any specified dominating measure. However, its dependence on the (arbitrary) order in which the data is processed, hence it is not a Bayesian estimator, along with its inability to be characterized as an optimizer of any objective function, makes the predictive recursion estimator difficult to interpret.

In this paper, we investigate properties of a simple and fast iterative algorithm for estimating , one that shares certain features with the MLE, a Bayesian approach, as well as predictive recursion. Versions of this algorithm have been presented in the literature before, and its convergence properties have been conjectured but not rigorously proved. Here we fill this gap by providing a proof that algorithm converges to the nonparametric MLE as the number of iterations approaches infinity. While the limit is a discrete distribution, it is interesting that at every finite number of iterations, the algorithm returns a continuous density. This suggests that a smooth near-MLE of the density can be obtained by stopping the algorithm before convergence is achieved, and we propose an data-driven stopping rule and demonstrate empirically the quality performance of this nonparametric near-MLE of the mixing density compared to predictive recursion.

## 2 A simple and fast algorithm

### 2.1 Definition

As discussed above, our focus is on estimating a smooth mixing density associated with the mixing distribution in (1); throughout, we assume that is a density with respect to Lebesgue measure, though other choices of dominating measure could be handled similarly. Given a prior guess of , if a data point is observed, then the Bayesian update of to , say, is

 p1,i(x)=k(Yi∣x)p0(x)f0(Yi), (3)

where . However, we can carry out this single-observation update for any and, since observations ordering is irrelevant, it is reasonable to take an average:

 p1(x)=1nn∑i=1p1,i(x)=1nn∑i=1k(Yi∣x)p0(x)f0(Yi).

This same argument can be applied, with replaced by , to get an updated estimate , and so on. This suggests the following iterative algorithm for an estimator of :

 pt+1(x)=1nn∑i=1k(Yi∣x)pt(x)ft(Yi),t≥0, (4)

where for general . Algorithms similar to (4) for certain applications or models have appeared in the literature; see, e.g., Vardi et al. (1985), Laird and Louis (1991), and Vardi and Lee (1993). But despite the hints in these papers about more general versions, it seems that the algorithm (4) has not been studied thoroughly and in the level of generality considered here.

Aside from this Bayesian-motivated formulation, there are number of ways to think about this algorithm and understand what it is trying to do. First, note the similarities with the predictive recursion algorithm of Newton (2002) which updates by taking a weighted average of the current guess and the single-observation Bayes update (3) based on the current guess as the prior. These computations proceed along the sequence and, therefore, the result depends on the arbitrary order of the data sequence. The proposed algorithm (4) can, therefore, be viewed as an order-invariant version of predictive recursion that can also be refined ad infinitum, by taking , if desired.

Second, suppose that converges, in some sense, to a limit . Then that limit, in particular, its corresponding mixture , must satisfy

 1nn∑i=1k(Yi∣x)f∞(Yi)=1∀x.

This condition boils down to one involving the directional derivative of the log-likelihood function , for as in (2), which, according to Theorem 19 in Lindsay (1995), characterizes the nonparametric MLE. Therefore, we can expect that, if the limit exists, then it must be the nonparametric MLE, and details supporting this claim are presented in Section 2.3. However, for all finite , the continuity of the initial guess persists, so stopping at some makes a sort of “smoothed” nonparametric MLE; see Section 3.

### 2.2 Illustration

Example 1.2 in Böhning (2000) presents data on the number of illness spells for 602 pre-school children in Thailand over a two-week period; see, also, Wang (2007, Table 1). Böhning argues that a Poisson model provides poor fit, so he proposes a nonparametric Poisson mixture model instead. Wang produces the nonparametric MLE for the mixing distribution displayed in the first five panels Figure 1 (vertical gray lines). These panels also show the mixing density estimates from (4) for five different stopping times, namely, , based on . It is interesting that the estimate very quickly forgets the shape of the uniform initial guess, that at 500 iterations it has mostly identified the locations of the nonparametric MLE, but that it takes many more iterations before it starts to clearly concentrate around those locations. Most interesting, however, is that the log-likelihood trajectory is non-decreasing and very quickly jumps up to near , the maximum possible value; in fact, the relative difference between and the maximum is about 0.003, close to satisfying any reasonable convergence criterion. So, at least in the sense of the likelihood value, is virtually as good as , and can be computed very quickly and simply; plus, is also a continuous density.

### 2.3 Properties

This section seeks to give a rigorous demonstration of two key results suggested in Sections 2.12.2 above. First is the conjecture coming from Panel (f) in Figure 1, namely, that the likelihood function is non-decreasing as a function of in (4), which implies that the iterations are not moving down on the likelihood surface.

###### Proposition 1.

For defined as (4), the likelihood is non-decreasing.

###### Proof.

Write the iterates in (4) as

 pt+1(x)=∫k(y∣x)pt(x)ft(y)ˆF(dy),

where is the empirical distribution of . Then the log-likelihood satisfies

 ℓ(pt+1)−ℓ(pt) =∫logft+1(y)ft(y)ˆF(dy) =∫[log∫∫k(y∣x)k(y′∣x)pt(x)dxft(y)ft(y′)ˆF(dy′)]ˆF(dy).

Applying Jensen’s inequality, the right-hand side is lower-bounded by

 ∫{pt(x)∫k(y∣x)ft(y)ˆF(dy)}log{∫k(y′∣x)ft(y′)ˆF(dy′)}dx,

which is the Kullback–Leibler divergence of

from . Since the latter is non-negative, it follows that , as was to be shown. ∎

We have opted here for a direct proof of non-decreasing likelihood, though this could also be verified by confirming that (4) is an EM algorithm with the parameter space

, the set of all probability measures

on having strictly positive densities. While a characterization of (4) as an EM algorithm may not be new—e.g., Laird and Louis (1991) give a similar characterization of a special case of (4)—there appears to be no general results on whether the sequence actually converges to the nonparametric MLE, . Of course, not moving downwards on the likelihood surface suggests that the algorithm converges to , but a proof requires some care. Of critical importance is that, since is convex, the general results in Csiszár and Tusnády (1984, Sec. 5A) imply that

 L(Pt)→L(^P),as t→∞, (5)

a necessary condition for to converge to .

###### Proposition 2.

Suppose for each that is a continuous and strictly positive map on . Also, assume that for every there exists a compact such that for each . If there exists the unique nonparametric MLE , then weakly as .

###### Proof.

Since a continuous function on a compact set is bounded, the map is bounded on for each . Since by (5), we have for each . For , let be a compact set such that for every . Then, there exists an integer such that for every and . It follows that is a uniformly tight sequence of probability measures.

By Prokhorov’s theorem, every subsequence has a further subsequence that converges weakly to a probability measure in , the set of all probability measures on . Since is continuous and bounded on , is also continuous with respect to the topology of weak convergence on . Therefore,

 L(Pts(r))→L(P⋆),r→∞.

It follows from (5) that is a nonparametric MLE. We assumed that is the unique nonparametric MLE, so every subsequence of has a further subsequence converging weakly to , which implies weakly. ∎

While convergence of (4) to the nonparametric MLE is important to explain and justify the method itself, the EM’s convergence to the maximizer is known to be relatively slow (e.g., Jamshidian and Jennrich 1997; Meng and van Dyk 1997). In fact, Figure 1 reveals that we need more than 5000 iterations of (4) in order to reach the MLE. However, since we are interested in estimating a density, we prefer to stop the algorithm short of convergence to . So Propositions 12, along with the speed at which the iterations get close to the maximum, as demonstrated in Panel (f) of Figure 1, suggest that , for suitable , is a near-maximum likelihood estimator (nMLE) of the mixing density.

## 3 A smooth nonparametric near-MLE

### 3.1 Proposal

As described above, for the purpose of maintaining smoothness of the mixing density estimator , we have reason to stop the algorithm at a relatively small value . But we want to ensure that the likelihood at is sufficiently large and, hence, can be understood as a near-MLE, or nMLE. Towards this, we propose the following stopping rule:

 stop at T such that ~ℓext−ℓ(pT)<δ|~ℓext|, (6)

where is a small constant to be specified, and is the log-likelihood evaluated at some external estimate of the mixture density . One option is for to be the likelihood evaluated at the actual nonparametric MLE, but a reasonable alternative is to take based on a simple kernel estimate of . In our experiments that follow, we take and based on the default kernel estimator in the density function in R.

### 3.2 Numerical results

Here we carry out a number of simulations to demonstrate the performance of our proposed nonparametric nMLE of the mixing density. In each case, we take to be an iid sample of size from the mixture density . Three different kernels will be considered:

Kernel 1.

;

Kernel 2.

;

Kernel 3.

.

We will also consider three different mixing densities, supported (roughly) on :

Mixing 1.

;

Mixing 2.

;

Mixing 3.

.

For each kernel and mixing density pair, 100 datasets are generated, and mixing density estimates using the proposed nMLE based on stopping rule (6) above; all are based on a starting initial guess. Figure 2

summarizes the results of these simulations, plotting the 100 mixing density estimates for the individual datasets, the truth, the point-wise average of those 100 estimate, and the point-wise one-standard deviation range around the average. In all cases, although there is variability around the truth in each individual case, as is expected, our proposed estimator is quite accurate overall. Also, the

defined by (6) was less than 5 in all of these runs. For comparison, we also produced the mixing density estimator based on the predictive recursion algorithm described briefly in Section 1. Figure 3 shows the predictive recursion to nMLE relative error. Since the values all tend to be greater than one for all kernel and mixing density pairs, we conclude that nMLE tends to be more accurate than predictive recursion.

## Acknowledgments

This work is partially supported by the National Science Foundation under grants DMS–1612891 and DMS–1737929

## References

• Böhning (2000) Böhning, D. (2000). Computer-assisted Analysis of Mixtures and Applications: Meta-analysis, Disease Mapping, and Others. Chapman and Hall–CRC, Boca Raton.
• Csiszár and Tusnády (1984) Csiszár, I. and Tusnády, G. (1984). Information geometry and alternating minimization procedures. Statist. Decisions, (suppl. 1):205–237. Recent results in estimation theory and related topics.
• Dempster et al. (1977) Dempster, A., Laird, N., and Rubin, D. (1977). Maximum-likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B, 39(1):1–38.
• Dunson and Park (2007) Dunson, D. B. and Park, J.-H. (2007). Kernel stick-breaking processes. Biometrika, 95:307–323.
• Eggermont and LaRiccia (1995) Eggermont, P. P. B. and LaRiccia, V. N. (1995). Maximum smoothed likelihood density estimation for inverse problems. Ann. Statist., 23(1):199–220.
• Eggermont and LaRiccia (1997) Eggermont, P. P. B. and LaRiccia, V. N. (1997). Nonlinearly smoothed EM density estimation with automated smoothing parameter selection for nonparametric deconvolution problems. J. Amer. Statist. Assoc., 92(440):1451–1458.
• Escobar and West (1995) Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc., 90(430):577–588.
• Ferguson (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist., 1:209–230.
• Green (1990) Green, P. J. (1990). On use of the EM algorithm for penalized likelihood estimation. J. Roy. Statist. Soc. Ser. B, 52(3):443–452.
• James et al. (2001) James, L. F., Priebe, C. E., and Marchette, D. J. (2001). Consistent estimation of mixture complexity. Ann. Statist., 29(5):1281–1296.
• Jamshidian and Jennrich (1997) Jamshidian, M. and Jennrich, R. I. (1997). Acceleration of the EM algorithm by using quasi-Newton methods. J. Roy. Statist. Soc. Ser. B, 59(3):569–587.
• Kalli et al. (2011) Kalli, M., Griffin, J. E., and Walker, S. G. (2011). Slice sampling mixture models. Stat. Comput., 21(1):93–105.
• Koenker and Mizera (2014) Koenker, R. and Mizera, I. (2014). Convex Optimization, Shape Constraints, Compound Decisions, and Empirical Bayes Rules. J. Amer. Statist. Assoc., 109(506):674–685.
• Laird and Louis (1991) Laird, N. M. and Louis, T. A. (1991). Smoothing the nonparametric estimate of a prior distribution by roughening: a computational study. Comput. Statist. Data Anal., 12(1):27–37.
• Leroux (1992) Leroux, B. G. (1992). Consistent estimation of a mixing distribution. Ann. Statist., 20(3):1350–1360.
• Lindsay (1983) Lindsay, B. G. (1983). The geometry of mixture likelihoods: a general theory. Ann. Statist., 11(1):86–94.
• Lindsay (1995) Lindsay, B. G. (1995). Mixture Models: Theory, Geometry and Applications. IMS, Haywood, CA.
• Liu et al. (2009) Liu, L., Levine, M., and Zhu, Y. (2009). A functional EM algorithm for mixing density estimation via nonparametric penalized likelihood maximization. J. Comput. Graph. Statist., 18(2):481–504.
• MacEachern (1998) MacEachern, S. N. (1998). Computational methods for mixture of Dirichlet process models. In Dey, D., Müller, P., and Sinha, D., editors,

Practical Nonparametric and Semiparametric Bayesian Statistics

, volume 133 of Lecture Notes in Statist., pages 23–43. Springer, New York.
• Martin and Ghosh (2008) Martin, R. and Ghosh, J. K. (2008). Stochastic approximation and Newton’s estimate of a mixing distribution. Statist. Sci., 23(3):365–382.
• Martin and Tokdar (2009) Martin, R. and Tokdar, S. T. (2009). Asymptotic properties of predictive recursion: robustness and rate of convergence. Electron. J. Stat., 3:1455–1472.
• Meng and van Dyk (1997) Meng, X.-L. and van Dyk, D. (1997). The EM algorithm—an old folk-song sung to a fast new tune. J. Roy. Statist. Soc. Ser. B, 59(3):511–567. With discussion and a reply by the authors.
• Miller and Harrison (2017) Miller, J. and Harrison, M. (2017). Mixture models with a prior on the number of components. J. Amer. Statist. Assoc., to appear. arXiv:1502.06241
• Newton (2002) Newton, M. A. (2002). On a nonparametric recursive estimator of the mixing distribution. Sankhyā Ser. A, 64(2):306–322.
• Nguyen (2013) Nguyen, X. (2013). Convergence of latent mixing measures in finite and infinite mixture models. Ann. Statist., 41(1):370–400.
• Redner and Walker (1984) Redner, R. A. and Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev., 26(2):195–239.
• Richardson and Green (1997) Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. J. Roy. Statist. Soc. Ser. B, 59(4):731–792.
• Roeder and Wasserman (1997) Roeder, K. and Wasserman, L. (1997). Practical Bayesian density estimation using mixtures of normals. J. Amer. Statist. Assoc., 92(439):894–902.
• Rousseau and Mengerson (2011) Rousseau, J. and Mengerson, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. J. Roy. Statist. Soc. Ser. B, 73(5):689–710.
• Sethuraman (1994) Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica, 4(2):639–650.
• Silverman et al. (1990) Silverman, B. W., Jones, M. C., Nychka, D. W., and Wilson, J. D. (1990). A smoothed EM approach to indirect estimation problems, with particular reference to stereology and emission tomography. J. Roy. Statist. Soc. Ser. B, 52(2):271–324.
• Tokdar et al. (2009) Tokdar, S. T., Martin, R., and Ghosh, J. K. (2009). Consistency of a recursive estimate of mixing distributions. Ann. Statist., 37(5A):2502–2522.
• Vardi and Lee (1993) Vardi, Y. and Lee, D. (1993). From image deblurring to optimal investments: maximum likelihood solutions for positive linear inverse problems. J. Roy. Statist. Soc. Ser. B, 55(3):569–612. With discussion.
• Vardi et al. (1985) Vardi, Y., Shepp, L. A., and Kaufman, L. (1985). A statistical model for positron emission tomography. J. Amer. Statist. Assoc., 80(389):8–37. With discussion.
• Walker (2007) Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Comm. Statist. Simulation Comput., 36(1-3):45–54.
• Wang (2007) Wang, Y. (2007). On fast computation of the non-parametric maximum likelihood estimate of a mixing distribution. J. R. Stat. Soc. Ser. B, 69(2):185–198.
• Woo and Sriram (2006) Woo, M.-J. and Sriram, T. N. (2006). Robust estimation of mixture complexity. J. Amer. Statist. Assoc., 101(476):1475–1486.