## 1 Introduction

Mixture models are widely used in statistics and machine learning, often for density estimation and clustering. Here we will be considering a general version of the mixture model, where the mixture density is given by

(1) |

where is a known kernel, i.e., where is a density for each , and is the unknown mixing distribution on (the Borel -algebra of) . An advantage to this general form is its flexibility: depending on the kernel, the mixture density can take virtually any shape (e.g., DasGupta 2008, p. 572), making such mixtures a powerful modeling tool for robust, nonparametric density estimation. Here we will assume that we have independent and identically distributed observations from a density —which may or may not have the form (1)—and our goal is to fit the above mixture model, estimate the mixing distribution , and, in turn, estimate the density .

An alternative perspective on the mixture model formulation considers a hierarchical formulation, where the first layer has iid

-valued random variables,

, from , and then the second layer hasThe idea is that the ’s are latent/unobservable variables and the ’s are the observable data. It is easy to check that, marginally, the ’s are iid with density as in (1). The classical deconvolution problem (e.g., Stefanski and Carroll 1990; Fan 1991) is a special case where is such that the second layer above could be described as “.” This hierarchical formulation sheds light on the difficulties of the problem we are considering; that is, our goal is to estimate the distribution of the latent variables based only on the corrupted observations .

For fitting the general mixture model (1), a number of different strategies are available in the literature. A natural approach is to use the nonparametric maximum likelihood estimator (MLE) of (Lindsay 1995; Eggermont and LaRiccia 1995) and the corresponding plug-in estimate of the mixture density . An interesting feature of the nonparametric MLE of is that it is almost surely a discrete distribution (e.g., Lindsay 1995). Another approach is to assume discreteness of with a fixed number of components and the component parameters are estimated via EM (Dempster et al. 1977; McLachlan and Peel 2000; Teel et al. 2015). Bayesian approaches have also been explored in this context; either by having a prior on like in Van Dyk and Meng (2001), or a prior on the number of components of like in Richardson and Green (1997).

An alternative to the likelihood-based frameworks mentioned above, Newton et al. (1998) proposed a recursive algorithm for nonparametric estimation of , originally designed to serve as an approximation of the posterior mean under the Dirichlet process mixture formulation; see, also, Newton and Zhang (1999). The so-called predictive recursion (PR) algorithm estimates the mixing distribution recursively, starting with an initial guess and applies a simple update , for each , resulting in an estimate of and a corresponding estimate of . Advantages of the PR estimator include its speed and—compared to likelihood-based methods whose estimates of are effectively discrete—its ability to estimate a mixing distribution that has a smooth density with respect to any user-specified dominating measure. Further details about the PR algorithm and its properties are discussed in Section 2.

Not being likelihood-based makes the theoretical justification of the PR estimator of not straightforward. It was not until Newton (2002) that a first theoretical convergence analysis of PR was presented, establishing the asymptotic consistency of the PR estimator as . Unfortunately, there was a gap in Newton’s proof, later filled by Ghosh and Tokdar (2006). These first results, along with those in Martin and Ghosh (2008), focus primarily on the case where is a known finite set. Tokdar et al. (2009) extended the consistency results to the case of compact , which was extended further by Martin and Tokdar (2009) who covered the case of model misspecification, where the true density need not have exactly the form (1), and bounded the rate of convergence.

However, even the latter results are based on conditions that can be too restrictive in applications. For example, Williamson (1956) showed that monotone densities are characterized as mixtures of the form (1) where and is the uniform kernel, , with being the indicator function of a set . But for this particular kernel, it is not possible to check the sufficient conditions required in, e.g., Theorem 4.5 of Martin and Tokdar (2009). Similar issues would arise in other mixture model applications. Motivated by this deficiency in the state of the art, the focus of the present paper is to establish new asymptotic consistency properties for the PR estimator under weaker and more easily verified conditions.

Following a brief review of the existing theory for PR in Section 2, we establish convergence properties of the PR estimator—both the mixture density and the mixing distribution—under weaker conditions in Section 3. We then apply these new results in Section 4 to our motivating example, namely, monotone density estimation via mixtures of uniform kernels. There we first give a characterization of the best mixing distribution and mixture density within a special class of uniform mixtures. This characterization suggests a particular formulation of the PR algorithm and we use the general results presented in Section 3 to prove that PR consistently estimates this best mixture. Our choice to focus on a special class of uniform mixtures generally introduces some model misspecification bias, but we show that this bias is a vanishing function of two user-specified parameters. Therefore, the bias has no practical impact on PR’s performance, as our numerical examples confirm. Finally, some concluding remarks are given in Section 5. Technical details and proofs are presented in the Appendix.

## 2 Background on PR

As mentioned briefly above, PR is a stochastic algorithm designed for fast, nonparametric estimation of mixing distributions. The algorithm’s inputs include the kernel , an initial guess of the mixing distribution, supported on , a rule for defining a sequence of weights , and a sequence of data points . Then the recursive updates first presented in Newton et al. (1998) define the PR algorithm:

(2) |

After data points have been observed, the mixing distribution estimator is , and the corresponding mixture density estimator is defined according to (1). To understand the motivation behind PR, observe that the PR update is just a weighted average of and the posterior for with prior and kernel likelihood . The weights, , need to be decreasing in but not too quickly; this will be made more precise below. Some recent and novel applications of PR can be found in Scott et al. (2015), Tansey et al. (2018), and Woody et al. (2021).

If the user has a specific dominating measure on in mind, then he/she can incorporate that information into the algorithm. That this, the updates in (2) can be expressed in terms of the density or Radon–Nikodym derivative as

where is the initial guess. Therefore, PR can be used to estimate a mixing density, compared to the nonparametric MLE, , which is almost surely discrete. Moreover, when the densities are evaluated on a fixed grid in , and the normalizing constant in the denominator is evaluated using quadrature, computation of the PR estimate, , is fast and simple—done in

operations—compared to the nonparametric MLE or a Bayesian estimate based on Markov chain Monte Carlo (MCMC).

The above algorithm is described for the case when data points are arriving one at a time, but, of course, the same procedure can be carried out when the data comes in a batch. When data are both batched and iid, as we consider here, one might be troubled by the fact that depends on the order in which the data are processed. In particular, while there are some potential advantages to PR’s order-dependence (see Dixit and Martin (2019)), it implies that is not a function of a minimal sufficient statistic. To overcome this, Newton (2002) suggested that one could evaluate the estimator separately on several random permutations of the data sequence and then take averages over permutations. This can be seen as a Monte Carlo estimate of the Rao–Blackwellized estimator, the average over all permutations. It has been shown empirically (e.g., Martin and Tokdar 2012) that it only takes a few random permutations to remove the order-dependence, so, with the inherent computational efficiency of PR, the permutation-averaged version is still much faster than, say, MCMC.

Not being Bayesian or maximum likelihood estimators, it is not immediately obvious that the PR estimates, and , would have any desirable statistical properties. It has, however, been shown that, under certain conditions, both and are consistent estimators. Before stating these sufficient conditions for consistency, we need to describe what the PR estimates are estimating in general.

Suppose the true density of the iid data is . Of course, there is generally no way to know if can be expressed as a mixture model of the form (1) for a particular kernel, . When the mixture model is incorrectly specified, there is no “” for the PR estimator to converge to, and we cannot expect to be a consistent estimator of . Instead, there may be a mixture density, , that is “closest” to , and that and would converge to and

, respectively. Proximity here is measured in terms of the Kullback–Leibler divergence,

More precisely, let

denote (a possibly proper subset of) the collection of probability distributions

on , and define the corresponding set of mixtures of the form (1) for a given kernel ,where is the closure of with respect to the weak topology, i.e., plus all possible limits of weakly convergent sequences in . To avoid vaccuous cases, we will assume that is finite for at least one . This is not a trivial assumption, however; see Section 4. In this case, the “best approximation” of in is the Kullback–Leibler minimizer, , that satisfies

(3) |

A relevant question is whether such a minimizer exists and if it is unique. Assuming that is finite for at least one and given that it is a convex function, we can expect that a minimizer exists and is unique. Existence of a corresponding to is guaranteed by assuming certain conditions on and ; see Conditions A1 and A2 in Martin and Tokdar (2009) and, more generally, Liese and Vajda (1987, Ch. 8). However, uniqueness of requires identifiability of the mixture model (1) in .

In Tokdar et al. (2009), consistency of the PR estimators was established in the case where the mixture model was correctly specified, i.e., when , so that there exists a true . That is, under certain conditions, they showed almost surely and that weakly almost surely. Martin and Tokdar (2009) extended these consistency results to the case where the mixture model is not necessarily correctly specified, i.e., where possibly . This extension is a practically important one, as it provides a theoretical basis for the PR-based marginal likelihood estimation framework developed in Martin and Tokdar (2011) and later applied in, e.g., Martin and Han (2016), Dixit and Martin (2020). Under conditions slightly stronger than those given in Tokdar et al. (2009) for the correctly specified case, they showed that and weakly, both almost surely. This implies, for example, that the PR estimates do the best they could, asymptotically, relative to the specified model. It turns out, however, that the sufficient conditions stated in Martin and Tokdar (2009), very similar to those in Tokdar et al. (2009), are rather restrictive. The most problematic of those assumptions is the following:

(4) |

For nice kernels like for a fixed , if is compact and has Gaussian-like tails, then (4) can be satisfied. However, if is heavier-tailed, then (4) could easily fail. More concerning is if we are considering a not-so-nice kernel, such as uniform: , for and ; this is the natural kernel in the case where is monotone non-increasing on . In this case, the -dependent support implies that the ratio in the above display is infinite on an open interval and, hence, (4) obviously fails. The difficulty in verifying condition (4) in several practical applications is what motivated our investigation into potentially weaker sufficient conditions and, in turn, the present paper.

## 3 New consistency results

### 3.1 Conditions

The goal is to develop a new set of sufficient conditions for PR consistency that are weak enough that they can be checked in the applications we mentioned above, in particular, the case of uniform kernels for monotone density estimation. First we make clear the setup/conditions, and then we present the main results.

###### Condition 1.

The PR algorithm’s weights satisfy , for .

###### Condition 2.

The mixing distribution support, , is compact.

###### Condition 3.

The kernel, the initial guess , with corresponding , and the true satisfy the following integrability property:

(5) |

In the previous literature on this topic, and also in the literature on stochastic approximation more generally, the weights/step sizes are assumed to satisfy

Of course, the specific weights in Condition 1—which are of the same form as the weights used in Hahn et al. (2018)—satisfy these conditions, but others do to. The reason we adopt this specific choice is that it allows us to replace (4) with the weaker bound (5) discussed more below. And since the choice of weights is entirely in the hands of the user, while the choice of kernel may be determined by the context of the problem and is a choice made by “Nature” and hidden from the user, it is best to sacrifice on generality in directions the user can control.

Condition 2 assumes that the mixing distribution support is compact, but this is not much of a restriction in practice, since it can be taken as large as the user pleases. Compactness of is not strictly needed for the results presented below, but (a) some more complicated notion of compactness is needed, as we briefly discuss in the paragraph leading up to Corollary 2, and (b) Condition 3 might be difficult to check without being compact. For these reasons, we opt for the simpler albeit slightly more restrictive compactness condition listed above.

Finally, the most complicated assumption is in Condition 4, about integrability. To understand this better, it may help to re-express the integrand as

First, if the PR prior guess is not too tightly concentrated, then the mixture would be heavier-tailed than any individual kernel . In that case, the first ratio in the above display would be bounded, or at least would not increasing too rapidly. Second, we cannot expect PR, or any mixture model-based method for that matter, to be able to do a good job of estimating if a mixture with a relatively diffuse mixing distribution cannot adequately cover the support of . So the heart of Condition 3 is an assumption that the posited mixture model can adequately cover the support of , in the sense that the second ratio in the above display is not blowing up too rapidly. Finally, if the two ratios are well controlled, then the integral with respect to should be bounded uniformly in . We shall see below, in Section 4, that (5) can be checked for uniform kernels while the condition (4) in Martin and Tokdar (2009) cannot.

### 3.2 Main results

Our goal in this section is to show that the PR estimator, , of is consistent in the sense that converges almost surely to , the minimum Kullback–Leibler divergence over the posited mixture model class . In the special case where , this implies consistency in the usual sense: almost surely. In either case, it says that the PR estimator, , is close to the best possible mixture approximation of , at least asymptotically. We will also show how consistency of the mixing distribution estimator can be established from consistency of the mixture, but this will require further explanation; see below.

###### Theorem 1.

###### Proof.

See Appendix A.1. ∎

Here we give a very rough sketch of the proof strategy. Start by writing , and let denote the -algebra generated by the observations , for . We show in the proof that

where

(6) |

and is a “remainder” term defined in the appendix. It follows from Jensen’s inequality that , with equality if and only if , the Kullback–Leibler minimizer. If we could ignore the remainder term, then would be a non-negative supermartingale and, therefore, would converge almost surely to some . Of course, the remainder term cannot be ignored, so we will use the “almost supermartingale” results in Robbins and Siegmund (1971) to accommodate this. Moreover, to show that is 0 almost surely, we will use some new and useful properties of the function in (6) which were overlooked in the analysis presented in Martin and Tokdar (2009).

When the mixture model is correctly specified, so that , it follows from Theorem 1 and the familiar properties of Kullback–Leibler divergence that almost surely in Hellinger or total variation distance, i.e., that and both go to 0 almost surely. In the general case where the mixture model is misspecified, Theorem 1 still strongly suggests that , but some effort is required to connect the Kullback–Leibler difference to a distance between and . Towards this, define the Hellinger contrast , which is given by

This is just a weighted version of the ordinary Hellinger distance—with weight function —so it is a proper metric. Clearly, if the mixture model is correctly specified, so that , then is exactly the Hellinger distance. See Patilea (2001) and Kleijn and van der Vaart (2006) for further details on the Hellinger contrast. The following result establishes that almost surely, which implies that the limit of satisfies almost everywhere with respect to the measure with Lebesgue density . Under some additional conditions, namely, that is suitably close to , the PR estimator is shown to converge to in total variation distance, which implies the limit is equal to almost everywhere with respect to Lebesgue measure.

###### Corollary 1.

Under the conditions of Theorem 1, almost surely. Moreover, if , then almost surely in total variation.

###### Proof.

See the proof of Corollary 4.10 in Martin and Tokdar (2009). ∎

Finally, what can be said about the convergence of the mixing distribution estimator, ? Again, Theorem 1 strongly suggests that is converging to in some sense, but we cannot make that leap immediately. In particular, without additional assumptions, there is no guarantee that is unique or even that converges at all. For this, we will need identifiability of the mixture model (1) and tightness of . Under Condition 2, as we assume here, tightness of follows from Prokhorov’s theorem. If compactness of is not a feasible assumption, then one can instead verify the more general sufficient condition, namely, Condition A6 in Martin and Tokdar (2009), for tightness of .

We will also reqjuire the following fairly abstract condition on the kernel density , written in terms of a general sequence of mixing distributions on :

(7) |

In words, (7) states that the kernel is such that weak convergence of mixing distributions implies almost everywhere pointwise convergence of mixture densities. This holds immediately if is bounded and continuous for almost all , as was assumed in Martin and Tokdar (2009) and elsewhere. However, in some examples, like in Section 4 below, strict continuity of the kernel fails, but condition (7) can be verified.

###### Corollary 2.

###### Proof.

Since is tight, there exists a subsequence such that weakly, for some . By (7), we have pointwise convergence of the mixture densities, i.e., for almost all , and then in total variation distance thanks to Scheffé’s theorem. But Corollary 1 already gives us almost surely in total variation distance on the full/original sequence. Therefore, it must be that almost surely and, by identifiability, that . Since any such convergent subsequence of would have the same almost weak limit, , it must be that itself converges weakly almost surely to , as claimed. ∎

The boundedness assumption on , as in Corollary 1, is needed simply to convert convergence of to in the Hellinger contrast to convergence in total variation. Identifiability of the mixture model in is non-trivial. Additively-closed one-parameter families of distributions were proved to be identifiable in Teicher (1961)

. Identifiability of finite mixtures of gamma and of Gaussian distributions was proved in

Teicher (1963). Scale mixtures of uniform distributions, like we discuss in Section

4 below, were shown to be identifiable in Williamson (1956). More generally, identifiability of mixture models needs to be checked on a case-by-case basis.## 4 Application: Monotone density estimation

### 4.1 Background

Any monotone non-increasing density can be written as a scale mixture of uniforms (Williamson 1956), i.e., for any monotone density defined on , there exists a mixing distribution , supported on , such that,

(8) |

where is the uniform kernel density. Therefore, the problem of estimating a monotone density can, at least in principle, be solved through the use of mixture density estimation methods, such as the PR algorithm.

Let be iid from a monotone non-increasing density . One approach to estimating is to calculate the nonparametric MLE, also known as the Grenander estimator (Grenander 1956), which is the left derivative of the least concave majorant of the empirical distribution function. It is known that Grenander’s is a consistent estimator of , with consistency results obtained in Rao (1969) and Groeneboom (1985). However, as shown in, e.g., Woodroofe and Sun (1993), the Grenander estimator tends to over-estimate near the origin and, in particular, is inconsistent at the origin. The same authors proposed a penalized likelihood estimator that penalizes the Grenander estimator at the origin and is also consistent overall.

Another approach is Bayesian, whereby a prior distribution on is imposed by using the mixture characterization in (8) along with a suitable prior on the mixing distribution . A natural choice is a Dirichlet process prior on , leading to a Dirichlet process mixture of uniforms model for the density ; see Bornkamp and Ickstadt (2009). Although this approach seems straightforward, obtaining asymptotic consistency results for the posterior distribution is made difficult by the uniform kernel’s varying support. In particular, if the support for the mixing distribution is not suitably chosen, then the Kullback–Leibler divergence of a posited mixture model from the true density would be infinite, which creates problems for verifying the so-called “Kullback–Leibler property” (Schwartz 1965; Wu and Ghosal 2008) in the classical Bayesian consistency theory. Some strategies have been suggested in, e.g., Salomond (2014), who showed that the Bayesian posterior distribution under the Dirichlet process mixture prior has a near optimal concentration rate in total variation. More recently, Martin (2019) proposed the use of an empirical, or data-driven prior for which the prior support conditions required for asymptotic consistency are automatically satisfied, and showed that the corresponding empirical Bayes posterior distribution concentrates around the true monotone density at nearly optimal minimax rate. But the fully Bayesian solutions are computationally non-trivial and somewhat time consuming; moreover, the estimates tend to be relatively rough. The PR algorithm, which is computationally fast and tends to produce smooth estimates, is a natural alternative to the aforementioned likelihood-based methods.

### 4.2 PR for uniform mixtures

Suppose that the true density is any monotone density supported on . We know that can be written as a mixture in (8), so there exists a mixing distribution , which is also supported on . This point is relevant because of the following unique feature of uniform mixtures: if is a mixture model as in (8) with supported on , then for all and, hence, if , then . Therefore, the upper bound of being creates some serious challenges. For practical implementation of the PR algorithm, and for the theory as discussed above, a compact mixing distribution support is needed. This calls for a different approach.

For a fixed , define a new target, , which is simply restricted and renormalized to . That is, if denotes the distribution function corresponding to the density , then

Alternatively, can be viewed as the conditional density of , given ; see below. The point of this adjustment is that has a known and bounded support, so a mixture model with mixing distribution supported on (a large subset of) can be fit with the PR algorithm to efficiently and accurately estimate this new target . Note that can be made arbitrarily close to by choosing sufficiently large (see below), so this modification has no practical consequences.

For technical and practical reasons, we cannot use the PR algorithm when the support of the mixing distribution contains , so we introduce a new lower bound , which can be arbitrarily small. Then the proposed mixture model to be fit by PR is

(9) |

While both above and the adjusted target are supported on , the model in (9) is still slightly misspecified through the introduction of the lower bound of the mixing distribution support. In particular, note that is constant for . But the fact that can be taken arbitrarily small means that there are no practical consequences to this misspecification. It does complicate the convergence analysis, but, fortunately, the theory presented in Section 3 above is general enough to handle this.

Given that the mixture model (9) is slightly misspecified, it is important to know what we can expect the PR algorithm to do. Theorem 1 states that, roughly, the PR estimator will converge to the Kullback–Leibler minimizer . Since the supports of and the model densities in (9) are the same, we avoid the “” problem so minimizing the Kullback–Leibler divergence is well-defined. To understand the bias coming from model misspecification, it will be important to understand what looks like. Incidentally, Williamson (1956) established that uniform mixtures are identifiable, so there is a unique mixing distribution, , supported on , at which the Kullback–Leibler divergence is attained. The following lemma gives the details.

###### Lemma 1.

For the targeted monotone density supported on , if the proposed mixture model is as in (9), then the unique minimizer, , of the Kullback–Leibler divergence is given by

(10) |

where is the Dirac point-mass at , is restricted to , and the coefficients are given by

with the distribution function corresponding to . Then the best approximation of under model (9) is , given by

(11) |

###### Proof.

See Appendix A.2. ∎

The characterization result in Lemma 1 is intuitive. There is a true that characterizes the true monotone mixture density , both generally supported on . Our proposed model, however, effectively restricts the mixing distribution’s support to , so it makes sense that the best approximation would agree with on and then suitably allocate the remaining mass to the endpoints and .

From Section 2, recall that the implementation of the PR algorithm begins with an initial guess , and that this effectively determines the dominating measure with respect to which has a density. PR’s ability to choose the underlying dominating measure comes in handy in cases like this where we know that the target mixing distribution, , has an “unusual” dominating measure. From Lemma 1, we know that the best mixing distribution for fitting mixture model (9) to puts point masses at the endpoints, and , of , and has a density with respect to Lebesgue measure on the interior of . So, naturally, we can initialize the PR algorithm with a starting guess that has a density with respect to the dominating measure , where denotes Lebesgue measure on . Specifically, our proposal is to initialize the PR algorithm at

where and are positive with sum strictly less than 1, and has a density with respect to Lebesgue measure, e.g., could just be a uniform distribution on . Then the estimate, , after the iteration will have the same form

and the corresponding mixture density estimate, , is obtained as usual by integrating the kernel with respect to the mixing distribution .

### 4.3 Theoretical results

Now that we know what the PR algorithm ought to converge to, we are ready to state our main result of this section. First, a word about the notation/terminology that follows. In our previous results, when we wrote “almost surely,” this was referring to the law that corresponds to iid sampling from . In the results below, is the target, so we will write “-almost surely” to be clear that it is with respect to the law corresponding to iid sampling from . Recall that is the conditional density of , given , so this modified law can be interpreted as iid sampling from , but throwing away any data points that exceed . Again, since can be taken arbitrarily large, there are no practical consequences of this restriction. In fact, a bound on the bias induced by both the - and -restrictions is given in Proposition 1 below.

###### Theorem 2.

###### Proof.

See Appendix A.3. ∎

Our choice to restrict the mixing distribution’s support to introduces some bias. That is, the limit of the sequence of PR estimators is the Kullback–Leibler minimizer over all mixtures supported on , but it is different from , which is different from . Intuitively, if and , then the bias ought to be negligible. The next result confirms this intuition by bounding the bias as a function of .

###### Proposition 1.

###### Proof.

See Appendix A.4. ∎

To make the bound in (12) more concrete, we consider a specific case. A common choice in the literature (e.g., Salomond 2014; Martin 2019) is to assume has tails that vanish exponentially fast, so that , for all large and some positive constants and ; the case corresponds to having a bounded support. From this, and standard asymptotic bounds on the incomplete gamma function, it follows that , for large . Furthermore, if, e.g., has a bounded density at 0, then we have . Combining these two, we arrive at the following, more explicit bound on the bias as a function of :

Clearly, by taking small and even just moderately large, the overall bias as a result of restricting to can be made negligibly small.

As a final technical detail in this section, we consider the problem of estimating , the density at its mode, the origin. This is an interesting and challenging problem, with a variety of applications; see, e.g., Vardi (1989). In particular, Woodroofe and Sun (1993) highlight examples such as time between breakdowns of a system and distribution of galaxies that require the estimation of this modal . The PR algorithm gives an obvious estimator of , in particular, . The following result gives a theoretical basis for using this estimate and simulations in Section 4.4 show that the proposed estimate at 0 performs well when compared to existing methods.

###### Proposition 2.

Under the assumptions of Theorem 2, -almost surely. Furthermore, the bias between and is bounded ,i.e,

###### Proof.

See Appendix A.5. ∎

### 4.4 Numerical illustrations

In this section we compare different methods for monotone density estimation to our PR-based method. The four methods we consider are the Grenander estimate, a Bayesian approach using a Dirichlet process, Bayesian approach using an empirical prior, and the method based on optimization of the penalized likelihood. The Grenander estimate is based on the nonparametric MLE and can be calculated easily using the R package fdrtool (Klaus and Strimmer 2015). Settings for the Dirichlet process mixture and the empirical Bayes were based on those suggested in Martin (2019) and computed using the R codes he provided on his website.^{2}^{2}2https://www4.stat.ncsu.edu/~rmartin/ The penalized likelihood maximization was based on Woodroofe and Sun (1993) and we used one of the values recommended by those authors for their penalization parameter, i.e., . For PR, we take the mixing distribution support to be , with and . The initial guess is taken to be uniform on . To reduce the dependence of the PR estimator on the data order, we average the PR estimates over 25 random permutations of the data. For the comparisons below, we consider both real and simulated data sets.

First, we consider data coming from a study of suicide risks reported in Silverman (1986), which consists of lengths of psychiatric treatment for patients used as control. As per the detailed study of suicide risks in Copas and Fryer (1980), there is a higher risk for suicide in the early stages of treatment, so modeling these data with a monotone density is appropriate. Figure 1 shows a comparison of the four monotone density estimation methods discussed above with PR over a histogram of the data. PR gives a smooth estimate of the monotone density in a very short amount of time, much faster than the Bayes and empirical Bayes estimates that require Markov chain Monte Carlo. The take-away message is that, PR’s misspecification bias—due to the choice of and —can be easily controlled and that it gives a quality estimate compared to the other four methods. In fact, the PR estimate in this case is smoother than the other four methods, a desirable feature in applied data analysis. The simulations below will give a clearer picture of how PR performs compared to the other four methods.

Second, we consider two true monotone densities , namely, the standard exponential and the half standard normal. We carry out the simulation study over sample sizes of . For each , we generate 200 data sets of size and produce the five different estimates on each data set. As our metric of comparison, we use the total variation (or ) distance between the true density and the estimate. Additionally since inconsistency of the Grenander estimate at the origin is a well-known complication we also look at the ratio for each method. Boxplots summarizing both the distance and the at-the-origin ratio for the two simulations are shown in Figures 2 and 3. Consider the boxplots summarizing the distance. As the sample size increases, the boxplots for all five methods shrink towards 0, as expected. Notably, performance of PR is better than the Grenander estimator over all sample sizes. It is also faster and with slightly better performance when compared to the two Bayesian estimates, and is comparable to the penalized likelihood estimate. For estimating the density at 0, we compare PR with only the state-of-art estimates, namely the one based on penalizing the nonparametric MLE near 0 and the DP mixture. Even though PR is not tailored specifically for estimation at 0, as the penalized likelihood estimator is, its performance is competitive with the other methods.

## 5 Conclusion

Estimation of mixing distributions in mixture models is a challenging problem, one for which there are very few satisfactory methods available. To our knowledge, the PR algorithm is the one general method available that is both fast and capable of nonparametrically estimating a mixing distribution having a density with respect to any user-specified dominating measure. Despite the simple and fast implementation of the PR algorithm, and the strong empirical performance observed in numerous applications, its theoretical analysis and justification is non-trivial because of the recursive structure. Previous work has established consistency of the PR estimates under relatively strong conditions. Most concerning is that there are known examples, such as monotone density estimation using uniform mixtures, for which the sufficient conditions in previous work do not hold. The main focus of the present paper was to weaken those overly-strong conditions in order to broaden the range of problems in which PR can be applied. In particular, the new sufficient conditions can be checked for mixtures of uniform kernels, which puts PR in a position to solve the non-trivial problem of monotone density estimation on .

There are a number of possible extensions and/or open problems that could be considered. First, from a practical or methodological point of view, there is a natural extension of the motivating monotone density estimation application. That is, what can be done if the location of the mode itself is unknown? This is a non-trivial problem and has been investigated by a number of researchers, including Liu and Ghosh (2020). In the PR framework, the natural approach would be to treat the mode as an unknown, non-mixing parameter contained in the kernel, and apply the PR marginal likelihood strategy in Martin and Tokdar (2011) to estimate both the mode and the mode-specific mixing distribution. How this proposal compares to existing methods remains to be investigated.

Second, from a theoretical point of view, it is undesirable to work with a fixed and compact mixing distribution support . A natural extension would be to introduce a type of sieve, to allow the support to depend on the sample size, i.e., . The use of a -dependent support , however, is difficult and awkward in the context of PR. First, unlike usual likelihood-based methods that assume all the data to be available at once, PR is technically meant to be used for recursive estimation with online data. In that case, having a sample size dependent support is unnatural since the sample size is not set in advance. But even if we ignore PR’s recursive structure and treat it as being applied to batch data, the analysis is based on martingales that do implicitly treat the data points one by one in a sequence, so having any -specific components in the algorithm itself is awkward. Beyond awkwardness, there is a specific technical obstacle. Much of the analysis depends on properties of the functional defined in (6). This functional depends on and so, if is made to depend on , then we end up with a sequence, , of functionals that are applied to the PR sequence of estimates, , so new techniques would be needed in order to analyze a sequence of random variables like .

## Acknowledgments

This work was supported by the U.S. National Science Foundation, grant DMS–1737929.

## Appendix A Proofs

### a.1 Proof of Theorem 1

We start by reviewing some details from the analysis in Martin and Tokdar (2009). From the recursive form of the PR estimate of the mixing distribution, and the linearity of the mixture model, clearly a similar recursive form holds for the mixture. That is,

where,

For later, define the function as

By Taylor’s theorem, we can write

where the remainder term satisfies . This remainder bound will be important later.

Let . Then from that recursive form of the mixture density updates above, and this Taylor approximation, it can be shown that

Next, let denote the -algebra generated by data , for . Now take conditional expectation of the above display, given , to get

(13) |

where,

If we let , then the same relationship as in (13) holds, i.e.,

(14) |

Surprisingly, this form is quite convenient—it is an almost supermartingale like those studied by Robbins and Siegmund (1971). Below we restate (a simple version of) Robbins and Siegmund’s main theorem for the reader’s convenience.

###### Robbins–Siegmund Theorem.

Consider non-negative random variables , where is adapted to a filtration . If

(15) |

and , almost surely, then converges and almost surely.

The equation in (14) satisfies the criteria in (15), where and . We need to check that is finite almost surely, which amounts to getting a suitable upper bound on and its conditional expectation. Here is where our analysis starts to differ from that in Martin and Tokdar (2009).

The most complicated part of the definition of is its dependence on the Taylor approximation remainder described above. Recalling that upper bound, we have

But since and are density functions, their ratio is non-negative, so

Therefore, , a constant, so

Since we only need to get an upper bound up to a multiplicative constant, we will ignore that constant lumped inside of “” in what follows; we will also ignore the leading “” since the bound will ultimately get multiplies by , which itself is summable by assumption. From this bound, plug in the definition of to get

where the second inequality is by Cauchy–Schwartz. Next, we focus on one of the terms in the denominator, say, . From that recursive form for the mixture density updates, we immediately see that

Plug in this lower bound for both terms in the denominator of the bound for to get

Now take conditional expectation with respect to and interchange the order of integration (which is allowed since the integrand is non-negative) to get

By Condition 3, we have that the expression inside curly braces above is bounded, uniformly in , by a constant. Therefore,

Next we used the assumed form of the weight sequence, in Condition 1, to bound the above product. In general, we have

Using the standard bound, , and the fact that the ’s are decreasing, we have

According to Condition 1, , the summation in the above expression is of the order , which implies

Comments

There are no comments yet.