 # Permutation-based uncertainty quantification about a mixing distribution

Nonparametric estimation of a mixing distribution based on data coming from a mixture model is a challenging problem. Beyond estimation, there is interest in uncertainty quantification, e.g., confidence intervals for features of the mixing distribution. This paper focuses on estimation via the predictive recursion algorithm, and here we take advantage of this estimator's seemingly undesirable dependence on the data ordering to obtain a permutation-based approximation of the sampling distribution which can be used to quantify uncertainty. Theoretical and numerical results confirm that the proposed method leads to valid confidence intervals, at least approximately.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

At a high-level, statistical analysis aims to separate signal from noise, and one of the more challenging problems is deconvolution or, more generally, estimation of a mixing distribution based on samples from the mixture. Suppose that data are independent and identically distributed from a density with respect to Lebesgue measure on , which we model as a mixture

 f(y)=∫Xk(y∣x)p(x)μ(dx),y∈Y, (1)

where is a known kernel, i.e., is a density on for each , and is a unknown mixing density with respect to a known -finite measure on . Alternatively, one can view this model hierarchically by assuming that are independent and identically distributed from , and , given , are independently distributed from , . So if we think of as “signals” with distribution , and the version corrupted by noise, then our goal—inference about based on data from model (1)—can be viewed as separation of signal from noise.

Often, is assumed to be discrete with finitely many points in . For these so-called finite mixture models, standard modes of inference can be applied. For example, when the number of support points of is known, there is a likelihood function with relatively simple form that can be optimized, often using the EM algorithm (Dempster et al 1977; Teel et al. 2015), to produce the corresponding maximum likelihood estimator, to which the classical asymptotic distribution theory applies (e.g., Redner and Walker 1984); for a comprehensive treatment, see McLachlan and Peel (2000). Similarly, with this same likelihood and a corresponding prior distribution for , one can apply an EM-like data-augmentation strategy (e.g., van Dyk and Meng 2001) to carry out a Bayesian analysis. In the more realistic scenario where the number of components in the finite mixture model is unknown, those methods described above can be modified by introducing a penalty term or a prior distribution on the number of mixture components, as in Leroux (1992) and Richard and Green (1997).

For the case considered here, where is a smooth mixing density, a number of methods for nonparametric estimation have appeared in the literature. When is a location-shift kernel, so that the mixture density is just a convolution, estimation of is referred to as deconvolution, a case that has been studied in Fan (1991), Stefanski and Carroll (1990), and Zhang (1990). For general mixtures, there are a variety of different approaches. Maximizing the likelihood will almost surely produce a discrete estimate of the mixing distribution (Lindsay 1995), which is not a satisfactory estimate of a smooth mixing density. Various approaches aim to smooth the discrete nonparametric maximum likelihood estimator, either directly (e.g., Eggermont and LaRiccia 1995) or by introducing a smoothness penalty (e.g., Liu et all 2009). Chae et al (2018) investigate an iterative algorithm that generates a sequence of smooth mixing density estimates that converge to the nonparametric maximum likelihood estimator. Another interesting and related method, which is the focus of the present paper, is that based on the predictive recursion algorithm first described in Newton et al. (1998), Newton and Zhang (1999), and Newton (2002), with extensions and theoretical properties developed in Ghosh and Tokdar (2006), Martin and Ghosh (2008), Tokdar et al. (2009), and Martin and Tokdar (2009, 2011, 2012); for a recent review of these developments, see Martin (2018).

Beyond estimation, a goal is to quantify uncertainty about the mixing density and, for this, the literature is scarce. The work that has been done is as listed below. Bissanz et al (2007) discuss asymptotic and bootstrap confidence bands for deconvolution problems, building on ideas first presented in Bickel and Rosenblatt (1973). Link and Sauer (1995) discuss empirical Bayes estimation of an empirical mixing distribution with emphasis on construction of interval estimates, using the nonparametric maximum likelihood estimator. Fortini and Petrone (2019)

have developed asymptotically approximate credible intervals for the cumulative distribution function based on a quasi-Bayesian interpretation of the predictive recursion algorithm.

A seemingly undesirable feature of the predictive recursion estimator is that it depends on the order in which the data is processed. In particular, this means that the estimator is not a function of the sufficient statistic—which, in this setting, is the empirical distribution—and, hence, the estimator cannot be Bayesian. In previous literature on predictive recursion, the focus has been on reducing its dependence on the order. For example, Newton (2002) suggested elimination of the order-dependence by averaging the estimators over a number of randomly chosen permutations; see Tokdar et al. (2009) for details. The idea in the present paper is to leverage predictive recursion’s order-dependence for the purpose of uncertainty quantification. Specifically, we propose to generate multiple copies of the predictive recursion estimator by permuting the data sequence, and then use this permutation-based distribution as an approximation of the estimator’s sampling distribution. After a review of the predictive recursion estimator and its properties in Section 2, we show in Section 3

that, for a given feature of the mixing distribution, the estimator’s sampling distribution variance can be approximately unbiasedly estimated by the corresponding permutation-based distribution variance. Being able to accurately estimate the spread of the relevant sampling distribution immediately suggests using the same permutation distribution quantiles as an approximate confidence interval for that mixing distribution feature. Numerical results presented in Section

4 reveal that this permutation-based approach gives approximately valid confidence intervals in finite samples across a range of mixture models and for various features of the mixing distribution, including the density function at a point.

## 2 Predictive recursion

Recall that we observe independent data from the mixture in (1), and the goal is to estimate the mixing density . The following predictive recursion algorithm returns a computationally efficient nonparametric estimator of .

###### Predictive Recursion Algorithm.

Start with an initial estimate of the mixing density and a sequence of weights . Using the observations from the mixture model, in that order, compute

 pi(x)=(1−wi)pi−1(x)+wik(Yi∣x)pi−1(x)fi−1(Yi),i=1,…,n, (2)

where is the mixture corresponding to . Finally, return and as the estimates of and , respectively.

An interesting observation is that, if is a smooth density with respect to the measure , then the output, , of the predictive recursion algorithm will also be a smooth density. Compare this to the nonparametric MLE which is almost surely discrete, regardless of the smoothness of in (1). Therefore, there is no need for post hoc smoothing of the predictive recursion estimator. And the ability to specify the dominating measure in the predictive recursion algorithm proved to be a useful property in the multiple testing application considered in Martin and Tokdar (2012).

Aside from estimating the mixing density itself, one can readily estimate various features of the mixing distribution. That is, if is a suitable function, then can be estimated by . For example, we can estimate the mixing distribution function at a point by taking to be the indicator function corresponding to .

Asymptotic convergence properties of the predictive recursion estimator were investigated in Tokdar et al. (2009) and Martin and Tokdar (2009). To summarize, under suitable tail conditions on the kernel, if is identifiable from the mixture model (1) and if the weights in the predictive recursion algorithm satisfy

 ∞∑i=1wi=∞and∞∑i=1w2i<∞,

then converges to almost surely in the weak topology, that is, if is a bounded and continuous function, then almost surely.

## 3 Leveraging order-dependence

It is clear from (2) that depends on the order in which the data are processed. However, since the data are assumed to be independent and identically distributed, the ordering should be irrelevant. To alleviate predictive recursion’s seemingly undesirable order-dependence, Newton (2002) and others have suggested to average the final estimate, , over a number of randomly chosen permutations of the data sequence. Tokdar et al. (2009) describe this as a sort of Rao–Blackwellization, replacing by a Monte Carlo approximation of its conditional expectation given the order statistics. Here, instead of trying to remove predictive recursion’s order-dependence, we propose to leverage it for the purpose of uncertainty quantification.

Let denote the permutation group on integers , i.e., the set of all bijections from to itself. If is the predictive recursion estimator based on data , in its given order, write for the corresponding estimator based on the data , permuted according to . If is a suitable function, write and as the estimators of based on the original and permuted data, respectively.

Our proposal is to approximate the sampling distribution of , as a function of sampled from the mixture in (1), by the distribution of , for fixed , as a function of . The justification for our claim that this provides an accurate approximation, at least asymptotically, is the following simple identity,

 VYn(Ψn)=VYn,S(ΨSn), (3)

where is the variance with respect to the distribution of from the mixture model (1), and

is the variance with respect to the joint distribution of

and , assumed to be independent. The identity (3) holds because, when data are independent and identically distributed, the extra layer of permutations changes nothing. In other words, if we imagine enumerating all possible realizations of to evaluate the left-hand side of (3), then we would get no new realizations if we also enumerated permutations for evaluating the right-hand side.

The practical value of the identity (3) is that the right-hand side suggests a familiar total-variance decomposition:

 VYn,S(ΨSn)=EYn{VYn,S(ΨSn∣Yn)}+VYn{EYn,S(ΨSn∣Yn)}. (4)

By the asymptotic consistency property of predictive recursion, discussed in Section 2, if is bounded and continuous, then and, hence, are consistent estimates of , so its variance, the second term on the right-hand side of (4), should be near 0 when is large. Therefore,

 VYn,S(ΨSn)≈EYn{VYn,S(ΨSn∣Yn)}, (5)

in other words, is an approximately unbiased estimator of . The key point, of course, is that the left-hand side of (3), namely, , the variance of the sampling distribution of , is relevant for uncertainty quantification, but it is not readily available. However, the quantity on the right-hand side of (3), namely, , can be readily computed by repeatedly sampling , permuting the data according to , and re-evaluating the predictive recursion estimator. This provides us with a relatively simple and fast approach to construct a data-dependent distribution for or from which valid uncertainty quantification can be achieved. And beyond variance estimation, to get an approximate % confidence interval for , we can easily extract the and quantiles of based on repeated sampling of with data fixed.

## 4 Numerical results

Here we carry out an empirical investigation into the performance of our proposed permutation-based approach to uncertainty quantification. We begin with pointwise evaluation of the mixing distribution function. The specific scenario we consider here is one where the true mixing density in (1) is a gamma with shape 2 and rate 1, and the kernel is also a gamma with shape and rate 20; this is Example 3-3 below. We generate samples of size from this mixture model and consider estimating the mixing distribution function, in particular, at the fixed points . For predictive recursion, we take initial guess and weight sequence suggested by Martin and Tokdar (2009).

Before presenting the results, we have to address a relevant and practically important question, namely, how many permutations? In our examples, we are using 200 randomly generated permutations on which to evaluate the predictive recursion estimator. Of course, it does not hurt to do more than 200, and this is still computationally feasible thanks to the algorithm’s efficiency; however, at least in the examples we tried, there were no substantial differences in the results based on more than 200 random permutations.

A plot of distribution function estimates for this gamma mixture example, based on 200 random permutations, is shown in Figure 1. Note that the span of these estimates over permutations hugs the true distribution relatively closely across the entire range of . The vertical bars at correspond to the central 95% interval of the sampling distribution of the predictive recursion estimator, based on repeated sampling from the gamma mixture. The goal of uncertainty quantification is to match this interval as closely as possible, so it is notable that, as predicted by the arguments leading up to (5), our permutation-based intervals are comparable to this “gold-standard” across various . Moreover, the permutation distribution takes only about 10 seconds in R running on an ordinary laptop computer. Fortini and Petrone (2019) present asymptotically approximate credible intervals for the same cumulative distribution function based on the predictive recursion algorithm. But their analysis is based on a dependent and non-stationary model for —one that makes the predictive recursion estimator “quasi-Bayes”—and, since their model and perspective on uncertainty quantification is different from ours, a direct comparison is not appropriate. Figure 1: Plot of the predictive recursion estimates of the distribution function for each of 200 random permutations (gray), along with the true distribution function (black); dashed line corresponds to the predictive recursion estimate averaged over permutations. Vertical bars correspond to the central 95% interval of the sampling distribution of the predictive recursion estimator.

Next we consider pointwise estimation of the mixing density. This is not strictly covered by the theoretical arguments discussed above because the functional for a fixed cannot be expressed as for a bounded and continuous . However, intuition and prior experience suggests that the predictive recursion density estimate ought to satisfy a pointwise consistency property, i.e., for fixed ; see, also, Section 5. Therefore, we can apply the same total-variance decomposition as in (4) and reason that

 VYn{pn(x)}≈EYn[VYn,S{pSn(x)∣Yn}]

and, moreover, that suitable quantiles from the permutation distribution can be used in the obvious way to construct approximate confidence intervals for .

Like in Chae et al (2018), we consider nine examples corresponding to different combinations of the following three kernels and mixing densities.

Kernel 1.

;

Kernel 2.

;

Kernel 3.

;

Mixing Density 1.

;

Mixing Density 2.

;

Mixing Density 3.

.

In what follows, Example - will refer to the case with kernel and mixing density , where and .

As a first visualization, we simulate observations from each of the above mixture models. For each data set, we run the predictive recursion algorithm for 200 randomly sampled permutations. Each panel in Figure 2 shows these 200 density estimates, , the true density, . As we expect, the cluster of permutation-based densities hugs the true mixing density rather closely throughout the range, with more variability in regions where the true density has more curvature. Also displayed in these panels is a central 95% interval from the sampling distribution of , at , based on 500 samples of size from the mixture model. As suggested by (5), the spread of the permutation distribution matches that of the sampling distribution relatively accurately at all three values and across all 9 of the examples.

To assess our claim that the permutation-based uncertainty quantification is approximately valid, we repeat the above experiment 500 times, extract the nominal 95% confidence interval for

based on the permutation distribution and check its coverage probability. Table

1 shows the estimated coverage probabilities for the all 9 examples, at each of the three values, and for two different sample sizes. Note, first, that the coverage probability increases from to . Major departures from the targeted 95% level are at values around which has considerable curvature; in regions where is smoother, the coverage probability estimate tends to be closer to the 95% level. The overall message is that the permutation-based distribution gives approximately valid uncertainty quantification across a variety of mixing densities and kernels.

## 5 Conclusion

This paper describes a simple permutation-based approach to uncertainty quantification about a mixing distribution by leveraging the built-in dependence of the predictive recursion estimator on the data ordering. The development and numerical results presented here suggest that the uncertainty quantification achieved by this approach, e.g., a confidence interval for the mixing distribution or density function, are valid in the sense that the frequentist coverage probability is approximately equal to the interval’s nominal level.

We will end with two concluding remarks. First, as noted in Section 4, we are currently lacking results on the pointwise consistency of the predictive recursion estimator of the mixing density estimate. At present, we have only results on almost sure convergence of mixing measure estimator to the truth in the weak topology. One idea is to prove the pointwise convergence directly using the structure of the predictive recursion estimator. Another idea is to check the available sufficient conditions, namely, equicontinuity (e.g., Boos 1985), to convert weak convergence of measures into uniform convergence of densities. Unfortunately, we were unable to push through either of these approaches, but this does not shake our confidence in the convergence conjecture.

Second, the idea of leveraging data ordering dependence employed herein is not specific to the predictive recursion estimator. That is, when data are independent and identically distributed or, more generally, exchangeable, and the density estimator in consideration depends on the data ordering, then the total-variance argument in (5) could be applied and approximately valid uncertainty quantification could be achieved. A natural question is: are there any density estimators that depend on the data ordering? Interestingly, while off-the-shelf estimators tend to be permutation invariant, one can consider Cesáro averages of these order-independent estimators, which retain the estimator’s good asymptotic properties while simultaneously creating order-dependence that can be leveraged using the techniques presented here for valid uncertainty quantification.

## Acknowledgments

This work is partially supported by the U.S. National Science Foundation, under grants DMS–1737929 and DMS–1811802.

## References

• Bissanz et al (2007) Bissantz N., Dümbgen L., Holzman H., Munk A. (2007) Non‐parametric confidence bands in deconvolution density estimation. Statist. Method., 69(3):483–506.
• Bickel and Rosenblatt (1973) Bickel P. J., Rosenblatt M. (1973) On some global measures of the deviations of density function estimates. Ann. Statist., 1(6):1071–1095.
• Boos (1985) Boos D. (1985) A converse to Scheffe’s theorem. Ann. Statist., 13(1):423–427.
• Chae et al (2018) Chae, M., Martin R., and Walker S. Convergence of an iterative algorithm to the nonparametric MLE of a mixing distribution. Statist. Probab. Lett., 140:142–146.
• Dempster et al (1977) Demster, A., Laird, N., and Rubin, D. Maximum-likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B, 39(1):1–38.
• Eggermont and LaRiccia (1995) Eggermont, P. and LaRiccia, V. (1995) Maximum smoothed likelihood density estimation for inverse problems. Ann. Statist., 23(1):199–220.
• Fan (1991) Fan, J. (1991) On the optimal rates of convergence for nonparametric deconvolution problems. Ann. Statist., 19(3):1257–-1272.
• Fortini and Petrone (2019) Fortini S. and Petrone S. (2019) Quasi-Bayes properties of a recursive procedure for mixtures, Unpublished mauscript, arXiv:1902.10708v1
• Ghosh and Tokdar (2006) Ghosh, J. K. and Tokdar, S. T. (2006). Convergence and consistency of Newton’s algorithm for estimating mixing distribution. In Fan, J. and Koul, H., editors, Frontiers in Statistics, pages 429–443. Imp. Coll. Press, London.
• Henna (2005) Henna, J. (2005) Estimation of the number of components of finite mixtures of multivariate distributions. Ann. Instit. Statist. Math., 57(4):655–664.
• Leroux (1992) Leroux, B. G. (1992) Consistent estimation of a mixing distribution. Ann. Statist., 20(3):1350–1360
• Lindsay (1995) Lindsay, B. G. (1995) Mixture Models: Theory, Geometry and Applications IMS, Haywood, CA.
• Link and Sauer (1995) Link W. and Sauer J. (1995) Estimation and confidence intervals for empirical mixing distributions. Biometrics 15(3), 810–821.
• Liu et all (2009) Liu, L., Levine, M. and Zhu, Y. (2009) A functional EM algorithm for mixing density estimation via nonparametric penalized likelihood maximization. J. Comput. Graph. Statist., 18(2):481–504.
• Martin (2018) Martin, R. (2018). On nonparametric estimation of a mixing density via the predictive recursion algorithm. Unpublished manuscript, arXiv:1812.02149.
• Martin and Ghosh (2008) Martin, R. and Ghosh, J. K. (2008). Stochastic approximation and Newton’s estimate of a mixing distribution. Statist. Sci., 23(3):365–382.
• Martin and Tokdar (2009) Martin, R. and Tokdar, S. T. (2009). Asymptotic properties of predictive recursion: robustness and rate of convergence. Electron. J. Stat., 3:1455–1472.
• Martin and Tokdar (2011) Martin, R. and Tokdar, S. T. (2011). Semiparametric inference in mixture models with predictive recursion marginal likelihood. Biometrika, 98(3):567–582.
• Martin and Tokdar (2012) Martin, R. and Tokdar, S. T. (2012). A nonparametric empirical Bayes framework for large-scale multiple testing. Biostatistics, 13(3):427–439.
• McLachlan and Peel (2000) McLachlan, G. and Peel, D. (2000) Finite Mixture Models. Wiley Series in Probability and Statistics.
• Newton (2002) Newton, M. A. (2002). On a nonparametric recursive estimator of the mixing distribution. Sankhyā Ser. A, 64(2):306–322.
• Newton et al. (1998) Newton, M. A., Quintana, F. A., and Zhang, Y. (1998). Nonparametric Bayes methods using predictive updating. In Dey, D., Müller, P., and Sinha, D., editors,

Practical Nonparametric and Semiparametric Bayesian Statistics

, volume 133 of Lecture Notes in Statist., pages 45–61. Springer, New York.
• Newton and Zhang (1999) Newton, M. A. and Zhang, Y. (1999). A recursive algorithm for nonparametric analysis with missing data. Biometrika, 86(1):15–26.
• Redner and Walker (1984) Redner, R. and Walker, H. (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Review 26(2):195–239
• Richard and Green (1997) Richardson, S. and Green, P. (1997) On Bayesian analysis of mixtures with an unknown number of components. J. Roy. Statist. Soc. Ser. B. 59(4):731–792
• Stefanski and Carroll (1990) Stefanski, L. and Carroll, R.J. (1990)

Deconvoluting kernel density estimators.

Statistics, 21(2):169–184
• Teel et al. (2015) Teel, C., Park, T., and Sampson A. (2015) EM estimation for finite mixture models with known mixture component size. Comm. Statist.-Sim. and Comp., 44(6):1545–1556
• Tokdar et al. (2009) Tokdar, S. T., Martin, R., and Ghosh, J. K. (2009). Consistency of a recursive estimate of mixing distributions. Ann. Statist., 37(5A):2502–2522.
• van Dyk and Meng (2001) van Dyk, D.  A. and Meng X.-L. (2001) The art of data augmentation. J. Comput. Graph. Statist,, 10(1):1–111.
• Zhang (1990) Zhang, C. H (1990) Fourier methods for estimating mixing densities and distributions. Ann, Statist.,, 18(2):806–831