denote a suitable space of continuous probability densities (henceforth merely called densities) on. A mixture on with components is a pair , where ; we write , and note that cannot have any components equal to zero. The corresponding mixture density is . The Jensen-Shannon divergence of is HarremoesBriet
where and is the entropy of .
is the mutual information between the random variablesand . Since mutual information is always nonnegative, the same is true of . The concavity of gives the same result, i.e. . If and
then is easy to show that .
We say that a density is unimodal if is either empty or contractible (within itself, or equivalently, has the homotopy type of a point) for all : for , this simply means that any nonempty sets are intervals. We call a mixture unimodal iff each of the component densities is unimodal. The unimodal category is the least integer such that there exists a unimodal mixture with components satisfying : in this event we write . The unimodal category is a topological invariant that generalizes the geometric category and is related to the Lusternik-Schnirelmann category BaryshnikovGhrist ; Ghrist .
The preceding constructions naturally lead us to consider the unimodal Jensen-Shannon divergence
as a simultaneous measure of both the topological and information-theoretical complexity of , and
as an information-theoretically optimal topological mixture estimate (TME). The natural questions are if such an estimate exists (is the supremum attained?), is unique, and if so, how to perform TME in practice. In this paper we address these questions for the case , and we demonstrate the utility of TME in examples (see Figures 4 and 5).
Ii Related work
While density estimation enables various clustering techniques LiRayLindsay ; AzzaliniMenardi , mixture estimation is altogether more powerful than clustering: e.g., it is possible to have mixture components that significantly and meaningfully overlap. For example, a cluster with a bimodal density will usually be considered as arising from two unimodal mixture components that are of interest in their own right. In this light and in view of its totally nonparametric nature, our approach can be seen as particularly powerful, particularly when coupled with topological density estimation and deblurring/reblurring (see §VII and §VIII).
Still, even for clustering (even in one dimension, where an optimal solution to -means can be computed efficiently WangSong ; NielsenNock ; GronlundEtAl ), determining the number of clusters in data Mirkin is as much an art as a science. All of the techniques we are aware of either require some ad hoc determination to be made, require auxiliary information (e.g., TibshiraniWaltherHastie ) or are parametric in at least a limited sense (e.g., SugarJames ). While a parametric approach allows likelihoods and thus various information criteria BurnhamAnderson or their ilk to be computed for automatically determining the number of clusters, this comes at the cost of a very strong modeling assumption, and the criteria values themselves are difficult to compare meaningfully MelnykovMaitra .
These shortcomings–including determining the number of mixture components–carry over to the more difficult problem of mixture estimation. McLachlanPeel ; McLachlanRathnayake ; MelnykovMaitra As an example, an ad hoc and empirically derived unimodal mixture estimation technique that requires one of a few common functional forms for the mixture components has been recently employed in MintsHekkerLinLeeHsieh ; BassoEtAl : though these generalize to the multivariate case quite effectively (see, e.g., LeeMcLachlan ), the EM-type algorithms are generically vulnerable to becoming trapped in local minima without good initial parameter values, and they require some model selection criterion to determine the number of mixture components (though the parameter learning and model selection steps can be integrated as in FigueiredoJain ). A Bayesian nonparametric mixture model that incorporates many–but not arbitrary–unimodal distributions is considered in RodriguezWalker . Principled work has been done on estimating mixtures of log-concave distributions Walther ; ChanEtAl describes how densities of discrete unimodal mixtures can be estimated. However, actually estimating generic unimodal mixtures themselves appears to be unaddressed in the literature, even in one dimension. Indeed, even estimating individual modes and their associated uncertainties or significances has only been addressed recently GenoveseEtAl ; Mukhopadhyay .
Given , the “sweep” algorithm of BaryshnikovGhrist yields . We will show how to repeatedly perturb to obtain (3) using two key lemmas. The first lemma, that is convex under perturbations of that preserve , is the subject of §IV. The second lemma, a characterization of perturbations of two components of a piecewise affine and continuous (or piecewise constant) mixture that preserve the predicate (i.e., that preserve unimodality and the mixture density), is the subject of §V. Together, these results entail that greedy unimodality- and density-preserving local perturbations of pairs of mixture components converge to (3). The proof in §VI is scarcely more involved than the preceding statement once the lemmas are in hand. Next, in §VII we review the related technique of topological density estimation (TDE) before showing in §VIII how blurring and deblurring mixture estimates can usefully couple TDE and TME. Finally, in §IX we produce examples of TME in action.
The following lemma shows that is convex as we gradually shift part of one mixture component to another.
Lemma. Let . The function defined by
Proof. The statement as well as the proof of the lemma can be simplified with a bit of extra notation. Define
Now , so
Furthermore, if we write , , , and , then
It is well known that is a concave functional: from this it follows that
V Preserving unimodality
Suppose that is a unimodal mixture on with . We would like to determine how we can perturb two components of this mixture so that the result is still unimodal and yields the same density. In the event that the mixture is piecewise affine and continuous (or piecewise constant) the space of permissible perturbations can be characterized by the following
Lemma. For , let be such that there are integers satisfying and
(That is, is a nonnegative, nontrivial unimodal sequence.) Then for and ,
Similarly, for ,
Proof (sketch). We first sketch ((8),). Nonegativity follows from . Unimodality follows from a series of trivial checks for the cases , , , and : the remaining cases , , and follow from symmetry. For example, in the case , we only need to show that .
A sketch of ((8),) amounts to using the same cases and symmetry argument to perform equally trivial checks. For example, in the case , we have .
The proof of (9) is mostly similar to that of (8): the key difference here is that any argument adjacent to or at a point where the maximum is attained can have its value increased arbitrarily without affecting unimodality (or nonnegativity).
The example in figure 1 is probably more illuminating than filling in the details of the proof sketch above.
Theorem. Let and be piecewise constant (or affine) over each . Then there is an efficient algorithm to compute (3).
Proof. First, compute via the “sweep” algorithm of BaryshnikovGhrist . Then, for each and pair of mixture components, compute the largest possible local perturbation of the mixture according to the lemma in §V. By the lemma in §IV, choosing a resulting maximal value of and repeating the perturbation-evaluation procedure gives the desired result in iterations. This result is unique by convexity.
Vii Topological density estimation
The obvious situation of practical interest for TME is that a density has been obtained from a preliminary estimation process involving some sample data. There is a natural approach to this preliminary estimation process called topological density estimation (TDE) Huntsman that naturally dovetails with (and as it happens, inspired) TME.
We recall the basic idea here. Given a kernel and sample data for , and for each proposed bandwidth
, we compute the kernel density estimateSilverman ; Chen
where . Next, we compute
and estimate the unimodal category of the PDF that is sampled from via
where denotes an appropriate measure (nominally counting measure or the pushforward of Lebesgue measure under the transformation ).
That is, (12) gives the most prevalent (and usually in practice, also topologically persistent Ghrist ; Oudot ) value of the unimodal category, i.e., this is a topologically robust estimate of the number of components required to produce the PDF that is sampled from as a mixture. While any element of is a bandwidth consistent with the estimate (12), we typically make the more detailed nominal specification
TDE turns out to be very computationally efficient relative to the traditional technique of cross-validation (CV). On highly multimodal densities, TDE is competitive or at least reasonably performing relative to CV and other nonparametric density estimation approaches with respect to traditional statistical evaluation criteria. Moreover, TDE outperforms other approaches when qualitative criteria such as the number of local maxima and the unimodal category itself are considered (see Figures 2 and 3). In practice, such qualitative criteria are generally of paramount importance. For example, precisely estimating the shape of a density is generally less important than determining if it has two or more clearly separable modes.
As an illustration, consider , and the family of distributions
for and , and where here is the standard Gaussian density: see Figure 2. Exhaustive details relating to the evaluation of TDE on this family and other densities are presented in the software package and test suite BAE : here, we content ourselves with the performance data for (14) shown in Figure 3.
TDE has the very useful feature (shared by essentially no high-performing density estimation technique other than CV) that it requires no free parameters or assumptions. Indeed, TDE can be used to evaluate its own suitability: for unimodal distributions, it is clearly not an ideal choice–but it is good at detecting this situation in the first place. Furthermore, TDE is very efficient computationally.
In situtations of practical interest, it is tempting to couple TDE and TME in the obvious way: i.e., perform them sequentially and indepdently. This yields a completely nonparametric estimate of a mixture from sample data alone. However, there is a better way to couple these techniques, as we shall see in the sequel.
Viii Blurring and deblurring
Recall that a log-concave function is unimodal, and moreover that a function is log-concave iff its convolutions with unimodal functions are identically unimodal. BertinCuculescuTheodorescu ; Ibragimov ; KeilsonGerber This observation naturally leads to the following question: if , how good of an approximation to the distribution must a log-concave density be in order to have ? In particular, suppose that is a Gaussian density: what bandwidth must it have? An answer to this question of how much blurring a minimal unimodal mixture model can sustain defines a topological scale (viz., the persistence of the unimodal category under blurring) that we proceed to illustrate in light of TDE.
In this paragraph we assume that is the standard Gaussian density, so that and . Write for the bandwidth obtained via TDE, whether via the nominal specification (13) or any other: by construction we have that . Now if , then . In order to have , it must be that , i.e.,
In particular, we have the weaker inequality involving a purely topological scale:
The preceding considerations generalize straightforwardly if we define , where once again is a generic kernel. This generalizes (11) so long as we associate sample data with a uniform average of distributions. Under reasonable conditions, we can write , and it is easy to see that the analogous bound is
Of course, (17) merely restates the triviality that the blurred mixture ceases to be minimal precisely when the number of mixture components exceeds the unimodal category of the mixture density. Meanwhile, the special case furnished by TDE with the standard Gaussian kernel affords sufficient structure for a slightly less trivial statement.
The considerations of §VIII.1 suggest how to couple TDE and TME in a much more effective way than performing them sequentially and independently. The idea is to use a Gaussian kernel and instead of (13), pick the bandwidth
and then perform TME; finally, convolve the results with where
This preserves the result of TDE while giving a smoother, less artificial, and more practically useful mixture estimate than the information theoretically optimal result.
Of course, a similar tactic can be performed directly on a density by considering its Fourier deconvolution , where
denotes the Fourier transform andis as in (17): however, any a priori justification for such a tactic is necessarily context-dependent in general, and our experience suggests that its implementation would be delicate and/or prone to aliasing. Nevertheless, this would be particularly desirable in the context of heavy-tailed distributions, where kernel density estimation requires much larger sample sizes in order to achieve acceptable results. In this context it would also be worth considering the use of a symmetric stable density ZolotarevUchaikin ; Nolan (e.g., a Cauchy density) as a kernel with the aim of recapturing the essence of (19).
We present two phenomenologically illustrative examples. First, in Figure 4 we consider the waiting times between eruptions of the Old Faithful geyser from the data set in Hardle . Then, in Figure 5 we consider the Sloan Digitial Sky Survey color indices accessed from the VizieR database OchsenbeinBauerMarcout at http://cdsarc.u-strasbg.fr/viz-bin/Cat?J/ApJ/700/523 and discussed in AnEtAl .
As suggested, several phenomena are readily apparent from these examples. First, mixtures obtained via the sweep algorithm are manifestly parity-dependent, i.e., the direction of sweeping matters; second, mixtures obtained via TME alone exhibit artificial anti-overlapping behavior; third, deblurring followed by reblurring preserves unimodality, the overall density, a topologically persistent invariant (viz., the unimodal category) and the spirit of information-theoretical optimality while producing an obviously better behaved mixture; fourth and finally, the various techniques involved here can significantly shift classification/decision boundaries based on the dominance of various mixture components.
While the data in Figure 4 is at least qualitatively approximated by a two-component Gaussian mixture, it is clear that a three-component Gaussian mixture cannot capture the highly oscillatory behavior of the density in Figure 5. Indeed, this example illustrates how such oscillatory behavior can actually arise from a unimodal mixture with many fewer components than might naively appear to be required.
Figure 6 shows that there are strong and independent grounds to conclude that the color index data of Figure 5 is produced by a unimodal mixture of three components, with componentwise modes as suggested by TME, and furthermore that CV undersmooths this data. For each density estimate shown, each componentwise extremum of the corresponding mixture (3) is virtually identical to one of the local extrema of the density estimate itself: this is a consequence of the anti-overlapping tendency described above.
In particular, the fourth panel of Figure 6 illustrates that it is possible and potentially advantageous to use TME as an alternative mode identification technique in the LPMode algorithm of Mukhopadhyay . Furthermore, while we have not implemented a reliable Fourier deconvolution/reblurring algorithm of the sort hinted at in §VIII.2, the fifth and sixth panels of Figure 6 suggest that this is not particularly important for the narrow task of mode finding/bump hunting.
The author thanks Adelchi Azzalini, Robert Ghrist, Subhadeep Mukhopadhyay, and John Nolan for their helpful and patient discussions. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA or AFRL.
- (1) An, D. et al. “Galactic globular and open clusters in the Sloan Digital Sky Survey. II. Test of theoretical stellar isochrones.” Astrophys. J. 700, 523 (2009).
- (2) Azzalini, A. and Menardi, G. “Clustering via nonparametric density estimation: the R package pdfCluster”. J. Stat. Soft. 57 (2014).
- (3) BAE Systems. MATLAB implementation of TDE, available at https://git.io/vyC4l (2017).
- (4) Baryshnikov, Y. and Ghrist, R. “Unimodal category and topological statistics.” NOLTA (2011).
Basso, R. M. et al.
“Robust mixture modeling based on scale mixtures of skew-normal distributions.”Comp. Stat. Data Anal. 54, 2926 (2010).
- (6) Bertin, E. M. J., Cuculescu, I., and Theodorescu, R. Unimodality of Probability Measures. Kluwer (1997).
- (7) Burnham, K. P. and Anderson, D. R. Model Selection and Multimodal Inference: A Practical Information-Theoretic Approach. 2nd ed. Springer (2003).
- (8) Chan, S.-O., et al. “Learning mixtures of structured distributions over discrete domains.” SODA (2013).
- (9) Chen, Y.-C. “A tutorial on kernel density estimation and recent advances.” arXiv:1704.03924 (2017).
Figueiredo, M. A. T. and Jain, A. K. “Unsupervised learning of finite mixture models.”IEEE Trans. Pattern Anal. Mach. Intell. 24, 381 (2002).
- (11) Genovese, C. R. et al. “Non-parametric inference for density modes.” J. Royal Stat. Soc. B 78, 99 (2016).
- (12) Ghrist, R. Elementary Applied Topology. CreateSpace (2014).
- (13) Grønlund, A., et al. “Fast exact -means, -medians, and Bregman divergence clustering in 1D.” arXiv:1701.07204 (2017).
- (14) Härdle, W. Smoothing Techniques with Implementation in S. Springer (1991).
- (15) Briët, J. and Harremoës, P. “Properties of classical and quantum Jensen-Shannon divergence.” Phys. Rev. A 79, 052311 (2009).
- (16) Huntsman, S. “Topological density estimation.” NOLTA (2017).
- (17) Ibragimov, I. A. “On the composition of unimodal distributions.” Th. Prob. Appl. 1, 255 (1956).
- (18) Keilson, J. and Gerber, H. “Some results for discrete unimodality.” J. Am. Stat. Ass. 66, 386 (1971).
- (19) Lee, S. X. and McLachlan, G. J. “Finite mixtures of canonical fundamental skew- distributions.” Stat. Comput. 26, 573 (2016).
Li, J., Ray, S., and Lindsay, B. G. “A nonparametric statistical approach to clustering via mode identification.”
J. Machine Learning Res.8, 1687 (2007).
- (21) Lin, T. I., Lee, J. C., and Hsieh, W. J. “Robust mixture modeling using the skew distribution.” Stat. Comput. 17, 81 (2007).
- (22) McLachlan, G. and Peel, D. Finite Mixture Models. Wiley (2000).
- (23) McLachlan, G. J. and Rathnayake, S. “On the number of components in a Gaussian mixture model.” WIREs Data Mining Knowl. Disc. 4, 341 (2014).
- (24) Melnykov, V. and Maitra, R. “Finite mixture models and model-based clustering.” Stat. Surv. 4, 80 (2010).
- (25) Mints, A. and Hekker, S. “A Unified tool to estimate Distances, Ages, and Masses (UniDAM) from spectrophotometric data.” Astro. Astrophys. 604, A108 (2017).
- (26) Mirkin, B. “Choosing the number of clusters.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1, 252 (2011).
- (27) Mukhopadhyay, S. “Large-scale mode identification and data-driven sciences.” Electronic J. Stat. 11, 215 (2017).
- (28) Nielsen, F. and Nock, R. “Optimal interval clustering: application to Bregman clustering and statistical mixture learning.” IEEE Sig. Proc. Lett. 21, 1289 (2014).
- (29) Nolan, J. P. Stable Distributions: Models for Heavy Tailed Data. Birkhaüser (2018).
- (30) Ochsenbein, F., Bauer, P. and Marcout, J. “The VizieR database of astronomical catalogues.” Astro. Astrophys. Supp. Ser. 143, 23 (2000).
- (31) Oudot, S. Y. Persistence Theory: From Quiver Representations to Data Analysis. AMS (2015).
- (32) Rodríguez, C. E. and Walker, S. G. “Univariate Bayesian nonparametric mixture modeling with unimodal kernels.” Stat. Comp. 24, 35 (2014).
- (33) Silverman, B.W. Density Estimation for Statistics and Data Analysis. Chapman & Hall (1986).
- (34) Sugar, C. A. and James, G. M. “Finding the number of clusters in a dataset: an information-theoretic approach.” J. Am. Stat. Ass. 98, 750 (2003).
- (35) Tibshirani, R., Walther, G., and Hastie, T. “Estimating the number of clusters in a data set via the gap statistic.” J. Roy. Stat. Soc. B 63, 411 (2001).
- (36) Walther, G. “Inference and modeling with log-concave distributions.” Stat. Sci. 24, 319 (2009).
- (37) Wang, H. and Song, M. “Ckmeans.1d.dp: optimal -means clustering in one dimension by dynamic programming.” R J. 3, 29 (2011).
- (38) Zolotarev, V. M. and Uchaikin, V. V. Chance and Stability: Stable Distributions and Their Applications. De Gruyter (1999).