M-decomposability, elliptical unimodal densities, and applications to clustering and kernel density estimation

02/12/2008
by   Nicholas Chia, et al.
0

Chia and Nakano (2009) introduced the concept of M-decomposability of probability densities in one-dimension. In this paper, we generalize M-decomposability to any dimension. We prove that all elliptical unimodal densities are M-undecomposable. We also derive an inequality to show that it is better to represent an M-decomposable density via a mixture of unimodal densities. Finally, we demonstrate the application of M-decomposability to clustering and kernel density estimation, using real and simulated data. Our results show that M-decomposability can be used as a non-parametric criterion to locate modes in probability densities.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/15/2021

Signed variable optimal kernel for non-parametric density estimation

We derive the optimal signed variable in general case kernels for the cl...
06/22/2019

Copula Density Estimation by Finite Mixture of Parametric Copula Densities

A Copula density estimation method that is based on a finite mixture of ...
05/27/2019

Kernel Conditional Density Operators

We introduce a conditional density estimation model termed the condition...
06/22/2012

Estimating Densities with Non-Parametric Exponential Families

We propose a novel approach for density estimation with exponential fami...
08/25/2020

Multiple-Source Adaptation with Domain Classifiers

We consider the multiple-source adaptation (MSA) problem and improve a p...
02/02/2021

Adaptive Random Bandwidth for Inference in CAViaR Models

This paper investigates the size performance of Wald tests for CAViaR mo...
03/15/2022

Estimating monotone densities by cellular binary trees

We propose a novel, simple density estimation algorithm for bounded mono...

1. Introduction

In a recent paper, Chia and Nakano (2009) conceptualized -decomposability and developed the theory in one-dimension. The main results are summarized in the following paragraph.

-decomposability is defined as follows. Let be a probability density defined in one-dimension. There exist countless ways to express as a weighted mixture of two probability densities, in the form of

If it is possible to find any combination of , which satisfies

then the original density is said to be -decomposable. Otherwise, is -undecomposable. Intuitively, multimodal densities with peaks separated far apart are likely to be -decomposable. Conversely, unimodal densities are probably -undecomposable. The authors proved that all one-dimensional symmetric unimodal densities

with finite second moments are

-undecomposable. In other words, if is symmetric unimodal and has finite second moments, then for any weighted mixture density components of , one must have

(1)

Eq (1) applies to a wide range of densities that include Gaussian, Laplace, logistic and many others. The authors also showed the possibility of using -decomposability to perform cluster analysis and mode finding in one-dimension. Incidentally, the “” in -decomposability may either mean “multimodal” or “mixture”.

In this paper, we further contribute to -decomposability, both in the theoretical and applicational aspects. On the theoretical front, we generalize the concept of -decomposability to any -dimensional space. First of all, we derive a theorem (Theorem 2.4) that is the -dimensional equivalent of Eq (1). We prove that all elliptical unimodal densities with finite second moments are -undecomposable. These densities include multivariate Gaussian, Laplace, logistic and many others. Following that, we derive another theorem, (Theorem 2.5), which determines if a given density is better approximated via a mixture of Gaussian densities, instead of one single Gaussian density.

One example of application of -undecomposability is cluster analysis. For decades, cluster analysis has been a popular research subject, both from the theoretical and algorithmic aspects. Cluster analysis is likely to remain a widely researched topic, given the many different approaches that caters to varying applications. The survey paper by Berkhin (2002) provides an up-to-date status of available clustering techniques and methodologies. There are two main classes of cluster analysis methodologies: parametric and non-parametric. For parametric cluster analysis, one needs prior knowledge or assumptions on the analytical structure of the underlying clusters. The whole dataset is modeled as a mixture of parametrized densities, and the problem reduces to parameter estimation. In McLachlan and Peel (2000), parametric cluster analysis via the Expectation-Maximization (EM) algorithm is described in detail. Other parametric methods include the Bayesian particle filter approach detailed in Fearnhead (2004), and the reversible jump Markov chain Monte Carlo (MCMC) approach by Richardson and Green (1997). For parametric cluster analysis, the most popular approach is to model the clusters as Gaussian densities.

As for non-parametric cluster analysis, a popular tool is the -means algorithm. The -means algorithm is optimal for locating similar-sized spherical clusters within a dataset, provided the number of clusters are known beforehand. With elliptical clusters, or clusters of varying sizes, the -means approach yields results that are meaningless. The -means algorithm assigns samples to clusters based on distance (Euclidean or its variations) to the centres of the clusters. Other distance-based non-parametric clustering algorithms include the nearest-neighbour clustering. Distance-based clustering algorithms generally share the same drawbacks such as sensitivity to scaling, elliptical clusters and clusters of varying sizes. If the number of clusters are not known beforehand, neither the -means algorithm nor the nearest-neighbour algorithm estimate the number of clusters automatically. For the -means algorithm, the unknown number of clusters has to be re-evaluated via Akaike’s information criterion (AIC), proposed by Akaike (1974), or other suitable model selection criterion.

Our approach to cluster analysis via -decomposability is non-parametric and are based on volume instead of distance. Being non-parametric, prior knowledge on the analytical structure of the underlying clusters is unnecessary. The only assumption required is that the clusters are approximately elliptical and unimodal. As a result, the limitation of clustering via -undecomposability is that it will probably not perform ideally for irregularly shaped clusters that deviate from elliptical unimodal densities. However, if the clusters are approximate elliptical and unimodal, then our clustering methodology works well, and allows for the unknown number of clusters to be recovered automatically. Furthermore, as clustering via -decomposability is based on volume instead of distance, cluster allocation is invariant to scaling.

For existing alternative methodologies to clustering, there has been recent development on Rousseeuw’s minimum volume ellipsoids (MVE) in Rousseeuw and Leroy (1987) and Rousseeuw and van Zomeren (1990)

. The MVE approach is originally developed as a robust method to estimate mean vectors and covariance matrices of multivariate data in the presence of outliers. MVE is computationally intensive and the optimal solution is often difficult to achieve, prompting many research papers on the algorithmic aspects of the problem. Some authors, for example,

Shioda and Tunçel (2005)

, outlined a heuristic for clustering via MVE by minimizing the sum of volume of clusters. Our methodology of clustering via

-decomposability has some similarities with clustering via the MVE approach, in that both measure “volume” in a certain sense. Central to the -decomposability concept is the “pseudo-volume”, which we define as the square-root of the determinant of the covariance matrix. Compared to MVE, the pseudo-volume is computationally cheap and straightforward. On top of that, we also provide theoretical justifications in Theorem 2.5 for minimizing the sum of pseudo-volumes of clusters.

Another possible area of application of -undecomposability is density estimation. In density estimation, data generated from some unknown densities are given, and the task is to estimate and recover the unknown density. One popular non-parametric approach to density estimation is kernel density estimation, treated in Silverman (1986), Scott (1992), Härdle et al (2004), as well as Wand and Jones (1995). The difficulty in kernel density estimation is the derivation of the optimal kernel bandwidth: If the kernel bandwidth is underestimated, the kernel density becomes unduly spiky; if the kernel bandwidth is overestimated, the kernel density becomes oversmoothed. For multimodal densities, it is not possible to find a single kernel bandwidth that provides a satisfactory density estimation everywhere. Using -decomposability, we demonstrate that there is a simple and logical way to circumvent the above problem by representing the underlying density as a mixture of unimodal densities where necessary.

This paper develops both the theoretical and applicational aspects of -decom-posability, and therefore should be of interest to theoretical statisticians and practitioners alike. Section 2 is devoted to the theoretical development of -decomposability in -dimensional space. For readers who are only interested in applications, it is possible to note only the results of Theorems 2.4 and 2.5, skipping the rest of Section 2 without disrupting the flow of the paper.

2. -Decomposability in -Dimensional Space

2.1. Extensions from One-Dimension

In Chia and Nakano (2009), -decom-posability involves only the standard deviations of probability densities. This is because in one-dimension, the standard deviation is a natural measure of scatter of a given density. The standard deviation of any density in one-dimension has the same order as the distance or “length” computed from the mean. When considering higher dimensions, a possible corresponding measure of scatter of a given density is the square-root of the determinant of the covariance matrix of the density. The square-root of the determinant of the covariance matrix in -dimensional space has the same order as -dimensional “hypervolume”. Henceforth, we shall call the above measure the pseudo-volume of a density. We denote the covariance matrix of a density by , and therefore the pseudo-volume of is given by . In one-dimension, pseudo-volume reduces to the standard deviation.

In Chia and Nakano (2009), the authors limited the number of mixture components to two in their development of -decomposability. In this paper, we show that it is possible to relax the above limitation, and generalize the number of mixture components to where . Let

be a probability density function defined on

, the -dimensional real space. One can always express as a weighted mixture of densities as follows:

(2)

where and . Henceforth, we call any set of densities which satisfies Eq (2) a set of mixture components of .

We extend the definition of -decomposability to -dimensional space as follows. [-Decomposability] For a given probability density function , if there exists a set of mixture components such that

then is defined to be -decomposable. Otherwise, is -undecomposable. If for any set of mixture components ,

then is strictly -undecomposable. Our new definition of -decomposability reduces to that presented in Chia and Nakano (2009) when and . For , the definition of -decomposability can be described compactly using pseudo-volumes.

2.2. Elliptical Uniform Densities

The uniform density is trivially defined in one-dimension, but in higher dimensions, it may assume many different possible shapes. For example, one may think of the uniform hypercube or the uniform hypersphere. However, the subject of interest in our paper is the elliptical uniform density, which forms the fundamental building block of elliptical unimodal densities.

Ellipticity, uniformity and unimodality are three different qualities. The definitions of the first two are given immediately below, and the third will be given in Section 2.3.

[Elliptical and Spherical Densities] We say that is elliptical if there exist a vector , a positive semidefinite symmetric matrix and a positive function on such that

Furthermore, if , where and denotes the

-dimensional identity matrix, then

becomes

and we say that is spherical. The mean and covariance matrix of the above-defined elliptical density are as follows:

[Uniform Densities] We say that is elliptical uniform if there exist a vector , a positive semidefinite symmetric matrix , and a positive real number such that

where denotes the indicator function. Furthermore, if , where and denotes the -dimensional identity matrix, then becomes

and we say that is spherical uniform.

[Inequality on Elliptical Uniform Densities] All elliptical uniform densities defined on are -undecomposable in and strictly -undecomposable for .

The proof of Theorem 2.2 proceeds the following lemma.

[Density with Minimum Pseudo-volume] Let be a probability density function defined on such that for all . Then

Identity holds if and only if is elliptical uniform with .

When , we recover , the result obtained in Chia and Nakano (2009). The proof of the Lemma 2.2 has been relegated to Section 5.2 of the appendix to enhance the flow of the paper. We use the results of Lemma 2.2 to prove Theorem 2.2.

Proof of Theorem 2.2.

Let be an elliptical uniform density on . We need to prove that for any set of mixture components of ,

Without loss of generality, set and therefore

Rewriting the elliptical uniform density as mixture components, we have

for some satisfying and . As a result, we have

for all . Using Lemma 2.2, we have

(3)

for all , with equalities holding if and only if the density in question is elliptical uniform. Now, for , we can have at most but never all of ’s to be elliptical uniform satisfying Eq (3). Therefore,

Identity may only hold when , refer to Chia and Nakano (2009). ∎

2.3. Elliptical Unimodal Densities

In one-dimension, symmetry is trivial to visualize and express mathematically. In higher dimensions, symmetry may be depicted via ellipticity. As such, elliptical unimodal densities play a key role in this paper. We provide a definition for elliptical unimodal densities below. Elliptical densities in general have been treated in detail by many researchers, see Fang et al (1990) and references within. Unimodal densities have also been the subject of active research. For example, refer to Anderson (1955), Dharmadhikari and Joag-Dev (1987) as well as Ibragimov (1956).

[Elliptical Unimodal Densities] We say that is elliptical unimodal if there exist a vector , a positive semidefinite symmetric matrix and a non-increasing positive function on such that

Comparing with Definition 2.2, the only additional information in Definition 2.3 is that the positive function has to be non-increasing as well. According to Definition 2.3, elliptical unimodal densities are those whose cross-sections are elliptical, and with mean () and covariance matrices proportional to (). Definition 2.3 encompasses a large class of general densities including -dimensional elliptical uniform, Gaussian, logistic, Laplace, Von Mises, beta() where , student-, and many other densities.

Henceforth, we propose the following alternative representation of elliptical unimodal densities. [Representation of Elliptical Unimodal Densities] Let be an elliptical unimodal density with mean and covariance matrix . Then, for all , it is possible to construct a density

such that

Here, each is an elliptical uniform density such that

(4)

and ’s are strictly positive. Furthermore, each proportionality constant satisfies

From the above representation, each elliptical uniform component is weighted proportionally to the hypervolume of its cross-section. The original elliptical unimodal density is “sliced lattitudinally” into elliptical uniforms with a prefixed constant “thickness”. The proof of Theorem 2.3 has been relegated to Section 5.3 of the appendix.

2.4. A Theorem on Elliptical Unimodal Densities

[Inequality on Elliptical Unimodal Densities] Let be an elliptical unimodal density with finite second moments. Then, for any set of mixture components ,

Identity is possible only when is uniform in one-dimension.

Proof.

Our task is to prove that for all mixture components satisfying

(5)

where and , we must have

(Claim 1)

Using Theorem 2.3, we can approximate to an arbitrary level of accuracy by rewriting as a finite mixture of elliptical uniform densities, each having “uniform thickness” as

(6)

The “thickness” of each elliptical uniform component is equal to Here, ’s, as described in Eq , are elliptical uniform densities sharing the same means and whose covariances are multiples of each other. Each constant of proportionality, denoted by , is proportional to the hypervolume of the corresponding elliptical uniform density .

To provide a link between Eqs (5) and (6), we further rewrite as

For each pair of above, is the “intersection” of the segments and with respect to on the curve. For all values of , and can be expressed in terms of as

(7)

Here, depending on the mixture components , it is possible for some of ’s to be , as long as for all values of , we have

If for a pair of , then is a density. From Eq (7), we can rewrite each elliptical uniform as

Following the argument presented in Theorem 2.2, we have

with equality holding if and only if is elliptical uniform having “thickness” satisfying

Similarly, rewriting each mixture component in terms of , we obtain

Next, we create new spherical unimodal densities ’s corresponding to each to facilitate lower boundings of . Define as follows:

In the above, each are spherical uniforms whose means coincide and such that

for all , hence yielding

Computing the determinant of the covariance matrix of , we have

The first inequality holds as a result of

(8)

where and are both non-negative definite symmetric matrices. The second inequality holds because

(9)

with identity holding if and only if and are proportional. The proof of both Eqs (8) and (9) can be found in Cover and Thomas (1988). The third inequality holds as we must have

as a direct result of Lemma 2.2. The equality that follows the third inequality is again a result of Eq (9), as all ’s are proportional to the identity matrix. We have just shown that

for all , i.e. the pseudo-volume of each is minimized when is spherical unimodal. Therefore, a sufficient condition to (Claim 1) is

(Claim 2)

Since is elliptical unimodal, it is possible to find a corresponding spherical unimodal density such that the hypervolumes are preserved, i.e. . To prove (Claim 2), we only have to deal with the pseudo-volumes of spherical unimodal densities. We obtain as follows

Here, we make use of the fact that the covariance of a -dimensional spherical uniform density defined by

is given as

where denotes the identity matrix in -dimensional space. Refer to Eq (18). Similarly,

Hence, proving (Claim 2) is equivalent to proving

(Claim 3)

where for all . To prove (Claim 3), we just have to invoke Lemma 2.4 given below for a total of () times, adding up summands on the RHS two at a time and maintaining the “” sign. We are now left with proof of Lemma 2.4 to prove Theorem 2.4. ∎

Let be sequences of non-negative real numbers such that for all , and . Then the following inequality holds for any positive integers and .

Equality holds if and only if the sequences and are linearly dependent.

Proof.

The proof is similar to that of Chia and Nakano (2009), with the only difference being in . We proceed in the spirit of Hardy et al (1988), as well as Pòlya and Szegö (1972). Set , and and similarly for . Let , i.e. for all . Furthermore, define the function as follows:

and set

where . It suffices to prove that for . This is an immediate consequence of Jensen’s inequality as implies

Setting  , we have

Denoting by , this becomes

However, from the definition of , we must have

Therefore implies as required. Equality holds if and only if .

We shall begin from the definition of as follows:

Differentiating twice with respect to and rearranging, we have

The term is expressible as a square and therefore greater or equal to . To evaluate , we set and , yielding

via Cauchy-Schwarz’s inequality. Therefore we must have

due to the non-negativeness of and . Hence, Lemma 2.4, and consequently, Theorem 2.4 is proved. ∎

As a result of Theorem 2.4, all elliptical unimodal densities with finite second moments are -undecomposable. Conversely, any density, which is -decomposable, cannot be elliptical unimodal. One can do better than that. In the next subsection, we further show that if is -decomposable, then there exists an approximation to represent via a mixture of Gaussian densities, which improves estimation of .

2.5. Estimation of -Decomposable Densities

[Inequality on -Decomposable Densities] Let be probability density functions defined on . Let be a set of mixture components of such that

where and . Then the following result applies:

Here, denotes the Kullback-Leibler divergence between densities and , given as

Furthermore, denotes the Gaussian density whose mean and covariance matrix coincide with those of , and ’s are similarly defined.

Proof.

We only need to prove that

(Claim A)

Now, RHS of (Claim A)

From definitions, the probabilitiy density function of is given by

where and denote the mean and covariance matrix of . We obtain

Hence, RHS of (Claim A)

Meanwhile,

To complete the prove of Theorem 2.5, it suffices to demonstrate that

(Claim B)

Using Jensen’s inequality, we have

which completes the proof of Theorem 2.5

We summarize the result of Theorem 2.5 as follows. Let be any density in -dimensional space. If is -decomposable, then by definition, one can find a set of mixture components of , such that the sum of pseudo-volumes of the mixture components is less than the pseudo-volume of the original density . From Theorem 2.4, cannot belong to the class of elliptical unimodal densities. It is possible to do better than that. Theorem 2.5 shows that is better estimated via a weighted Gaussian mixture, rather than a single Gaussian density. The Gaussian components are created via moments matching of the mixture components of . The better goodness of fit by the resultant weighted Gaussian mixture estimate is guaranteed in Kullback-Leibler sense. It should be noted that the analytical form of the original density does not need to be known. In the next section, we demonstrate the use of Theorems 2.4 and 2.5 to satistical applications, namely cluster analysis and kernel density estimation.

3. Applications Using -Decomposability

3.1. Clustering via -Decomposability: The Power of Two

Figure 1. Original data from multimodal density drawn from mixture of five logistic densities.

Figure 2. Original data split into two mixture components, represented by two different symbols. The sum of pseudo-volumes of the mixture components is less than that of original.

One straightforward application of

-decomposability is cluster analysis. Many existing clustering algorithm divide the dataset into clusters, based on the following heuristic: That the within-variances of clusters are minimized while the between-variance is maximized at the same time. Another variation to this heuristic is to determine cluster allocations such that a function of volume of clusters is minimized. In particular,

Shioda and Tunçel (2005) proposed dividing the dataset into clusters, such that the total sum of MVE (minimum volume of ellipsoid) of clusters are globally minimized. While the details for each algorithm may differ, the underlying idea is conceptually similar. Theorem 2.5 provides theoretical justification for minimizing sum of pseudo-volumes, and therefore supports all similar approaches of existing algorithms.

Intuitively, the rigorous approach to implement cluster analysis via Theorem 2.5 is to divide the dataset into clusters, such that the sum of pseudo-volumes of all clusters are globally minimized. This approach is computationally unfeasible for dataset of any reasonable size. To this end, we propose the following alternative approach that captures the essence of Theorem 2.5 as far as possible. We devise a split-merge clustering strategy that involves splitting and merging, two clusters at a time. This lowers the overall computational load. We show that with our approach, the algorithm is able to overcome local minima. Consequently, it is possible to perform cluster analysis well, even with clusters.

Figure 3. Mixture component denoted by (+) in Fig 2 split into two further mixture components.

Figure 4. Mixture component denoted by (o) in Fig 2 split into two further mixture components.

Figure 5. Mixture component denoted by (+) in Fig 3 split into two further mixture components.

Figure 6. Mixture component denoted by (o) in Fig 4 split into two further mixture components.

From the given sample , we are interested to know if the original sample is -decomposable. We check if can be partitioned into two clusters, such that the sum of pseudo-volumes of the clusters is less than that of . We denote as , a partition of , such that

and , with ’s being a rearrangement of . We further denote the sample covariance matrices of as , and . Our task is to find the optimal partition such that

is globally minimized and test this value against . If

(10)

where is a threshold value close to zero, then, we can conclude that is likely to be -decomposable. However, if Eq (10) is not satisfied, then is likely to be -undecomposable. To robustify the “splitting process” against local minima traps, it is possible to set the RHS of Eq (10) to be greater than . Furthermore, taking into consideration error due to finiteness of sample sizes, imperfection of splitting algorithms, and also accounting for limiting the number of mixture components to two, we recommend that the on the RHS of Eq (10) to be about .

When one concludes that a particular cluster is probably -undecom-posable, it is possible to stop at one cluster. However, if is found to be -decomposable into clusters of and , one may repeat the splitting process for and . The process is then reiterated until all clusters are probably -undecomposable. When that happens, the splitting process ends.

Our strategy also includes “merging” of clusters. At the point when all splitted clusters are probably -undecomposable, we select two clusters at a time and perform the following test. Now, let denote the two chosen clusters and be the union of the two clusters, i.e. . We then check the sum of the pseudo-volumes of and and compare against that of . If

(11)

we conclude that and should be merged to form a larger cluster . This process is repeated until there are no more mergeable clusters left. To prevent overclustering, we recommend to be around .

We have described a possible algorithm using -decomposability to perform cluster analysis. The crucial point is to find a partition such that is minimized as far as possible. There are many possible approaches to this task. To find the global minimum of the sum is computationally unfeasible and may be NP-hard. Here, we propose a computationally simpler approach. At each spitting step, we simply fit a two-mixture Gaussian to the original cluster , and then run the EM algorithm to convergence to obtain the partition . However, we emphasize that the EM algorithm approach itself is not critical, and that it is possible to use other approaches to obtain a reasonable partition of at the splitting step. The main point here is the concept of clustering via