Density-based Clustering with Best-scored Random Forest

06/24/2019 ∙ by Hanyuan Hang, et al. ∙ 0

Single-level density-based approach has long been widely acknowledged to be a conceptually and mathematically convincing clustering method. In this paper, we propose an algorithm called "best-scored clustering forest" that can obtain the optimal level and determine corresponding clusters. The terminology "best-scored" means to select one random tree with the best empirical performance out of a certain number of purely random tree candidates. From the theoretical perspective, we first show that consistency of our proposed algorithm can be guaranteed. Moreover, under certain mild restrictions on the underlying density functions and target clusters, even fast convergence rates can be achieved. Last but not least, comparisons with other state-of-the-art clustering methods in the numerical experiments demonstrate accuracy of our algorithm on both synthetic data and several benchmark real data sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Regarded as one of the most basic tools to investigate statistical properties of unsupervised data, clustering aims to group a set of objects in such a way that objects in the same cluster are more similar in some sense to each other than to those in other clusters. Typical application possibilities are to be found reaching from categorization of tissues in medical imaging to grouping internet searching results. For instance, on PET scans, cluster analysis can distinguish between different types of tissue in a three-dimensional image for many different purposes (Filipovych et al., 2011) while in the process of intelligent grouping of the files and websites, clustering algorithms create a more relevant set of search results (Marco and Navigli, 2013). Because of their wide applications, more urgent requirements for clustering algorithms that not only maintain desirable prediction accuracy but also have high computational efficiency are raised. In the literature, a wealth of algorithms have already been proposed such as -means (Macqueen, 1967), linkage (Ward, 1963; Sibson, 1973; Defays, 1977), cluster tree (Stuetzle, 2003), DBSCAN (Ester et al., 1996)

, spectral clustering

(Donath and Hoffman, 1973; Luxburg, 2007)

, and expectation-maximization for generative models

(Dempster et al., 1977).

As is widely acknowledged, an open problem in cluster analysis is how to describe a conceptually and mathematically convincing definition of clusters appropriately. In the literature, great efforts have been made to deal with this problem. Perhaps the first definition dates back to Hartigan (1975), which is known as the single-level density-based clustering assuming i.i.d. data generated by some unknown distribution that has a continuous density and the clusters of are then defined to be the connected components of the level set given some . Since then, different methods based on the estimator and the connected components of { have been established (Cuevas and Fraiman, 1997; Maier et al., 2012; Rigollet, 2006; Rinaldo and Wasserman, 2010).

Note that the single-level approach mentioned above is easily shown to have a conceptual drawback that different values of may lead to different (numbers of) clusters, and there is also no general rule for choosing

. In order to address this conceptual shortcoming, another type of the clustering algorithms, namely hierarchical clustering, where the hierarchical tree structure of the connected components for different levels

is estimated, was proposed. Within this framework, instead of choosing some , the so-called cluster tree approach tries to consider all levels and the corresponding connected components simultaneously. It is worth pointing out that the advantage of using cluster tree approach lies in the fact that it mainly focuses on the identification of the hierarchical tree structure of the connected components for different levels. For this reason, in the literature, there have already been many attempts to establish their theoretical foundations. For example, Hartigan (1981) proved the consistency of a hierarchical clustering method named single linkage merely for the one-dimensional case which becomes a more delicate problem that it is only fractionally consistent in the high-dimensional case. To address this problem, Chaudhuri and Dasgupta (2010) proposed a modified single linkage algorithm which is shown to have finite-sample convergence rates as well as lower bounds on the sample complexity under certain assumptions on . Furthermore, Kpotufe (2011) obtained similar theoretical results with an underlying -NN density estimator and achieved experimental improvement by means of a simple pruning strategy that removes connected components that artificially occur because of finite sample variability. However, the notion of recovery taken from Hartigan (1981) falls short of only focusing on the correct estimation of the cluster tree structure and not on the estimation of the clusters itself, more details we refer to Rinaldo and Wasserman (2010).

So far, the theoretical foundations for hierarchical clustering algorithms such as consistency and learning rates of the existing hierarchical clustering algorithms are only valid for the cluster tree structure and therefore far from being satisfactory. As a result, in this paper, we proceed with the study of single-level density-based clustering. In the literature, recently, various results for estimating the optimal level have already been established. First of all, Steinwart (2011) and Steinwart (2015a) presented algorithms based on histogram density estimators that are able to asymptotically determine the optimal level and automatically yield a consistent estimator for the target clusters. Obviously, these algorithms are of little practical value since only the simplest possible density estimators are considered. Attempting to address this issue, Sriperumbudur and Steinwart (2012) proposed a modification of the popular DBSCAN clustering algorithm. Although consistency and optimal learning rates have been established for this new DBSCAN-type construction, the main difficulty in carrying out this algorithm is that it restricts the consideration only to moving window density estimators for -Hölder continuous densities. In addition, it’s worth noticing that none of the algorithms mentioned above can be well adapted to the case where the underlying distribution possesses no split in the cluster tree. To tackle this problem, Steinwart et al. (2017)

proposed an adaptive algorithm using kernel density estimators which, however, also only performs well for low-dimensional data.

In this paper, we mainly focus on clusters that are defined as the connected components of high density regions and present an algorithm called best-scored clustering forest which can not only guarantee consistency and attain fast convergence rates, but also enjoy satisfactory performance in various numerical experiments. To notify, the main contributions of this paper are twofold: (i) Concerning with the theoretical analysis, we prove that with the help of the best-scored random forest density estimator, our proposed algorithm can ensure consistency and achieve fast convergence rates under certain assumptions for the underlying density functions and target clusters. We mention that the convergence analysis is conducted within the framework established in Steinwart (2015a)

. To be more precise, under properly chosen hyperparameters of the best-scored random forest density estimator

Hang and Wen (2018), the consistency of the best-scored clustering forest can be ensured. Moreover, under some additional regularization conditions, even fast convergence rates can be achieved. (ii)

When it comes to numerical experiments, we improve the original purely random splitting criterion by proposing an adaptive splitting method. Instead, at each step, we randomly select a sample point from the training data set and the to-be-split node is the one which this point falls in. The idea behind this procedure is that when randomly picking sample points from the whole training data set, nodes with more samples will be more likely to be chosen whereas nodes containing fewer samples are less possible to be selected. In this way, the probability to obtain cells with sample sizes evenly distributed will be much greater. Empirical experiments further show that the adaptive/recursive method enhances the efficiency of our algorithm since it actually increases the

effective number of splits.

The rest of this paper is organized as follows: Section 2 introduces some fundamental notations and definitions related to the density level sets and best-scored random forest density estimator. Section 3 is dedicated to the exposition of the generic clustering algorithm architecture. We provide our main theoretical results and statements on the consistency and learning rates of the proposed best-scored clustering forest in Section 4, where the main analysis aims to verify that our best-scored random forest could provide level set estimator that has control over both its vertical and horizontal uncertainty. Some comments and discussions on the established theoretical results will be also presented in this section. Numerical experiments conducted upon comparisons between best-scored clustering forest and other density-based clustering methods are given in Section 5. All the proofs of Section 3 and Section 4 can be found in Section 6. We conclude this paper with a brief discussion in the last section.

2 Preliminaries

In this section, we recall several basic concepts and notations related to clusters in the first subsection while in the second subsection we briefly recall the best-scored random forest density estimation proposed recently by Hang and Wen (2018).

2.1 Density Level Sets and Clusters

This subsection begins by introducing some basic notations and assumptions about density level sets and clusters. Throughout this paper, let be a compact and connected subset, be the Lebesgue measure with . Moreover, let be a probability measure that is absolutely continuous with respect to and possess a bounded density with support . We denote the centered hypercube of with side length by where

and the complement of is written by .

Given a set , we denote by its interior, its closure, its boundary, and its diameter. Furthermore, for a given , denotes the distance between and . Given another set , we denote by the symmetric difference between and . Moreover, stands for the indicator function of the set .

We say that a function is -Hölder continuous, if there exists a constant such that

To mention, it can be apparently seen that is constant whenever .

Finally, throughout this paper, we use the notation to denote that there exists a positive constant such that , for all .

2.1.1 Density Level Sets

In order to find a notion of density level set which is topologically invariant against different choices of the density of the distribution , Steinwart (2011) proposes to define a density level set at level by

where stands for the support of , and the measure is defined by

where denotes the Borel -algebra of . According to the definition, the density level set should be closed. If the density is assumed to be -Hölder continuous, the above construction could be replaced by the usual without changing our results.

Figure 1: Topologically relevant changes on set of measure zero. Left: The thick solid lines indicate a set consisting of two connected components and . The density of is with being a suitable constant, then and are the two connected components of for all . Right: This is a similar situation. The straight horizontal thin line indicates a line of measure zero connecting the two components, and the dashed lines indicate cuts of measure zero. In this case, the density of is , then , , , and are the four connected components of for all .

Here, some important properties of the sets , are useful:

  1. Level Sets.

  2. Monotonicity. for all .

  3. Regularity. .

  4. Normality. , where and .

  5. Open Level Sets. .

2.1.2 Comparison of Partitions and Notations of Connectivity

Before introducing the definition of clusters, some notions related to the connected components of level sets are in need. First of all, we give the definition that compares different partitions.

Definition 2.1.

Let be nonempty sets with , and and be partitions of and , respectively. Then is said to be comparable to , if for all , there exists a such that . In this case, we write .

It can be easily deduced that is comparable to , if no cell is broken into pieces in . Let and be two partitions of , then we call is finer than if and only if . Moreover, as is demonstrated in Steinwart (2015b), for two partitions and with , there exits a unique map such that for . We call the cell relating map (CRM) between and .

Now, we give further insight into two vital examples of comparable partitions coming from connected components. Recall that an is topologically connected if, for every pair of relatively closed disjoint subsets of with , we have or . The maximal connected subsets of are called the connected components of . As is widely acknowledged, these components make up a partition of , and we denote it by . Furthermore, for a closed with , we have .

The next example describes another type of connectivity, namely -connectivity, which can be considered as a discrete version of path-connectivity. For the latter, let us fix a and . Then, are called -connected in , if there exists such that , and for all . Clearly, being -connected gives an equivalence relation on . To be specific, the resulting partition can be written as , and we call its cells the -connected components of . It can be verified that, for all and , we always have , see Lemma A.2.7 in Steinwart (2015b). In addition, if , then we have for all sufficiently small , see Section 2.2 in Steinwart (2015a).

2.1.3 Clusters

Based on the concept established in the preceding subsections we now recall the definition of clusters, see also Definition 2.5 in Steinwart (2015b).

Definition 2.2 (Clusters).

Let be a compact and connected set, and be a -absolutely continuous distribution. Then can be clustered between and , if is normal and for all , the following three conditions are satisfied:

  1. We have either or ;

  2. If we have , then ;

  3. If we have , then and .

Using the CRMs ; we then define the clusters of by

where and are the two topologically connected components of . Finally, we define

(2.1)
Figure 2: Definition of clusters. Left: A one-dimensional mixture of three Guassians with the optimal level and a possible choice of . It is easily observed that the open intervals and are the two clusters of the distribution. We only have one connected component for the level and the levels and are not considered in above definition. Right: Here we have a similar situation for a mixture of three two-dimensional Gaussians drawn by contour lines. The thick solid lines indicate the levels and , while the thin solid lines show a level in . The dashed lines correspond to a level and a level . In this case, the clusters are the two connected components by the outer thick solid line.

To illustrate, the above definition ensures that the level set below are connected, while there are exactly two components in the level sets for a certain range above . To notify, any two level sets between this range are supposed to be comparable. As a result, the topological structure between and can be determined by that of . In this manner, the connected components of , can be numbered by the connected components of . This numbering procedure can be clearly reflected from the definition of the clusters as well as that of the function , which in essence measures the distance between the two connected components at level .

Concerning that the quantification of uncertainty of clusters is indispensable, we need to introduce for , , the sets

(2.2)

In other words, can be recognized as adding a -tube to , while is treated as removing a -tube from . We are expected to avoid cases where the density level sets have bridges or cusps that are too thin. To be more precise, recall that for a closed , the function is defined by

Particularly, for all , we have for all , and if , then . Consequently, according to Lemma A.4.3 in Steinwart (2015b), for all with and all , we have

whenever is contained in some compact and .

With the preceding preparations, we now come to the following definition excluding bridges and cusps which are too thin.

Definition 2.3.

Let be a compact and connected set, and be a -absolutely continuous distribution that is normal. Then we say that has thick level sets of order up to the level , if there exits constants and such that, for all and , we have

In this case, we call the thickness function of .

Figure 3: Thick level sets. Left: The thick solid line presents a level set below or at the level and the thin solid line indicates the two clusters and of . Since the quadratic shape of around the thin bridge, the distribution has thickness of order . Right: In the same situation, the distribution has thick level sets of order . It is worth noting that smaller leads to a significantly wider separation of and .

In order to describe the distribution we wish to cluster, we now make the following assumption based on all concepts introduced so far.

Assumption 2.1.

The distribution with bounded density is able to be clustered between and . Moreover, has thick level sets of order up to the level . The corresponding thickness function is denoted by and the function defined in (2.1) is abbreviated as .

In the case that all level sets are connected, we introduce the following assumption to investigate the behavior of the algorithm in situations in which cannot be clustered.

Assumption 2.2.

Let be a compact and connected set, and be a -absolutely continuous distribution that is normal. Assume that there exist constants , , and such that for all and , the following conditions hold:

  • .

  • If then .

  • If , then for all non-empty and all .

  • For each there exists a with .

2.2 Best-scored Random Forest Density Estimation

Considering the fact that the density estimation should come first before the analysis on the level sets, we dedicate this subsection to the methodology of building an appropriate density estimator. Different from the usual histogram density estimation (Steinwart, 2015a) and kernel density estimation (Steinwart et al., 2017), this paper adopts a novel random forest-based density estimation strategy, namely the best-scored random forest density estimation proposed recently by Hang and Wen (2018).

2.2.1 Purely Random Density Tree

Recall that each tree in the best-scored random forest is established based on a purely random partition followed the idea of Breiman (2000)

. To give a clear description of one possible construction procedure of this purely random partition, we introduce the random vector

as in Hang and Wen (2018), which represents the building mechanism at the -th step. To be specific,

Figure 4: Possible construction procedures of three-split axis-parallel purely random partitions in a two-dimensional space. The first split divides the input domain, e.g.  into two cells and . Then, the to-be-split cell is chosen uniformly at random, say , and the partition becomes , , after the second random split. Finally, we once again choose one cell uniformly at random, say , and the third random split leads to a partition consisting of , , and .
  • denotes the to-be-split cell at the -th step chosen uniformly at random from all cells formed in the -th step;

  • stands for the dimension chosen to be split from in the -th step where each dimension has the same probability to be selected, that is, are i.i.d. multinomial distributed with equal probabilities;

  • is a proportional factor standing for the ratio between the length of the newly generated cell in the -th dimension after the -th split and the length of the being-cut cell in the -th dimension. We emphasize that

    are i.i.d. drawn from the uniform distribution

    .

In this manner, the above splitting procedure leads to a so-called partition variable with probability measure of denoted by , and any specific partition variable can be treated as a splitting criterion. Moreover, for the sake of notation clarity, we denote by the collection of non-overlapping cells formed after conducting splits on following . This can be further abbreviated as which exactly represents a random partition on . Accordingly, we have , and for certain sample , the cell where it falls is denoted by .

In order to better characterize the purely random density tree, we give another expression of the random partition on , which is where represents one of the resulting cells of this partition. Based on this partition, we can build the random density tree with respect to probability measure on , denoted as , defined by

where unless otherwise stated, we assume that for all , the Lebesgue measure . In this regard, when taking , the density tree decision rule becomes

where . When taking to be the empirical measure , we obtain

and hence the density tree turns into

(2.3)

2.2.2 Best-scored Random Density Trees and Forest

Considering the fact that the above partitions completely make no use of the sample information, the prediction results of their ensemble forest may not be accurate enough. In order to improve the prediction accuracy, we select one partition for tree construction out of candidates with the best density estimation performance according to certain performance measure such as ANLL (Hang and Wen, 2018, Section 5.4). The resulting trees are then called the best-scored random density trees.

Now, let , be the best-scored random density tree estimators generated by the splitting criteria respectively, which is defined by

where is a random partition of . Then the best-scored random density forest can be formulated by

(2.4)

and its population version is denoted by .

3 A Generic Clustering Algorithm

In this section, we present a generic clustering algorithm, where the clusters are estimated with the help of a generic level set estimator which can be specified later by histogram, kernel, or random forest density estimators. To this end, let the optimal level and the resulting clusters , for distributions be as in Definition 2.2, and the constant be as in Assumption 2.2. The goal of this section is to investigate whether or is possible to be estimated and , can be clustered.

Let us first recall some more notations introduced in Section 2. For a -absolutely continuous distribution , let the level , the level set , , and the function be as in Definition 2.2. Furthermore, for a fixed set , its -tubes and are defined by (2.2). Moreover, concerning with the thick level sets, the constant and the function are introduced by Definition 2.3.

In what follows, let always be a decreasing family of sets such that

(3.1)

holds for all .

The following theorem relates the component structure of a family of level sets estimators , which is a decreasing family of subsets of , to the component structure of certain sets , more details see e.g., Steinwart (2015a).

Theorem 3.1.

Let Assumption 2.1 hold. Furthermore, for , let , , , and be as in (3.1). Then, for all and the corresponding CRMs , the following disjoint union holds:

From Theorem 3.1 we see that for suitable , , and , all -connected components of are either contained in , or vanish at level . Accordingly, carrying out these steps precisely, we obtain a generic clustering strategy shown in Algorithm 1.

Input: some , and a start level . A decreasing family of subsets of X.
repeat
       Identify the -connected components of satisfying
until ;
Identify the -connected components of satisfying
if  then
      return and the sets for .
end if
else
      return and the set .
end if
Output: An estimator of or the corresponding clusters.
Algorithm 1 Estimate clusters with the help of a generic level set estimator
Figure 5: Illustration of Algorithm 1. Left: The density presented by solid line has two modes on the left and a flat part on the right. A plug in approach based on a density estimator (thin solid line) with three modes is used to provide the level set estimator. The level set estimator satisfies (3.1). Only the left two components of do not vanish at . Therefore, the algorithm only finds one component. Right: We consider the same distribution at a higher level. In this case, both components of do not vanish at and thus the algorithm correctly identifies two connected components.

Under Assumptions 2.1 and 2.2, the following theorem bounds the level and the components , , and the start level and the corresponding single cluster , respectively, which are outputs returned by Algorithm 1.

Theorem 3.2.

(i) Let Assumption 2.1 hold. For , let , , , and satisfy (3.1) for all . Then, for any data set , the following statements hold for Algorithm 1:

  • The returned level satisfies both and

  • The returned sets , , can be ordered such that

    (3.2)

    Here, , , are ordered in the sense of .

(ii) Let Assumption 2.2 hold. Moreover, let , be fixed, , and satisfy (3.1) for all . If , then Algorithm 1 returns the start level and the corresponding single cluster such that

where .

The above analysis is mainly illustrated on the general cases where we assume that the underlying density has already been successfully estimated. Therefore, in the following, we delve into the characteristic of components structure and other properties of clustering algorithm under the condition where the density is estimated by the forest density estimator (2.4).

Note that one more notation is necessary for clear understanding: One way to define level set estimators with the help of the forest density estimator (2.4) is a simple plug-in approach, which is

However, these level set estimators are too complicated to compute the -connected components in Algorithm 1. Instead, we take level set estimators of the form

(3.3)

The following theorem shows that some kind of uncertainty control of the form (3.1) is valid for level set estimators of the form (3.3) induced by the forest density estimator (2.4).

Theorem 3.3.

Let be a -absolutely continuous distribution on and be the forest density estimator (2.4) with . For any , , that is, is one of the cells in the -th partition, there exists a constant such that . Then, for all and , there holds

(3.4)

Before we present the next theorem, recall that denotes half of the side length of the centered hypercube in and denotes the number of trees in the best-scored random forest.

Theorem 3.4.

Let be a -absolutely continuous distribution on . For , , , , we choose an satisfying

(3.5)

where is defined by

(3.6)

Furthermore, for and , we choose a with and assume this satisfying and . Moreover, for each random density tree, we pick the number of splits satisfying

(3.7)

If we feed Algorithm 1 with parameters , , , and as in (3.3), then the following statements hold:
  (i) If satisfies Assumption 2.1 and there exists an satisfying

then with probability not less than , the following statements hold:

  • The returned level satisfies both and

  • The returned sets , , can be ordered such that

    Here, , , are ordered in the sense of .

(ii) If satisfies Assumption 2.2 and , then

holds with probability not less than for the returned level and the corresponding single cluster , where .

4 Main Results

In this section, we present main theoretical results of our best-scored clustering forest on the consistency as well as convergence rates for both the optimal level and the true clusters , , simultaneously using the error bounds derived in Theorem 3.2 and Theorem 3.4, respectively. We also present some comments and discussions on the obtained theoretical results.

4.1 Consistency for Best-scored Clustering Forest

Theorem 4.1 (Consistency).

Let Assumption 2.1 hold. Furthermore, for certain constant , assume that , , , and are strictly positive sequences converging to zero satisfying for sufficiently large , , . Moreover, let the number of splits satisfy

where and . If we feed Algorithm 1 with parameters , , as in (3.3), and , then the following statements hold:

  • If satisfies Assumption 2.1, then for all , the returned level satisfies

    Moreover, if , then for all , the returned sets , , satisfy

  • If satisfies Assumption 2.2 and , then for all , the returned level satisfies

    Moreover, if , then for all , the returned set satisfies

4.2 Convergence Rates for Best-scored Clustering Forest

In this subsection, we derive the convergence rates for both estimation problems, that is, for estimating the optimal level and the true clusters , , in our proposed algorithm separately.

4.2.1 Convergence Rates for Estimating the Optimal Level

In order to derive the convergence rates for estimating the optimal level , we need to make following assumption that describes how well the clusters are separated above .

Definition 4.1.

Let Assumption 2.1 hold. The clusters of are said to have separation exponent if there exists a constant such that

holds for all . Moreover, the separation exponent is called exact if there exists another constant such that

holds for all .

The separation exponent describes how fast the connected components of the approach each other for and a distribution having separation exponent also has separation exponent for all . If the separation exponent , then the clusters and do not touch each other. With the above Definition 4.1, we are able to establish error bounds for estimating the optimal level in the following theorem whose proof is quite similar to that of Theorem 4.3 in Steinwart (2015a) and hence will be omitted.

Theorem 4.2.

Let Assumption 2.1 hold, and assume that has a bounded -density whose clusters have separation exponent . For , , , , we choose an satisfying