Density-based Clustering with Best-scored Random Forest

Single-level density-based approach has long been widely acknowledged to be a conceptually and mathematically convincing clustering method. In this paper, we propose an algorithm called "best-scored clustering forest" that can obtain the optimal level and determine corresponding clusters. The terminology "best-scored" means to select one random tree with the best empirical performance out of a certain number of purely random tree candidates. From the theoretical perspective, we first show that consistency of our proposed algorithm can be guaranteed. Moreover, under certain mild restrictions on the underlying density functions and target clusters, even fast convergence rates can be achieved. Last but not least, comparisons with other state-of-the-art clustering methods in the numerical experiments demonstrate accuracy of our algorithm on both synthetic data and several benchmark real data sets.

There are no comments yet.

Authors

• 14 publications
• 3 publications
• 4 publications
• Best-scored Random Forest Classification

We propose an algorithm named best-scored random forest for binary class...
05/27/2019 ∙ by Hanyuan Hang, et al. ∙ 0

• Clustering Optimization: Finding the Number and Centroids of Clusters by a Fourier-based Algorithm

We propose a Fourier-based approach for optimization of several clusteri...
04/29/2019 ∙ by Soheil Mehrabkhani, et al. ∙ 0

• Best-scored Random Forest Density Estimation

This paper presents a brand new nonparametric density estimation strateg...
05/09/2019 ∙ by Hanyuan Hang, et al. ∙ 0

• Two-stage Best-scored Random Forest for Large-scale Regression

We propose a novel method designed for large-scale regression problems, ...
05/09/2019 ∙ by Hanyuan Hang, et al. ∙ 0

• Unsupervised Decision Forest for Data Clustering and Density Estimation

An algorithm to improve performance parameter for unsupervised decision ...
07/15/2015 ∙ by Hayder Albehadili, et al. ∙ 0

• Bisecting for selecting: using a Laplacian eigenmaps clustering approach to create the new European football Super League

We use European football performance data to select teams to form the pr...
04/20/2021 ∙ by A. J. Bond, et al. ∙ 0

• Efficient Computation of Multiple Density-Based Clustering Hierarchies

HDBSCAN*, a state-of-the-art density-based hierarchical clustering metho...
09/13/2017 ∙ by Antonio Cavalcante Araujo Neto, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Regarded as one of the most basic tools to investigate statistical properties of unsupervised data, clustering aims to group a set of objects in such a way that objects in the same cluster are more similar in some sense to each other than to those in other clusters. Typical application possibilities are to be found reaching from categorization of tissues in medical imaging to grouping internet searching results. For instance, on PET scans, cluster analysis can distinguish between different types of tissue in a three-dimensional image for many different purposes (Filipovych et al., 2011) while in the process of intelligent grouping of the files and websites, clustering algorithms create a more relevant set of search results (Marco and Navigli, 2013). Because of their wide applications, more urgent requirements for clustering algorithms that not only maintain desirable prediction accuracy but also have high computational efficiency are raised. In the literature, a wealth of algorithms have already been proposed such as -means (Macqueen, 1967), linkage (Ward, 1963; Sibson, 1973; Defays, 1977), cluster tree (Stuetzle, 2003), DBSCAN (Ester et al., 1996)

(Donath and Hoffman, 1973; Luxburg, 2007)

, and expectation-maximization for generative models

(Dempster et al., 1977).

As is widely acknowledged, an open problem in cluster analysis is how to describe a conceptually and mathematically convincing definition of clusters appropriately. In the literature, great efforts have been made to deal with this problem. Perhaps the first definition dates back to Hartigan (1975), which is known as the single-level density-based clustering assuming i.i.d. data generated by some unknown distribution that has a continuous density and the clusters of are then defined to be the connected components of the level set given some . Since then, different methods based on the estimator and the connected components of { have been established (Cuevas and Fraiman, 1997; Maier et al., 2012; Rigollet, 2006; Rinaldo and Wasserman, 2010).

Note that the single-level approach mentioned above is easily shown to have a conceptual drawback that different values of may lead to different (numbers of) clusters, and there is also no general rule for choosing

. In order to address this conceptual shortcoming, another type of the clustering algorithms, namely hierarchical clustering, where the hierarchical tree structure of the connected components for different levels

is estimated, was proposed. Within this framework, instead of choosing some , the so-called cluster tree approach tries to consider all levels and the corresponding connected components simultaneously. It is worth pointing out that the advantage of using cluster tree approach lies in the fact that it mainly focuses on the identification of the hierarchical tree structure of the connected components for different levels. For this reason, in the literature, there have already been many attempts to establish their theoretical foundations. For example, Hartigan (1981) proved the consistency of a hierarchical clustering method named single linkage merely for the one-dimensional case which becomes a more delicate problem that it is only fractionally consistent in the high-dimensional case. To address this problem, Chaudhuri and Dasgupta (2010) proposed a modified single linkage algorithm which is shown to have finite-sample convergence rates as well as lower bounds on the sample complexity under certain assumptions on . Furthermore, Kpotufe (2011) obtained similar theoretical results with an underlying -NN density estimator and achieved experimental improvement by means of a simple pruning strategy that removes connected components that artificially occur because of finite sample variability. However, the notion of recovery taken from Hartigan (1981) falls short of only focusing on the correct estimation of the cluster tree structure and not on the estimation of the clusters itself, more details we refer to Rinaldo and Wasserman (2010).

So far, the theoretical foundations for hierarchical clustering algorithms such as consistency and learning rates of the existing hierarchical clustering algorithms are only valid for the cluster tree structure and therefore far from being satisfactory. As a result, in this paper, we proceed with the study of single-level density-based clustering. In the literature, recently, various results for estimating the optimal level have already been established. First of all, Steinwart (2011) and Steinwart (2015a) presented algorithms based on histogram density estimators that are able to asymptotically determine the optimal level and automatically yield a consistent estimator for the target clusters. Obviously, these algorithms are of little practical value since only the simplest possible density estimators are considered. Attempting to address this issue, Sriperumbudur and Steinwart (2012) proposed a modification of the popular DBSCAN clustering algorithm. Although consistency and optimal learning rates have been established for this new DBSCAN-type construction, the main difficulty in carrying out this algorithm is that it restricts the consideration only to moving window density estimators for -Hölder continuous densities. In addition, it’s worth noticing that none of the algorithms mentioned above can be well adapted to the case where the underlying distribution possesses no split in the cluster tree. To tackle this problem, Steinwart et al. (2017)

proposed an adaptive algorithm using kernel density estimators which, however, also only performs well for low-dimensional data.

In this paper, we mainly focus on clusters that are defined as the connected components of high density regions and present an algorithm called best-scored clustering forest which can not only guarantee consistency and attain fast convergence rates, but also enjoy satisfactory performance in various numerical experiments. To notify, the main contributions of this paper are twofold: (i) Concerning with the theoretical analysis, we prove that with the help of the best-scored random forest density estimator, our proposed algorithm can ensure consistency and achieve fast convergence rates under certain assumptions for the underlying density functions and target clusters. We mention that the convergence analysis is conducted within the framework established in Steinwart (2015a)

. To be more precise, under properly chosen hyperparameters of the best-scored random forest density estimator

Hang and Wen (2018), the consistency of the best-scored clustering forest can be ensured. Moreover, under some additional regularization conditions, even fast convergence rates can be achieved. (ii)

When it comes to numerical experiments, we improve the original purely random splitting criterion by proposing an adaptive splitting method. Instead, at each step, we randomly select a sample point from the training data set and the to-be-split node is the one which this point falls in. The idea behind this procedure is that when randomly picking sample points from the whole training data set, nodes with more samples will be more likely to be chosen whereas nodes containing fewer samples are less possible to be selected. In this way, the probability to obtain cells with sample sizes evenly distributed will be much greater. Empirical experiments further show that the adaptive/recursive method enhances the efficiency of our algorithm since it actually increases the

effective number of splits.

The rest of this paper is organized as follows: Section 2 introduces some fundamental notations and definitions related to the density level sets and best-scored random forest density estimator. Section 3 is dedicated to the exposition of the generic clustering algorithm architecture. We provide our main theoretical results and statements on the consistency and learning rates of the proposed best-scored clustering forest in Section 4, where the main analysis aims to verify that our best-scored random forest could provide level set estimator that has control over both its vertical and horizontal uncertainty. Some comments and discussions on the established theoretical results will be also presented in this section. Numerical experiments conducted upon comparisons between best-scored clustering forest and other density-based clustering methods are given in Section 5. All the proofs of Section 3 and Section 4 can be found in Section 6. We conclude this paper with a brief discussion in the last section.

2 Preliminaries

In this section, we recall several basic concepts and notations related to clusters in the first subsection while in the second subsection we briefly recall the best-scored random forest density estimation proposed recently by Hang and Wen (2018).

2.1 Density Level Sets and Clusters

This subsection begins by introducing some basic notations and assumptions about density level sets and clusters. Throughout this paper, let be a compact and connected subset, be the Lebesgue measure with . Moreover, let be a probability measure that is absolutely continuous with respect to and possess a bounded density with support . We denote the centered hypercube of with side length by where

 Br:={x=(x1,…,xd)∈Rd:xi∈[−r,r],i=1,…,d},

and the complement of is written by .

Given a set , we denote by its interior, its closure, its boundary, and its diameter. Furthermore, for a given , denotes the distance between and . Given another set , we denote by the symmetric difference between and . Moreover, stands for the indicator function of the set .

We say that a function is -Hölder continuous, if there exists a constant such that

 |f(x)−f(x′)|≤c∥x−x′∥α2,α∈(0,1].

To mention, it can be apparently seen that is constant whenever .

Finally, throughout this paper, we use the notation to denote that there exists a positive constant such that , for all .

2.1.1 Density Level Sets

In order to find a notion of density level set which is topologically invariant against different choices of the density of the distribution , Steinwart (2011) proposes to define a density level set at level by

 Mρ:=suppμρ

where stands for the support of , and the measure is defined by

 μρ(A):=μ(A∩{f≥ρ}),A∈B(X),

where denotes the Borel -algebra of . According to the definition, the density level set should be closed. If the density is assumed to be -Hölder continuous, the above construction could be replaced by the usual without changing our results.

Here, some important properties of the sets , are useful:

1. Level Sets.

2. Monotonicity. for all .

3. Regularity. .

4. Normality. , where and .

5. Open Level Sets. .

2.1.2 Comparison of Partitions and Notations of Connectivity

Before introducing the definition of clusters, some notions related to the connected components of level sets are in need. First of all, we give the definition that compares different partitions.

Definition 2.1.

Let be nonempty sets with , and and be partitions of and , respectively. Then is said to be comparable to , if for all , there exists a such that . In this case, we write .

It can be easily deduced that is comparable to , if no cell is broken into pieces in . Let and be two partitions of , then we call is finer than if and only if . Moreover, as is demonstrated in Steinwart (2015b), for two partitions and with , there exits a unique map such that for . We call the cell relating map (CRM) between and .

Now, we give further insight into two vital examples of comparable partitions coming from connected components. Recall that an is topologically connected if, for every pair of relatively closed disjoint subsets of with , we have or . The maximal connected subsets of are called the connected components of . As is widely acknowledged, these components make up a partition of , and we denote it by . Furthermore, for a closed with , we have .

The next example describes another type of connectivity, namely -connectivity, which can be considered as a discrete version of path-connectivity. For the latter, let us fix a and . Then, are called -connected in , if there exists such that , and for all . Clearly, being -connected gives an equivalence relation on . To be specific, the resulting partition can be written as , and we call its cells the -connected components of . It can be verified that, for all and , we always have , see Lemma A.2.7 in Steinwart (2015b). In addition, if , then we have for all sufficiently small , see Section 2.2 in Steinwart (2015a).

2.1.3 Clusters

Based on the concept established in the preceding subsections we now recall the definition of clusters, see also Definition 2.5 in Steinwart (2015b).

Definition 2.2 (Clusters).

Let be a compact and connected set, and be a -absolutely continuous distribution. Then can be clustered between and , if is normal and for all , the following three conditions are satisfied:

1. We have either or ;

2. If we have , then ;

3. If we have , then and .

Using the CRMs ; we then define the clusters of by

 A∗i=⋃ρ∈(ρ∗,ρ∗∗]ζρ(Ai),i∈{1,2},

where and are the two topologically connected components of . Finally, we define

 τ∗(ε):=13d((ζρ∗+ε(A1),ζρ∗+ε(A2)),ε∈(0,ρ∗∗−ρ∗]. (2.1)

To illustrate, the above definition ensures that the level set below are connected, while there are exactly two components in the level sets for a certain range above . To notify, any two level sets between this range are supposed to be comparable. As a result, the topological structure between and can be determined by that of . In this manner, the connected components of , can be numbered by the connected components of . This numbering procedure can be clearly reflected from the definition of the clusters as well as that of the function , which in essence measures the distance between the two connected components at level .

Concerning that the quantification of uncertainty of clusters is indispensable, we need to introduce for , , the sets

 A+δ :={x∈X:d(x,A)≤δ}, A−δ :=X∖(X∖A)+δ. (2.2)

In other words, can be recognized as adding a -tube to , while is treated as removing a -tube from . We are expected to avoid cases where the density level sets have bridges or cusps that are too thin. To be more precise, recall that for a closed , the function is defined by

Particularly, for all , we have for all , and if , then . Consequently, according to Lemma A.4.3 in Steinwart (2015b), for all with and all , we have

 |Cτ(A−δ)|≤|C(A)|,

whenever is contained in some compact and .

With the preceding preparations, we now come to the following definition excluding bridges and cusps which are too thin.

Definition 2.3.

Let be a compact and connected set, and be a -absolutely continuous distribution that is normal. Then we say that has thick level sets of order up to the level , if there exits constants and such that, for all and , we have

 ψ∗Mρ(δ)≤cthickδγ.

In this case, we call the thickness function of .

In order to describe the distribution we wish to cluster, we now make the following assumption based on all concepts introduced so far.

Assumption 2.1.

The distribution with bounded density is able to be clustered between and . Moreover, has thick level sets of order up to the level . The corresponding thickness function is denoted by and the function defined in (2.1) is abbreviated as .

In the case that all level sets are connected, we introduce the following assumption to investigate the behavior of the algorithm in situations in which cannot be clustered.

Assumption 2.2.

Let be a compact and connected set, and be a -absolutely continuous distribution that is normal. Assume that there exist constants , , and such that for all and , the following conditions hold:

• .

• If then .

• If , then for all non-empty and all .

• For each there exists a with .

2.2 Best-scored Random Forest Density Estimation

Considering the fact that the density estimation should come first before the analysis on the level sets, we dedicate this subsection to the methodology of building an appropriate density estimator. Different from the usual histogram density estimation (Steinwart, 2015a) and kernel density estimation (Steinwart et al., 2017), this paper adopts a novel random forest-based density estimation strategy, namely the best-scored random forest density estimation proposed recently by Hang and Wen (2018).

2.2.1 Purely Random Density Tree

Recall that each tree in the best-scored random forest is established based on a purely random partition followed the idea of Breiman (2000)

. To give a clear description of one possible construction procedure of this purely random partition, we introduce the random vector

as in Hang and Wen (2018), which represents the building mechanism at the -th step. To be specific,

• denotes the to-be-split cell at the -th step chosen uniformly at random from all cells formed in the -th step;

• stands for the dimension chosen to be split from in the -th step where each dimension has the same probability to be selected, that is, are i.i.d. multinomial distributed with equal probabilities;

• is a proportional factor standing for the ratio between the length of the newly generated cell in the -th dimension after the -th split and the length of the being-cut cell in the -th dimension. We emphasize that

are i.i.d. drawn from the uniform distribution

.

In this manner, the above splitting procedure leads to a so-called partition variable with probability measure of denoted by , and any specific partition variable can be treated as a splitting criterion. Moreover, for the sake of notation clarity, we denote by the collection of non-overlapping cells formed after conducting splits on following . This can be further abbreviated as which exactly represents a random partition on . Accordingly, we have , and for certain sample , the cell where it falls is denoted by .

In order to better characterize the purely random density tree, we give another expression of the random partition on , which is where represents one of the resulting cells of this partition. Based on this partition, we can build the random density tree with respect to probability measure on , denoted as , defined by

 fQ,Z(x):=fQ,Z,p(x):=p∑j=0Q(Aj)1Aj(x)μ(Aj)+Q(Bcr)1Bcr(x)μ(Bcr)

where unless otherwise stated, we assume that for all , the Lebesgue measure . In this regard, when taking , the density tree decision rule becomes

 fP,Z(x)=P(A(x))μ(A(x))=1μ(A(x))∫A(x)f(x′)dμ(x),x∈Br,

where . When taking to be the empirical measure , we obtain

 Dn(A(x))=EDn1A(x)=1nn∑i=1δxi(A(x))=1nn∑i=11A(x)(xi),

and hence the density tree turns into

 fD,Z:=fDn,Z(x)=Dn(A(x))μ(A(x))=1nμ(A(x))n∑i=11A(x)(xi). (2.3)

2.2.2 Best-scored Random Density Trees and Forest

Considering the fact that the above partitions completely make no use of the sample information, the prediction results of their ensemble forest may not be accurate enough. In order to improve the prediction accuracy, we select one partition for tree construction out of candidates with the best density estimation performance according to certain performance measure such as ANLL (Hang and Wen, 2018, Section 5.4). The resulting trees are then called the best-scored random density trees.

Now, let , be the best-scored random density tree estimators generated by the splitting criteria respectively, which is defined by

 fD,Zt(x):=p∑j=0D(Atj)1Atj(x)μ(Atj)+D(Bcr)1Bcr(x)μ(Bcr)

where is a random partition of . Then the best-scored random density forest can be formulated by

 fD,ZE(x):=1mm∑t=1fD,Zt(x), (2.4)

and its population version is denoted by .

3 A Generic Clustering Algorithm

In this section, we present a generic clustering algorithm, where the clusters are estimated with the help of a generic level set estimator which can be specified later by histogram, kernel, or random forest density estimators. To this end, let the optimal level and the resulting clusters , for distributions be as in Definition 2.2, and the constant be as in Assumption 2.2. The goal of this section is to investigate whether or is possible to be estimated and , can be clustered.

Let us first recall some more notations introduced in Section 2. For a -absolutely continuous distribution , let the level , the level set , , and the function be as in Definition 2.2. Furthermore, for a fixed set , its -tubes and are defined by (2.2). Moreover, concerning with the thick level sets, the constant and the function are introduced by Definition 2.3.

In what follows, let always be a decreasing family of sets such that

 M−δρ+ε⊂Lρ⊂M+δρ−ε (3.1)

holds for all .

The following theorem relates the component structure of a family of level sets estimators , which is a decreasing family of subsets of , to the component structure of certain sets , more details see e.g., Steinwart (2015a).

Theorem 3.1.

Let Assumption 2.1 hold. Furthermore, for , let , , , and be as in (3.1). Then, for all and the corresponding CRMs , the following disjoint union holds:

 Cτ(Lρ)=ζ(Cτ(M−δρ+ε))∪{B′∈Cτ(Lρ):B′∩Lρ+2ε=∅}.

From Theorem 3.1 we see that for suitable , , and , all -connected components of are either contained in , or vanish at level . Accordingly, carrying out these steps precisely, we obtain a generic clustering strategy shown in Algorithm 1.

Under Assumptions 2.1 and 2.2, the following theorem bounds the level and the components , , and the start level and the corresponding single cluster , respectively, which are outputs returned by Algorithm 1.

Theorem 3.2.

(i) Let Assumption 2.1 hold. For , let , , , and satisfy (3.1) for all . Then, for any data set , the following statements hold for Algorithm 1:

• The returned level satisfies both and

 τ−ψ(δ)<3τ∗(ρout−ρ∗+ε);
• The returned sets , , can be ordered such that

 2∑i=1μ(Bi(D)△A∗i)≤22∑i=1μ(A∗i∖(Aiρout+ε)−δ)+μ(M+δρout−ε∖{f>ρ∗}). (3.2)

Here, , , are ordered in the sense of .

(ii) Let Assumption 2.2 hold. Moreover, let , be fixed, , and satisfy (3.1) for all . If , then Algorithm 1 returns the start level and the corresponding single cluster such that

 μ(Lρ0△ˆMρ∗)≤μ(M+δρ0−ε∖ˆMρ∗)+μ(ˆMρ∗∖M−δρ0+ε)

where .

The above analysis is mainly illustrated on the general cases where we assume that the underlying density has already been successfully estimated. Therefore, in the following, we delve into the characteristic of components structure and other properties of clustering algorithm under the condition where the density is estimated by the forest density estimator (2.4).

Note that one more notation is necessary for clear understanding: One way to define level set estimators with the help of the forest density estimator (2.4) is a simple plug-in approach, which is

 LD,ρ:={fD,ZE(x)≥ρ}.

However, these level set estimators are too complicated to compute the -connected components in Algorithm 1. Instead, we take level set estimators of the form

 LD,ρ:={x∈D:fD,ZE(x)≥ρ}+σ. (3.3)

The following theorem shows that some kind of uncertainty control of the form (3.1) is valid for level set estimators of the form (3.3) induced by the forest density estimator (2.4).

Theorem 3.3.

Let be a -absolutely continuous distribution on and be the forest density estimator (2.4) with . For any , , that is, is one of the cells in the -th partition, there exists a constant such that . Then, for all and , there holds

 M−2σρ+ε⊂LD,ρ⊂M+2σρ−ε. (3.4)

Before we present the next theorem, recall that denotes half of the side length of the centered hypercube in and denotes the number of trees in the best-scored random forest.

Theorem 3.4.

Let be a -absolutely continuous distribution on . For , , , , we choose an satisfying

 ε≥√∥f∥∞Eς,p/n+Eς,p/(3n)+2/n, (3.5)

where is defined by

Furthermore, for and , we choose a with and assume this satisfying and . Moreover, for each random density tree, we pick the number of splits satisfying

 p>(2mdeς/δ)4d/cT. (3.7)

If we feed Algorithm 1 with parameters , , , and as in (3.3), then the following statements hold:
(i) If satisfies Assumption 2.1 and there exists an satisfying

 ε+inf{ε′∈(0,ρ∗∗−ρ∗]:τ∗(ε′)≥τ}≤ε∗≤(ρ∗∗−ρ∗)/9,

then with probability not less than , the following statements hold:

• The returned level satisfies both and

 τ−ψ(2σ)<3τ∗(ρD,out−ρ∗+ε);
• The returned sets , , can be ordered such that

 2∑i=1μ(Bi(D)△A∗i)≤22∑i=1μ(A∗i∖(AiρD,out+ε)−2σ)+μ(M+2σρD,out−ε∖{h>ρ∗}).

Here, , , are ordered in the sense of .

(ii) If satisfies Assumption 2.2 and , then

 μ(Lρ0△ˆMρ∗)≤μ(M+2σρ0−ε∖ˆMρ∗)+μ(ˆMρ∗∖M−2σρ0+ε)

holds with probability not less than for the returned level and the corresponding single cluster , where .

4 Main Results

In this section, we present main theoretical results of our best-scored clustering forest on the consistency as well as convergence rates for both the optimal level and the true clusters , , simultaneously using the error bounds derived in Theorem 3.2 and Theorem 3.4, respectively. We also present some comments and discussions on the obtained theoretical results.

4.1 Consistency for Best-scored Clustering Forest

Theorem 4.1 (Consistency).

Let Assumption 2.1 hold. Furthermore, for certain constant , assume that , , , and are strictly positive sequences converging to zero satisfying for sufficiently large , , . Moreover, let the number of splits satisfy

 limn→∞np−2an(logn)−1ε2n =∞, limn→∞δnpcT/(4d)n =∞,

where and . If we feed Algorithm 1 with parameters , , as in (3.3), and , then the following statements hold:

• If satisfies Assumption 2.1, then for all , the returned level satisfies

 limn→∞Pn({D∈Xn:0<ρD,out−ρ∗≤ϵ})=1.

Moreover, if , then for all , the returned sets , , satisfy

 limn→∞Pn({D∈Xn:2∑i=1μ(Bi(D)△A∗i)≤ϵ})=1.
• If satisfies Assumption 2.2 and , then for all , the returned level satisfies

 limn→∞Pn({D∈Xn:0<ρD,out≤ϵ})=1.

Moreover, if , then for all , the returned set satisfies

 limn→Pn({D∈Xn:μ(LD,ρD,out△{f>0})≤ϵ})=1;

4.2 Convergence Rates for Best-scored Clustering Forest

In this subsection, we derive the convergence rates for both estimation problems, that is, for estimating the optimal level and the true clusters , , in our proposed algorithm separately.

4.2.1 Convergence Rates for Estimating the Optimal Level

In order to derive the convergence rates for estimating the optimal level , we need to make following assumption that describes how well the clusters are separated above .

Definition 4.1.

Let Assumption 2.1 hold. The clusters of are said to have separation exponent if there exists a constant such that

 τ∗(ε)≥c–sepε1/κ

holds for all . Moreover, the separation exponent is called exact if there exists another constant such that

 τ∗(ε)≤¯¯csepε1/κ

holds for all .

The separation exponent describes how fast the connected components of the approach each other for and a distribution having separation exponent also has separation exponent for all . If the separation exponent , then the clusters and do not touch each other. With the above Definition 4.1, we are able to establish error bounds for estimating the optimal level in the following theorem whose proof is quite similar to that of Theorem 4.3 in Steinwart (2015a) and hence will be omitted.

Theorem 4.2.

Let Assumption 2.1 hold, and assume that has a bounded -density whose clusters have separation exponent . For , , , , we choose an satisfying

 ε≥√∥f∥∞Eς,p/n+Eς,