Statistical Properties of the Single Linkage Hierarchical Clustering Estimator

11/24/2015 ∙ by Dekang Zhu, et al. ∙ 0

Distance-based hierarchical clustering (HC) methods are widely used in unsupervised data analysis but few authors take account of uncertainty in the distance data. We incorporate a statistical model of the uncertainty through corruption or noise in the pairwise distances and investigate the problem of estimating the HC as unknown parameters from measurements. Specifically, we focus on single linkage hierarchical clustering (SLHC) and study its geometry. We prove that under fairly reasonable conditions on the probability distribution governing measurements, SLHC is equivalent to maximum partial profile likelihood estimation (MPPLE) with some of the information contained in the data ignored. At the same time, we show that direct evaluation of SLHC on maximum likelihood estimation (MLE) of pairwise distances yields a consistent estimator. Consequently, a full MLE is expected to perform better than SLHC in getting the correct HC results for the ground truth metric.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hierarchical clustering (HC) is widely used in a range of applications, especially for situations where the hierarchy is physically meaningful, such as the study of the genetic history of biological populations (evolutionary trees) [1, 2] and microarray expression data which contributes to pharmacology, toxicogenomics and disease-subclass determination [3, 4]. Phylogeny of languages [5] and authorship attribution [6] are performed using HC methods. In addition, HC is used where “flat” clustering is required but the scale of the clustering remains to be determined. In wireless sensor network (WSN) routing protocol designs, HC methods have been introduced for grouping sensor nodes into clusters to achieve network scalability, energy efficiency and longer lifetimes [7, 8]. In the context of WSNs, increasingly being deployed in Internet of Things (IoT) scenarios, energy efficiency and better organization are achieved by using HC methods to form clusters of sensors, enabling the network to operate in a grouped way. Of particular importance, applications have been proposed in cyber-security, e.g. detection of distributed denial of service (DDoS) attacks [9, 10], detection of worms and viruses [11] and “poisoning” methods have been proposed to disrupt HC [12]. HC methods are applied in many other domains, including documents classification [13], image segmentation [14], network topology identification [15], and so on. Many clustering algorithms begin with a population being assigned some measure of difference (weight) between members of that population. Typically, but not always, the pre-processing of the data involves modelling the differences between members of the population by a metric rather than a general weight. That is, the differences are assumed to be symmetric and, more significantly, to satisfy the triangle inequality. As a result we will restrict our attention to clustering methods where the weights are actually metrics — distance-based clustering.

There are many approaches to clustering, and HC, of a finite population with a metric to quantify differences (finite metric space); for instance, single/complete linkage clustering and unweighted pair group method with arithmetic mean (UPGMA). “Flat” clustering methods are often unable to explicate the finer structure of clusters, and the most significant disadvantage for many of them is the dependence on the number of clusters, which is almost never more than a guess. Most flat clustering methods are optimizations with respect to some objective functions over all possible clusterings, and it is infeasible for a global search. Therefore, for implementation, they start from an initial random partition and refine it iteratively to get a “local” optimum. Inevitably, finding a good starting point is a key issue. On the contrary, HC methods return a hierarchy of partitions unveiling structure in the data and do not usually require a pre-specified number of clusters. In fact, HC methods are helpful in identifying the correct number of clusters. Furthermore, HC methods are deterministic. Disadvantages of flat clustering and advantages of HC are discussed in [16].

Meanwhile, attempts to justify clustering from an axiomatic standpoint have been made, notably one by Kleinberg [17] which resulted in an “impossibility theorem” stating a seemingly minimalistic and natural set of requirements of a “good” distance-based clustering algorithm that cannot be satisfied simultaneously. However, as argued in [18], HC overcomes this non-existence problem and gives an analogous theorem in which one obtains an existence and uniqueness theorem instead. Moreover, this perspective affords a rich mathematical theory with deep roots in geometry [19] and topology [20]. In addition to being the only known tractable, unsupervised, well-defined HC method, single linkage hierarchical clustering (SLHC) has particular mathematical properties that make it an attractive option. Only SLHC is stable under small perturbations of the weights, and it is the only one satisfying the following consistency/convergence property [18]: if the number of i.i.d. sample points goes to infinity, the result of applying SLHC to a data set converges almost surely in the Gromov-Hausdorff sense to an ultra-metric space recovering the multi-scale structure of the probability distribution support.

Despite the above suite of attractive characterizations, a focus on the so-called “chaining” phenomenon in SLHC contributes to a growing consensus that SLHC is largely unsuitable for “real world” applications [21]. It is important to observe, however, that the chaining phenomenon may not be as pertinent a feature of the resulting HC if one allows small perturbations of the weights [22]. Indeed, we argue that it is often the case that the measurement/assignment of difference carries some level of uncertainty. This uncertainty can be an inherent “noise” in the measurement process but in many applications might just be a technique for modelling the unknowns in the weight assignment process. A typical application might be an attempt to cluster on the basis of common features characterised by a real parameter (temperature, length, etc), where there is uncertainty in the actual value of the parameter. This work adopts and explores the point of view that the uncertainties inherent in the modelling process require us to view the weights it generates as measurements of an unknown “true” metric describing a correct model of the observed population and the differences between its members. The aim here is to estimate (in the statistical sense) the most likely hierarchical clustering of the data associated with that measurement.

Conventional approaches to statistical estimation of partitions and hierarchies view the objects to be clustered as random samples of certain distributions over a prescribed geometry (e.g. Gaussian mixture model estimation using expectation-maximization in Euclidean spaces), and clusters can then easily be described in terms of their most likely origin. Thus, these are really

distribution-based clustering methods  —  not distance-based ones. Our approach, proposed here for the first time, directly attributes uncertainty to the process of obtaining values for the pairwise distances rather than distort the data by mapping it into one’s “favorite space”. To the best of our knowledge, very little work has been done in this vein. Of note is [23], where similar ideas have been applied to the estimation of spanning trees in a communication network. However, pairwise dissimilarity measures characterizing the latent hierarchy are already given in their setting, and additional data is then applied to infer the correct tree. Thus, their problem may be seen as complementary to ours (see Section 2.2.3 below).

Our approach to the estimation of the single linkage hierarchy corresponding to a ground truth metric treats as a nuisance parameter to be eliminated. In situations of this kind, it is recommended [24] to maximize an integrated likelihood function derived from an appropriate so-called “conditional prior”. As is often the case, the full maximum likelihood estimator (MLE) is intractable and we have to resort to weaker versions of the optimization. In particular, we consider certain profile and partial likelihoods. We recall that profile likelihood eliminates the nuisance parameter through maximization rather than integration [24], the former being easier to implement. Partial likelihood is based on the idea that the problem parameters may be partitioned into “blocks”, one of which strongly depends on the parameter of interest, so that the likelihood for that block (given the parameter) is easier to compute [25].

This paper is organized as follows. Section 2 gives a review of preliminary notions useful in establishing the relationship between SLHC and the geometry of the metric cone in terms of spanning trees. This facilitates the discussion of statistical estimation of SLHC in the main section, Section 3. In Section 4 we prove the consistency of SLHC as an estimator. We conclude with simulations and a brief discussion in Section 5.

2 Preliminaries

We restrict our attention to distance-based clustering methods, for which it is assumed that a data set first undergoes initial processing to produce a weight (see below), which is then used as input to a clustering map. A distinction is made in the literature between flat clustering methods and hierarchical ones: a flat clustering method generates a partition of from a weight , while a hierarchical clustering method produces a hierarchy of partitions, also known as a dendrogram (see below).

2.1 Hierarchies and Dendrograms

2.1.1 Weights and Metrics

A weight111Often called a dissimilarity in this context on is a symmetric function satisfying for all , whose values , represent a “degree of dissimilarity” between data entries. We will say that a weight is strict, if whenever . Hereafter we will use the notation , with the pair representing an undirected edge of the complete graph with vertex set whenever . The set of edges of will be denoted by .

Metrics on traditionally form a preferred class of weights for clustering purposes. A weight on is called a (semi/pseudo)-metric if it satisfies the triangle inequality, . An ultra-metric is a weight satisfying an even stronger condition, the ultra-metric inequality, . The spaces of weights, metrics and ultra-metrics on will be denoted by and , respectively.

2.1.2 Partitions

Recall that a collection of pairwise-disjoint subsets of whose union is is called a partition of . The elements of a partition will be referred to as its clusters. We denote the set of partitions of by . Thus, formally, a distance-based flat clustering method on is merely a function .

For any and any partition , we will denote the cluster of containing by . Recall that a partition is said to refine a partition , if every cluster of is contained in a cluster of ; we denote this by . A hierarchy is a chain of partitions: a collection of partitions where every satisfy either or . The refinement relation has been shown by Kleinberg [17] to be fundamental in any principled discussion of distance-based clustering maps, because of its role as an obstruction to a variety of intuitive consistency requirements commonly seen as desirable for a clustering method.

2.1.3 Dendrograms

Following [18], we describe a dendrogram as a pair , where is a map satisfying the following: (1) there exists s.t. for all , (2) if then partition refines partition , and (3) for all there exists s.t. for

. Carlsson and Mémoli have shown that expanding the domain of allowed classifiers from

to dendrograms over resolves the consistency problems indicated in Kleinberg’s work in [17]. It will be useful to separate the metric information in a dendrogram (the grading by resolution) from the combinatorial information it conveys: a dendrogram may be uniquely represented by a pair , where denotes the structure of   —  the chain of partitions defined by (with the resolutions forgotten), ordered by refinement; and is the

height vector

, see Figure 1, whose coordinates, in order, indicate the minimum resolution at which each partition in the structure occurs in the dendrogram.

Ultra-metrics provide a convenient tool for encoding dendrograms — see [26, 18] and Figure 1: any dendrogram gives rise to an ultra-metric , and conversely, an ultra-metric encodes a dendrogram via:

(1)

This enables a formulation of the study of distance-based hierarchical clustering as the study of ‘coherent’ families of maps .222Where coherence is to be understood in suitable categorical terms, as suggested by Carlsson and Mémoli [27, 20].

Figure 1: A rooted tree with labelled leaves as a dendrogram (left) and as an ultra-metric on (right).

2.2 Single Linkage Hierarchical Clustering (SLHC)

2.2.1 Construction of SLHC

Recalling [18], one defines single-linkage hierarchical clustering of a weighted space to be the dendrogram for which if and only if contains points where , , and for . Following [28], SLHC is often implemented by constructing a minimum spanning tree (MST) in : the partition is obtained from any MST of as the set of connected components obtained from by removing all edges of of length exceeding . The corresponding ultra-metric, which we denote by , has equal to the maximum -length among all edges in separating from . It is well known that:

(2)

From this presentation, some properties emerge which characterize single linkage among all operators :

Proposition 2.1.

There exists one and only one operator with such that

(3)

for all , and this operator is SLHC over .

This simple characterization of the single linkage operator, encompassing very general notions of (a) stability; (b) information bottlenecking; and (c) consistency, appears to have gone unnoticed up till now (compare with the categorical characterizations by Carlsson and Mémoli [27]), and serve as our motivation for studying SLHC a target for statistical estimation in the present context.

Proofs of this Proposition and of the remaining technical results regarding the geometry of SLHC presented in this section are given in Appendix 6. To the best of our knowledge, these results are new.

2.2.2 Weighted Spanning Trees

In view of the preceding paragraph, it will be convenient to introduce the space of weighted spanning trees over . First, we denote the set of spanning trees in by , while identifying each spanning tree with its edge set. For a weight , the subset of its MSTs will be denoted by .

Define the space of weighted spanning trees to be the space of all pairs   —  henceforth denoted in this context  —  with and being referred to as a weight on . The total weight of is defined to be . Of course, a weight on naturally restricts to a weight on through when ; we denote the corresponding weighted spanning tree by , by abuse of notation.

The obvious identification(s) of the set of all weights on a fixed tree with yields a topology on with clopen333That is, both closed and open [29]. connected components, one for each , each homeomorphic to .

For a given spanning tree , note that every determines a unique edge path in joining with . We denote this path by . We then observe that the construction of the preceding paragraph makes use of the map sending every to the ultra-metric defined by:

(4)

The continuity of is fairly straightforward. Curiously, it is central to establishing the consistency of SLHC as an estimator (Theorem 4.2), so we include a proof of this fact in this paper for the sake of completeness (Lemma 4.1).

The following elementary result has surprisingly significant impact. We have not been able to locate it or its corollaries in the literature.

Lemma 2.2.

Let be real-valued weights on the edges of the complete graph . Suppose that implies for all . Then .

Corollary 2.3.

Let be a real-valued weight on the edges of the complete graph . If is a strictly increasing function, then is an MST of iff it is an MST of .

2.2.3 The geometry of fibers of SLHC

Estimating the parameter from a measurement of a “ground truth” metric while treating the remaining information regarding as a nuisance parameter requires detailed understanding of the point pre-images of the map , which we develop below (we shall, from now on, use bold symbols to denote the weight/metric/ultra-metric as multi-dimentional parameters in the context of the estimation problem discussed in this paper).

For any weight we will say that is generic, if no two values of coincide, that is: if for all with . It is a textbook result that a generic metric has exactly one minimum spanning tree [30]. This may be restated geometrically as follows:

Lemma 2.4.

For each , define to be the set of all satisfying . Then is the union of the pairwise interiorly-disjoint domains .

Lemma 2.5.

For , the set is a closed pointed convex cone.

Lemma 2.6.

For every we have:

(5)

Each is a finite-sided convex polytope and no two of these polytopes have an interior point in common.

A critical distinction between the case of general weights/dissimilarities and the metric case is that, while the are unbounded polytopes, the polytopes

(6)

are compact, due to the characterization

(7)

where is the tree metric on induced from the weights given by :

(8)

The triangle inequality forces on any metric satisfying . Note that coincides with on edges of , as .

The main implications of the resulting decomposition

(9)

for this work are significant: the single linkage operator is only smooth when restricted to one of the , hence any explicit integration over needs to be decomposed as a sum of integrals over the relevant . Unfortunately, the structure (e.g. set of vertices) of these polytopes is known to be extremely complex, as they share vertices with the set of extreme directions of the metric cone , whose enumeration is a long-standing open problem [31, 32]. This makes obtaining closed-form integration formulae over   —  and marginalizing over in particular  —  a very hard combinatorial problem. In addition, the number of MSTs for a given ultra-metric poses an additional hurdle on the way to solving optimization problems over .

3 Estimation of SLHC

3.1 Statistical model for hierarchical clustering

Distances between data points are assumed to be corrupted by noise from a suitable distribution. One naïve approach to estimating hierarchical clustering is to estimate the true distances between data points and feed them into the given HC algorithm. This is almost certainly not optimal in the highly nonlinear context of HC methods, none of which  —  at least among those satisfying requirement (a) of Proposition 2.1, which is quite natural  —  are one-to-one maps (for example, consider SLHC; is a piecewise-smooth map whose generic fibers have co-dimension in the -dimensional space , as seen from Equations (9) and (7)). Therefore, our statistical model is as follows:

  • The ultra-metric of a hierarchical clustering map is a fixed unknown parameter to be estimated from a measurement of a metric ; thus, is a nuisance parameter taking its values in the point pre-image .

  • The measurement only depends on through a specific probability distribution which we denote by .

  • A reasonable assumption for this noise model is that the measurements of the values , of are sampled independently from the same parametrized distribution , :

    (10)

These assumptions are justified, for example, in the context of WSNs as discussed in Section 1. The input data for clustering is provided in the form of independently measured distances between the sensors, obtained from Received Signal Strengths (RSSs), parametrized by the ground truth distances and corrupted by receiver noise as well as external effects such as multipath or shadowing. These kinds of distortions are often, and justifiably, modeled as noise [33, 34, 35].

Construction of the likelihood from and is actually a problem of eliminating the nuisance parameter . Hierarchical clustering maps are generally not one-to-one, and this is the case for . Because of the uncertainty of relative to , classical likelihood methods for eliminating nuisance parameters (such as profile likelihood) perform worse than the integrated likelihood proposed by Berger [24]. As pointed out there, going back to papers of [36, 37], and as we have observed in our problem, profile likelihood can give rise to “misleading behaviour”. Section 2 of the Berger et.al paper argues in favour of integrated likelihood on the “grounds of simplicity, generality, sensitivity, and precision”. The integrated likelihood is constructed by integrating over according to, in the terms of Berger et al., “the conditional prior density” of given :

(11)

In the current context the support of is restrict to .

3.2 Specializing to Single Linkage

In the case of SLHC, our estimation model may be further decomposed through treating the MST of (which is also an MST of the ultra-metric ) as a discrete nuisance parameter associated with . Generically speaking, has multiple MSTs, so any integration over the domain must decompose according to (9). Consequently, the weight (prior) decomposes as

(12)

which itself may be seen as an integrated likelihood with respect to the discrete parameter .

Substituting (12) into (11), we obtain the likelihood:

(13)

Figure 2 is a schematic for the statistical model for estimating SLHC.

Figure 2: A schematic of the statistical model for estimating SLHC.

3.2.1 “Least-Biased” Distributions and Restricting to

Below we argue that, in the absence of assumptions on the latent parameters , the weight distributions ought to be uniform over their supports. Hence:

(14)

Accordingly,

is uniformly distributed on

, and the should be assigned according to their relative volumes. Unfortunately, requires the computation of integrals over the polytopes (see Section 2.2.3 above). Existing techniques for these computations incur extremely high computational costs [38, 39, 40].

The above assumption is justified by the following maximum entropy argument, the details of which lie outside the scope of this paper and will be detailed elsewhere. It follows from (9) that the weight of is constant for for a fixed . Moreover Corollary 2.3 guarantees for all whenever is any strictly increasing function, so that the quantity

(15)

is a reasonable choice of energy on the space , viewed as a space of micro-states of the ‘particles’ in . In turn, one can show that gives rise to a piecewise-smooth maximum entropy parametric distribution

(16)

on , derived from the constraints

(17)

Clearly by (9), this distribution is constant on for all .

Crucially, the above considerations also speak in favor of our self-imposed restriction to ground truth weights satisfying the triangle inequality. Observe that for each , the level sets , , of the functional only depend on the weights induced on . This implies that, in , these level sets are unbounded. Consequently, since is a function of the energy alone, will not be normalizable under these circumstances. Restricting the possible values of to solves this problem because the triangle inequality guarantees the compactness of level sets. Indeed, by Lemma 2.4, any level set , , is a finite union, over all spanning trees , of the closed subsets whose elements are metrics with and with fixed; by the triangle inequalities for , no value of may exceed the total weight of , implying that the level set is a finite union of closed bounded sets.

3.2.2 Partial Profile Likelihood

The likelihood proposed in (13) incurs a prohibitive computational cost, if precise computation is required. Leaving computationally feasible approximation methods of the full likelihood to future work, here, instead, we propose eliminating the discrete nuisance parameter by forming the following profile likelihood:

(18)

For this notion of likelihood we have the following result:

Proposition 3.1.

Let be the map defined in Section 2.2.2. Then, the maximum likelihood estimate of derived from the profile likelihood of (18) is given by

(19)
Proof.

Equation (18) is the modified likelihood defined in [41], and the proposition follows from the invariance property of the maximum likelihood estimator proved there. ∎

Consider the two complementary projections of denoted and , with the variable split into and accordingly.

Proposition 3.2.

For any , splits as the product of two independent parts:

(20)

where:

(21)
Proof.

See Appendix 7. ∎

The product decomposition above raises the question of how much hierarchical information is conveyed only by the distance measurements along the edges of a chosen tree. One expects SLHC to correspond to the best estimate in this case. Indeed, the main result of this section characterizes SLHC as a maximum partial profile likelihood estimator (MPPLE) of hierarchical clustering for a significant class of measurement models. We define the partial profile likelihood [25] of given a single measurement to be:

(22)

The resulting characterization is as follows:

Theorem 3.3.

Given the measurement model , let where is the MLE of given one measurement . Denote , as the MPPLE, then:

  1. If is strictly increasing and is strictly decreasing, then .

  2. If, in addition, holds for all , then .

This theorem provides evidence that, subject to the above monotonicity requirements of the measurement model, the full maximum likelihood estimator of should display improved performance over that of , which is now characterized as the partial profile likelihood estimator obtained by discarding the measurements that do not correspond to edges on the MST of (save, of course, for order information which was used to obtain the MST). We now prove the theorem.

Proof.

From (21), (22) and since for , the log-likelihood is

(23)

It follows from Proposition 3.1 that , where is the MLE of weighted spanning tree from

(24)

Since the variables are independent, summation in (24) is maximized if and only if each of the summands is maximized. Thus we get the MLE weight of a given spanning tree : for all . Then the maximum likelihood MST is

(25)

Since the function is a strictly decreasing function of , Corollary 2.3 implies that is an MST of if and only if is also a maximum spanning tree of the weight , implying that the total weight, , along this tree is the maximum possible over all spanning trees. So far we have shown that the estimated MST coincides with an MST of the measurement, that is: .

In order for the dendrogram structures obtained from these estimated (weighted) trees through the map to coincide, it suffices that the estimated weights along these trees be ordered in the same way. Indeed, this is true because is a strictly increasing function.

Finally, to prove the second assertion, we observe that for we have . Since is an MST of , this implies that coincides with . ∎

3.2.3 Exponential Families

We now obtain a sufficient condition for Theorem 3.3 to apply in the case that the noise model, is a single parameter exponential family [42] ; that is, we assume

(26)

We assume too, as is customary, that is a function, so that the exponential family (26) can be transformed to canonical form:

(27)

where and are related by , and and, consequently, .

In the continuously differentiable case the MLE of given a measurement is a solution of the equation  (see e.g. [42]). The property yields that the MLE for satisfies , where is the MLE for , and satisfies

(28)

For , we obtain

(29)

using .

According to the hypotheses, is strictly increasing and , so that the structure of is the MPPLE of the dendrogram structure. If, in addition, holds for all , then coincides with the MPPLE of .

3.2.4 Example based on the log-normal distribution

We assume that each

follows a log-normal distribution

with identical variance parameter

. We model the relationship between the measurement and the true metric as follows: we require and for all ,

(30)

This is an exponential family with , , and . In this context,

(31)

Given a measurement , we have which is strictly increasing. It follows that

(32)

For this model, then, the MPPLE is just the SLHC. Note that both conditions of Theorem 3.3 are satisfied in this example irrespective of the value of . In other words, estimating the parameter is redundant for finding the MPPLE.

To illustrate this, we provide a simulation:

  • Step 1. Samples of weights for 5 points are generated from the uniform distribution over the space until one is found that satisfies the triangle inequality.

  • Step 2. For a specific value of , a sample of is generated according to (30).

  • Step 3. The values of , the MPPLE , and the true value of are calculated.

To quantify the error in this simulation, we use the distance (see [43])

(33)

The standard deviation

is decremented from to in steps of . At each value of , Steps 1-3 are repeated 10000 times, and incorrect dendrogram structures and mean errors are calculated. Plots of the proportions of incorrect dendrogram structures and mean errors are given in Figures 4 and 4 respectively. As is readily seen, the results for SLHC and MPPLE are identical.

Ideally in the simulation we would like to sample general metrics from metric cones consistent with our model description and theoretically analysis. However, randomly selecting a large number of samples from a metric cone is computationally unrealistic because of the problem of checking triangle inequalities to determine whether a given random weight is in fact a metric. It would be much easier, for instance, to select samples from a specific metric such as the Euclidean metric or the metric. We have chosen, in this paper, to perform our simulations with a small number of data points (5 actually) because 1), the simulation is only an illustration of our rigorously proved theoretical result, 2) calculation for a significantly larger number of points would be be computationally intensive as we have stated above; 3) a small number of data points enables easier comprehension of the results.

Figure 3: Incorrect structure identification ratio versus .
Figure 4: Mean error of the ultra-metric versus .

4 Consistency of SLHC

In this section we demonstrate the consistency of SLHC as an estimator, that is: we consider the consistency of

(34)

as an estimator of . Prior to the discussion, we require a technical lemma:

Lemma 4.1.

The map defined in Equation (4) is continuous with respect to the standard topology on induced from .

Proof.

First, since the connected components of are open, it suffices to verify the continuity of when restricted to the component corresponding to an arbitrarily chosen tree .

For a fixed , is homeomorphic to the subspace of non-negative vectors in , written as , with the standard topology. For each fixed , the -coordinate of , equal to , is the maximum of a finite collection of continuous real-valued functions of the vector . It follows that is continuous over , as desired. ∎

Consistency is a characteristic of the limiting performance of an estimator as the number of samples tends to infinity. Since SLHC does not process multiple samples at a time, a natural way to take account of information contained in a sequence of measurements of is as follows: for each edge obtain the MLE estimate of from the set to form the weight , and define

(35)
Theorem 4.2.

SLHC is a consistent estimator, that is: converges in probability to .

Proof.

Since we are interested in convergence in probability, we may omit from consideration the set of metrics that are not generic. In particular, has exactly one MST, denoted , by Lemma 2.4. Moreover, we have converging in probability to because MLE is consistent, resulting in all but finitely many of the being generic. Then, without loss of generality, each has exactly one (weighted) MST, denoted . We conclude that converges to in probability. Since is continuous (Lemma 4.1), we may apply the continuous mapping theorem [44] to obtain ; since Proposition 6.1(a) gives , we are done. ∎

A sample simulation: Simulations have been performed to illustrate the consistency property with the distribution model discussed in Section 3.2.4. In the simulation the steps were as follows:

  • Step 1. As Step 1 in Section 3.2.4;

  • Step 2. samples were randomly generated according to (30) for a specific , and was calculated;

  • Step 3. The ultra-metric was calculated and compared to the true value .

In these simulations, was incremented from to in powers of , and Steps 1-3 were repeated 10000 times. Again incorrect dendrogram structure identifications and mean errors were calculated. For four values of , curves were plotted of proportions of incorrect identifications and mean errors in Figures 6 and 6 respectively. In both figures we observe convergence to the true value as .

Figure 5: Incorrect structure identification ratio versus .
Figure 6: Mean error of the ultra-metric versus .

5 Conclusion

Our work here is an effort to address the issue of uncertainty of data surrounding the problem of hierarchical clustering. We restrict attention to the natural case where differences between data are given in terms of a metric but the information about that metric contains uncertainty modelled by a parametrized (by the true metric) family of distributions. The hierarchical clustering approach is described in terms of an ultra-metric associated with the metric. The problem then becomes on of statistical estimation where the parameter to be estimated is an ultra-metric.

Despite its evident shortcomings, SLHC is proposed as the most natural algorithm for clustering in this context, following the work of Carlsson and Memoli, though with a simplification of their treatment. We show also that SLHC is equivalent to maximum partial profile likelihood estimator under reasonable models (see Theorem 3.3) for the uncertainty. SLHC is shown, also, to be a consistent estimator for multiple measurements. The proofs of these results have required a study of the structure of the cone of weights giving rise to a specific minimum spanning tree.

Our work strongly suggests that the use of the full maximum likelihood estimator, if available and computable, will outperform SLHC in contexts where the uncertainty is high. One focus of our future work will be on achieving computationally feasible but good approximations to the MLE.

Acknowledgements

This work was supported by the National Science Fund of China under Grant 61025006 and 61422114; the US Air Force Office of Science Research under Grant MURI FA9550-10-1-0567 and FA9550-12-1-0418.

References

  • [1] Khanafiah D, Situngkir H. Visualizing the phylomemetic tree. Journal of Social Complexity. 2006;2(2):20–30.
  • [2]

    Blanchette G, O’Keefe R, Benuskova L. Inference of a phylogenetic tree: hierarchical clustering versus genetic algorithm. In: Ai 2012: Advances in artificial intelligence. Springer; 2012. p. 300–312.

  • [3] Butte A. The use and analysis of microarray data. Nature reviews drug discovery. 2002;1(12):951–960.
  • [4] Levenstien MA, Yang Y, Ott J. Statistical significance for hierarchical clustering in genetic association and microarray expression studies. BMC bioinformatics. 2003;4(1):62.
  • [5]

    Mahata P, Costa W, Cotta C, Moscato P. Hierarchical clustering, languages and cancer. In: Applications of evolutionary computing. Springer; 2006. p. 67–78.

  • [6] Segarra S, Eisen M, Ribeiro A. Authorship attribution using function words adjacency networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2013. p. 5563–5567.
  • [7] Lung CH, Zhou C, Yang Y. Applying hierarchical agglomerative clustering to wireless sensor networks. In: Proceedings of International Workshop on Theoretical and Algorithmic Aspects of Sensor and Ad-hoc Networks. Citeseer; 2007. p. 97–105.
  • [8] Lung CH, Zhou C. Using hierarchical agglomerative clustering in wireless sensor networks: An energy-efficient and flexible approach. Ad Hoc Networks. 2010;8(3):328–344.
  • [9]

    Karami A. Article: Data clustering for anomaly detection in content-centric networks. International Journal of Computer Applications. 2013 November;81(7):1–8; full text available.

  • [10] Du H, Yang SJ. Discovering collaborative cyber attack patterns using social network analysis. In: Salerno JJ, Yang SJ, Nau DS, Chai SK, editors. SBP; Lecture Notes in Computer Science; Vol. 6589. Springer; 2011. p. 129–136.
  • [11] Wang J, Miller D, Kesidis G. Efficient mining of the multidimensional traffic cluster hierarchy for digesting, visualization, and anomaly identification. IEEE Journal on Selected Areas in Communications. 2006 Oct;24(10):1929–1941.
  • [12] Biggio B, Rieck K, Ariu D, Wressnegger C, Corona I, Giacinto G, Roli F. Poisoning behavioral malware clustering. In: Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop. ACM; 2014. p. 27–36.
  • [13] Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In: KDD Workshop on Text Mining; 2000.
  • [14]

    Martínez-Usó A, Pla F, García-Sevilla P. Unsupervised image segmentation using a hierarchical clustering selection process. In: Structural, syntactic, and statistical pattern recognition. Springer; 2006. p. 799–807.

  • [15]

    Castro R, Nowak R. Likelihood based hierarchical clustering and network topology identification. In: Energy minimization methods in computer vision and pattern recognition. Vol. 2683; Springer Berlin Heidelberg; 2003. p. 113–129; Available from:

    $http://dx.doi.org/10.1007/978-3-540-45063-4_8$.
  • [16] Manning CD, Raghavan P, Schütze H, et al. Introduction to information retrieval. Vol. 1. Cambridge university press Cambridge; 2008.
  • [17] Kleinberg J. An impossibility theorem for clustering. Advances in neural information processing systems. 2003;:463–470.
  • [18] Carlsson G, Mémoli F. Characterization, stability and convergence of hierarchical clustering methods. J Mach Learn Res. 2010 Aug;11:1425–1470; Available from: http://dl.acm.org/citation.cfm?id=1756006.1859898.
  • [19] Isbell JR. Six theorems about injective metric spaces. Comment Math Helv. 1964;39:65–76.
  • [20] Carlsson G, Mémoli F. Classifying clustering schemes. arXiv preprint arXiv:10115270. 2010;.
  • [21] Jain AK, Murty MN, Flynn PJ. Data clustering: A review. ACM Comput Surv. 1999 Sep;31(3):264–323; Available from: http://doi.acm.org/10.1145/331499.331504.
  • [22] Gama F, Segarra S, Ribeiro A. Overlapping clustering of network data using cut metrics. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2015.
  • [23] Castro RM, Member S, Coates MJ, Nowak RD. Likelihood based hierarchical clustering. IEEE Trans on Signal Processing. 2004;52:2308–2321.
  • [24] Berger JO, Liseo B, Wolpert RL, et al. Integrated likelihood methods for eliminating nuisance parameters. Statistical Science. 1999;14(1):1–28.
  • [25] Cox DR. Partial likelihood. Biometrika. 1975;62(2):269–276.
  • [26] N Jardine RS. Mathematical taxonomy. New York: John Wiley & Sons; 1971.
  • [27] Carlsson G, Mémoli F. Persistent clustering and a theorem of J. Kleinberg. arXiv preprint arXiv:08082241. 2008;.
  • [28]

    Gower JC, Ross GJS. Minimum spanning trees and single linkage cluster analysis. Applied Statistics. 1969;:54–64.

  • [29] Munkres JR. Topology: a first course. Englewood Cliffs, New Jersey. 1975;.
  • [30] West DB, et al. Introduction to graph theory. 2nd ed; Vol. 2. Prentice hall Upper Saddle River; 2001.
  • [31] Avis D. On the extreme rays of the metric cone. Canad J Math. 1980;32(1):126–144.
  • [32] Deza A, Fukuda K, Pasechnik D, Sato M. On the skeleton of the metric polytope. Springer; 2001.
  • [33] Mao G, Fidan B, Anderson BD. Wireless sensor network localization techniques. Computer networks. 2007;51(10):2529–2553.
  • [34] Cox DC, Murray RR, Norris A. 800-mhz attenuation measured in and around suburban houses. AT&T Bell Laboratories technical journal. 1984;63(6):921–954.
  • [35] Bernhardt R. Macroscopic diversity in frequency reuse radio systems. IEEE Journal on Selected Areas in Communications. 1987;5(5):862–870.
  • [36] Neyman J, Scott EL. Consistent estimates based on partially consistent observations. Econometrica: Journal of the Econometric Society. 1948;:1–32.
  • [37] Cruddas A, Reid N, Cox D. A time series illustration of approximate conditional likelihood. Biometrika. 1989;76(2):231–237.
  • [38] Barvinok AI. Computing the volume, counting integral points, and exponential sums. Discrete & Computational Geometry. 1993;10(1):123–141.
  • [39] Barvinok A, Hartigan J. Maximum entropy Gaussian approximations for the number of integer points and volumes of polytopes. Advances in Applied Mathematics. 2010;45(2):252–289.
  • [40] Lasserre JB, Zeron ES. A Laplace transform algorithm for the volume of a convex polytope. Journal of the ACM (JACM). 2001;48(6):1126–1140.
  • [41] Kay SM. Fundamentals of statistical signal processing: Estimation theory. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.; 1993.
  • [42] Letac G. Lectures on natural exponential families and their variance functions. 50; Conselho Nacional de Desenvolvimento Científico e Tecnológico, Instituto de Matemática Pura e Aplicada; 1992.
  • [43] Boorman SA, Olivier DC. Metrics on spaces of finite trees. Journal of Mathematical Psychology. 1973;10(1):26–59.
  • [44] Billingsley P. Convergence of probability measures. John Wiley & Sons; 2013.

6 Proofs of geometric and combinatorial results

6.1 Proof of Proposition 2.1

Proof.

Equation (2) guarantees that satisfies the above requirements, and it remains to verify the uniqueness claim. Suppose are as stated. Then (a) coincides with the fixpoint sets of and , and for every one has and since . Applying (b) and (c), we obtain:

(36)

By symmetry, for all as well. ∎

6.2 Combinatorics of Minimum Spanning Trees

6.2.1 Proof of Lemma 2.2

In the rest of this section we will make extensive use of Kruskal’s metric on :

(37)

Recall that forms a connected graph under the adjacency defined by if and only if . In fact, the set of minimum spanning trees of is a connected sub-graph for any weight . This observation underlies Kruskal’s algorithm for computing MSTs [30].

Proof.

Suppose there are trees in which are not MSTs of . Then we may pick and so that is as small as possible, yet positive. We proceed to derive a contradiction by applying Kruskal moves as follows.

First, find an edge with . Then contains a cycle of edges, . We must have for all edges : if not, replace any violating edge of with the edge to obtain a tree with either or with and   —  a contradiction in either case.

Thus, by the assumption about and , we have for all . Since cannot contain , we find an edge . Since , the tree obtained from by replacing with is a spanning tree of lower total weight under than   —  a contradiction again. ∎

6.2.2 Proof of Corollary 2.3

Proof.

Setting we observe that . Lemma 2.2 then provides us with the desired equality . ∎

6.3 Cones and the Geometry of SLHC

6.3.1 Proof of Lemma 2.5

Proof.

Lemma 2.5 is a corollary of Proposition 6.1. It suffices to take , and to show that as well. Take any and set . We have:

(38)

as desired. According to Proposition 6.1, . is closed, because it is defined by a finite number of closed conditions. ∎

6.3.2 Proof of Lemma 2.6

Proof.

The proof follows directly from part (c) of Proposition 6.1 and Lemma 2.4. Lemma 2.6 is also a corollary of Proposition 6.1. ∎

6.3.3 Point Pre-Images under SLHC

The preceding proofs depend on the following characterization of point pre-images of the map :

Proposition 6.1.

For all and