1 Introduction
The explosive growth of data sets in recent years is fueling a search for efficient and effective knowledge discovery tools. Cluster analysis
[5, 22, 20]is a nearly ubiquitous tool for unsupervised learning aimed at discovering unknown groupings within a given set of unlabelled data points (objects, patterns), presuming that objects in the same group (cluster) are more similar to each other (intracluster homogeneity) than objects in other groups (intercluster separability). Amongst the many alternative methods, this paper focuses on dissimilaritybased hierarchical clustering as represented by a tree that indexes a chain of successively more finely nested partitions of the dataset. We are motivated to explore this approach to knowledge discovery because clustering can be imposed on arbitrary data types, and hierarchy can relieve the need to commit a priori to a specific partition cardinality or granularity
[5]. However, we are generally interested in online or reactive problem settings and in this regard hierarchical clustering suffers a number of long discussed weaknesses that become particularly acute when scaling up to large, and, particularly, dynamically changing, data sets [5, 20]. Construction of a clustering hierarchy (tree) generally requires time with the number of data points [32]. Moreover, whenever a data set is changed by insertion, deletion or update of a single data point, a clustering tree must generally be reconstructed in its entirety. This paper addresses the problem of anytime online reclustering.1.1 Contributions of The Paper
We introduce a new homogeneity criterion applicable to an array of agglomerative (“bottom up") clustering methods through a test involving their “linkage function" — the mechanism by which dissimilarity at the level of individual data entries is “lifted" to the level of the clusters they are assigned. That criterion motivates a “homogenizing" local adjustment of the nesting relationship between proximal clusters in the hierarchy that increases the degree of similitude within them while increasing the dissimilarity between them. We show that iterated application of this local homogenizing operation transforms any initial cluster hierarchy through a succession of increasingly “better sorted" ones along a path in the abstract space of hierarchies that we prove, for a fixed data set and with respect to a family of linkages including the common single, average and complete cases, must converge in a finite number of steps. In particular, for the single linkage function, we prove convergence from any initial condition of any sequence of arbitrarily chosen local homogenizing reassignments to the generically unique^{1}^{1}1In the generic case, all pairwise distances of data points are distinct and this guarantees that single linkage clustering yields a unique tree [15, 17]., globally homogeneous hierarchy that would emerge from application of the standard, onestep “batch" single linkage based agglomerative clustering procedure.
We present evidence to suggest that decentralized algorithms based upon this homogenizing transformation can scale effectively for anytime online hierarchical clustering of large and dynamically changing data sets. Each local homogenizing adjustment entails computation over a proper subset of the entire dataset — and, for some linkages, merely its sufficient statistics (e.g. mean, variance). In these circumstances, given the sufficient statistics of a dataset, such a restructuring decision at a node of a clustering hierarchy can be made in constant time (for further discussion see Section
4.2). Recursively defined (“anytime") algorithms such as this are naturally suited to time varying data sets that arise insertions, deletions or updates of a set of data points must be accommodated. Our particular local restructuring method can also cope with timevarying dissimilarity measures or cluster linkage functions such as might result from the introduction of learning aimed at increasing clustering accuracy [40].1.2 A Brief Summary of Related Literature
Two common approaches to remediating the limited scaling capacity and static nature of hierarchical clustering methods are data abstraction (summarization) and incremental clustering [5, 23].
Rather than improving algorithmic complexity of a specific clustering method, data abstraction aims to scale down a large data set with minimum loss of information for efficient clustering. The large literature on data abstraction includes (but is not limited to) such widely used methods as: random sampling (e.g., CLARANS [31]); selection of representative points (e.g., CURE [19], data bubble [9]); usage of cluster prototypes(e.g., Stream [18]) and sufficient statistics (e.g., BIRCH [41], scalable means [8], CluStream [3], data squashing [12]); gridbased quantization [5, 20] and sparcification of connectivity or distance matrix (e.g., CHAMELEON [24]).
In contrast, incremental approaches to hierarchical clustering generally target algorithmic improvements for efficient handling of large data sets by processing data in sequence, point by point. Typically, incremental clustering proceeds in two stages: first (i) locate a new data point in the currently available clustering hierarchy, and then (ii) perform a set of restructuring operations (cluster merging, splitting or creation), based on a heuristic criterion, to obtain a better clustering model. Unfortunately, this sequential process generally incurs unwelcome sensitivity to the order of presentation
[5, 23]. Independent of the efficiency and accuracy of our clustering method, the results we report here may be of interest to those seeking insight into the possible spread of outcomes across the combinatorial explosion of different paths through even a fixed data set. Among the many alternatives (e.g., the widely accepted COBWEB [14] or BIRCH [41] procedures), our anytime method most closely resembles the incremental clustering approach of [39], and relies on analogous structural criteria, using similar concepts (“homogeneity" and “monotonicity"). However, a major advantage afforded by our new homogeneity criterion (Definition 4) relative to that of [39] is that there is now no requirement for a minimum spanning tree over the dataset. Beyond ameliorating the computational burden, this relaxation extends the applicability of our method beyond singlelinkage to a subclass of linkages, Definition 5, a family of cluster distance functions that includes single, complete, average, minimax and Ward’s linkages [22].Of course, recursive (“anytime") methods can be adapted to address the general setting of time varying data processing. Beyond the specifics of the data insertion problem handled by incremental clustering methods adapting, we aim for reactive algorithms suited to a range of dynamic settings, including data insertion, deletion, update or perhaps, a processinginduced nonstationarity such time varying dissimilarity measure or linkage function [1]. Hence, as described in the previous section, we propose a partially decentralized, recursive method: a local cluster restructuring operation yielding a discrete dynamical system in the abstract space of trees guaranteed to improve the hierarchy at each step (relative to a fixed dataset) and to terminate in an appropriately homogenizing cluster hierarchy from any (perhaps even random) initial such structure.
1.3 Organization of The Paper
Section 2 introduces notation and offers a brief summary of the essential background. Section 3 presents our homogeneity criterion and establishes some of its properties. Section 4 introduces a simple anytime hierarchical clustering method that seeks successively to “homogenize" local clusters according to this criterion. We analyze the termination and complexity properties of the method and then illustrate its algorithmic implications by applying it to the specific problem of incremental clustering. Section 5 presents experimental evaluation of the anytime hierarchical clustering method using both synthetic and real datasets. We conclude with a brief discussion of future work in Section 6.
2 Background & Notation
2.1 Datasets, Patterns, and Statistics
We consider data points (patterns, observations) in with a dissimilarity measure^{2}^{2}2A dissimilarity measure in is a symmetric, nonnegative and reflexive function, i.e. , and for all . , where is the dimension of the space containing the dataset and denotes the set of nonnegative real numbers. Note that need not necessarily be a metric^{3}^{3}3A dissimilarity is a metric if it satisfies strong reflexivity and the triangle inequality, i.e. for all and ., and our results can be easily generalized to qualitative data as well, once some dissimilarity ordering has been defined.
Let be a set of data points bijectively labelled by a fixed finite index set , say , and let denote a partial set of observations associated with subset , whose centroid and variance, respectively, are
(1)  
(2) 
where and denote the cardinality of set and the standard Euclidean norm of point , respectively. Throughout the sequel the term “sufficient cluster statistics" denotes the cardinality, , and mean (1) and variance (2) for each cluster in a hierarchy [8].
2.2 Hierarchies
A rooted semilabelled tree over a fixed finite index set , illustrated in Figure 1, is a directed acyclic graph , whose leaves, vertices of degree one, are bijectively labeled by and interior vertices all have outdegree at least two; and all of whose edges in are directed away from a vertex designated to be the root [7]. A rooted semilabelled tree uniquely determines (and henceforth will be interchangeably used with) a cluster hierarchy [27]. By definition, all vertices of can be reached from the root through a directed path in . The cluster of a vertex is defined to be the set of leaves reachable from by a directed path in . Accordingly, the cluster set of is defined to be the set of all its vertex clusters,
(3) 
where denotes the power set of .
For every cluster we recall the standard notion of parent (cluster) and lists of children of in , illustrated in Figure 1. For the trivial case, we set . Additionally, we find it useful to define the local complement (sibling) of cluster as , not to confused with the standard (global) complement . Further, a grandchild in is a cluster having a grandparent in . We denote the set of all grandchildren in by , the maximal subset of excluding the root and its children ,
(4a)  
(4b) 
A rooted tree with all interior vertices of outdegree two is said to be binary or, equivalently, nondegenerate, and all other trees are said to be degenerate. In this paper denotes the set of rooted nondegenerate trees over leaf set . Note that the number of hierarchies in grows super exponentially [7],
(5) 
for , quickly precluding the possibility of exhaustive search for the “best" hierarchical clustering model in even modest problem settings.
2.3 Nearest Neighbor Interchange (NNI) Moves
Different notions of the neighborhood of a nondegenerate hierarchy in can be imposed by recourse to different tree restructuring operations [13] (or moves). NNI moves are particularly important for our setting because of their close relation with cluster hierarchy homogeneity (Definition 4) and their role in the anytime procedure introduced in Section 4.
A convenient restatement of the standard definition of NNI walks [33, 28] for rooted trees, illustrated in Figure 2, is:
Definition 1
The Nearest Neighbor Interchange (NNI) move at a grandchild on a binary hierarchy swaps cluster with its parent’s sibling to yield another binary hierarchy .
We say that are NNIadjacent if and only if one can be obtained from the other by a single NNI move.
More precisely, is the result of performing the NNI move at grandchild on if
(6) 
Throughout the sequel we will denote the map of into itself defining an NNI move at a grandchild cluster of a tree as .
A useful observation for NNIadjacent hierarchies illustrating their structural difference is:
Lemma 1
[4]
An ordered pair of hierarchies
is NNIadjacent if and only if there exists one and only one ordered triple of common clusters of and such that and .We call the “NNItriplet" of .
2.4 Hierarchical Agglomerative Clustering
Given a choice of linkage, , Table 1 formalizes the associated Hierarchical Agglomerative Clustering method [22]. This method yields a sequence of nested partitions of the dataset that can be represented by a tree with root at the coarsest, single cluster partition, leaves at the most refined trivial partition (comprising all singleton sets), and a vertex for each subset appearing in any of the successively coarsened partitions that appears as the mergers of Table 1 are imposed. Because only two clusters are merged at each step, the resulting sequence of nested partitions defines a binary tree, whose nodes represent the clusters
(7) 
and whose edges represent the nesting relation, again as presented in Table 1. Hence, the set of grandchildren clusters (4) of is given by
(8) 
From this discussion it is clear that Table 1 defines a relation from datasets to trees, . Note that is in general not a function since there may well be more than one pair of clusters satisfying (9a) at any stage, . It is, however, a multifunction: in other words, while agglomerative clustering of a dataset always yields some tree, that tree is not necessarily unique to that dataset.
For any given set of data points and linkage function^{4}^{4}4Note that the linkage between any partial observations and the empty set is always defined to be zero, i.e. for all and .
,


2.4.1 Linkages
A linkage, , uses the dissimilarity of observations in the partial datasets, and , to define dissimilarity between the clusters, and [2]. Some common examples are
(10a)  
(10b)  
(10c)  
(10d)  
(10e) 
for single, complete, average, minimax and Ward’s linkages, respectively [22, 6], where and are a dissimilarity measure in and the standard Euclidean norm , respectively.
A common way of characterizing linkages is through their behaviours after merging a set of clusters. For any pairwise disjoint subsets of and dataset , a linkage relation between partial observations and after merging and are generally described by the recurrence formula of Lance and Williams [25],
(11) 
where is a linkage function and . Table 2 lists the coefficient of (11) for some common linkages in (10). Although the minimax linkage (10d) can not be written in the form of the recurrence formula [6], as many other linkage functions above it satisfies
(12) 
which is known as the strong reducibility property, defined in the following paragraph.
Linkage  

Single  
Complete  
Average  
Ward  0 
2.4.2 Reducibility & Monotonicity
Definition 2 ([10, 29])
For a fixed finite index set , a linkage function is said to be reducible if for any pairwise disjoint subsets of and set of data points
(13a) 
implies
(13b) 
Further, say is strongly reducible^{6}^{6}6Although [16] refers to strong reducibility of linkages as the reducibility property, by definition, strong reducibility is more restrictive than reducibility of linkages. if for any pairwise disjoint subsets of and it satisfies
(14) 
The well known examples of linkages with the strong reducibility property are single, complete, average and minimax linkages in (10) [29, 6]. Even though Ward’s linkage is not strongly reducible, it still has the reducibility property.
A property of clustering hierarchies (of great importance in the sequel) consequent upon the reducibility property of linkages is monotonicity:
Definition 3 ([22])
A nondegenerate hierarchy associated with a set of data points is said to be monotone if all grandchildren, , are more similar to their siblings, , than are their parents, , i.e.
(15) 
3 Homogeneity
We now introduce our new notion of homogeneity and explore its relationships to previously developed structural properties of trees.
Definition 4
(Homogeneity) A binary hierarchy associated with a set of data points is locally homogeneous at grandchild cluster if the siblings, and , are closer to each other than to their parent’s sibling, ,
(16) 
A tree is homogeneous if it is locally homogeneous at each grandchild.
A useful observation when we focus attention on reducible linkages is:
Proposition 2
If a tree, associated with a set of data points , is homogeneous for a reducible linkage , then it must be monotone as well.
The result directly follows from homogeneity of and reducibility of .
For any grandchild cluster and its parent , using (13) and (16), one can verify the result as
(17) 
where .
The converse of Proposition 2 only holds for single linkage:
Proposition 3
A clustering hierarchy associated with a set of data points is monotone for single linkage (10a) if and only if it is homogeneous as well.
The sufficiency of homogeneity of a clustering tree for its monotonicity directly follows from Proposition 2.
The other way of implication is evident from definitions of monotonicity (Definition 3) and single linkage (10a), i.e. for any and ,
(18)  
(19) 
where .
A major significance of homogeneity is that it is a common characteristic feature of any clustering hierarchy resulting from agglomerative clustering using any strong reducible linkage:
Proposition 4
If linkage is strongly reducible then any nondegenerate hierarchy in the relation (i.e. resulting from the procedure of Table 1 applied to some dataset ) is homogeneous.
Let be a sequence of nested partitions of , defining as in (7), resulting from agglomerative clustering of . Further, for let be a pair of clusters of in (9a) with the minimum linkage value.
For and any (grandchild) cluster , from (9a), we have
(20)  
(21) 
Now, observe that the parent’s sibling of and can be written as the union of elements of a subset of ,
(22) 
That is to say, the elements of are merged in a way described by the sequence of nested partitions of such that their union finally yields .
Hence, using strong reducibility of and (20), one can verify that
(23)  
(24)  
(25) 
which, by symmetry, also holds for ,
(26) 
Thus, since (8), the result follows.
In particular, a critical observation for single linkage is:
Theorem 1
The sufficiency, of being a single linkage clustering hierarchy, for homogeneity is evident from Proposition 4.
To see the necessity of homogeneity, we will first prove that if is homogeneous, then for any and nonempty subset the following holds
(27) 
Observe that (27) states that the cost of merging any one of and with another cluster is greater and equal to the cost of merging and . Then, by induction, we conclude that is a possible outcome of agglomerative single linkage clustering of .
Let denote the set of ancestors of cluster of , except the root ,
(28) 
Using the definition of (10a) and monotonicity of , one can verify that for any grandchild and its ancestor
(29) 
Now observe that the global complement of can be written as
(30) 
As a result, combining (29) and (30) yields
(31)  
(32)  
(33) 
from which one can conclude (27) for single linkage .
Finally, using a proof by induction, the result of the theorem can be shown as follows:

(Induction) Otherwise, suppose that and are already constructed since their children also satisfy (27) and, by monotonicity, children of satisfies as do children of . Thus, since clusters and satisfy (27), they can be directly aggregated when the merging cost, i.e. the value of minimum cluster distance in (9a), reaches .
4 Anytime Hierarchical Clustering
Given a choice of linkage, , Table 3 presents the formal specification of our central contribution, the associated Anytime Hierarchical Clustering method. Once again, this method defines a new relation from datasets to hierarchies, that is generally not a function but rather a multifunction (i.e. all datasets yield some hierarchy, but not necessarily a unique one).
For any given clustering hierarchy associated with a set of data points , and linkage function ,

Because the procedure defining in Table 3 does not entail any obvious gradientlike greedy step as do many previously proposed iterative clustering methods, demonstrating that it terminates requires some analysis that we now present.
4.1 Proof of Convergence
For any nondegenerate hierarchy associated with a set of data points and a linkage function , we consider the sum of linkage values as an objective function to assess the quality of clustering,
(36) 
Intuitively, one might expect that hierarchical agglomerative clustering methods yield clustering hierarchies minimizing (36). However, they are generally known to be stepwise optimal greedy methods [16] with an exception that single linkage clustering always returns a globally optimal clustering tree in the sense of (36) due to its close relation with a minimum spanning tree of the data set [17]. In contrast, for example, as witness to the general suboptimality of agglomerative clustering relative to (36), for Ward’s linkage (10e) is constant and equal to the sum of squared error of (see Appendix A), i.e. for any
(37) 
where (1) denotes the centroid of .
Let be a pair of NNIadjacent (Definition 1) hierarchies in and be the NNItriplet (Lemma 1) of common clusters of and . Recall that and are only unshared clusters of and , respectively. Hence, one can write the change in the objective function (36) after the NNI transition from to as
(38) 
Here we find it useful to define a new class of linkages:
Definition 5
A linkage is NNIreducible if for any set of data points and pairwise disjoint subsets of
(39a) 
implies
(39b) 
Using (10), (11) and Table 2 one can verify that single, complete, minimax and Ward’s linkages are examples of NNIreducible linkages. Note that a reducible linkage is not necessarily NNIreducible; for instance, average linkage (10c).
We now proceed to investigate the termination of anytime hierarchical clustering for NNIreducible linkages:
Lemma 2
If is homogeneous, then , and so the result directly follows.
Otherwise, let be the NNItriplet (Lemma 1) associated with . Recall that and . To put it another way, anytime hierarchical clustering performs an NNI move on at grandchild towards , and so
(41)  
(42) 
Therefore, since is NNIreducible (Definition 5), the change in the objective function (4.1) is nonnegative,
(43) 
which completes the proof.
Theorem 2
If is an NNIreducible linkage, then iterated application of the Anytime Hierarchical Clustering procedure of Table 3 initiated from any hierarchy in for a fixed set of data points must terminate in finite time at a tree in , that is homogeneous.
For a fixed finite index set , the number of nondegenerate hierarchies in (5) is finite. Hence, for the proof of theorem, we shall show that the anytime clustering procedure in Table 3 can not yield any cycle in .
Let denote a clustering hierarchy visited at th iteration of anytime clustering method, where . Since and are NNIadjacent, let be the associated NNItriplet (Lemma 1) of the pair satisfying and . Further, recall from Lemma 2 that for any NNIreducible linkage , .
If , it is clear that anytime clustering method never revisits any previously visited clustering hierarchy.
Otherwise, , we have
(44)  
(45) 
where the later is due to the anytime clustering rule in Table 3. Hence, the construction cost of (grand)parent increases after the NNI move,
(46) 
Now, let denote the level of cluster of which is equal to the number of ancestors of in ,
(47) 
and define to be an ordered tuple of sum of linkages of at each level,
(48) 
where a binary hierarchy over leaf set might have at most levels, and
(49) 
Note that if there is no cluster at level of , then we set .
Have tuples of real numbers ordered lexicographically according to the standard order of reals. Then, since NNI transition from to might only change linkages between clusters below (grand)parent cluster , using (46), one can conclude that
(50) 
Thus, it is also impossible to visit the same clustering hierarchy at the same level of objective function , which completes the proof.
4.2 A Brief Discussion of Computational
Properties
Complexity analysis of any recursive algorithm will necessarily engage two logically independent questions: (i) how many iterations are required to convergence; and (ii) what computational cost is incurred by application of the recursive function at each step along the way? Accordingly, in this section we address this pair of question in the context of the anytime hierarchical clustering algorithm of Table 3. Specifically we : (i) discuss (but defer to a subsequent paper a complete treatment of) the problem of determining bounds on the number of iterations of anytime clustering; and (ii) present explicit bounds on the computational cost of checking whether a cluster hierarchy violates local homogeneity at a given cluster (node) of tree or not. Prior work on discriminative comparison of nondegenerate hierarchies [4] and the results of experimental evaluation in Section 5 hint at a bound on the number of iterations (i) that is with the dataset cardinality, . We leave a comprehensive detailed study of algorithmic complexity of anytime hierarchical clustering to a future discussion of specific implementations. However, we still find it useful to give a brief idea of the computational cost incurred by the determination of tree homogeneity with respect to a number of commonly used linkages.
A straightforward implementation to check (ii) local homogeneity of a clustering hierarchy at any cluster with respect to any linkage function in (10) generally has time complexity of with the dataset size, , with an exception that local homogeneity of a clustering tree relative to Ward’s linkage can be computed in linear, , time.
Alternatively, following the CF(Clustering Feature) tree of BIRCH [41], a simple tree data structure can be used to store sufficient statistics, such as cluster sizes, means and variances, of a clustering hierarchy associated with a dataset. Such a data structure can be constructed in linear time, with the dataset size, using a postorder traversal of a clustering tree and the following recursion of cluster sizes, means and variances. For any
Comments
There are no comments yet.