Clustering is a primitive in data analysis which simultaneously serves to summarize data and elucidate its hidden structure. In its most common form a clustering problem consists of a pair , where is a metric space, and indicates the desired number of clusters. The goal of the problem is to try to find a partition of the points of into sets such that some objective is minimized. Because of the fundamental nature of such a primitive, clustering enjoys broad application in a variety of settings and an extensive body of work exists to explain, refine, and adapt its methodology [3, 8, 11, 13, 17, 18].
Having to decide the number of clusters in advance can be a source of difficulty in practice. When faced with this problem, one common approach is to use hierarchical clustering to produce a parameter free summary of the input. That is, instead of producing a single partition of the input points, the goal is to find a rooted tree (called a dendrogram) where the leaves are the points of and the internal nodes of the tree indicate the distance at which its subtrees merge.
We aim to address the analogous question of how to avoid having to decide the number of clusters in advance in the case of dynamic data. Here, we adopt the temporal clustering framework of [9, 10]. In this framework, the input is a sequence of clustering problems, and the goal is to ensure that the solutions of successive instances remain close according to some objective. This differs from incremental [2, 7] and kinetic clustering [1, 4, 14, 16] in that there is no constraint that the clustering instances in the input must be incrementally related. Further, an optimal sequence of spatial clusterings is not automatically a low cost solution to the temporal clustering instance.
In this paper we present a natural adaptation of hierarchical clustering to the temporal setting. We study the problem of finding a temporally coherent sequence of hierarchical clusterings from a sequence of unlabeled point sets. Our goal is to produce a sequence of hierarchical clusterings (dendrograms) corresponding to each set of points in the input such that successive pairs of clusterings have similar dendrograms. We show that the corresponding optimization problem is NP-hard. However, a polynomial-time approximation algorithm exists when the metric spaces in the input are taken from a common ambient metric space. We explore the properties of this algorithm and find that it is unstable under perturbations of the metric. We then show how to restore stability with only a slight loss in the guarantee.
An idea used in this paper is that we may hierarchically cluster a metric space by trying to find a low distortion embedding of it into an ultrametric. An ultrametric is a metric space which satisfies a stronger version of the triangle inequality. Formally, an ultrametric space is a metric space such that for all .
Ultrametric spaces have interesting geometry. For instance, in an ultrametric all points contained in a ball of radius are centers of the ball. That is, for any , we have , where denotes the ball of radius about a point in a metric space . Further, given any pair of balls , with non-empty intersection, one has or . This simple fact implies that any ultrametric space has the structure of a tree where items in a common subtree are close. That is, an ultrametric induces a natural hierarchical clustering, commonly depicted as a dendrogram (see Figure 1).
Similarity of dendrograms.
For dendrograms over sets of points with identical labelings there is a natural dissimilarity measure given by comparing the merge heights for any pair of corresponding points. Namely, where , and give the merge heights for a respective pair of dendrograms.
One immediate obstacle to adopting this formalization is that our model does not require that the sets of points comprising the input have the same cardinality. For this reason, we take the point of view that two dendrograms are similar if there exists a correspondence between their leaves such that the merge heights of corresponding points are close. Formally, a correspondence between and is a relation such that , . Here, , denote the canonical projections of to and (respectively). Further we use the notation to denote the set of correspondences between , . Given a correspondence between two sets of points , , we have the following dissimilarity measure which accounts for differences in the merge heights of a pair of dendrograms under a correspondence . This measure is called the distortion , or the merge distortion distance with respect to , and is given by
Our goal, then, is not only to output a sequence of hierarchical clusterings corresponding to the point sets of the input, but also to produce an interstitial sequence of low distortion correspondences linking successive pairs of dendrograms. We quantify the extent to which an ultrametric faithfully represents an input metric space under the norm. Specifically, let , be a pair of finite pseudometric spaces on the same set of points. We define In other words, a pseudometric space is a good fit for (and vice-versa) whenever is small.
Let be a pseudometric space. If for any , it holds that then we say that is a pseudo-ultrametric and is a pseudo-ultrametric space. We now formally define this general version of the problem.
Definition 1.1 (Temporal Hierarchical Clustering (Generalized Version)).
Let be a sequence of metric spaces, where for each , , and let . The goal of the Generalized Temporal Hierarchical Clustering problem is to find a sequence of pseudo-ultrametric spaces, and a sequence of correspondences , where for each , we have , and for any , with . Such a clustering is called a Generalized -Clustering of .
We show in Section 4 that the Generalized Hierarchical Temporal Clustering problem is NP-hard.
Absent the ambient metric space, the above notion of distortion would be sufficient to capture the intuitive idea that consecutive hierarchical clusterings should be close. However, it is easy to produce examples where symmetries in the input permit low-distortion correspondences which are manifestly non-local in the ambient space. Thus it makes sense to further require that any correspondence be local in the ambient metric. We say that a correspondence is -local provided that where is the distance in the ambient space.
We now formalize this version of the problem. Here, the input , consists of a sequence of unlabeled, finite, non-empty subsets of a metric space . We call such a sequence a temporal-sampling of of length , and refer to individual elements of the sequence ( for some ) as a level of (see [9, 10]). The size of is simply the sum of the number of points in each level of , that is . Let be a metric space. For any we use the notation to denote the restriction of to , that is, . We have the following definition:
Definition 1.2 (Temporal Hierarchical Clustering (Local Version)).
Let be a temporal-sampling over a metric space , and let . The goal of the Local Temporal Hierarchical Clustering problem is to find a sequence of pseudo-ultrametric spaces, , where for each , , and , together with a sequence of correspondences where for any , with . Such a clustering is called a Local -Clustering.
While the general version of the problem is NP-hard, the local version is trivial and can be computed in -time by computing a correspondence minimizing the Hausdorff distance for each pair of successive levels. We highlight this problem for expository purposes as well as a prelude to a labeled version of the problem.
This version of the problem is further of interest in that it can be used to approximate the general version such that the resulting distortion is bounded in terms of , and . We discuss this topic further in Section 4.
There are already several drawbacks with previous versions of the problem in regard to making concrete cluster assignments. In particular it is unclear how to coherently assign cluster labels to points given a correspondence. Moreover, we must account for the fact that the number of points can vary across levels. Taking the point of view that a good labeling is one in which labels in successive levels remain close, we opt to allow points to be given multiple labels. Doing so affords us additional bookkeeping to help ensure that labelings for near by levels remain local, even across levels which require relatively few labels.
To this end, given a set , a -labeling of is a function such that is a partition of . Informally, we say two labelings are -contiguous if the copies of the same label in a pair of assignments are no farther than . We have the following definition:
Given a pair of sets , of points from a metric space , and a pair -labelings , of , (respectively), we say that and are -contiguous in if
for all , ,
for all , .
See Figure 2 for an example.
Since points can be multi-labeled, we need a tie-breaking rule to determine which label applies. By convention we take the label of any set of points to be the smallest label among all labels of points in the set. Moreover, a good solution should never use more than labels on an input of size . We are now ready to define the main version of the problem.
Definition 1.4 (Temporal Hierarchical Clustering).
Let be a temporal-sampling of size over a metric space with distance, , and let . The goal of the Temporal Hierarchical Clustering problem is to find a sequence of pseudo-ultrametric spaces, , such that for any , , and a sequence of -labelings, , for , such that for any , , are -contiguous. Such a clustering is called a Labeled -Clustering.
In Section 2 we show how to find an optimal solution to the local version of the problem in -time. Then, in Section 3, we give an -time algorithm which converts any Local -Clustering into a Labeled -Clustering. This combined with Section 2 implies an optimal solution for the labeled version of the problem. In Section 4 we show that the general version is NP-hard, but observe that the local version provides an approximate solution in the special case where the inputs comes from a common metric space. In Section 5 we show that the optimal algorithms are unstable with respect to perturbations of the metric, and how to ensure stability by changing the ultrametric construction. Last, Section 6 contains an experiment.
2 Local Version
In this section we present a straightforward solution to the local version of temporal hierarchical clustering in -time. We are not directly interested in the solution of this problem. Instead, this section serves as a prelude to solving the labeled version.
The algorithm is trivial. Let be a scheme for finding the -nearest ultrametric to a metric. For each set of points in the input we use to find an ultrametric. To compute correspondences between successive levels , , we add all pairs of points such that and are at a distance of at most the Hausdorff distance of , . Formally, the algorithm takes a temporal-sampling of a metric space as input and consists of the following steps:
Step 1: Fitting by ultrametrics. For each , find an ultrametric near to via a chosen scheme.
Step 2: Build correspondences. For each , compute
. Here, denotes the Hausdorff distance in the ambient metric space.
Step 3: Return .
Let denote the size of the temporal sampling. In this section we argue that the above algorithm returns an optimal solution in time, provided that it is equipped with a scheme for finding the -nearest ultrametric to an -point metric space in -time. The following theorem ensures that one exists.
Theorem 2.1 (Farach-Colton Kannan Warnow ).
Let be an -point metric space and let denote the set of ultrametrics on the points of . There exists an -time algorithm which finds .
We are now ready to prove the main theorem of this section.
Let be a temporal-sampling of size which admits a Local -Clustering. There exists an -time algorithm returning a Local -Clustering.
Let denote the length of , and the ambient metric space. Run the algorithm of Section 2 where is the algorithm of Farach-Colton, Kannan, and Warnow . Let denote the pseudo-ultrametrics in the output. By Theorem 2.1, , as otherwise would imply that for some level , the algorithm of Theorem 2.1 fails to return an -nearest ultrametric to .
Let . We now argue that is smallest possible in the sense that admits a Local -Clustering, but does not admit an Local -Clustering for any , when . Let . First we show . Fix any Local -Clustering, and let be the associated sequence of -local correspondences. Fix some and some . Since is a correspondence, , and thus there exists such that . Since is -local it holds that , and we conclude . An analogous argument for implies . Thus, for , Now we argue that is feasible. Fix . Since it holds that for every point there exists such that . Construct a set . Analogously construct a set . The set is thus a -local correspondence between , . Thus, it follows that .
The preceding two paragraphs show that the result is a Local -Clustering. It only remains to show the algorithm runs in -time. Let for . Step takes -time as finding the -nearest ultrametric for level can be done in -time by Theorem 2.1. Computing the inter-level Hausdorff distance and building the correspondence for level in Step can both be done in -time, for a total of -time over all. ∎
3 Labeled Version
In this section we show how to convert a Local -Clustering into a Labeled -Clustering in -time by transforming a sequence of -local correspondences into a sequence of pairwise -contiguous labelings.
Drawing upon an idea in [9, 10], we employ minimum cost feasible flow to find a -contiguous labeling with few labels. Formally, we construct the flow instance as follows: Let be a temporal-sampling. Given the -local correspondences of a Local -Clustering, , the following construction transforms into a flow network, , such that corresponding points in successive levels are connected by a directed edge which points to the higher indexed level. Moreover, a source, , connects to each of the points in the first level, while the sink is the target of a directed edge from each point in . Formally, let . For , let such that if and only if . The vertices of consist of , , and the contents of . The edges of consist of the union of , , and . Specifically, we seek an integral flow with minimum flow value such that the in-flow of each vertex of is at least one.
The main idea is to view each correspondence as a bipartite graph. We concatenate the sequence of correspondences together by merging overlapping vertices. This allows us to interpret the sequence of correspondences as a graph. Our goal is then to decompose this graph into a path cover of small size, which we do by solving a flow instance. Since this graph only contains edges between points which are close, the resulting labeling will be contiguous. Formally, we perform the following steps:
Step 1: Constructing a flow instance. Given a sequence of -local correspondences of a Local -Clustering, construct the minimum flow instance as defined above.
Step 2: Solve the flow instance. Find a minimum cost integral flow in .
Step 3: Decompose the flow. Greedily extract unit flows from to construct a list of paths .
Step 4: Construct label functions. Build label functions by initializing each to the empty set. Next, for each , denote as the point sequence . Append label to .
Step 5: Output. Return the labelings .
In this section we show that the above algorithm finds an optimal solution in -time on temporal samplings of size . To this end we now argue that the above network flow instance is feasible.
Let be a temporal-sampling. Given the -local correspondences of a Local -Clustering, , the flow instance is feasible with value at most .
For any , any point can be extended to a path from to , by iteratively extending the ends of the path via the correspondences. Construct a feasible flow by initializing to be zero everywhere. Greedily extend points receiving no flow to paths from to in the described manner, and increase the flow value of along the path by . It follows that remains integral and satisfies all lower bounds of . Since we flow at most unit of flow per point of , the value of is at most . ∎
The next theorem shows that the algorithm outputs an optimal clustering.
Let be a temporal-sampling of size . There exists an -time algorithm which is guaranteed to output a Labeled -Clustering of , for any , such that admits a Labeled -Clustering.
Let be the length of . Run the algorithm of Section 2 on . Since admits a Labeled -Clustering, it also admits a Local -Clustering where for any , the -th correspondence is given by Thus, by Theorem 2.2, we are guaranteed a Local -Clustering in -time. Let be its -local correspondences, and run the above algorithm on it. By Lemma 3.1, the flow instance is feasible with value at most . Using an algorithm of Gabow & Tarjan , we can solve in -time, yielding an integral flow . Again in -time, we decompose into a collection of unit flows , for some , which we interpret as paths from to .
We now verify that the sequence of label functions output by the algorithm is indeed a -contiguous -labeling for some . For any , and any let denote the -th vertex in the -th path. Recall that for each , we assign each point the set of labels . Note that each label in is used at most once per level since for any , , is the only place where intersects . Also, since each intersects all levels , each label is used at least once per level. It follows that is a partition of . Finally, since the edges of correspond to points that are separated by at most in the ambient space, any two uses of the label for some occur within . Thus the corresponding sequence of -labelings is indeed pairwise -contiguous. ∎
4 Generalized Version
In this section we show that the generalized version problem is NP-hard. However, we argue that for the special case where the points of the input share a (known) common ambient metric, the algorithm of Section 2 gives an approximate solution. It remains an open question as to how to find an approximate solution in polynomial-time when there is no ambient metric (or it is unknown).
Let be an instance of -coloring. We construct an instance of Generalized Temporal Hierarchical Clustering, , consisting of two levels. For the first level let be a set of three points, and let be a metric on such that distinct have . Denote the corresponding metric space . For the second level we construct a metric space , where , such that
If admits a -coloring requiring colors, then admits a Generalized -Clustering.
Fix a -coloring of . We will exhibit a pair of pseudometric spaces and a -distortion correspondence between them. For the first space let be a uniform metric space where distinct points are at a distance of . Note that , since for any distinct , .
We will use the points of to denote the color class of . Fix be such that if and only if share the same color class. Let be the pseudometric space where for any , if and only if , and otherwise. We now bound by considering for an arbitrary pair . Since for any , only distinct can contribute to the distortion. Suppose , then and thus . Otherwise, , and while so that . Thus .
Last, let . We now verify that is a -distortion correspondence. To see that , note that since requires colors, and since every vertex belongs to a color class. Finally, to bound note that for any , either and (since ), or and . ∎
If does not admit a -coloring, then does not admit a Generalized -Clustering.
Let . Fix a Generalized -Clustering of for some consisting of ultrametrics , , and a -distortion correspondence . We first argue that the points of are separated. Let , . If then . Thus , a contradiction.
Now fix a map , such that for any , such that . First we argue that is indeed a function by showing that for any , corresponds to exactly one point in . To see why observe that given any with it follows that . We now show how to use to construct a -coloring of . Since , for every , we have , as otherwise . Consider any pair of corresponding points . It must be the case that as otherwise . Color the graph by assigning each to a color class given by . Since for adjacent , we have , it follows that , and thus there is no edge between vertices of the same color. We have exhibited a -coloring of . ∎
Theorem 4.3 result follows directly from Lemma 4.1, and Lemma 4.2. The proof also implies that for the Generalized Temporal Hierarchical Clustering problem, for some fixed , approximating within any factor smaller than is -hard.
The Generalized Temporal Hierarchical Clustering problem is -hard.
Approximation by local version.
We now show that any Local -Clustering is a Generalized -Clustering. That is, we can view the local version of the problem as an approximation to the general version in the special case that the points of the input come from the same metric space.
Let be a temporal-sampling. Any Local -Clustering of is a Generalized -Clustering of .
Suppose has length and ambient metric space . Fix a Local -Clustering of with ultrametrics , and correspondences, , induced by labelings of successive pairs of levels. Observe that
Since , it follows by definition of that for any , . Fix an arbitrary and let . By triangle inequality . Note that since , we have . Thus , are contained in -balls of , in (respectively). It follows that . We conclude that for any , , , and thus . ∎
In this section we show that the algorithm for finding an -nearest ultrametric in  is unstable under perturbations of the metric and, consequently, so are our algorithms. Stability, naturally, is a desirable property; as otherwise if small changes in the input are allowed to produce vastly different ultrametrics, then the observed temporal coherence of the output is lost. Furthermore, this is the case even if the cost of fitting each level to an ultrametric remains best possible. We resolve this issue in practice by instead finding the -nearest subdominant ultrametric.
Let be a metric space. We will consider to be a complete graph where the edges are weighted by distance, and use the notation to refer to a minimum spanning tree on . Further, for any , let denote the unique path joining . Let denote the set of ultrametrics on the points of . Let In other words, is the set of ultrametrics on the points of such that no distance is made larger than its counterpart in . We say that an ultrametric in is subdominant to . Let be a metric space on the points of with distance function The distance function is independent of the choice of minimum spanning tree, and easily verified to be ultrametric and subdominant to . It can further be shown that is the unique, -closest subdominant ultrametric to . That is,
We now show that the algorithms of Section 2, Section 3 are unstable. To elucidate why we now restate the algorithm in  in a slightly modified form which helps to make our point. This procedure is equivalent to the following:
Step 1: Compute a minimum spanning tree. Given a metric space consider a weighted complete graph on where the the weight of any edge is . Find a minimum spanning tree of this graph, .
Step 2: Compute cut-weights for each edge. Let . For each edge , compute and assign a priority to such that
Step 3: Assign distances. Edges are cut in order of descending priority. Any pair of vertices first separated by a cut at are assigned a distance of .
When an edge is cut, points first separated by the removal of that edge are assigned a distance which depends on its largest supported distance in . The issue is that small perturbations in the metric can change the path structure of so that an edge becomes responsible for linking a far pair of points. The only hope for stability is that the other term in the assigned distance, , changes enough to offset this effect. However, Lemma 5.1 shows that this term is stable, and thus is not large enough to compensate. It follows that the above procedure is unstable. See Figure 3 for a concrete example.
In contrast, the -nearest subdominant ultrametric is stable under metric perturbations. We now give a simple, direct proof of this fact for our setting. See  for extended discussion.
Let , be metric spaces on the same points such that , then .
Let denote the points of . Fix a distance weighted MST of , , and let , . For any pair of points let denote the set of all simple paths in (when is viewed as a complete graph). Let be the function that sends each path in to the value of its maximum weight edge. Observe that the maximum weight edge along is equal to , as otherwise it is possible to construct a spanning tree with cost strictly less than that of . Thus, . Now since , differ by an -perturbation, the values individual edges of the paths (and therefore the values of the paths in under ) change by at most . Thus, ∎