1 Introduction
Hierarchical clustering (HC) is a widely used data analysis tool, ubiquitous in information retrieval, data mining, and machine learning (see a survey by
Berkhin (2006)). This clustering technique represents a given dataset as a binary tree; each leaf represents an individual data point and each internal node represents a cluster on the leaves of its descendants. HC has become the most popular method for gene expression data analysis Eisen et al. (1998), and also has been used in the analysis of social networks Leskovec et al. (2014); Mann et al. (2008), bioinformatics Diez et al. (2015), image and text classification Steinbach et al. (2000), and even in analysis of financial markets Tumminello et al. (2010). It is attractive because it provides richer information at all levels of granularity simultaneously, compared to more traditional flat clustering approaches like means or median.Recently, Dasgupta (2016)
formulated HC as a combinatorial optimization problem, giving a principled way to compare the performance of different HC algorithms. This optimization viewpoint has since received a lot of attention
Roy and Pokutta (2016); Charikar and Chatziafratis (2017); CohenAddad et al. (2017); Moseley and Wang (2017); CohenAddad et al. (2018) that has led not only to the development of new algorithms but also to theoretical justifications for the observed success of popular HC algorithms (e.g. averagelinkage).However, in real applications of clustering, the user often has background knowledge about the data that may not be captured by the input to the clustering algorithm. There is a rich body of work on constrained (flat) clustering formulations that take into account such user input in the form of “cannot link” and “must link” constraints Wagstaff and Cardie (2000); Wagstaff et al. (2001); Bilenko et al. (2004); Rangapuram and Hein (2012). Very recently, “semisupervised” versions of HC that incorporate additional constraints have been studied Vikram and Dasgupta (2016), where the natural form of such constraints is triplet (or “must link before”) constraints ^{1}^{1}1Hierarchies on data imply that all datapoints are linked at the highest level and all are separated at the lowest level, hence “cannot link” and “must link” constraints are not directly meaningful.: these require that valid solutions contain a subcluster with together and previously separated from them.^{2}^{2}2For a concrete example from taxonomy of species, a triplet constraint may look like (). Such triplet constraints, as we formally show later, can encode more general structural constraints in the form of rooted subtrees. Surprisingly, such simple triplet constraints already pose significant challenges for bottomup linkage methods. (Figure 1).
Our work is motivated by applying the optimization lens to study the interaction of hierarchical clustering algorithms with structural constraints. Constraints can be fairly naturally incorporated into topdown (i.e. divisive) algorithms for hierarchical clustering; but can we establish guarantees on the quality of the solution they produce? Another issue is that incorporating constraints from multiple experts may lead to a conflicting set of constraints; can the optimization viewpoint of hierarchical clustering still help us obtain good solutions even in the presence of infeasible constraints? Finally, different objective functions for HC have been studied in the literature; do algorithms designed for these objectives behave similarly in the presence of constraints? To the best of our knowledge, this is the first work to propose a unified approach for constrained HC through the lens of optimization and to give provable approximation guarantees for a collection of fast and simple topdown algorithms that have been used for unconstrained HC in practice (e.g. community detection in social networks Mann et al. (2008)).
Background on Optimization View of HC.
Dasgupta (2016) introduced a natural optimization framework for HC. Given a weighted graph and pairwise similarities between the data points , the goal is to find a hierarchical tree such that
(1) 
where is the subtree rooted at the lowest common ancestor of in and is the number of leaves it contains.^{3}^{3}3Observe that in HC, all edges get cut eventually. Therefore it is better to postpone cutting “heavy” edges to when the clusters become small, i.e .as far down the tree as possible. We denote (1) as similarityHC. For applications where the geometry of the data is given by dissimilarities, again denoted by , CohenAddad et al. (2018) proposed an analogous approach, where the goal is to find a hierarchical tree such that
(2) 
We denote (2) as dissimilarityHC. A comprehensive list of desirable properties of the aformentioned objectives can be found in Dasgupta (2016); CohenAddad et al. (2018). In particular, if there is an underlying groundtruth hierarchical structure in the data, then can recover the groundtruth. Also, both objectives are NPhard to optimize, so the focus is on approximation algorithms.
Our Results.
1) We design algorithms that take into account both the geometry of the data, in the form of similarities, and the structural constraints imposed by the users. Our algorithms emerge as the natural extensions of Dasgupta’s original recursive sparsest cut algorithm and the recursive balanced cut suggested in Charikar and Chatziafratis (2017). We generalize previous analyses to handle constraints and we prove an approximation guarantee^{4}^{4}4For data points, is the best approximation factor for the sparsest cut and is the number of constraints., thus surprisingly matching the best approximation guarantee of the unconstrained HC problem for constantly many constraints.
2) In the case of infeasible constraints, we extend the similarityHC optimization framework, and we measure the quality of a possible tree by a constraintbased regularized objective. The regularization naturally favors solutions with as few constraint violations as possible and as far down the tree as possible (similar to the motivation behind similarityHC objective). For this problem, we provide a topdown approximation algorithm by drawing an interesting connection to an instance of the hypergraph sparsest cut problem.
3) We then change gears and study the dissimilarityHC objective. Surprisingly, we show that known topdown techniques do not cope well with constraints, drawing a contrast with the situation for similarityHC. Specifically, the
(locally) densest cut heuristic
performs poorly even if there is only one triplet constraint, blowing up its approximation factor to . Moreover, we improve upon the stateoftheart in CohenAddad et al. (2018), by showing a simple randomized partitioning is a approximation algorithm. We also give a deterministic localsearch algorithm with the same worstcase guarantee. Furthermore, we show that our randomized algorithm is robust under constraints, mainly because of its “exploration” behavior. In fact, besides the number of constraints, we propose an inherent notion of dependency measure among constraints to capture this behavior quantitatively. This helps us not only to explain why “nonexploring” algorithms may perform poorly, but also gives tight guarantees for our randomized algorithm.Experimental Results.
We run experiments on the Zoo dataset (Lichman, 2013) to demonstrate our approach and the performance of our algorithms for a taxonomy application. We consider a setup where there is a groundtruth tree and extra information regarding this tree is provided for the algorithm in the form of triplet constraints. The upshot is we believe specific variations of our algorithms can exploit this information; In this practical application, our algorithms have around imrpvements in the objective compared to the naive recursive sparsest cut proposed in Dasgupta (2016) that does not use this information. See Appendix B for more details on the setup and precise conclusions of our experiments.
Constrained HC workflow in Practice.
Throughout this paper, we develop different tools to handle userdefined structural constraints for hierarchical clustering. Here we describe a recipe on how to use our framework in practice.
(1) Preprocessing constraints to form triplets. Userdefined structural constraints as rooted binary subtrees are convenient for the user and hence for the usability of our algorithm. The following proposition (whose proof is in the supplement) allows us to focus on studying HC with just triplet constraints.
Proposition 1.
Given constraints as a rooted binary subtree on data points (), there is linear time algorithm that returns an equivalent set of at most triplet constraints.
(2) Detecting feasibility. The next step is to see if the set of triplet constraints is consistent, i.e. whether there exists a HC satisfying all the constraints. For this, we use a simple linear time algorithm called BUILD Aho et al. (1981).
(3) Hard constraints vs. regularization. BUILD can create a hierarchical decomposition that satisfies triplet constraints, but ignores the geometry of the data, whereas our goal here is to consider both simultaneously. Moreover, in the case that the constraints are infeasible, we aim to output a clustering that minimizes the cost of violating constraints combined with the cost of the clustering itself.
Feasible instance: to output a feasible HC, we propose using Constrained Recursive Sparsest Cut (CRSC) or Constrained Recursive Balanced Cut (CRBC): two simple topdown algorithms which are natural adaptations of recursive sparsest cut (Mann et al., 2008; Dasgupta, 2016) or recursive balanced cut Charikar and Chatziafratis (2017) to respect constraints (Section 2).
Infeasible instance: in this case, we turn our attention to a regularized version of HC, where the cost of violating constraints is added to the tree cost. We then propose an adaptation of CRSC, namely Hypergraph Recursive Sparsest Cut (HRSC) for the regularized problem (Section 3).
Realworld application example.
In phylogenetics, which is the study of the evolutionary history and relationships among species, an enduser usually has access to whole genomes data of a group of organisms. There are established methods in phylogeny to infer similarity scores between pairs of datapoints, which give the user the similarity weights . Often the user also has access to rare structural footprints of a common ancestry tree (e.g. through gene rearrangement data, gene inversions/transpositions etc., see Patané et al. (2018)). These rare, yet informative, footprints play the role of the structural constraints. The user can follow our preprocessing step to get triplet constraints from the given rare footprints, and then use Aho’s BUILD algorithm to choose between regularized or hard version of the HC problem. The above illustrates how to use our workflow and why using our algorithms facilitates HC when expert domain knowledge is available.
Further related work.
Similar to Vikram and Dasgupta (2016)
, constraints in the form of triplet queries have been used in an (adaptive) active learning framework by
Tamuz et al. (2011); EmamjomehZadeh and Kempe (2018), showing that approximately triplet queries are enough to learn an underlying HC. Other forms of user interaction in order to improve the quality of the produced clusterings have been used in Balcan and Blum (2008); Awasthi et al. (2014) where they prove that interactive feedback in the form of cluster split/merge requests can lead to significant improvements. Robust algorithms for HC in the presence of noise were studied in Balcan et al. (2014) and a variety of sufficient conditions on the similarity function that would allow linkagestyle methods to produce good clusters was explored in Balcan et al. (2008). On a different setting, the notion of triplets has been used as a measure of distance between hierarchical decomposition trees on the same data points Brodal et al. (2013). More technically distant analogs of how to use relations among triplets points have recently been proposed in Kleindessner and von Luxburg (2017) for defining kernel functions corresponding to highdimensional embeddings.2 Constrained Sparsest (Balanced) Cut
Given an instance of the constrained hierarchical clustering, our proposed CRSC algorithm uses a blackbox approximation algorithm for the sparsest cut problem (the bestknown approximation factor for this problem is due to Arora et al. (2009)). Moreover, it also maintains the feasibility of the solution in a topdown approach by recursive partitioning of what we call the supergraph . Informally speaking, the supergraph is a simple data structure to track the progress of the algorithm and the resolved constraints.
More formally, for every constraint we merge the nodes and into a supernode while maintaining the edges in (now connecting to their corresponding supernodes). Note that may have parallel edges, but this can easily be handled by grouping edges together and replacing them with the sum of their weights. We repeatedly continue this merging procedure until there are no more constraints. Observe that any feasible solution needs to start splitting the original graph by using a cut that is also present in . When cutting the graph , if a constraint is resolved,^{5}^{5}5A constraint is resolved, if gets separated from . then we can safely unpack the supernode into two nodes again (unless there is another constraint in which case we should keep the supernode). By continuing and recursively finding approximate sparsest cuts on the supergraph and , we can find a feasible hierarchical decomposition of respecting all triplet constraints. Next, we show the approximation guarantees for our algorithm.
Analysis of Crsc Algorithm.
The main result of this section is the following theorem:
Theorem 1.
Given a weighted graph with triplet constraints for , the CRSC algorithm outputs a HC respecting all triplet constraints and achieves an approximation for the HCsimilarity objective as in (1).
Notations and Definitions.
We slightly abuse notation by having OPT denote the optimum hierarchical decomposition or its optimum value as measured by (1). Similarly for CRSC. For , OPT denotes the maximal clusters in OPT of size at most . Note that OPT induces a partitioning of . We use OPT to denote edges cut by OPT (i.e. edges with endpoints in different clusters in OPT) or their total weight; the meaning will be clear from context. For convenience, we define OPT . For a cluster created by CRSC, a constraint is active if , otherwise is resolved and can be discarded.
Overview of the Analysis.
There are three main ingredients: The first is to view a HC of datapoints as a collection of partitions, one for each level , as in (Charikar and Chatziafratis, 2017). For a level , the partition consists of maximal clusters of size at most . The total cost incurred by OPT is then a combination of costs incurred at each level of this partition. This is useful for comparing our CRSC cost with OPT. The second idea is in handling constraints and it is the main obstacle where previous analyses Charikar and Chatziafratis (2017); CohenAddad et al. (2018) break down: constraints inevitably limit the possible cuts that are feasible at any level, and since the set of active constraints^{6}^{6}6All constraints are active in the beginning of CRSC. differ for CRSC and OPT, a direct comparison between them is impossible. If we have no constraints, we can charge the cost of partitioning a cluster to lower levels of the OPT decomposition. However, when we have triplet constraints, the partition induced by the lower levels of OPT in a cluster will not be feasible in general (Figure 2). The natural way to overcome this obstacle is merging pieces of this partition so as to respect constraints and using higher levels of OPT, but it still may be impossible to compare CRSC with OPT if all pieces are merged. We overcome this difficulty by an indirect comparison between the CRSC cost and lower levels of OPT, where is the number of active constraints in
. Finally, after a clusterbycluster analysis bounding the
CRSC cost for each cluster, we exploit disjointness of clusters of the same level in the CRSC partition allowing us to combine their costs.Proof of Theorem 1.
We start by borrowing the following facts from (Charikar and Chatziafratis, 2017), modified slightly for the purpose of our analysis (proofs are provided in the supplementary materials).
Fact 1 (Decomposition of Opt).
The total cost paid by OPT can be decomposed into costs of the different levels in the OPT partition, i.e. .
Fact 2 (Opt at scaled levels).
Let be the number of constraints. Then, .
Given the above facts, we look at any cluster of size produced by the algorithm. Here is the main technical lemma that allows us to bound the cost of CRSC for partitioning .
Lemma 1.
Suppose CRSC partitions a cluster () in two clusters (w.l.o.g. ). Let the size and let , where denotes the number of active constraints for . Then: .
Proof.
The cost incurred by CRSC for partitioning is . Now consider . This induces a partitioning of into pieces , where by design , . Now, consider the cuts . Even though all cuts are allowed for OPT, for CRSC some of them are forbidden: for example, in Figure 2, the constraints render 4 out of the 6 cuts infeasible. But how many of them can become infeasible with active constraints? Since every constraint is involved in at most 2 cuts, we may have at most infeasible cuts. Let denote the index set of feasible cuts, i.e. if , the cut is feasible for CRSC. To cut , we use an approximation of sparsest cut, whose sparsity is upper bounded by any feasible cut:
where for the last inequality we used the standard fact that for and . We also have the following series of inequalities:
where the first inequality holds because we double count some (potentially all) edges of (these are the edges cut by that are also present in cluster , i.e. they have both endpoints in ) and the second inequality holds because and
Finally, we are ready to prove the lemma by combining the above inequalities ():
It is clear that we exploited the charging to lower levels of OPT, since otherwise if all pieces in were merged, the denominator with the ’s would become 0. The next lemma lets us combine the costs incurred by CRSC for different clusters (proof is in the supplementary materials)
Lemma 2 (Combining the costs of clusters in Crsc).
The total CRSC cost for partitioning all clusters into (with ) is bounded by:
Remark 1.
In the supplementary material, we prove how one can use balanced cut, i.e. finding a cut such that
(3) 
instead of sparsest cut, and using approximation algorithms for this problem achieves the same approximation factor as in Theorem 1, but with better running time.
Remark 2.
Optimality of the CRSC algorithm: Note that complexity theoretic lowerbounds for the unconstrained version of HC from Charikar and Chatziafratis (2017) also apply to our setting; more specifically, they show that no constant factor approximation exists for HC assuming the SmallSet Expansion Hypothesis.
Theorem 2 (The divisive algorithm using balanced cut).
Given a weighted graph with triplet constraints for , the constrained recursive balanced cut algorithm CRBC (same as CRSC, but using balanced cut instead of sparsest cut) outputs a HC respecting all triplet constraints and achieves an approximation for Dasgupta’s HC objective. Moreover, the running time is almost linear time.
3 Constraints and Regularization
Previously, we assumed that constraints were feasible. However, in many practical applications, users/experts may disagree, hence our algorithm may receive conflicting constraints as input. Here we want to explore how to still output a satisfying HC that is a good in terms of objective (1) (similarityHC) and also respects the constraints as much as possible. To this end, we propose a regularized version of Dasgupta’s objective, where the regularizer measures quantitatively the degree by which constraints get violated.
Informally, the idea is to penalize a constraint more if it is violated at top levels of the decomposition compared to lower levels. We also allow having different violation weights for different constraints (potentially depending on the expertise of the users providing the constraints). More concretely, inspired by the Dasgupta’s original objective function, we consider the following optimization problem:
(4) 
where is the set of all possible binary HC trees for the given data points, is the set of the triplet constraints, is the size of the subtree rooted at the least common ancestor of , and is defined as the base cost of violating triplet constraint . Note that the regularization parameter
allows us to interpolate between satisfying the constraints or respecting the geometry of the data.
Hypergraph Recursive Sparsest Cut
In order to design approximation algorithms for the regularized objective, we draw an interesting connection to a different problem, which we call 3Hypergraph Hierarchical Clustering (3HHC). An instance of this problem consists of a hypergraph with edges , and hyperedges of size , , together with similarity weights for edges, , and similarity weights for 3hyperedges,^{7}^{7}7We have 3 different weights corresponding to the 3 possible ways of partitioning in two parts: , and . . We now think of HC on the hypergraph , where for every binary tree we define the cost to be the natural extension of Dasgupta’s objective:
(5) 
where is either equal to , and is either the subtree rooted at LCA(),^{8}^{8}8LCA() denotes the lowest common ancestor of . LCA() or LCA(), all depending on how cuts the 3hyperedge . The goal is to find a hierarchical clustering of this hypergraph, so as to minimize the cost (5) of the tree.
Reduction from Regularization to 3HHC.
Given an instance of HC with constraints (with their costs of violations) and a parameter , we create a hypergraph so that the total cost of any binary clustering tree in the 3HHC problem (5) corresponds to the regularized objective of the same tree as in (4). has exactly the same set of vertices, (normal) edges and (normal) edge weights as in the original instance of the HC problem. Moreover, for every constraint (with cost ) it has a hyperedge , to which we assign three weights . Therefore, we ensure that any divisive algorithm for the 3HHC problem avoids the cost only if it chops into and at some level, which matches the regularized objective.
Reduction from 3HHC to Hypergraph Sparsest Cut.
A natural generalization of the sparsest cut problem for our hypergraphs, which we call Hyper Sparsest Cut (HSC), is the following problem:
where is the weight of the cut and is either equal to or , depending on how chops the hyperedge . Now, similar to Charikar and Chatziafratis (2017); Dasgupta (2016), we can recursively run a blackbox approximation algorithm for HSC to solve 3HHC. The main result of this section is the following technical proposition, whose proof is analogous to that of Theorem 1 (provided in the supplementary materials).
Proposition 2.
Given the hypergraph with hyperedges, and given access to an algorithm which is approximation for HSC, the Recursive Hypergraph Sparsest Cut (RHSC) algorithm achieves an approximation.
Reduction from HSC back to Sparsest Cut.
We now show how to get an approximation oracle for our instance of the HSC problem by a general reduction to sparsest cut. Our reduction is simple: given a hypergraph and all the weights, create an instance of sparsest cut with the same vertices, (normal) edges and (normal) edge weights. Moreover, for every 3hyperedge consider adding a triangle to the graph, i.e. three weighted edges connecting , where:
This construction can be seen in Figure 3. The important observation is that , and , which are exactly the weights associated with the corresponding splits of the 3hyperedge . So, correctness of the reduction^{9}^{9}9Since all weights in the final graph are nonnegative, standard techniques for Sparsest Cut can be used. follows as the weight of each cut is preserved between the hypergraph and the graph after adding the triangles. For a discussion on extending this gadget more generally, see the supplement.
Remark 3.
Reduction to hypergraphs: we would like to emphasize the necessity of the hypergraph version in order for the reduction to work. One might think that just adding extra heavy edges would be sufficient, but there is a technical difficulty with this approach. Consider a triplet constraint ; once is separated from and at some level, there is no extra tendency anymore to keep and together (i.e. only the similarity weight should play role after this point). This behavior cannot be captured by only adding heavyweight edges. Instead, one needs to add a heavy edge between and that disappears once is separated, and this is exactly why we need the hyperedge gadget. One can replace the reduction for a oneshot proof, but we believe it will be less modular and less transparent.
4 Variations on a Theme
In this section we study dissimilarityHC, and we look into the problem of designing approximation algorithms for both unconstrained and constrained hierarchical clustering. In CohenAddad et al. (2017), they show that average linkage is a approximation for this problem and they propose a topdown approach based on locally densest cut achieving a approximation in time . Notably, when gets small the running time blows up.
Here, we prove that the most natural randomized algorithm for this problem, i.e. recursive random cutting, is a approximation with expected running time . We further derandomize this algorithm to get a simple deterministic localsearch style approximation algorithm.
If we also have structural constraints for the dissimilarityHC, we show that the existing approaches fail. In fact we show that they lead to an approximation factor due to the lack of “exploration” (e.g. recursive densest cut). We then show that recursive random cutting is robust to adding user constraints, and indeed it preserves a constant approximation factor when there are, roughly speaking, constantly many user constraints.
Randomized approximation.
Consider the most natural randomized algorithm for hierarchical clustering, i.e. recursively partition each cluster into two, where each point in the current cluster independently flips an unbiased coin and based on the outcome, it is put in one of the two parts.
Theorem 3.
RecursiveRandomCutting is a approximation for maximizing dissimilarityHC objective.
Proof sketch..
An alternative view of Dasgupta’s objective is to divide the reward of the clustering tree between all possible triples , where and is another point (possibly equal to or
). Now, in any hierarchical clustering tree, if at the moment right before
and become separated the vertex has still been in the same cluster as , then this triple contributesto the objective function. We claim this event happens with probability exactly
. To see this, consider an infinite independent sequence of coin flips for , , and . Without loss of generality, condition on ’s sequence to be all heads. The aforementioned event happens only if ’s first tales in its sequence happens no later than ’s first tales in its sequence. This happens with probability . Therefore, the algorithm gets the total reward in expectation. Moreover, the total reward of any hierarchical clustering is upperbounded by , which completes the proof of the approximation. ∎Remark 4.
This algorithm runs in time in expectation, due to the fact that the binary clustering tree has expected depth (see for example Cormen et al. (2009)) and at each level we only perform operations.
We now derandomize the recursive random cutting algorithm using the method of conditional expectations. At every recursion, we go over the points in the current cluster one by one, and decide whether to put them in the “left” partition or “right” partition for the next recursion. Once we make a decision for a point, we fix that point and go to the next one. Roughly speaking, these local improvements can be done in polynomial time, which will result in a simple localsearch style deterministic algorithm.
Theorem 4.
There is a deterministic localsearch style approximation algorithm for maximizing dissimilarityHC objective that runs in time .
Maximizing the Objective with User Constraints
From a practical point of view, one can think of many settings in which the output of the hierarchical clustering algorithm should satisfy userdefined hard constraints. Now, combining the new perspective of maximizing Dasgupta’s objective with this practical consideration raises a natural question: which algorithms are robust to adding user constraints, in the sense that a simple variation of these algorithms still achieve a decent approximation factor?
Failure of “Nonexploring” Approaches.
Surprisingly enough, there are convincing reasons that adapting existing algorithms for maximizing Dasgupta’s objective (e.g. those proposed in CohenAddad et al. (2018)) to handle user constraints is either challenging or hopeless. First, bottomup algorithms, e.g. averagelinkage, fail to output a feasible outcome if they only consider each constraint separately and not all the constraints jointly (as we saw in Figure 1). Second, maybe more surprisingly, the natural extension of (locally) RecursiveDensestCut^{10}^{10}10While a locally densest cut can be found in polytime, desnest cut is NPhard, making our negative result stronger. algorithm proposed in CohenAddad et al. (2018) to handle user constraints performs poorly in the worstcase, even when we have only one constraint. RecursiveDensestCut proceeds by repeatedly picking the cut that has maximum density, i.e. and making two clusters. To handle the user constraints, we run it recursively on the supergraph generated by the constraints, similar to the approach in Section 2. Note that once the algorithm resolves a triplet constraint, it also breaks its corresponding supernode.
Now consider the following example in Figure 4, in which there is just one triplet constraint abc. The weight should be thought of as large and as small. By choosing appropriate weights on the edges of the clique , we can fool the algorithm into cutting the dense parts in the clique, without ever resolving the constraint until it is too late. The algorithm gets a gain of whereas OPT gets by starting with the removal of the edge and then removing , thus enjoying a gain of .
Constrained Recursive Random Cutting.
The example in Figure 4, although a bit pathological, suggests that a meaningful algorithm for this problem should explore cutting lowweight edges that might lead to resolving constraints, maybe randomly, with the hope of unlocking rewarding edges that were hidden before this exploration.
Formally, our approach is showing that the natural extension of recursive random cutting for the constrained problem, i.e. by running it on the supergraph generated by constraints and unpacking supernodes as we resolve the constraints (in a similar fashion to CSC), achieves a constant factor approximation when the constraints have bounded dependency. In the remaining of this section, we define an appropriate notion of dependency between the constraints, under the name of dependency measure and analyze the approximation factor of constrained recursive random cutting (ConstrainedRRC ) based on this notion.
Suppose we are given an instance of hierarchical clustering with triplet constraints , where . For any triplet constraint , lets call the pair the base, and the key of the constraint. We first partition our constraints into equivalence classes , where . For every , the constraints and belong to the same class if they share the same base (see Figure 5).
Definition 1 (Dependency digraph).
The Dependency digraph is a directed graph with vertex set . For every , there is a directed edge if , such that , and either or (see Figure 6).
The dependency digraph captures how groups of constraints impact each other. Formally, the existence of the edge implies that all the constraints in should be resolved before one can separate the two endpoints of the (common) base edge of the constraints in .
Remark 5.
If the constraints are feasible, i.e. there exists a hierarchical clustering that can respect all the constraints, the dependency digraph is clearly acyclic.
Definition 2 (Layered dependency subgraph).
Given any class , the layered dependency subgraph of is the induced subgraph in the dependency digraph by all the classes that are reachable from . Moreover, the vertex set of this subgraph can be partitioned into layers , where is the maximum length of any directed path leaving and is a subset of classes where the length of the longest path from to each of them is exactly equal to (see Figure 7).
We are now ready to define a crisp quantity for every dependency graph. This will later help us give a more meaningful and refined beyondworstcase guarantee for the approximation factor of the ConstrainedRRC algorithm.
Definition 3 (Dependency measure).
Given any class , the dependency measure of is defined as
where are the layers of the dependency subgraph of , as in Definition 2. Moreover, the dependency measure of a set of constraints is defined as , where the maximum is taken over all the classes generated by .
Intuitively speaking, the notion of the dependency measure quantitatively expresses how “deeply” the base of a constraint is protected by the other constraints, i.e. how many constraints need to be resolved first before the base of a particular constraint is unpacked and the ConstrainedRRC algorithm can enjoy its weight. This intuition is formalized through the following theorem, whose proof is deferred to the supplementary materials.
Theorem 5.
The constrained recursive random cutting (ConstrainedRRC ) algorithm is an approximation algorithm for maximizing dissimilarityHC objective objective given a set of feasible constraints , where
Corollary 1.
ConstrainedRRC is an approximation for maximizing dissimilarityHC objective, given feasible constraints of constant dependency measure.
5 Conclusion
We studied the problem of hierarchical clustering when we have structural constraints on the feasible hierarchies. We followed the optimization viewpoint that was recently developed in Dasgupta (2016); CohenAddad et al. (2018) and we analyzed two natural topdown algorithms giving provable approximation guarantees. In the case where the constraints are infeasible, we proposed and analyzed a regularized version of the HC objective by using the hypergraph version of the sparsest cut problem. Finally, we also explored a variation of Dasgupta’s objective and improved upon previous techniques, both in the unconstrained and in the constrained setting.
Acknowledgements
Vaggos Chatziafratis was partially supported by ONR grant N000141712562. Rad Niazadeh was supported by Stanford Motwani fellowship. Moses Charikar was supported by NSF grant CCF1617577 and a Simons Investigator Award. We would also like to thank Leo Keselman, Aditi Raghunathan and Yang Yuan for providing comments on an earlier draft of the paper. We also thank the anonymous reviewers for their helpful comments and suggestions.
References
 Aho et al. [1981] Alfred V. Aho, Yehoshua Sagiv, Thomas G. Szymanski, and Jeffrey D. Ullman. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM Journal on Computing, 10(3):405–421, 1981.
 Arora et al. [2009] Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander flows, geometric embeddings and graph partitioning. Journal of the ACM (JACM), 56(2):5, 2009.
 Awasthi et al. [2014] Pranjal Awasthi, Maria Balcan, and Konstantin Voevodski. Local algorithms for interactive clustering. In International Conference on Machine Learning, pages 550–558, 2014.
 Balcan and Blum [2008] MariaFlorina Balcan and Avrim Blum. Clustering with interactive feedback. In International Conference on Algorithmic Learning Theory, pages 316–328. Springer, 2008.

Balcan et al. [2008]
MariaFlorina Balcan, Avrim Blum, and Santosh Vempala.
A discriminative framework for clustering via similarity functions.
In
Proceedings of the fortieth annual ACM symposium on Theory of computing
, pages 671–680. ACM, 2008.  Balcan et al. [2014] MariaFlorina Balcan, Yingyu Liang, and Pramod Gupta. Robust hierarchical clustering. The Journal of Machine Learning Research, 15(1):3831–3871, 2014.
 Berkhin [2006] Pavel Berkhin. A survey of clustering data mining techniques. In Grouping multidimensional data, pages 25–71. Springer, 2006.
 Bilenko et al. [2004] Mikhail Bilenko, Sugato Basu, and Raymond J Mooney. Integrating constraints and metric learning in semisupervised clustering. In Proceedings of the twentyfirst international conference on Machine learning, page 11. ACM, 2004.
 Brodal et al. [2013] Gerth Stølting Brodal, Rolf Fagerberg, Thomas Mailund, Christian NS Pedersen, and Andreas Sand. Efficient algorithms for computing the triplet and quartet distance between trees of arbitrary degree. In Proceedings of the twentyfourth annual ACMSIAM symposium on Discrete algorithms, pages 1814–1832. Society for Industrial and Applied Mathematics, 2013.
 Charikar and Chatziafratis [2017] Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the TwentyEighth Annual ACMSIAM Symposium on Discrete Algorithms, pages 841–854. Society for Industrial and Applied Mathematics, 2017.
 CohenAddad et al. [2017] Vincent CohenAddad, Varun Kanade, and Frederik MallmannTrenn. Hierarchical clustering beyond the worstcase. In Advances in Neural Information Processing Systems, pages 6202–6210, 2017.
 CohenAddad et al. [2018] Vincent CohenAddad, Varun Kanade, Frederik MallmannTrenn, and Claire Mathieu. Hierarchical clustering: Objective functions and algorithms. In Proceedings of the TwentyNinth Annual ACMSIAM Symposium on Discrete Algorithms, pages 378–397. SIAM, 2018.
 Cormen et al. [2009] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009. ISBN 0262033844, 9780262033848.
 Dasgupta [2016] Sanjoy Dasgupta. A cost function for similaritybased hierarchical clustering. In Proceedings of the fortyeighth annual ACM symposium on Theory of Computing, pages 118–127. ACM, 2016.
 Diez et al. [2015] Ibai Diez, Paolo Bonifazi, Iñaki Escudero, Beatriz Mateos, Miguel A Muñoz, Sebastiano Stramaglia, and Jesus M Cortes. A novel brain partition highlights the modular skeleton shared by structure and function. Scientific reports, 5:10532, 2015.
 Eisen et al. [1998] Michael B Eisen, Paul T Spellman, Patrick O Brown, and David Botstein. Cluster analysis and display of genomewide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863–14868, 1998.
 EmamjomehZadeh and Kempe [2018] Ehsan EmamjomehZadeh and David Kempe. Adaptive hierarchical clustering using ordinal queries. In Proceedings of the TwentyNinth Annual ACMSIAM Symposium on Discrete Algorithms, pages 415–429. SIAM, 2018.
 Kleindessner and von Luxburg [2017] Matthäus Kleindessner and Ulrike von Luxburg. Kernel functions based on triplet comparisons. In Advances in Neural Information Processing Systems, pages 6810–6820, 2017.
 Leskovec et al. [2014] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge university press, 2014.
 Lichman [2013] Moshe Lichman. Uci machine learning repository, zoo dataset, 2013. URL http://archive.ics.uci.edu/ml/datasets/zoo.
 Mann et al. [2008] Charles F Mann, David W Matula, and Eli V Olinick. The use of sparsest cuts to reveal the hierarchical community structure of social networks. Social Networks, 30(3):223–234, 2008.

Moseley and Wang [2017]
Benjamin Moseley and Joshua Wang.
Approximation bounds for hierarchical clustering: Average linkage, bisecting kmeans, and local search.
In Advances in Neural Information Processing Systems, pages 3097–3106, 2017.  Patané et al. [2018] José SL Patané, Joaquim Martins, and João C Setubal. Phylogenomics. In Comparative Genomics, pages 103–187. Springer, 2018.

Rangapuram and Hein [2012]
Syama Sundar Rangapuram and Matthias Hein.
Constrained 1spectral clustering.
In Artificial Intelligence and Statistics, pages 1143–1151, 2012.  Roy and Pokutta [2016] Aurko Roy and Sebastian Pokutta. Hierarchical clustering via spreading metrics. In Advances in Neural Information Processing Systems, pages 2316–2324, 2016.
 Steinbach et al. [2000] Michael Steinbach, George Karypis, Vipin Kumar, et al. A comparison of document clustering techniques. In KDD workshop on text mining, volume 400, pages 525–526. Boston, 2000.
 Tamuz et al. [2011] Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Tauman Kalai. Adaptively learning the crowd kernel. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 673–680. Omnipress, 2011.
 Tumminello et al. [2010] Michele Tumminello, Fabrizio Lillo, and Rosario N Mantegna. Correlation, hierarchies, and networks in financial markets. Journal of Economic Behavior & Organization, 75(1):40–58, 2010.
 Vikram and Dasgupta [2016] Sharad Vikram and Sanjoy Dasgupta. Interactive bayesian hierarchical clustering. In International Conference on Machine Learning, pages 2081–2090, 2016.
 Wagstaff and Cardie [2000] Kiri Wagstaff and Claire Cardie. Clustering with instancelevel constraints. AAAI/IAAI, 1097:577–584, 2000.
 Wagstaff et al. [2001] Kiri Wagstaff, Claire Cardie, Seth Rogers, Stefan Schrödl, et al. Constrained kmeans clustering with background knowledge. In ICML, volume 1, pages 577–584, 2001.
Appendix A Supplementary Materials
a.1 Missing proofs and discussion in Section 2
Proof of Proposition 1.
For nodes , let denote the parent of in the tree and ) denote the lowest common ancestor of . For a leaf node , we say that its label is , whereas for an internal node of , we say that its label is the label of any of its two children. As long as there are any two nodes that are siblings (i.e. ), we create a constraint where is the label of the second child of . We delete leaves from the tree and repeat until there are fewer than leaves left. To see why the above procedure will only create at most constraints, notice that every time a new constraint is created, we delete two nodes of the given tree . Since has leaves and is binary, it can have at most nodes in total. It follows that we create at most triplet constraints. For the equivalence between the constraints imposed by and the created triplet constraints, observe that all triplet constraints we create are explicitly imposed by the given tree (since we only create constraints for two leaves that are siblings) and that for any three datapoints with LCA()=LCA(), our set of triplet constraints will indeed imply , because LCA() appears further down the tree than LCA() and hence become siblings before or . ∎
Proof of Fact 1 from Charikar and Chatziafratis [2017].
We will measure the contribution of an edge to the RHS and to the LHS. Suppose that denotes the size of the minimal cluster in OPT that contains both and . Then the contribution of the edge to the LHS is by definition . On the other hand, . Hence the contribution to the RHS is also . ∎
Proof of Fact 2 from Charikar and Chatziafratis [2017].
We rewrite OPT using the fact that
at every level :
∎
Proof of Lemma 2.
By using the previous lemma we have:
Observe that is a decreasing function of , since as decreases, more and more edges are getting cut. Hence we can write:
To conclude with the proof of the first part all that remains to be shown is that:
To see why this is true consider the clusters with a contribution to the LHS. We have that , hence meaning that is a minimal cluster of size , i.e. if both ’s children are of size less than , then this cluster contributes such a term. The set of all such form a disjoint partition of because of the definition for minimality (in order for them to overlap in the hierarchical clustering, one of them needs to be ancestor of the other and this cannot happen because of minimality). Since for all such forms a disjoint partition of , the claim follows by summing up over all .
Note that so far our analysis handles clusters with size . However, for clusters with smaller size we can get away by using a crude bound for bounding the total cost and still not affecting the approximation guarantee that will be dominated by :
Theorem 6 (The divisive algorithm using balanced cut).
Given a weighted graph with triplet constraints for , the constrained recursive balanced cut algorithm (same as CRSC, but using balanced cut instead of sparsest cut) outputs a HC respecting all triplet constraints and achieves an approximation for the HC objective (1).
Proof.
It is not hard to show that one can use access to balanced cut rather than sparsest cut and achieve the same approximation factor by the recursive balanced cut algorithm.
We will follow the same notation as in the sparsest cut analysis and we will use some of the facts and inequalities we previously proved about OPT(t). Again, for a cluster of size , the important observation is that the partition (at the end, we will again choose ) induced inside the cluster by OPT can be separated into two groups, let’s say such that . In other words we can demonstrate a Balanced Cut with ratio for the cluster . Since we cut fewer edges when creating compared to the partitioning of OPT:
By the fact we used an approximation to balanced cut we can get the following inequality (similarly to Lemma 1):
Finally, we have to sum up over all the clusters (now in the summation we should write instead of just , since there is dependence in ) produced by the constrained recursive balanced cut algorithm for Hierarchical Clustering and we get that we can approximate the HC objective function up to . ∎
Remark 6.
Using balancedcut can be useful for two reasons. First, the runtime of sparsest and balanced cut on a graph with nodes and edges are . When run recursively however as in our case, taking recursive sparsest cuts might be worse off by a factor of (in case of unbalanced splits at every step) in the worst case. However, recursive balanced cut is still . Second, it is known that an approximation for the sparsest cut yields an approximation for balanced cut, but not the other way. This gives more flexibility to the balanced cut algorithm, and there is a chance it can achieve a better approximation factor (although we don’t study it further in this paper).
a.2 Missing proofs in Section 3
Proof sketch of Proposition 2.
Here the main obstacle is similar to the one we handled when proving Theorem (1): for a given cluster created by the RHSC algorithm, different constraints are, in general, active compared to the OPT decomposition for this cluster . Note of course, that OPT itself will not respect all constraints, but because we don’t know which constraints are active for OPT, we still need to use a charging argument to low levels of OPT. Observe that here we are allowed to cut an edge even if we had the constraint (incurring the corresponding cost ), however we cannot possibly hope to charge this to the OPT solution, as OPT, for all we know, may have respected this constraint. In the analysis, we crucially use a merging procedure between subclusters of having active constraints between them and this allows us to compare the cost of our RHSC with the cost of OPT . ∎
3hyperedges to triangles for general weights .
Even though the general reduction presented in Section 3 (Figure 3) to transform a 3hyperedge to a triangle is valid even for general instances of HSC with 3hyperedges and arbitrary weights, the reduced sparsest cut problem may have negative weights, e.g. when . To the best of our knowledge, sparsest cut with negative weights has not been studied. Notice however that if the original weights satisfy the triangle inequality (or as a special case, if two of them are zero which is usually the case when we have a triplet constraints), then we can actually solve (approximately) the HSC instance, as the sparsest cut instance will only have nonnegative weights. ∎
a.3 Missing proofs in Section 4
Proof of Theorem 3.
We start by looking at the objective value of any algorithm as the summation of contributions of different triples and to the objective, where and is some other point (possibly equal to or ).
where random variable
denotes the contribution of the edge and vertex to the objective value. The vertex is a leaf of if and only if right before the time that and gets separated is still in the same cluster as and . Therefore,We now show that . Given this, the expected objective value of recursive random cutting algorithm will be at least . Moreover, the objective value of the optimal hierarchical clustering, i.e. maximizer of the Dasgupta’s objective, is no more than , and we conclude that recursive random cutting is a approximation. To see why , think of randomized cutting as flipping an independent unbiased coin for each vertex, and then deciding on which side of the cut this vertex belongs to based on the outcome of its coin. Look at the sequence of the coin flips of , and . Our goal is to find the probability of the event that for the first time that and sequences are not matched, still ’s sequence and ’s sequence are matched up to this point, or still ’s sequence and ’s sequence are matched up to this. The probability of each of these events is equal to . To see this for the first event, suppose ’s sequence is all heads (). We then need the pair of coin flips of to be a sequence of ’s ending with a , and this happens with probability . The probability of the second event is similarly calculated. Now, these events are disjoint. Hence, the probability that is separated from no earlier than is exactly , as desired. ∎
Proof of Theorem 4.
We derandomize the recursive random cutting algorithm using the method of conditional expectations. At every recursion, we go over the points in the current cluster one by one, and decide whether to put them in the “left” partition or “right” partition for the next recursion. Once we make a decision for a point, we fix that point and go to the next one. Now suppose for a cluster we have already fixed points , and now we want to make a decision for . The reward of assigning to left(right) partition is now defined as the expected value of recursive random cutting restricted to , when the points in are fixed (i.e. it is already decided which points in are going to the left partition and which ones are going to the right partition), goes to the left(right) partition and are randomly assigned to either the left or right. Note that these two rewards (or the difference of the two rewards) can be calculated exactly in polynomial time by considering all triples consisting of an edge and another vertex, and then calculating the probability that this triple contributes to the objective function (this is similar to the proof of Theorem 3, and we omit the details for brevity here). Because we know the randomized assignment of gives a approximation (Theorem 3), we conclude that assigning to the better of left or right partition for every vertex will remain to be at least a approximation. For running time, we have at most clusters to investigate. Moreover, a careful counting argument shows that the total number of operations required to calculate the differences of the rewards of assigning to left and right partitions for all vertices is at most . Hence, the running time is bounded by . ∎
Proof sketch of Theorem 5..
Before starting to prove the theorem, we prove the following simple lemma.
Lemma 3.
There is no edge between any two classes in the same layer .
Proof of Lemma 3.
If such an edge exists, then there is a path of length from to a class in , a contradiction. ∎
Now, similar to the proof of Theorem 3, we consider every triple , where and is another point , but this time we only consider ’s that are not involved in any triplet constraint (there are at least such points). We claim with probability at least the supernode containing is still in the same cluster as supernodes containing and right before and gets separated. By summing over all such triples, we show that the algorithm gets a gain of at least , which proves the approximation as the optimal clustering has a reward bounded by .
To prove the claim, if is not the base of any triplet constraint then a similar argument as in the proof of Theorem 3 shows the desired probability is exactly (with a slight adaptation, i.e. by looking at the coin sequences of supernodes containing and , which are going to be disjoint in this case at all iterations, and the coin sequence of ). Now suppose is the base of any constraint and suppose belongs to a class . Consider the layered dependency subgraph of as in Definition 2 and let the layers to be . In order for to be in the same cluster as and when they get separated, a chain of independent events needs to happen. These events are defined inductively; for the first event, consider the coin sequence of , coin sequence of (the supernode containing all the bases of) constraints in and coin sequences of all the keys of constraints in (there are of them). Without loss of generality, suppose the coin sequence of (the supernode containing) is all heads. Now the event happens only if at the time flips its first tales all keys of have already flipped at least one tales. Conditioned on this event happening, all the constraints in will be resolved and remains in the same cluster as and . Now, remove from the dependency subgraph and repeat the same process to define the events in a similar fashion. For the event to happen, we need to look at number of i.i.d. symmetric geometric random variable, and calculate the probability that first of them is no smaller than the rest. This event happens with a probability at least . Moreover the events are independent, as there is no edge between any two classes in for , and different classes have different keys. After these events, the final event that needs to happen is when all the constraints are unlocked, and needs to remain in the same cluster as and at the time they get separated. This event happens with probability . Multiplying all of these probabilities due to independence implies the desired approximation factor. ∎
Appendix B Experiments
The purpose of this section is to present the benefits of incorporating triplet constraints when performing Hierarchical Clustering. We will focus on real data using the Zoo dataset (Lichman [2013]) for a taxonomy application. We demonstrate that using our approach, the performance of simple recursive spectral clustering algorithms can be improved by approximately as measured by the Dasgupta’s Hierarchical Clustering cost function (1). More specifically:

The Zoo dataset
: It contains 100 animals forming 7 different categories (e.g. mammals, amphibians etc.). The features of each animal are provided by a 16dimensional vector containing information such as if the animal has hair or feathers etc.

Evaluation method: Given the feature vectors, we can create a similarity matrix
indexed by the labels of the animals. We choose the widely used cosine similarity to create
. 
Algorithms
: We use a simple implementation of spectral clustering based on the second eigenvector of the normalized Laplacian of
. By applying the spectral clustering algorithm once, we can create two clusters; by applying it recursively we can create a complete hierarchical decomposition, which is ultimately the output of the HC algorithm. 
Baseline comparison: Since triplet constraints are especially useful when there is noisy information (i.e. noisy features), we simulate this situation by hiding some of the features of our Zoo dataset. Specifically, when we want to find the target HC tree , we use the full 16dimensional feature vectors, but for the comparison between the unconstrained and the constrained HC algorithms we will use a noisy version of the feature vectors which consists of only the first 10 coordinates from every vector.
In more detail, the first step in our experiments is to evaluate the cost of the target clustering . For this, we use the full feature vectors and perform repeated spectral clustering to get a hierarchical decomposition (without incorporating any constraints). We call this cost OPT.
The second step is to perform unconstrained HC but with noisy information, i.e. to run the spectral clustering algorithm repeatedly on the 10dimensional feature vectors (again without taking into account any triplet constraints). This will output a hierarchical tree that has cost in terms of the Dasgupta’s HC cost ^{11}^{11}11The cost of the trees are always evaluated using the actual similarities obtained from the full feature vectors.
The final step is to choose some structural constraints (that are valid in )^{12}^{12}12Here we chose triplet constraints that will induce the same first cut as and required no constraints after that. This corresponds to a highlevel separation of the animals, for example to those that are “land” animals versus those that are “water” animals. and perform again HC with noisy information. We again use the 10dimensional feature vectors but the spectral clustering algorithm is allowed only cuts that do not violate any of the given structural constraints. Repeating until we get a decomposition gives us the final output which will have cost in terms of the Dasgupta’s HC cost Constrained_Noisy_Cost.
The first main result of our experimental evaluation is that the Constrained_Noisy_Cost is surprisingly close to OPT, even though to get the Constrained_Noisy_Cost the features used were noisy and the second main result is that incorporating the structural constraints yields improvement over the noisy unconstrained version of HC with cost Unconstrained_Noisy_Cost. Now that we have presented the experimental setup, we can proceed by describing our results and final observations in greater depth.
b.1 Experimental Results
We ran our experiments for and animals from the Zoo dataset and for the evaluation of the improvement in terms of the Dasgupta’s HC cost (1), we used the following formula:
The improvements obtained due to the constrained version are presented in Table 1.
animals 
OPT  Unconstrained_Noisy_Cost  Constrained_Noisy_Cost  Improvement 

20  1137  1286  1142  12.63 
50  23088  25216  23443  7.68 
80  89256  99211  90419  9.85 
100  171290  190205  173499  9.75 
Some observations regarding the structural constraints are the following:

When we add triplet constraints to the input as advice for the algorithm, it is crucial for the triplet constraints to actually be useful. “Easy” constraints that are readily implied by the similarity scores will have no extra use and will not lead to better solutions.

We also observed that having “nested” constraints can be really useful. Nested constraints can guide our algorithm to perform good cuts as they refer to a larger portion of the optimum tree (i.e. contiguous subtrees) rather than just different unrelated subtrees of it. The usefulness of the given constraints is correlated with the depth of the nested constraints and their accordance with the optimum tree based on Dasgupta’s objective.

Furthermore, since most of the objective cost comes from the large initial clusters, we focused on the partitions that created large clusters and imposed triplet constraints that ensured good cuts in the beginning. Actually in some cases, just the first 3 or 4 cuts are enough to guarantee that we get improvement.

Finally, we conclude that just the number of the given triplet constraints may not constitute a good metric for their usefulness. For example, a large number of constraints referring to wildly different parts of , may end up being much less useful than a smaller number of constraints guiding towards a good first cut.
Comments
There are no comments yet.