Ultrametric Fitting by Gradient Descent

by   Giovanni Chierchia, et al.

We study the problem of fitting an ultrametric distance to a dissimilarity graph in the context of hierarchical cluster analysis. Standard hierarchical clustering methods are specified procedurally, rather than in terms of the cost function to be optimized. We aim to overcome this limitation by presenting a general optimization framework for ultrametric fitting. Our approach consists of modeling the latter as a constrained optimization problem over the continuous space of ultrametrics. So doing, we can leverage the simple, yet effective, idea of replacing the ultrametric constraint with an equivalent min-max operation injected directly into the cost function. The proposed reformulation leads to an unconstrained optimization problem that can be efficiently solved by gradient descent methods. The flexibility of our framework allows us to investigate several cost functions, following the classic paradigm of combining a data fidelity term with a regularization. While we provide no theoretical guarantee to find the global optimum, the numerical results obtained over a number of synthetic and real datasets demonstrate the good performance of our approach with respect to state-of-the-art agglomerative algorithms. This makes us believe that the proposed framework sheds new light on the way to design a new generation of hierarchical clustering methods.



There are no comments yet.


page 8


A New Cost Function for Hierarchical Cluster Trees

Hierarchical clustering has been a popular method in various data analys...

A Unified Framework for Compositional Fitting of Active Appearance Models

Active Appearance Models (AAMs) are one of the most popular and well-est...

From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering

Similarity-based Hierarchical Clustering (HC) is a classical unsupervise...

A Function Fitting Method

In this article we present a function fitting method, which is a convex ...

Stochastic Gradient Descent for Spectral Embedding with Implicit Orthogonality Constraint

In this paper, we propose a scalable algorithm for spectral embedding. T...

Multilayer Graph Clustering with Optimized Node Embedding

We are interested in multilayer graph clustering, which aims at dividing...

Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes

In artificial neural networks, learning from data is a computationally d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ultrametrics provide a natural way to describe a recursive partitioning of data into increasingly finer clusters, also known as hierarchical clustering Murtagh2012. Ultrametrics are intuitively represented by dendrograms, i.e., rooted trees whose leaves correspond to data points, and whose internal nodes represent the clusters of its descendant leaves. In topology, this corresponds to a metric space in which the usual triangle inequality is strengthened by the ultrametric inequality, so that every triple of points forms an isosceles triangle, with the two equal sides at least as long as the third side. The main question investigated in this article is: << How well can we construct an ultrametric to fit the given dissimilarity data? >> This is what we refer to as ultrametric fitting.

Ultrametric fitting can be traced back to the early work on numerical taxonomy Sneath1962 in the context of phylogenetics Felsenstein2003. Several well-known algorithms originated in this field, such as single linkage Gower1969, average linkage Jardine1968, and Ward method Ward1963

. Nowadays, there exists a large literature on ultrametric fitting, which can be roughly divided in four categories: agglomerative and divisive greedy heuristics

deAmorim2015; Ackerman2016; Dasgupta2016; kobren2017; CohenAddadNIPS2017; Chehreghani2018; Bonald2018

, integer linear programming

YarkonyNIPS2015; DISUMMA2015; RoyNIPS2016, continuous relaxations DeSoete1984; Ailon2011; Charikar2017; monath2017, and probabilistic formulations Hartigan1985; Neal2003; vikram2016. Our work belongs to the family of continuous relaxations.

The most popular methods for ultrametric fitting probably belong to the family of agglomerative heuristics. They follow a bottom-up approach, in which the given dissimilarity data are sequentially merged through some specific strategy. But since the latter is specified procedurally, it is usually hard to understand the objective function being optimized. In this regard, several recent works

Dasgupta2016; RoyNIPS2016; Charikar2017; CohenAddadNIPS2017; CohenAddad2018 underlined the importance to cast ultrametric fitting as an optimization problem with a well-defined cost function, so as to better understand how the ultrametric is built.

Recently, Dasgupta Dasgupta2016 introduced a cost function for evaluating an ultrametric, and proposed an heuristic to approximate its optimal solution. The factor of this approximation was later improved by several works, based on a linear programming relaxation RoyNIPS2016, a semidefinite programming relaxation Charikar2017, or a recursive -sparsest cut algorithm CohenAddad2018. Along similar lines, it was shown that average linkage provides a good approximation of the optimal solution to Dasgupta cost function Moseley2017; Charikar2019. Closer to our approach, a differentiable relaxation inspired by Dasgupta cost function was also proposed monath2017. Moreover, a regularization for Dasgupta cost function was formulated in the context of semi-supervised clustering vikram2016; Chatziafratis2018, based on triplet constraints provided by the user.

More generally, the problem of finding the closest ultrametric to dissimilarity data was extensively studied through linear programming relaxations Ailon2011 and integer linear programming DISUMMA2015. A special case of interest arises when the dissimilarities are specified by a planar graph, which is a natural occurrence in image segmentation MALIS2009; ArbelaezPAMI2011; ManinisPAMI2018; DEEPMALIS. By exploiting the planarity of the input graph, a tight linear programming relaxation can be derived from the minimum-weight multi-cut problem YarkonyNIPS2015. There exist many other continuous relaxations of discrete problems in the specific context of image segmentation Ishikawa2003; PockECCV2008; PockCVPR2009; PockECCV2009; MollenhoffCVPR2016; Foare2018, but they typically aim at a flat representation of data, rather than hierarchical.

Contribution. We propose a general optimization framework for ultrametric fitting based on gradient descent. Our approach consists of optimizing a cost function over the continuous space of ultrametrics, where the ultrametricity constraint is implicitly enforced by an equivalent min-max operation. We demonstrate the versatility of our approach by investigating several cost functions:

  1. the closest-ultrametric fidelity term, which expresses that the fitted ultrametric should be close to the given dissimilarity graph;

  2. the cluster-size regularization, which penalizes the presence of small clusters in the upper levels of the associated hierarchical clustering;

  3. the triplet

    regularization term for semi-supervised learning, which aims to minimize the intra-class distance and maximize the inter-class distance;

  4. the Dasgupta fidelity term, which is a continuous relaxation of Dasgupta cost function expressing that the fitted ultrametric should associate large dissimilarities to large clusters.

We devise efficient algorithms with automatic differentiation in mind, and we show that they scale up to millions of vertices on sparse graphs. Finally, we evaluate the proposed cost functions on synthetic and real datasets, and we show that they perform as good as Ward method and semi-supervised SVM.

2 Ultrametric fitting

(a) Ultrametric


(b) Dendrogram

(c) Graph
Figure 1: Ultrametric on given by the dissimilarity matrix (a), and represented by the dendrogram (b) and the graph (c). Two elements and merge at the altitude in the dendrogram. The distance between and is given by the lowest common ancestor (l.c.a.) of the leaves and . For example, the l.c.a. of and is the node at altitude , hence ; the l.c.a. of and is the node at altitude , hence . The graph (c) leads to the ultrametric (a) via the min-max distance (3). For example, all the paths from to contain an edge of weight which is maximal, and thus .

Central to this work is the notion of ultrametric, a special kind of metric that is equivalent to hierarchical clustering. Formally, an ultrametric is a metric on a space in which the triangle inequality is strengthen by the ultrametric inequality, defined as


The notion of ultrametric can also be defined on a connected (non-complete) graph with non-negative edge weights , where denotes the space of functions from to . In this case, the distance is only available between the pairs of vertices in , and the ultrametric constraint must be defined over the set of cycles of as follows:


Note that an ultrametric on can be extended to all the pairs of vertices in through the min-max distance on , which is defined as


where denotes the set of all paths between the vertices and of . This observation allows us to compactly represent ultrametrics as weight functions on sparse graphs , instead of more costly pairwise distances. Figure 1 shows an example of ultrametric and its possible representations.


The dendrogram associated to an ultrametric on is denoted by  CarlssonJMLR2010. It is a rooted tree whose leaves are the elements of . Each tree node is the set composed by all the leaves of the sub-tree rooted in . The altitude of a node , denoted by , is the maximal distance between any two elements of : i.e., . The size of a node , denoted by , is the number of leaves contained in . For any two leaves and , the lowest common ancestor of and , denoted , is the smallest node of containing both and .

1:Graph with edge weights
3:for  do
4:      gradient of evaluated at
5:      update of using
Algorithm 1 Solution to the ultrametric fitting problem defined in (4).

2.1 Optimization framework

Our goal is to find the ultrametric that "best" represents the given edge-weighted graph. We propose to formulate this task as a constrained optimization problem involving an appropriate cost function defined on the (continuous) space of distances , leading to


The ultrametricity constraint is highly nonconvex and cannot be efficiently tackled with standard optimization algorithms. We circumvent this issue by replacing the constraint with an equivalent operation injected directly into the cost function. The idea is that the ultrametricity constraint can be enforced implicitly through the operation that computes the subdominant ultrametric, defined as the largest ultrametric below the given dissimilarity function. One way to compute the subdominant ultrametric is through the min-max operator defined by


where is defined as in (3). Then, Problem (4) can be rewritten as


Since the min-max operator is sub-differentiable (see (15) in Section 3), the above problem can be optimized by gradient descent, as long as is sub-differentiable. This allows us to devise Algorithm 1.

Note that the mix-max operator already proved useful in image segmentation to define a structured loss function for end-to-end supervised learning

MALIS2009; DEEPMALIS. The goal was however to find a flat segmentation rather than a hierarchical one. To the best of our knowledge, we are the first to use the mix-max operator within an optimization framework for ultrametric fitting.

2.2 Cost functions

Closest ultrametric

A natural goal for ultrametric fitting is to find the closest ultrametric to the given dissimilarity graph. This task fits nicely into Problem (4) by setting the cost function to the sum of squared errors between the sought ultrametric and the edge weights of the given graph, namely


Although the exact minimization of this cost function is a NP-hard problem kvrivanek1988complexity, the proposed optimization framework allows us to compute an approximate solution. Figure 2 shows the ultrametric computed by Algorithm 1 with for an illustrative example of hierarchical clustering.

A common issue with the closest ultrametric is that small clusters might branch very high in the dendrogram. This is also true for average linkage and other agglomerative methods. Such kind of ultrametrics are undesirable, because they lead to partitions containing very small clusters at large scales, as clearly shown in Figures 1(b)-1(c). We now present two approaches to tackle this issue.

Cluster-size regularization

To fight against the presence of small clusters at large scales, we need to introduce a mechanism that pushes down the altitude of nodes where such incorrect merging occurs. This can be easily translated in our framework, as the altitude of a node corresponds to the ultrametric distance between its children. Specifically, we penalize the ultrametric distance proportionally to some non-negative coefficients that depend on the corresponding nodes in the dendrogram, yielding


Here above, the coefficients play an essential role: they must be small for the nodes that need to be pushed down, and large otherwise. We thus rank the nodes by the size of their smallest child, that is


where denotes the children of a node in the dendrogram associated to the ultrametric on . Figure 1(d) shows the ultrametric computed by Algorithm 1 with . The positive effect of this regularization can be appreciated by observing that small clusters are no longer branched very high in the dendrogram.

(a) Graph/Labels
(b) Average link
(c) Closest
(d) Closest+Size
(e) Closest+Triplet
(f) Dasgupta
Figure 2: Illustrative examples of hierarchical clustering. Top row: Ultrametrics fitted to the input graph; only the top-30 non-leaf nodes are shown in the dendrograms (all the others are contracted into leaves). Bottom row: Assignments obtained by thresholding the ultrametrics at three clusters. The detrimental effect of having "small clusters at large scales" can be observed in (b) and (c).

Triplet regularization

Triplet constraints vikram2016; Chatziafratis2018 provide an alternative way to penalize small clusters at large scales. Like in semi-supervised classification, we assume that the labels of some data points are known, and we build a set of triplets according to the classes they belong to:


These triplets provide valuable information on how to build the ultrametric. Intuitively, we need a mechanism that reduces the ultrametric distance within the classes, while increasing the ultrametric distance between different classes. This can be readily expressed in our framework with a regularization acting on the altitude of nodes containing the triplets, leading to


Here above, the constant represents the minimum prescribed distance between different classes. Figure 1(e) shows the ultrametric computed by Algorithm 1 with .

Dasgupta cost function

Dasgupta’s objective function Dasgupta2016 has recently gained traction in the seek of a theoretically grounded framework for hierarchical clustering RoyNIPS2016; CohenAddadNIPS2017; RoyJMLR2017; CohenAddad2018; Chatziafratis2018. However its minimization is known to be a NP-hard problem Dasgupta2016. The intuition behind this function is that large clusters should be associated to large dissimilarities. The idea it then to minimize, for each edge , the size of the dendrogram node associated to divided by the weight of , yielding


However, we cannot directly use (12) in our optimization framework, as the derivative of with respect to the underlying ultrametric is equal to 0 almost everywhere. To solve this issue, we propose a soft-cardinal measure of a node that is differentiable w.r.t. the associated ultrametric . Let be a node of the dendrogram , and let be a leaf of the sub-tree rooted in . We observe that the cardinal of is equal to the number of vertices such that the ultrametric distance between and is strictly lower than the altitude of , namely


where is the Heaviside function. By replacing

with a continuous approximation, such as a sigmoid function, we provide a soft-cardinal measure of a node of

that is differentiable with respect to the ultrametric . Figure 1(f) shows the ultrametric computed by Algorithm 1 with .

Note that a differentiable cost function inspired by Dasgupta cost function was proposed in monath2017. This function replaces the node size by a parametric probability measure which is optimized over a fixed tree. This is fundamentally different from our approach, where the proposed measure is a continuous relaxation of the node size, and it is directly optimized over the ultrametric distance.

(a) Single linkage clustering
(b) Jacobian
Figure 3: Each node of the single linkage clustering (in blue) of the graph (in grey) is canonically associated (green dashed arrows) to an edge of a minimum spanning tree of the graph (thick edges): this edge is the pass edge between the leaves of the two children of the node. Edges are numbered from 1 to (number of edges). The -th column of the Jacobian matrix of the

operator is equal to the indicator vector denoting the pass edge holding the maximal value of the min-max path between the two extremities of the

-th edge. The pass edge can be found efficiently in the single linkage clustering using the l.c.a. operation and the canonical link between the nodes and the m.s.t. edges. For example, the l.c.a. of the vertices and linked by the 4-th edge is the node , which is canonically associated to the 6-th edge (): we thus have .

3 Algorithms

In this section, we present a general approach to efficiently compute the various terms appearing in the cost functions introduced earlier. All the proposed algorithms rely on some properties of the single linkage (agglomerative) clustering, which is a dual representation of the subdominant ultrametric. We perform a detailed analysis of the algorithm used to compute the subdominant ultrametric; the other algorithms can be found in the supplemental material.

Single-linkage clustering can be computed similarly to a minimum spanning tree (m.s.t.) with Kruskal’s algorithm, by sequentially processing the edges of the graph in non decreasing order, and merging the clusters located at the extremities of the edge, when a m.s.t. edge is found. One consequence of this approach is that each node of the dendrogram representing the single linkage clustering is canonically associated to an edge of the m.s.t. (see Figure 2(a)), which is denoted by .

In the following, we assume that we are working with a sparse graph , where . The number of vertices (resp. edges) of is denoted by (resp. ). For the ease of writing, we denote the edge weights as vectors of . The dendrogram corresponding to the single-linkage clustering of the graph with edge weights is denoted by .

Subdominant ultrametric

To obtain an efficient and automatically differentiable algorithm for computing the subdominant ultrametric, we observe that the min-max distance between any two vertices is given by the weight of the pass edge between and . This is the edge holding the maximal value of the min-max path from to , and an arbitrary choice is made if several pass edges exist. Moreover, the pass edge between and corresponds to the l.c.a. of and in the single linkage clustering of (see Figure 2(a)). Equation (5) can be thus rewritten as


The single-linkage clustering can be computed in time with a variant of Kruskal’s minimum spanning tree algorithm Gower1969; NajmanISMM2013. Then, a fast algorithm allows us to compute the l.c.a. of two nodes in constant time , thanks to a linear time preprocessing of the tree Bender2000. The subdominant ultrametric can thus be computed in time with Algorithm 2.

1:Graph with edge weights
2: for each
3: withGower1969; NajmanISMM2013
4:preprocess l.c.a. on with Bender2000
5:for each edge  do
6:      with Bender2000
7:      see Figure 2(a)
Algorithm 2 Subdominant ultrametric operator defined in (5).

Note that Algorithm 2

can be interpreted as a special max pooling applied to the input tensor

, and can be thus automatically differentiated. Indeed, a sub-gradient of the min-max operator at a given edge is equal to 1 on the pass edge between and and 0 elsewhere. Then, the Jacobian of the min-max operator can be written as the matrix composed of the indicator column vectors giving the position of the pass edge associated to the extremities of each edge in :


where denotes the index of the pass edge between the two extremities of the -th edge, and is the column vector of equals to 1 in position , and 0 elsewhere (see Figure 2(b)).

Cluster-size regularization

The cost function in (8) can be implemented through the same strategy used in Algorithm 2, leading to a time complexity in . See supplemental material.

Triplet regularization

Thanks to Equation (14), the cost function in (11) can be written as


This can be implemented with a time complexity in . See supplemental material.

Dasgupta cost function

The soft cardinal of a node of the tree as defined in (13) raises two issues: the arbitrary choice of the reference vertex , and the quadratic time complexity of a naive implementation. One way to get rid of the arbitrary choice of is to use the two extremities of the edge canonically associated to . To efficiently compute (13), we can notice that, if and are the children of , then the pass edge between any element of and any element of is equal to edge associated to . This allows us to rewrite (13) as


where denotes the set of ancestors of , and is the child of that does not contain . The time complexity to evaluate (17) is dominated by the sum over the ancestors of which, in the worst case, is in the order of , leading also to worst case time complexity of to compute the soft cardinal of all the nodes. In practice, dendrograms are usually well balanced, and thus the number of ancestors of a node is in the order of , yielding an empirical complexity in .

4 Experiments

Figure 4: Test image, its gradient, and superpixel contour weights with 525, 1526, and 4434 edges.
(a) Mean square error
(b) Computation time
(c) Computation time per iteration
Figure 5: Validation and computation time. Figureas (a) and (b): comparison between the CUCP algorithm YarkonyNIPS2015 and the proposed gradient descent approach. For CUCP we tested different numbers of hierarchy levels (5, 10, 20 40) distributed evenly over the range of the input dissimilarity function. Figure (a) shows the final mean square error (normalized against CUCP 40) w.r.t. the number of edges in the tested graph. Figure (b) shows the run-time compared w.r.t. the number of edges in the tested graph (CUCP was capped at 200 seconds per instance). Figure (c) shows the computation time of the tested cost functions (one iteration of Algorithm 1) with respect to the number of edges in the graph.

Framework validation

As Problem (6) is non-convex, there is no guarantee that the gradient descent method will find the global optimum. To assess the performance of the proposed framework, we use the algorithm proposed in YarkonyNIPS2015, denoted by CUCP (Closest Ultrametric via Cutting Plane), as a baseline for the closest ultrametric problem defined in (7). Indeed, CUCP can provides an (almost) exact solution to the closest ultrametric problem for planar graphs based on a reformulation as a set of correlation clustering/multi-cuts problems with additional hierarchical constraints. However, CUCP requires to define a priori the set of levels which will compose the final hierarchy.

We generated a set of superpixels adjacency graphs of increasing scale from a high-resolution image (see Figure 4). The weight of the edge linking two superpixels is defined as the mean gradient value, obtained with DollarPAMI2015, along the frontier between the two superpixels. The results presented in Figure 5 shows that the proposed approach is able to provide solutions close to the optimal ones (Figure 4(a)) using only a fraction of the time needed by the combinatorial algorithm (Figure 4(b)), and without any assumption on the input graph. The complete experimental setup is in the supplemental material.

The computation time of some combinations of cost terms are presented in Figure 4(c). Note that, Algorithm 1 usually achieves convergence in about one hundred iterations (see supplemental material). Closest and Closest+Size can handle graphs with millions of edges. Dasgupta relaxation is computationally more demanding, which decreases the limit to a few hundred thousands of edges.

(a) Hierarchical clustering (unsupervised)
(b) Semi-supervised clustering
Figure 6: Performance on real datasets.

Hierarchical clustering

We evaluate the proposed optimization framework on five datasets downloaded from the LIBSVM webpage,222https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/ whose size ranges from to samples. For each dataset, we build a 5-nearest-neighbor graph, to which we add the edges of a minimum spanning tree to ensure the connectivity. Then, we perform hierarchical clustering on this graph, and we threshold the resulting ultrametric at the prescribed number of clusters. We divide our analysis in two sets of comparisons: hierarchical clustering (unsupervised), and semi-supervised clustering. To be consistent among the two types of comparisons, we use the classification accuracy as a measure of performance.

Figure 5(a) compares the performance of three hierarchical clustering methods. The baseline is "Ward" agglomerative method, applied to the pairwise distance matrix of each dataset. Average linkage and closest ultrametric are not reported, as their performance is consistently worst. The "Dasgupta" method refers to Algorithm 1 with and . The "Closest+Size" method refers to Algorithm 1 with the cost function and . In both cases, the regularization is only applied to the top-10 dendogram nodes (see supplemental material). The results show that the proposed approach is competitive with Ward method (one of the best agglomerative heuristics). On the datasets Digit1 and Heart, "Dasgupta" performs slightly worse than "Closest+Size": this is partly due to the fact that our relaxation of the Dasgupta cost function is sensible to data scaling.

Figure 5(b) compares the performance of two semi-supervised clustering methods, and an additional unsupervised method. The first baseline is "Spectral" clustering applied to the Gaussian kernel matrix of each dataset. The second baseline is "SVM

" classifier trained on the fraction of labeled samples, and tested on the remaining unlabeled samples. Between


of training samples are drawn from each dataset using a 10-fold scheme, and the cross-validated performance is reported in terms of mean and standard deviation. The "

Closest+Triplet" method refers to Algorithm 1 with , and

. The results show that the triplet regularization performs comparably to semi-supervised SVM, which in turn performs better than spectral clustering.

5 Conclusion

We have presented a general optimization framework for fitting ultrametrics to sparse edge-weighted graphs in the context of hierarchical clustering. We have demonstrated that our framework can accommodate various cost functions, thanks to efficient algorithms that we have carefully designed with automatic differentiation in mind. Experiments carried on simulated and real data allowed us to show that the proposed approach provides good approximate solutions to well-studied problems.

The theoretical analysis of our optimization framework is beyond the scope of this paper. Nonetheless, we believe that statistical physics modelling carleo2019 may be a promising direction for future work, based on the observation that ultrametricity is a physical property of spin-glasses Grossman1989; leuzzi1999; katzgraber2009; katzgraber2012; baviera2015; jagannath2017

. Other possible extensions include the end-to-end learning of neural networks for hierarchical clustering, possibly in the context of image segmentation.

6 Acknowledgment

Giovanni Chierchia was supported by the INS2I JCJC project under grant 2019OSCI.


7 Supplemental material

7.1 Average linkage approximates the closest ultrametric problem

To help understand how the closest ultrametric problem relates to average linkage, note that an ultrametric takes a finite set of non-negative values with . Hence, it can be represented as


where are functions from to defining a hierarchical partition. In this setting, the cost function boils down to


and, for a fixed hierarchical clustering , the optimal altitudes are given by


This is exactly the criterion used by average linkage to build a hierarchical clustering. We can thus argue that the latter provides an approximate solution to the closest ultrametric problem. As a matter of fact, average linkage and Algorithm 1 with produce structurally similar ultrametrics, as shown in Figure 9 for illustrative examples of hierarchical clustering.

7.2 Algorithms

This section describes in detail the algorithms proposed to compute the cost terms associated to cluster-size regularization, triplet regularization, and Dapgusta cost function.

Cluster-size regularization

This regularization penalizes small clusters at large scales (see Figure 6(a)), and the associated cost is computed by Algorithm 3. It proceeds by first computing the size of the smallest child of each node of the tree. The size of each node can be trivially computed recursively from the leaves to the root in linear time. Then, it identifies the pass node associated to each edge thanks to he fast l.c.a. algorithm, and it deduces the individual cost for this edge. Note that we assume the weights are constants, even though they depend on the variable being optimized. This allows us to simplify the gradient evaluation. Moreover, the algorithm presents an additional hyper-parameter for applying the regularization only to the top- dendrogram nodes. In the algorithm we denote by , the rank of a node according to the ordering given by their altitudes (from highest to lowest values). The root of the tree is thus ranked 1, the second highest node is ranked 2 and so on. Note that in practice, the single linkage algorithm (as any agglomerative clustering method) naturally orders nodes according to this ordering and no extra computation is required. We set in all our numerical experiments.

(a) The cluster-size regularization pushes down the nodes of the hierarchical clustering having at least one small child. This corresponds to reducing the distance between the elements in the cluster, effectively preventing small clusters to appear at high altitudes.
(b) The triplet regularization pushes down the lowest common ancestor between elements of the same class (i.e., it reduces the intra-class distance) and pushes up the lowest common ancestor between elements of different classes (i.e., it increases the inter-class distance).
Figure 7: Intuitive interpretation of the proposed regularization schemes.
1:Graph with ultrametric
2:Parameter to apply regularization only on the top- nodes
3:Output the cluster-size regularization value
4: withGower1969, NajmanISMM2013
5: cardinal of each node of
6:for each node of from the leaves to the root (excluded) do
8:     for each child of  do
10:preprocess l.c.a on with Bender2000
12:for each edge  do
13:      with Bender2000
14:     if  then
Algorithm 3 Cluster-size regularization (8)

Triplet regularization

This regularization for semi-supervised learning enforces triplet constraints (see Figure 6(b)), and the associated cost is computed by Algorithm 4, which is very similar to the subdominant ultrametric algorithm. For every triplet , it searches for and , the smallest clusters containing the pairs and , with the fast l.c.a. algorithm. It then introduces a penalization if the altitude of (i.e., the distance between and ) is not small enough compared to the altitude of (i.e., the distance between and ).

1:Graph with ultrametric
4:Output the triplet regularization value
5: withGower1969, NajmanISMM2013
6:preprocess l.c.a on with Bender2000
8:for each  do
9:      with Bender2000
10:      with Bender2000
Algorithm 4 Triplet regularization (16)

Dasgupta cost function

The difficulty in implementing the proposed relaxation of Dasgupta cost term lies in the soft-cardinal function defined in (17). The main function described in Algorithm 5 is similar to previously presented algorithm. The soft-cardinal function is computed by algorithm 6. As with Algorithm 3, the size of the nodes of the tree can be computed recursively from leaves to root in linear time. Note that, on line 9, the child of that does not contain can easily be determined by remembering the previous node of the "for each" loop: it is the sibling of the latter.

1:Graph with ultrametric
2:Output Dapgusta cost function value
3: Algorithm 6
4:preprocess l.c.a on with Bender2000
6:for each edge  do
7:      with Bender2000
Algorithm 5 Dasgupta cost function (12)
1:Graph with ultrametric
2:Single linkage clustering on
3:Output the soft-cardinal of non leaves node of
4: cardinal of each node of
5:preprocess l.c.a on with Bender2000
6:for each non leaf node of  do
10:     for each vertex of  do
11:         for each ancestor of  do
12:               child of the that does not contain
Algorithm 6 Soft-cardinal function (17)

7.3 Framework validation

All the tests in the comparison with the CUCP algorithm (Section 4) were conducted on a computer with an Intel I7 4 cores processor and 16 GB of memory. For the optimization, we use the AMSGrad variation j.2018on of the ADAM method with step-size . Our implementation of the proposed algorithms are all single threaded.

For each of the 52 test instances, the values obtained at each iteration of Algorithm 1 were normalized between 0 (lowest achieved cost for this instance) and 1 (highest cost reached for this instance). Figure 8 shows the mean-normalized convergence curve (with its standard deviation). We can see, that the convergence rate appears to be very good and smooth in practice. The convergence is usually reached within a bit more than a hundred iterations.

Figure 8: Mean-normalized convergence curve of the proposed approach and standard deviation.

7.4 Illustrative examples

Figure 9 shows more illustrative examples of hierarchical clustering. For each dataset, we build a 5-nearest-neighbor graph, to which we add the edges of a minimum spanning tree to ensure the connectivity. Then, we perform hierarchical clustering on this graph, and we threshold the resulting ultrametric at the prescribed number of clusters. The column "Closest" is the solution to Algorithm 1 with the cost function . The column "Closest+Size" is the solution to Algorithm 1 with the cost function and , where the regularization is only applied to the top-10 dendogram nodes. The column "Closest+Triplet" is the solution to Algorithm 1 with the cost function , , and . The column "Dasgupta" is the solution to Algorithm 1 with the cost function . For the optimization, we use the AMSGrad variation j.2018on of the ADAM method with step-size .

Figure 9: Illustrative examples of hierarchical clustering. Top rows: Ultrametrics fitted to the input graph (only the top-30 non-leaf nodes are shown in the dendrograms). Bottom rows: Assignments obtained by thresholding the ultrametrics at two, three, or four clusters.