HyperAid
This is the repository for the KDD 2022 paper "HyperAid: Denoising in hyperbolic spaces for treefitting and hierarchical clustering".
view repo
The problem of fitting distances by treemetrics has received significant attention in the theoretical computer science and machine learning communities alike, due to many applications in natural language processing, phylogeny, cancer genomics and a myriad of problem areas that involve hierarchical clustering. Despite the existence of several provably exact algorithms for treemetric fitting of data that inherently obeys treemetric constraints, much less is known about how to best fit treemetrics for data whose structure moderately (or substantially) differs from a tree. For such noisy data, most available algorithms perform poorly and often produce negative edge weights in representative trees. Furthermore, it is currently not known how to choose the most suitable approximation objective for noisy fitting. Our contributions are as follows. First, we propose a new approach to treemetric denoising (HyperAid) in hyperbolic spaces which transforms the original data into data that is “more” treelike, when evaluated in terms of Gromov's δ hyperbolicity. Second, we perform an ablation study involving two choices for the approximation objective, ℓ_p norms and the Dasgupta loss. Third, we integrate HyperAid with schemes for enforcing nonnegative edgeweights. As a result, the HyperAid platform outperforms all other existing methods in the literature, including Neighbor Joining (NJ), TreeRep and TREX, both on synthetic and realworld data. Synthetic data is represented by edgeaugmented trees and shortestdistance metrics while the realworld datasets include Zoo, Iris, Glass, Segmentation and SpamBase; on these datasets, the average improvement with respect to NJ is 125.94%.
READ FULL TEXT VIEW PDFThis is the repository for the KDD 2022 paper "HyperAid: Denoising in hyperbolic spaces for treefitting and hierarchical clustering".
The problem of fitting treedistances between data points is of great relevance in application areas concerned with extracting and processing hierarchical information (Saitou and Nei, 1987; Ailon and Charikar, 2005; Macaluso et al., 2021; Coenen et al., 2019; Hewitt and Manning, 2019). A significant body of work on the topic has focused on fitting phylogenetic trees based on evolutionary distances (Saitou and Nei, 1987) and using fitted treemodels for historic linguistics and natural language processing (Coenen et al., 2019; Hewitt and Manning, 2019). Since trees are usually hard to accurately represent in lowdimensional Euclidean spaces (Linial et al., 1995), alternative methods have been suggested for representation learning in hyperbolic spaces (Nickel and Kiela, 2017). For these and related approaches, it has been observed that the quality of embedding and hence the representation error depend on the so called Gromov hyperbolicity (Gromov, 1987) of the data, which will be discussed indepth in the following section. For perfect trees, , and the larger the value of the less treelike the data structure. When
is sufficiently large, as is the case when dealing with noisy observations or observations with outliers, the embedding distortion may be significant
(Abraham et al., 2007). This is a wellknown fact that has limited the performance of methods such as Neighbor Joining (NJ) and TreeRep, which come with provable reconstruction guarantees only in the presence of small treemetric perturbations/noise (Saitou and Nei, 1987; Sonthalia and Gilbert, 2020)(more precisely, the correct output tree topology is guaranteed for “nearly additive” distance matrices, i.e., matrices in which every distance differs from the true distance by not more than half of the shortest edge weight in the tree). Heuristic methods for treemetric fitting, such as NINJA
(Wheeler, 2009) and TREX (Boc et al., 2012) have similar issues, but address the problems of scaling, and in the latter case, the issue of negative treeedge weights that may arise due to large hyperbolicity.In a parallel development, the theoretical computer science community has explored the problem of fitting both general treemetrics and ultrametrics with specialized distortion objectives (Ailon and Charikar, 2005; Harb et al., 2005; CohenAddad et al., 2021). Ultrametrics are induced by rooted trees for which all the roottoleaf distances are the same. Although ultrametrics do not have the versatility to model nonlinear evolutionary phenomena, they are of great relevance in data analysis, in particular in the context of “linkage” clustering algorithms (single, complete or average linkage) (Ailon and Charikar, 2005). Existing methods for ultrametric approximations scale well and provide approximation guarantees for their general graph problem counterparts. For both general treemetrics and ultrametrics, two objectives have received significant attention: The Dasgupta objective and derivatives thereof (Dasgupta, 2016; Charikar et al., 2019) and the objective (Ailon and Charikar, 2005; Harb et al., 2005)
. Dasgupta’s objective was notably used for hierarchical clustering (HC) in hyperbolic spaces as a means to mitigate the hard combinatorial optimization questions associated with this clustering metric
(Sahoo et al., 2020); see also (Macaluso et al., 2021) for other approaches to HC in hyperbolic spaces. Neither of the aforementioned approaches provides a solution to the problem of fitting treemetrics for data with moderatetohigh Gromov hyperbolicity, i.e., noisy or perturbed treemetric data not covered by existing performance guarantee analyses. In addition, no previous study has compared the effectiveness of different objectives for the measure of fit for treemetrics on realworld datasets.Our contributions are threefold.
We propose the first approach to data denoising (i.e., Gromov hyperbolicty reduction) in hyperbolic space as the first step to improving the quality of treemetric fits.
We combine our preprocessing method designed to fit both general trees and ultrametrics with solutions for removing negative edgeweights in the treemetrics that arise due to practical deviations from treelike structures.
We demonstrate the influence of hyperbolic denoising on the quality of treemetric fitting and hierachical clustering for different objectives. In particular, we perform an ablation study that illustrates the advantages and disadvantages of different objectives (i.e., Dasgupta (Dasgupta, 2016) and (Harb et al., 2005)) when combined with denoising in hyperbolic spaces.
The paper is organized as follows. Related works are discussed in Section 2 while the relevant terminology and mathematical background material are presented in Section 3. Our treefitting, HC clustering approach and accompanying analysis on different choices of objective functions are described in Section 4. More details regarding the HyperAid method are provided in Section 5, with experimental results made available in Section 6. All proofs are relegated to the Supplement.
The Dasgupta HC objective. Determining the quality of trees produced by HC algorithms is greatly aided by an adequate choice of an objective function. Dasgupta (Dasgupta, 2016) proposed an objective and proved that an approximation can be obtained using specialized approximation sparsest cut procedures (Dasgupta, 2016). HC through the lens of Dasgupta’s objective and variants thereof have been investigated in (Charikar and Chatziafratis, 2017; Moseley and Wang, 2017; Ahmadian et al., 2019; Alon et al., 2020). For example, the authors of (Moseley and Wang, 2017) showed that the average linkage (UPGMA) algorithm produces a approximation for a Dasguptatype objective. It is important to point out that the Dasgupta objective depends only on the sizes of unweighted subtrees rooted at lowest common ancestors (LCAs) of two leaves, the optimal trees are restricted to be binary, and that it does not produce edgelengths (weights) in the hierarchy. The latter two properties are undesirable for applications such as natural language processing, phylogeny and evolutionary biology in general (Waterman et al., 1977). As a result, other objectives have been considered, most notably metrictree fitting objectives, which we show can resolve certain issues associated with the use of the Dasgupta objective.
Treemetric learning and HC. In comparison to the Dasgupta objective, metricbased objective functions for HC have been mostly overlooked. One metric based objective is the norm of the difference between the input metric and the resulting (fitted) treemetric (Waterman et al., 1977). For example, the authors of (Farach et al., 1995) showed that an optimal ultrametric can be found with respect to the norm loss for arbitrary input metric in polynomial time. It is also known in the literature that finding optimal treemetrics with respect to the norm, for is NPhard and finding optimal ultrametrics for is also NPhard. In fact, the problem for treemetrics and the problem for treemetrics and ultrametrics are both APXhard (Wareham, 1993; Agarwala et al., 1998; Ailon and Charikar, 2005). The work (Agarwala et al., 1998) presents a approximation result for the closest treemetric under norm and established an important connection between the problems of fitting an ultrametric and fitting a treemetric. The result shows that an approximation result for the constrained ultrametric fitting problem yields a approximation solution for treemetric fitting problem. The majority of the known results for treemetric fitting depend on this connection. The algorithm in (Ailon and Charikar, 2005) offers an approximation for both treemetrics fitting and ultrametrics fitting under general norms. The work (CohenAddad et al., 2021) improves the approximation factor to for the norm. Unfortunately, despite having provable performance guarantees, both the methods from (Ailon and Charikar, 2005) and (CohenAddad et al., 2021) are not scalable due to the use of correlation clustering. As a result, they cannot be executed efficiently in practice even for trees that have 50 leaves, due to severe computational bottlenecks.
Gradientbased HC. The authors of (Chierchia and Perret, 2020) proposed an ultrametric fitting framework (Ufit) for HC based on gradient descent methods that minimize the loss between the resulting ultrametric and input metric. There, a subdifferentiable minmax operation is integrated into the cost optimization framework to guarantee ultrametricity. Since an ultrametric is a special case of a treemetric, Ufit can be used in place of general treemetric fitting methods. The gHHC method (Monath et al., 2019) represents the first HC approach based on hyperbolic embeddings. The authors assume that the hyperbolic embeddings of leaves are given and describe how to learn the internal vertex embeddings by optimizing a specialized objective function. The HypHC approach (Chami et al., 2020) disposes of the requirement to have the leaf embeddings. It directly learns a hyperbolic embedding by optimizing a relaxation of Dasgupta’s objective (Dasgupta, 2016) and reconstructs the discrete binary tree from the learned hyperbolic embeddings of leaves using the distance of their hyperbolic lowest common ancestors to the root. Compared to gHHC, our framework also does not require input hyperbolic embeddings of leaves. We learn the hyperbolic embeddings of leaves by directly minimizing the loss (rather than the Dasgupta objective) between the hyperbolic distances and the input metric of pairs of points, which substantially differs from HypHC and has the goal to preprocess (denoise) the data for downstream treemetric fitting. Furthermore, it will be shown in the subsequent exposition that metric based losses are preferable to the Dasgupta objective for the new task of hyperbolic denoising and consequent HC.
Learning in hyperbolic spaces. Representation learning in hyperbolic spaces has received significant interest due to its effectiveness in capturing latent hierarchical structures (Krioukov et al., 2010; Nickel and Kiela, 2017; Papadopoulos et al., 2015; Tifrea et al., 2019)
. Several learning methods for Euclidean spaces have been generalized to hyperbolic spaces, including perceptrons and SVMs
(Cho et al., 2019; Chien et al., 2021; Tabaghi et al., 2021)(Ganea et al., 2018; Shimizu et al., 2021) and graph neural networks (Chami et al., 2019; Liu et al., 2019). In the context of learning hyperbolic embeddings, the author of (Sarkar, 2011) proposed a combinatorial approach for embedding trees with low distortion in just two dimensions. The work (Sala et al., 2018) extended this idea to higher dimensions. Learning methods for hyperbolic embeddings via gradientbased techniques were discussed in (Nickel and Kiela, 2017, 2018). None of these methods currently provides quality clustering results for practical data whose distances do not “closely” match those induced by a treemetric. Furthermore, the main focus of (Nickel and Kiela, 2017, 2018) is to learn hyperbolic embeddings that represent the input graphs and trees as accurately as possible. In contrast, our goal is to learn hyperbolic metrics that effectively “denoise” parts of the input metric to better conform tree structures, in order to improve the performance of arbitrary metrictree fitting and HC downstream methods.Notation and terminology. Let be a tree with vertex set , edge set and nonnegative edge weights . Let denote the set of leaf vertices in . The shortest length path between any two vertices is given by the metric . The metric is called a treemetric induced by the (weighted) tree . In binary trees, there is only one internal (nonleaf) vertex of degree two, the root vertex ; all other internal vertices are of degree three. Since our edge weights are allowed to be zero, nonbinary trees can be obtained by adding edges of zero weight. The pairwise distances of leaves in this case remain unchanged. A vertex is said to be a descendant of vertex if belongs to the directed path from the root to . Also, with this definition, the vertex is an ancestor of the vertex . For any two vertices , we let denote their lowest common ancestor (LCA), i.e., the common ancestor that is the farthest from the root vertex. We define a clan in as a subset of leaf vertices that can be separated from the rest of the tree by removing a single edge. Furthermore, we let be the set of binary trees with leaves. For any internal vertex of , we let be a binary subtree of rooted at .
hyperbolic metrics. Gromov introduced the notion of hyperbolic metrics as a generalization of the type of metric obtained from manifolds with constant negative curvature (Gromov, 1987; Sonthalia and Gilbert, 2020), as described below.
Given a metric space , the Gromov product of with respect to a base point is
(1) 
Note that the Gromov product measures how close is to the geodesic connecting and (Sonthalia and Gilbert, 2020).
A metric on a space is a hyperbolic metric for if
(2) 
Usually, when stating that is hyperbolic we mean that is the smallest possible value that satisfies the condition (2). An ultrametric is a special case of a treemetric (Ailon and Charikar, 2005), and is formally defined as follows.
A metric on a space is called an ultrametric if it satisfies the strong triangle inequality property
(3) 
Note that the generating tree of an ultrametric on has a special structure: All element in are leaves of the tree and all leaves are at the same distance from the root (Ailon and Charikar, 2005). An illustration of an ultrametric can be found at Figure 2.
Hyperbolic geometry. A hyperbolic space is a nonEuclidean space with a constant negative curvature. Despite the existence of various equivalent models for hyperbolic spaces, Poincaré ball models have received the broadest attention in the machine learning and data mining communities. Although our hyperbolic denoising framework can be extended to work for any hyperbolic model, we choose to work with Poincaré ball models with negative curvature : . For any two points , their hyperbolic distance equals
(4) 
Furthermore, for a reference point , we denote its tangent space, the first order linear approximation of around , by . Möbius addition and scalar multiplication — two basic operators on the Poincaré ball (Ungar, 2008) – may be defined as follows. The Möbius sum of equals
(5) 
Unlike its vectorspace counterpart, this addition is noncommutative and nonassociative. The Möbius version of multiplication of
by a scalar is defined according to(6) 
For more details, see (Vermeer, 2005; Ganea et al., 2018; Chien et al., 2021; Tabaghi et al., 2021). With the same operators, one can also define geodesics – analogues of straight lines in Euclidean spaces and shortest path in graphs – in . The geodesics connecting two points is given by
(7) 
Note that and and . An illustration of the Poincare model is given in Figure 3.
As previously explained, treemetric fitting is closely related to hierarchical clustering, and both methods are expected to greatly benefit from data denoising. It is nevertheless unclear for which treemetric fitting algorithms and objectives used in hierarchical clustering does one see the most significant benefits (if any) from denoising. To this end, we describe two objectives that are commonly used for the aforementioned tasks in order to determine their strengths and weaknesses and potential performance benefits.
Let be a set of nonnegative dissimilarity scores between a set of entities. In HC problems, the goal is to find a tree that best represents a set of gradually finer hierarchical partitions of the entities. In this representation, the entities are placed at the leaves of the tree , i.e., . Each internal vertex of is representative of its clan, i.e., the leaves of or the hierarchical cluster at the level of . We find the following particular HC problem definition useful in our analysis.
Let be a set of pairwise dissimilarities between entities. The optimal hierarchical clusters are equivalent to a binary tree that is a solution to the following problem:
where is a (known) function of pairwise dissimilarities of the leaves in , and
is a given loss function.
Dasgupta (Dasgupta, 2016) introduced the following cost function for HC:
(8) 
The optimal HC solution is a binary tree that maximizes (8). In the formalism of Problem 4.1, this approach assumes that the pairwise dissimilarity between leaf vertices and equals the number of leaves of the tree rooted at their LCA, i.e.,
In words, two dissimilar leaves and should have an LCA that is close to the root. This ensures that the size of is proportional to . It is also important to note that this method is sensitive to the absolute value of the input dissimilarities, since the loss function is the inner product between measurements and tree leaf dissimilarities, i.e., .
Instead of using the Dasgupta objective, one can choose to work with a leaf dissimilarity metric that is provided by the tree distance function, . This way, one can learn a weighted tree that best represents the input dissimilarity measurements. This is formally described as follows.
Let be a set of pairwise dissimilarities of entities. The optimal hierarchical clusters in the norm sense are equivalent to a tree that minimizes the objective
(9) 
According to our general formalism, Problem 4.2 uses the natural treedistance function to compute the leaf dissimilarities, i.e., , and the loss to measure the quality of the learned trees, i.e., . The objective function in Problem 4.2 aims to jointly find a topology (adjacency matrix) and edge weights for a tree that best approximates a set of dissimilarity measurements. This problem can be interpreted as approximating the measurement with an additive distance matrix (i.e., distances induced by trees). This is a significantly stronger constraint than the one induced by the requirement that two entities that are similar to one another should be placed (as leaves) as close as possible in the tree. We illustrate this distinction in Figure 4 : The dissimilarity measurements are generated according to the tree and the vertices and can be arbitrarily close to each other ()— see Figure 4. The optimal tree in the sense does not place these vertices close to each other since the objective is to approximate all dissimilarity measurements with a set of additive distances. This is therefore in contrast to Dasgupta’s objective where, for a small enough , it favors placing the vertices and closest to each other (see Figure 4). In Figures 4 , we provide another example to illustrate the distinction between optimal clusters according to the and Dasgupta’s criteria.
Remark 1. We can interpret Problem 4.2 as two independent problems: (1) Learning edgeweights, and (2) Learning the topology. A binary tree with leaves has a total of vertices and edges. Let be the vector of the edge weights. Any distance between pairs of vertices in can be expressed as a linear combination of the edge weights. We can hence write for the vector of all pairwise distances of leaves, so that . Given a fixed topology (adjacency matrix of the tree ), we can compute the edges weight of that minimize the objective function in Problem 4.2
(10) 
There always exists at least one solution to the optimization problem 4.2. This is the result of the following proposition.
The objective in Problem 4.2 is a convex function of the edge weights of the tree .
Remark 2. Let and be two edges, with weights and , incident to the root of a binary tree . We can delete the root along with the edges and , and place an edge of weight between its children. This process trims the binary tree while preserving all pairwise distances between its leaf vertices. Hence, we can also work with trimmed nonbinary trees. Furthermore, in this formalism, the root vertex does not carry any hierarchyrelated information about the entities (leaves), although it is important for “anchoring” the tree in HC. We discuss how to select a root vertex upon tree reconstruction in our subsequent exposition.
Here we discuss which objective function produces more “acceptable” trees and hierarchical clusters, or Dasgupta? In clustering problems for which we do not have the ground truth clusters, it is hard to quantify the quality of the solution. This is especially the case when using different objectives, as in this case comparisons may be meaningless. To address this issue, we performed an experiment in which groundtruth synthetic data is modeled in a way that reflect the properties of the objectives used in optimization. Then, the quality of the objective is measured with respect to its performance on both modelmatched and modelunmatched data.
Let us start by generating a ground truth random (weighted) tree with edge weights distributed uniformly and independently in . Measurements are subsequently generated by computing the dissimilarities of leaves in . According to the Dasgupta objective, the dissimilarity between is given by , whereas according to the objective, it is given by (all distances are computed wrt the weighted tree ). We postulate that if the measurements are generated according to the Dasgupta notion of dissimilarities, then his objective should perform better in terms of recovering the original tree than the objective; and vice versa.
In this experiment, we generate random ground truth trees with i.i.d. edge weights from ; and produce measurements according to both objectives, denoted by and . Then, we generate a fixed set of random trees and find the tree that minimizes the cost function of each of the problems. Let be the measurements of the form and be the measurements of the form for the randomly generated tree . We let be the binary tree with the minimum cost on dissimilarity measurements . We define , , and similarly. The goal is to compare the deviation of and from the ground truth tree , i.e., vs. . We ask the same question for and .
In Figure 5, we show the scatter plot of the results for each of the random trials, pertaining to trees with leaves. If the measurements are generated according to the objective, then recovers the underlying tree with higher accuracy compared to , as desired. On the other hand, for dissimilarities generated according to Dasgupta’s objective, still manages to recover the underlying tree better than for larger trees (. This suggests that the objective could be more useful in recovering the underlying tree for more versatile types of dissimilarity measurements and larger datasets.
We describe next how to solve the constrained optimization Problem 4.2 in a hyperbolic space, and explain why our approach has the effect of data denoising for treemetric learning. The HyperAid framework is depicted in Figure 2. In particular, the HyperAid encoder is described in Section 5.1, while the decoder is described in Section 5.2.
The first key idea is to exploit the fact that a treemetric is hyperbolic. This can be deduced from the following theorem.
A graph is a tree if and only if it is connected, contains no loops and has shortest path distance satisfying the fourpoint condition:
(11) 
Note that through some simple algebra, one can rewrite condition (2) for hyperbolicity as
(12) 
It is also known that any hyperbolic metric space embeds isometrically into a metrictree (Dress, 1984).
The second key idea is that condition (2) with a value of may be viewed as a relaxation of the treemetric constraint. The parameter can be viewed as the slackness variable, while a hyperbolic metric can be viewed as a “soft” treemetric. Hence, the smaller the value of , the closer the output metric to a treemetric. Denoising therefore refers to the process of reducing .
Note that direct testing of the condition (2) takes time, which is prohibitively complex in practice. Checking the soft treemetric constraint directly also takes time, but to avoid this problem we exploit the connection between hyperbolic metric spaces and negative curvature spaces. This is our third key idea, which allows us to optimize the Problem 4.2 satisfying the soft treemetric constraint efficiently. More precisely, (see Appendix B). Hence, a distance metric induced by a hyperbolic space of negative curvature will satisfy the soft treemetric constraint with slackness at most . Therefore, using recent results pertaining to optimization in hyperbolic spaces (i.e., Riemannian SGD and Riemannian Adam (Bonnabel, 2013; Kochurov et al., 2020)), we can efficiently solve the Problem 4.2 with a soft treemetric constraint in a hyperbolic space. Combined with our previous argument, we demonstrate that finding the closest hyperbolic metric with small enough may be viewed as treemetric denoising.
Based on , one many want to let . Unfortunately, optimization in hyperbolic spaces is known to have precisionquality tradeoff problems (Sala et al., 2018; Nickel and Kiela, 2018), and some recent works has attempted to address this issue (Yu and De Sa, 2019, 2021). In practice, one should choose at a moderate scale ( in our experiment). The loss optimized by the encoder in HyperAid equals
(13) 
where , and are dissimilarities between entities.
A comparison between HyperAid and HypHC is once more in place. The purpose of using hyperbolic optimization in HypHC and our HyperAid is very different. HypHC aims to optimize the Dasgupta objective and introduces a continuous relaxation of this cost in hyperbolic space, based on continuous versions of lowest common ancestors (See Equation (9) in (Chami et al., 2020)). In contrast, we leverage hyperbolic geometry for learning soft treemetrics under loss, which denoises the input metrics to better fit “treelike” measurements. The resulting hyperbolic metric can then be further used with any downstream treemetric learning and inference method, and has therefore a level of universality not matched by HypHC whose goal is to perform HC. Our ablation study to follow shows that HypHC cannot offer performance improvements like our approach when combined with downstream treemetric learners.
Treemetric fitting methods produce trees based on arbitrary input metrics. In this context, the NJ (Saitou and Nei, 1987)
algorithm is probably the most frequently used algorithm, especially in the context of reconstructing phylogenetic trees. TreeRep
(Sonthalia and Gilbert, 2020), a divideandconquer method, was also recently proposed in the machine learning literature. In contrast to linkage base methods that restrict their output to be ultrametrics, both NJ and TreeRep can output a general treemetric. When the input metric is a treemetric, TreeRep is guaranteed to recover it correctly in the absence of noise. It is also known that the local distortion of TreeRep is at most when the input metric is not a treemetric but a hyperbolic metric. NJ can recover the ground truth treemetric when input metric is a “slightly” perturbed treemetric (Atteson, 1997). Note that methods that generate ultrametrics such as linkagebased methods and Ufit do not have similar accompanying theoretical results.The key idea of HyperAid is that if we are able to reduce for the input metric, both NJ, TreeRep (as well as other methods) should offer better performance and avoid pitfalls such as negative tree edgeweights. Since an ultrametric is a special case of a treemetric, we also expect that our denoising process will offer empirically better performance for ultrametrics as well, which is confirmed by the results presented in the Experiments Section.
Rooting the tree. The NJ and TreeRep methods produce unrooted trees, which may be seen as a drawback for HC applications. This property of the resulting tree is a consequence of the fact that NJ does not assume constant rates of changes in the generative phenomena. The rooting issue may be resolved in practice through the use of outgroups, or the following simple solution known as midpoint rooting. In midpoint rooting, the root is assumed to be the midpoint of the longest distance path between a pair of leaves; additional constraints may be used to ensure stability under limited leafremoval. Rooting may also be performed based on priors, as is the case for outgroup rooting (Hess and De Moraes Russo, 2007).
Complexity analysis. We focus on the time complexity of our encoder module. Note that for the exact computation of the (Dasgupta) loss, HypHC takes time, while our norm loss only takes time. Nevertheless, as pointed out by the authors of HypHC, one can reduce the time complexity to by sampling only one triplet per leaf. A similar approach can be applied in our encoder, resulting in a time complexity. Hence, the time complexity of our HyperAid encoder is significantly lower than that of HypHC, as we focus on a different objective function. With respect to treemetric and ultrametric fitting methods, we still have loworder polynomial time complexities. For example, NJ runs in time, while an advanced implementation of UPGMA requires time (Murtagh, 1984).
To demonstrate the effectiveness of our HyperAid framework, we conduct extensive experiments on both synthetic and realworld datasets. We mainly focus on determining if denoised data leads to decoded trees that have a smaller losses compared to those generated by decoders that operate on the input metrics directly (Direct). Due to space limitations, we only examine the norm loss, as this is the most common choice of objective in practice; HyperAid easily generalizes to arbitrary . An indepth description of experiments is deferred to the Supplement.
Treemetric methods. We test both NJ and TreeRep methods as decoders. Note that TreeRep is a randomized algorithm so we repeat it times and pick the tree with lowest loss, as suggested in the original paper (Sonthalia and Gilbert, 2020). We also test the TREX (Boc et al., 2012) version of NJ and Unweighted NJ (UNJ) (Gascuel and others, 1997)
, both of which outperform NJ due to their sophisticated postprocessing step. TREX is not an opensource toolkit and we can only use it by interfacing with its website manually, which limits the size of the trees that can be tested; this is the current reason for reporting the performance of TREX only for smallsize realworld datasets.
Ultrametric methods.
We also test if our framework can help ultrametric fitting methods, including linkagebased methods (single, complete, average, weight) and gradientbased ultrametric methods such as Ufit. For Ufit, we use the default hyperparameters provided by the authors
(Chierchia and Perret, 2020), akin to (Chami et al., 2020)., EL  , EL  , EL  


Gain (%) 


Gain (%) 


Gain (%)  
NJ  139.57  [HTML]EFEFEF 117.19  19.09  97.14  [HTML]EFEFEF 93.14  4.29  80.92  [HTML]EFEFEF 72.36  11.82  
TreeRep  167.59  [HTML]EFEFEF150.37  11.45  169.06  [HTML]EFEFEF138.28  22.26  131.52  [HTML]EFEFEF97.15  35.37  
single  379.14  [HTML]EFEFEF370.55  2.31  290.11  [HTML]EFEFEF254.77  13.87  236.10  [HTML]EFEFEF196.38  20.22  
complete  376.24  [HTML]EFEFEF355.54  5.82  293.06  [HTML]EFEFEF279.55  4.83  281.92  [HTML]EFEFEF231.41  21.82  
average  [HTML]EFEFEF148.68  149.83  0.77  108.05  [HTML]EFEFEF106.64  1.32  85.22  [HTML]EFEFEF83.63  1.89  
weighted  162.35  [HTML]EFEFEF150.86  7.61  132.96  [HTML]EFEFEF130.85  1.61  111.55  [HTML]EFEFEF104.41  6.84  
Ufit  228.21  [HTML]EFEFEF227.25  0.42  131.97  [HTML]EFEFEF124.78  5.76  90.20  [HTML]EFEFEF85.45  5.56  
, EL  , EL  , EL  


Gain (%) 


Gain (%) 


Gain (%)  
NJ  283.73  [HTML]EFEFEF 259.38  9.38  201.71  [HTML]EFEFEF 181.52  11.12  151.58  [HTML]EFEFEF 142.86  6.09  
TreeRep  505.03  [HTML]EFEFEF318.62  58.50  383.12  [HTML]EFEFEF228.30  67.81  343.21  [HTML]EFEFEF192.57  78.22  
single  936.98  [HTML]EFEFEF767.25  22.12  688.52  [HTML]EFEFEF519.70  32.48  563.58  [HTML]EFEFEF431.50  30.61  
complete  826.73  [HTML]EFEFEF698.71  18.32  738.98  [HTML]EFEFEF583.67  26.60  619.42  [HTML]EFEFEF454.95  36.14  
average  [HTML]EFEFEF323.58  326.26  0.82  236.29  [HTML]EFEFEF234.92  0.58  182.29  [HTML]EFEFEF178.40  2.18  
weighted  340.93  [HTML]EFEFEF340.30  0.18  [HTML]EFEFEF301.30  316.85  4.90  [HTML]EFEFEF262.61  273.47  3.97  
Ufit  681.52  [HTML]EFEFEF556.37  22.49  388.47  [HTML]EFEFEF303.14  28.15  231.69  [HTML]EFEFEF198.29  16.84  
, EL  , EL  , EL  


Gain (%) 


Gain (%) 


Gain (%)  
NJ  910.31  [HTML]EFEFEF 642.68  41.64  450.57  [HTML]EFEFEF 420.97  7.03  320.29  [HTML]EFEFEF 312.83  2.38  
TreeRep  1484.36  [HTML]EFEFEF1078.49  37.63  1125.81  [HTML]EFEFEF751.72  49.76  925.65  [HTML]EFEFEF551.10  67.96  
single  2343.78  [HTML]EFEFEF2277.91  2.89  1666.06  [HTML]EFEFEF1397.08  19.25  1313.64  [HTML]EFEFEF1066.32  23.19  
complete  2186.13  [HTML]EFEFEF2024.22  7.99  1786.13  [HTML]EFEFEF1532.99  16.51  1565.70  [HTML]EFEFEF1390.28  12.61  
average  758.35  [HTML]EFEFEF756.91  0.18  [HTML]EFEFEF521.83  521.91  0.01  392.09  [HTML]EFEFEF386.87  1.35  
weighted  794.90  [HTML]EFEFEF776.84  2.32  [HTML]EFEFEF681.73  683.01  0.18  626.06  [HTML]EFEFEF600.36  4.27  
Ufit  2026.42  [HTML]EFEFEF1974.59  2.62  1206.72  [HTML]EFEFEF1008.93  19.60  774.92  [HTML]EFEFEF641.07  20.88 
Let be what we refer to the edgenoise rate, which controls the number of random edges added to a tree as a means of introducing perturbations. We generate a random binary tree and then add random edges between vertices, including both leaves and internal vertices. The input metric to our encoder is the collection of shortest path distances between leaves. We generate such corrupted random binary trees with . All the results are averages over three independent runs with the same set of input metrics. Note that this edgenoise model results in graphgenerated similarities which may not have full generality but are amenable to simple evaluations of HyperAid in terms of detecting spurious noisy edges and removing their effect from the resulting treefits or hierarchies.
It is easy to see that our framework consistently improves the performance of treemetric learning methods, with gains as large as for NJ and for TreeRep (and and on average, respectively). Furthermore, one can see that the hyperbolic metric learned by HyperAid indeed results in smaller hyperbolicity values compared to the input. For ultrametric methods, our framework frequently leads to improvements but not of the same order as those seen for general treemetrics. We also test the centroid, median and ward linkage methods for raw input metric data, but given that these methods requires the input metric to be Euclidean we could not combine them with HyperAid. Nevertheless, their results are significantly worse than those offered by our framework and thus not included in Table 1.
We examine the effectiveness of our HyperAid framework on five standard datasets from the UIC Machine Learning repository (Dua and Graff, 2017). We use cosine similarities as the raw input metrics. The results are listed in Table 2.
One can see that the HyperAid framework again consistently and significantly improves the performance of all downstream methods (reaching gains in excess of ). Surprisingly, we find that average linkage, the best ultrametric fitting method, achieves smaller norm losses on the Spambase dataset when compared to treemetric methods like NJ. We conjecture two possible explanations: 1) We have not explored more powerful treemetric reconstruction methods like TREX. Again, the reason for not being able to make full use of TREX is the fact that the software requires interactive data loading and processing; 2) The optimal treemetric is very close to an ultrametric. It is still worth pointing out that despite the fact that the ultrametric fits have smaller loss than treemetric fits, even in the former case HyperAid consistently improved all tested algorithms on realworld datasets.
While minimizing the loss in the encoding process of the HyperAid framework is an intuitively good choice given that our final treemetric fit is evaluated based on the loss, it is still possible to choose other objective functions (such as the Dasguptas loss) for the encoder module. We therefore compare the performance of the hyperbolic metric learned using HypHC, centered on the Dasgupta objective, with the performance of our HyperAid encoder coupled with several possible decoding methods. For HypHC, we used the official implementation and hyperparameters provided by the authors (Chami et al., 2020). The results are listed in Table 3 in the Supplement. The results reveal that minimizing the loss during hyperbolic embedding indeed offers better results than those obtained by using the Dasgupta loss when our ultimate performance criterion is the loss. Note that the default decoding rule of HypHC produces unweighted trees and hence offers worse performance than the tested mode (and the results are therefore omitted from the table).
One conceptually straightforward method for evaluating the quality of the hierarchical clustering is to use the ground truth labels of object classes available for realworld datasets such as Zoo and Iris.
The Zoo dataset comprises classes of the animal kingdom, including insects, cartilaginous and bony fish, amphibians, reptiles, birds and mammals. The hierarchical clusterings based on NJ, HyperAid+NJ and HypHC reveal that the first and last method tend to misplace amphibians and bony fish as well as insects and cartilaginous fish by placing them in the same subtrees. The NJ method also shows a propensity to create “pathlike trees” which do not adequately capture known phylogenetic information (e.g., for mammals in particular). On the other hand, HyperAid combined with NJ discriminates accurately between all seven classes with the exception of two branches of amphibian species.
The Iris dataset includes three groups of iris flowers, Irissetosa, Irisversicolor and Irisvirginica. All three methods perfectly identify the subtrees of Irissetosa species, but once again the NJ method produces a “pathlike tree” for all three groups which does not conform to nonlinear models of evolution. It is wellknown that it is hard to distinguish Iris versicolor and Iris virginica flowers as both have similar flower color and growth/bloom times. As a result, all three methods result in subtrees containing intertwined clusters of flowers from both groups.
Zoo, , EL  Iris, , EL  Glass, , EL  Segmentation, , EL  Spambase, , EL  Average Gain (%)  


Gain (%) 


Gain (%) 


Gain (%) 


Gain (%) 


Gain (%)  
NJ  11.77  [HTML]EFEFEF8.99  30.94  43.85  [HTML]EFEFEF18.17  141.22  60.96  [HTML]EFEFEF30.76  98.15  856.47  [HTML]EFEFEF 292.78  192.53  1008.6  [HTML]EFEFEF377.93  166.87  125.94  
TreeRep  14.65  [HTML]EFEFEF9.32  57.12  22.59  [HTML]EFEFEF19.00  18.89  71.81  [HTML]EFEFEF32.00  124.41  1357.28  [HTML]EFEFEF1038.98  30.64  2279.31  [HTML]EFEFEF927.02  145.87  75.38  
NJ(TREX)  8.85  [HTML]EFEFEF8.59  3.09  26.38  [HTML]EFEFEF16.96  55.51  37.61  [HTML]EFEFEF29.58  27.11  NA  NA  NA  NA  NA  NA  28.57  
UNJ(TREX)  8.60  [HTML]EFEFEF 8.57  0.31  17.47  [HTML]EFEFEF 16.86  3.58  29.41  [HTML]EFEFEF 29.38  0.09  NA  NA  NA  NA  NA  NA  1.33  
single  34.53  [HTML]EFEFEF17.17  101.11  84.66  [HTML]EFEFEF65.97  28.33  95.91  [HTML]EFEFEF52.82  81.55  1174.95  [HTML]EFEFEF924.76  27.06  1809.83  [HTML]EFEFEF1109.75  63.08  60.23  
complete  20.78  [HTML]EFEFEF10.64  95.27  58.13  [HTML]EFEFEF30.95  87.80  88.19  [HTML]EFEFEF37.75  133.61  928.63  [HTML]EFEFEF486.31  90.96  1248.15  [HTML]EFEFEF633.70  96.96  100.92  
average  10.04  [HTML]EFEFEF 9.17  9.47  23.38  [HTML]EFEFEF 22.99  1.68  31.94  [HTML]EFEFEF 30.79  3.73  339.41  [HTML]EFEFEF 300.07  13.11  365.78  [HTML]EFEFEF 350.68  4.31  6.46  
weighted  12.08  [HTML]EFEFEF11.05  9.33  [HTML]EFEFEF27.20  29.07  6.42  34.75  [HTML]EFEFEF34.16  1.74  458.17  [HTML]EFEFEF337.75  35.66  385.37  [HTML]EFEFEF382.13  0.84  8.22  
Ufit  10.53  [HTML]EFEFEF9.35  12.70  26.02  [HTML]EFEFEF23.02  13.03  33.21  [HTML]EFEFEF31.04  6.98  354.66  [HTML]EFEFEF314.08  12.92  381.68  [HTML]EFEFEF377.67  1.06  9.34 
Hyperbolic graph convolutional neural networks
. In Advances in Neural Information Processing Systems, pp. 4868–4879. Cited by: §2.International Conference on Artificial Intelligence and Statistics
, pp. 1832–1840. Cited by: §2.Proceedings of the fortyeighth annual ACM symposium on Theory of Computing
, pp. 118–127. Cited by: 3rd item, §1, §2, §2, §4.1.Geoopt: riemannian optimization in pytorch
. External Links: 2005.02819 Cited by: §5.1.Approximation bounds for hierarchical clustering: average linkage, bisecting kmeans, and local search
. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Cited by: §2.Linear classifiers in product space forms
. arXiv preprint arXiv:2102.10204. Cited by: §2, §3.The objective in Problem 4.2 is a convex function of the edge weights of the tree .
The objective in Problem 4.2 can be written as
where is the set of admissible linear operators that map edge weights to additive pairwise distances on a binary tree with leaves. Let . For a , we have
This result follows from the convexity of the norm for . ∎
.
We implemented our framework in Python. The encoder part is developed in Pytorch, and is a modification based on the official implementation of HypHC^{2}^{2}2https://github.com/HazyResearch/HypHC.
Training details. Note that all hyperparameters of each dataset are described in our implementation. Here, we briefly describe the most important ones. All the hyperparameters are tuned based on the loss of Hyp+NJ. Note that since we are solving a search problem, this is a standard approach – for combinatorial search algorithms, one usually reports the best loss for different random seeds. HypHC also report the best loss results over several random seeds (Chami et al., 2020).
When training hyperbolic embeddings in the Poincaré model, we follow the suggestion stated in (Nickel and Kiela, 2017). We initialize the embeddings within a small radius 1e6
with small learning rate for a few (burnin
) epochs, then increase the learning rate by a
burnin_factor
. We also tune the negative curvature c
for different datasets. As mentioned in the main text, although choosing larger c
makes the resulting hyperbolic metric more similar to the treemetric (i.e. becomes smaller based on the relation ), it can make the optimization process harder and can potentially cause numerical issues. For synthetic datasets, we find that scaling the input metric by a factor scaling_factor
during training (and scale it back before decoding) can help with improving the final results. We conjecture that this is due to the numerical precision issues associated with hyperbolic embeddings approaches for which the resulting points are close to the boundary of Poincaré ball. We use Riemannian Adam implemented by the authors of HypHC as our optimizer.
For NJ and TreeRep, we use the implementation from the TreeRep paper^{3}^{3}3https://github.com/rsonthal/TreeRep. We manually interact with the TREX website^{4}^{4}4www.trex.uqam.ca for NJTREX and UNJTREX. For linkagebased methods, we use the scipy library^{5}^{5}5https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html. For UFit, we use the official implementation^{6}^{6}6https://github.com/PerretB/ultrametricfitting along with default hyperparameters.
Remarks. The NJ implementation, which is based on the PhyloNetworks library^{7}^{7}7http://crsl4.github.io/PhyloNetworks.jl/latest in Julia, is a naive version of the method. It is known from the literature that such versions of NJ can potentially produce negative edge weights^{8}^{8}8https://en.wikipedia.org/wiki/Neighbor_joining/#Advantages_and_disadvantages. In the naive version of NJ, one simply replaces the negative edgeweights by . On the other hand, a more sophisticated cleaning approach is implemented in TREX, which is the reason why the TREX version offers better performance. Note that TreeRep is also known to have the negative edgeweights problem, where the naive correction (replacing them with ) was adopted by the authors (Sonthalia and Gilbert, 2020).
Zoo  Iris  Glass  

HyperAid  HypHC  HyperAid  HypHC  HyperAid  HypHC  
NJ  [HTML]EFEFEF11.77  31.55  [HTML]EFEFEF18.18  70.43  [HTML]EFEFEF30.76  87.75 
TreeRep  [HTML]EFEFEF14.65  28.51  [HTML]EFEFEF19.00  64.61  [HTML]EFEFEF32.00  80.31 
single  [HTML]EFEFEF34.53  45.01  [HTML]EFEFEF65.97  85.40  [HTML]EFEFEF52.83  108.44 
complete  [HTML]EFEFEF20.78  24.01  [HTML]EFEFEF30.96  59.60  [HTML]EFEFEF37.75  69.54 
average  [HTML]EFEFEF10.05  33.31  [HTML]EFEFEF23.00  72.36  [HTML]EFEFEF30.79  90.58 
weighted  [HTML]EFEFEF12.09  32.97  [HTML]EFEFEF29.07  71.00  [HTML]EFEFEF34.16  87.15 
Ufit  [HTML]EFEFEF10.54  33.07  [HTML]EFEFEF23.03  73.19  [HTML]EFEFEF31.04  90.34 
Tree visualization for the Glass dataset. Clustering results obtained for the Glass dataset comprising seven classes of material are hardest to interpret, as all three methods only accurately cluster tableware glass. HypAid appears to better delineate between buildingwindows glass that is floatprocessed from buildingwindows glass that is nonfloatprocessed.