Log In Sign Up

HyperAid: Denoising in hyperbolic spaces for tree-fitting and hierarchical clustering

The problem of fitting distances by tree-metrics has received significant attention in the theoretical computer science and machine learning communities alike, due to many applications in natural language processing, phylogeny, cancer genomics and a myriad of problem areas that involve hierarchical clustering. Despite the existence of several provably exact algorithms for tree-metric fitting of data that inherently obeys tree-metric constraints, much less is known about how to best fit tree-metrics for data whose structure moderately (or substantially) differs from a tree. For such noisy data, most available algorithms perform poorly and often produce negative edge weights in representative trees. Furthermore, it is currently not known how to choose the most suitable approximation objective for noisy fitting. Our contributions are as follows. First, we propose a new approach to tree-metric denoising (HyperAid) in hyperbolic spaces which transforms the original data into data that is “more” tree-like, when evaluated in terms of Gromov's δ hyperbolicity. Second, we perform an ablation study involving two choices for the approximation objective, ℓ_p norms and the Dasgupta loss. Third, we integrate HyperAid with schemes for enforcing nonnegative edge-weights. As a result, the HyperAid platform outperforms all other existing methods in the literature, including Neighbor Joining (NJ), TreeRep and T-REX, both on synthetic and real-world data. Synthetic data is represented by edge-augmented trees and shortest-distance metrics while the real-world datasets include Zoo, Iris, Glass, Segmentation and SpamBase; on these datasets, the average improvement with respect to NJ is 125.94%.


page 2

page 3

page 5

page 6

page 7

page 8

page 9

page 10


Tree! I am no Tree! I am a Low Dimensional Hyperbolic Embedding

Given data, finding a faithful low-dimensional hyperbolic embedding of t...

Recognizing and realizing cactus metrics

The problem of realizing finite metric spaces in terms of weighted graph...

Fitting Distances by Tree Metrics Minimizing the Total Error within a Constant Factor

We consider the numerical taxonomy problem of fitting a positive distanc...

Pitman-Yor Diffusion Trees

We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical clust...

Highly Scalable and Provably Accurate Classification in Poincare Balls

Many high-dimensional and large-volume data sets of practical relevance ...

Meta-Learning with MAML on Trees

In meta-learning, the knowledge learned from previous tasks is transferr...

Gromov Hyperbolicity, Geodesic Defect, and Apparent Pairs in Vietoris-Rips Filtrations

Motivated by computational aspects of persistent homology for Vietoris-R...

Code Repositories


This is the repository for the KDD 2022 paper "HyperAid: Denoising in hyperbolic spaces for tree-fitting and hierarchical clustering".

view repo

1. Introduction

The problem of fitting tree-distances between data points is of great relevance in application areas concerned with extracting and processing hierarchical information (Saitou and Nei, 1987; Ailon and Charikar, 2005; Macaluso et al., 2021; Coenen et al., 2019; Hewitt and Manning, 2019). A significant body of work on the topic has focused on fitting phylogenetic trees based on evolutionary distances (Saitou and Nei, 1987) and using fitted tree-models for historic linguistics and natural language processing (Coenen et al., 2019; Hewitt and Manning, 2019). Since trees are usually hard to accurately represent in low-dimensional Euclidean spaces (Linial et al., 1995), alternative methods have been suggested for representation learning in hyperbolic spaces (Nickel and Kiela, 2017). For these and related approaches, it has been observed that the quality of embedding and hence the representation error depend on the so called Gromov -hyperbolicity (Gromov, 1987) of the data, which will be discussed in-depth in the following section. For perfect trees, , and the larger the value of the less tree-like the data structure. When

is sufficiently large, as is the case when dealing with noisy observations or observations with outliers, the embedding distortion may be significant 

(Abraham et al., 2007). This is a well-known fact that has limited the performance of methods such as Neighbor Joining (NJ) and TreeRep, which come with provable reconstruction guarantees only in the presence of small tree-metric perturbations/noise (Saitou and Nei, 1987; Sonthalia and Gilbert, 2020)

(more precisely, the correct output tree topology is guaranteed for “nearly additive” distance matrices, i.e., matrices in which every distance differs from the true distance by not more than half of the shortest edge weight in the tree). Heuristic methods for tree-metric fitting, such as NINJA 

(Wheeler, 2009) and T-REX (Boc et al., 2012) have similar issues, but address the problems of scaling, and in the latter case, the issue of negative tree-edge weights that may arise due to large -hyperbolicity.

In a parallel development, the theoretical computer science community has explored the problem of fitting both general tree-metrics and ultrametrics with specialized distortion objectives (Ailon and Charikar, 2005; Harb et al., 2005; Cohen-Addad et al., 2021). Ultrametrics are induced by rooted trees for which all the root-to-leaf distances are the same. Although ultrametrics do not have the versatility to model nonlinear evolutionary phenomena, they are of great relevance in data analysis, in particular in the context of “linkage” clustering algorithms (single, complete or average linkage) (Ailon and Charikar, 2005). Existing methods for ultrametric approximations scale well and provide approximation guarantees for their general graph problem counterparts. For both general tree-metrics and ultrametrics, two objectives have received significant attention: The Dasgupta objective and derivatives thereof (Dasgupta, 2016; Charikar et al., 2019) and the objective (Ailon and Charikar, 2005; Harb et al., 2005)

. Dasgupta’s objective was notably used for hierarchical clustering (HC) in hyperbolic spaces as a means to mitigate the hard combinatorial optimization questions associated with this clustering metric 

(Sahoo et al., 2020); see also (Macaluso et al., 2021) for other approaches to HC in hyperbolic spaces. Neither of the aforementioned approaches provides a solution to the problem of fitting tree-metrics for data with moderate-to-high Gromov hyperbolicity, i.e., noisy or perturbed tree-metric data not covered by existing performance guarantee analyses. In addition, no previous study has compared the effectiveness of different objectives for the measure of fit for tree-metrics on real-world datasets.

Our contributions are three-fold.

  • We propose the first approach to data denoising (i.e., Gromov hyperbolicty reduction) in hyperbolic space as the first step to improving the quality of tree-metric fits.

  • We combine our preprocessing method designed to fit both general trees and ultrametrics with solutions for removing negative edge-weights in the tree-metrics that arise due to practical deviations from tree-like structures.

  • We demonstrate the influence of hyperbolic denoising on the quality of tree-metric fitting and hierachical clustering for different objectives. In particular, we perform an ablation study that illustrates the advantages and disadvantages of different objectives (i.e., Dasgupta (Dasgupta, 2016) and  (Harb et al., 2005)) when combined with denoising in hyperbolic spaces.

The paper is organized as follows. Related works are discussed in Section 2 while the relevant terminology and mathematical background material are presented in Section 3. Our tree-fitting, HC clustering approach and accompanying analysis on different choices of objective functions are described in Section 4. More details regarding the HyperAid method are provided in Section 5, with experimental results made available in Section 6. All proofs are relegated to the Supplement.

2. Related works

The Dasgupta HC objective. Determining the quality of trees produced by HC algorithms is greatly aided by an adequate choice of an objective function. Dasgupta (Dasgupta, 2016) proposed an objective and proved that an -approximation can be obtained using specialized -approximation sparsest cut procedures (Dasgupta, 2016). HC through the lens of Dasgupta’s objective and variants thereof have been investigated in (Charikar and Chatziafratis, 2017; Moseley and Wang, 2017; Ahmadian et al., 2019; Alon et al., 2020). For example, the authors of (Moseley and Wang, 2017) showed that the average linkage (UPGMA) algorithm produces a -approximation for a Dasgupta-type objective. It is important to point out that the Dasgupta objective depends only on the sizes of unweighted subtrees rooted at lowest common ancestors (LCAs) of two leaves, the optimal trees are restricted to be binary, and that it does not produce edge-lengths (weights) in the hierarchy. The latter two properties are undesirable for applications such as natural language processing, phylogeny and evolutionary biology in general (Waterman et al., 1977). As a result, other objectives have been considered, most notably metric-tree fitting objectives, which we show can resolve certain issues associated with the use of the Dasgupta objective.

Tree-metric learning and HC. In comparison to the Dasgupta objective, metric-based objective functions for HC have been mostly overlooked. One metric based objective is the norm of the difference between the input metric and the resulting (fitted) tree-metric (Waterman et al., 1977). For example, the authors of (Farach et al., 1995) showed that an optimal ultrametric can be found with respect to the norm loss for arbitrary input metric in polynomial time. It is also known in the literature that finding optimal tree-metrics with respect to the norm, for is NP-hard and finding optimal ultrametrics for is also NP-hard. In fact, the problem for tree-metrics and the problem for tree-metrics and ultrametrics are both APX-hard (Wareham, 1993; Agarwala et al., 1998; Ailon and Charikar, 2005). The work (Agarwala et al., 1998) presents a -approximation result for the closest tree-metric under norm and established an important connection between the problems of fitting an ultrametric and fitting a tree-metric. The result shows that an -approximation result for the constrained ultrametric fitting problem yields a -approximation solution for tree-metric fitting problem. The majority of the known results for tree-metric fitting depend on this connection. The algorithm in (Ailon and Charikar, 2005) offers an approximation for both tree-metrics fitting and ultrametrics fitting under general norms. The work (Cohen-Addad et al., 2021) improves the approximation factor to for the norm. Unfortunately, despite having provable performance guarantees, both the methods from (Ailon and Charikar, 2005) and (Cohen-Addad et al., 2021) are not scalable due to the use of correlation clustering. As a result, they cannot be executed efficiently in practice even for trees that have 50 leaves, due to severe computational bottlenecks.

Gradient-based HC. The authors of (Chierchia and Perret, 2020) proposed an ultrametric fitting framework (Ufit) for HC based on gradient descent methods that minimize the loss between the resulting ultrametric and input metric. There, a subdifferentiable min-max operation is integrated into the cost optimization framework to guarantee ultrametricity. Since an ultrametric is a special case of a tree-metric, Ufit can be used in place of general tree-metric fitting methods. The gHHC method (Monath et al., 2019) represents the first HC approach based on hyperbolic embeddings. The authors assume that the hyperbolic embeddings of leaves are given and describe how to learn the internal vertex embeddings by optimizing a specialized objective function. The HypHC approach (Chami et al., 2020) disposes of the requirement to have the leaf embeddings. It directly learns a hyperbolic embedding by optimizing a relaxation of Dasgupta’s objective (Dasgupta, 2016) and reconstructs the discrete binary tree from the learned hyperbolic embeddings of leaves using the distance of their hyperbolic lowest common ancestors to the root. Compared to gHHC, our framework also does not require input hyperbolic embeddings of leaves. We learn the hyperbolic embeddings of leaves by directly minimizing the loss (rather than the Dasgupta objective) between the hyperbolic distances and the input metric of pairs of points, which substantially differs from HypHC and has the goal to preprocess (denoise) the data for downstream tree-metric fitting. Furthermore, it will be shown in the subsequent exposition that metric based losses are preferable to the Dasgupta objective for the new task of hyperbolic denoising and consequent HC.

Learning in hyperbolic spaces. Representation learning in hyperbolic spaces has received significant interest due to its effectiveness in capturing latent hierarchical structures (Krioukov et al., 2010; Nickel and Kiela, 2017; Papadopoulos et al., 2015; Tifrea et al., 2019)

. Several learning methods for Euclidean spaces have been generalized to hyperbolic spaces, including perceptrons and SVMs 

(Cho et al., 2019; Chien et al., 2021; Tabaghi et al., 2021)

, neural networks 

(Ganea et al., 2018; Shimizu et al., 2021) and graph neural networks (Chami et al., 2019; Liu et al., 2019). In the context of learning hyperbolic embeddings, the author of (Sarkar, 2011) proposed a combinatorial approach for embedding trees with low distortion in just two dimensions. The work (Sala et al., 2018) extended this idea to higher dimensions. Learning methods for hyperbolic embeddings via gradient-based techniques were discussed in (Nickel and Kiela, 2017, 2018). None of these methods currently provides quality clustering results for practical data whose distances do not “closely” match those induced by a tree-metric. Furthermore, the main focus of (Nickel and Kiela, 2017, 2018) is to learn hyperbolic embeddings that represent the input graphs and trees as accurately as possible. In contrast, our goal is to learn hyperbolic metrics that effectively “denoise” parts of the input metric to better conform tree structures, in order to improve the performance of arbitrary metric-tree fitting and HC downstream methods.

3. Preliminaries

Figure 2. Left: Illustration of the generative tree for an ultrametric and a tree-metric. Note that all the weighted distances of leaves to the root are equal for an ultrametric, while general tree-metrics do not have to satisfy this constraint. Right: Detailed block diagram of our HyperAid framework. Detailed explanations of the encoder and decoder modules are available in Section 5.

Notation and terminology. Let be a tree with vertex set , edge set and nonnegative edge weights . Let denote the set of leaf vertices in . The shortest length path between any two vertices is given by the metric . The metric is called a tree-metric induced by the (weighted) tree . In binary trees, there is only one internal (nonleaf) vertex of degree two, the root vertex ; all other internal vertices are of degree three. Since our edge weights are allowed to be zero, nonbinary trees can be obtained by adding edges of zero weight. The pairwise distances of leaves in this case remain unchanged. A vertex is said to be a descendant of vertex if belongs to the directed path from the root to . Also, with this definition, the vertex is an ancestor of the vertex . For any two vertices , we let denote their lowest common ancestor (LCA), i.e., the common ancestor that is the farthest from the root vertex. We define a clan in as a subset of leaf vertices that can be separated from the rest of the tree by removing a single edge. Furthermore, we let be the set of binary trees with leaves. For any internal vertex of , we let be a binary subtree of rooted at .

-hyperbolic metrics. Gromov introduced the notion of -hyperbolic metrics as a generalization of the type of metric obtained from manifolds with constant negative curvature (Gromov, 1987; Sonthalia and Gilbert, 2020), as described below.

Definition 3.1 ().

Given a metric space , the Gromov product of with respect to a base point is


Note that the Gromov product measures how close is to the geodesic connecting and  (Sonthalia and Gilbert, 2020).

Definition 3.2 ().

A metric on a space is a -hyperbolic metric for if


Usually, when stating that is -hyperbolic we mean that is the smallest possible value that satisfies the condition (2). An ultrametric is a special case of a tree-metric (Ailon and Charikar, 2005), and is formally defined as follows.

Definition 3.3 ().

A metric on a space is called an ultrametric if it satisfies the strong triangle inequality property


Note that the generating tree of an ultrametric on has a special structure: All element in are leaves of the tree and all leaves are at the same distance from the root (Ailon and Charikar, 2005). An illustration of an ultrametric can be found at Figure 2.

Hyperbolic geometry. A hyperbolic space is a nonEuclidean space with a constant negative curvature. Despite the existence of various equivalent models for hyperbolic spaces, Poincaré ball models have received the broadest attention in the machine learning and data mining communities. Although our hyperbolic denoising framework can be extended to work for any hyperbolic model, we choose to work with Poincaré ball models with negative curvature : . For any two points , their hyperbolic distance equals


Furthermore, for a reference point , we denote its tangent space, the first order linear approximation of around , by . Möbius addition and scalar multiplication — two basic operators on the Poincaré ball (Ungar, 2008) – may be defined as follows. The Möbius sum of equals


Unlike its vector-space counterpart, this addition is noncommutative and nonassociative. The Möbius version of multiplication of

by a scalar is defined according to


For more details, see (Vermeer, 2005; Ganea et al., 2018; Chien et al., 2021; Tabaghi et al., 2021). With the same operators, one can also define geodesics – analogues of straight lines in Euclidean spaces and shortest path in graphs – in . The geodesics connecting two points is given by


Note that and and . An illustration of the Poincare model is given in Figure 3.

Figure 3. The two-dimensional Poincaré disk. (Right) a set of geodesics. (Left) vectors in the tangent space at .

4. Hierarchical Clustering: A Tree Learning Problem

As previously explained, tree-metric fitting is closely related to hierarchical clustering, and both methods are expected to greatly benefit from data denoising. It is nevertheless unclear for which tree-metric fitting algorithms and objectives used in hierarchical clustering does one see the most significant benefits (if any) from denoising. To this end, we describe two objectives that are commonly used for the aforementioned tasks in order to determine their strengths and weaknesses and potential performance benefits.

Let be a set of nonnegative dissimilarity scores between a set of entities. In HC problems, the goal is to find a tree that best represents a set of gradually finer hierarchical partitions of the entities. In this representation, the entities are placed at the leaves of the tree , i.e., . Each internal vertex of is representative of its clan, i.e., the leaves of or the hierarchical cluster at the level of . We find the following particular HC problem definition useful in our analysis.

Problem 4.1 ().

Let be a set of pairwise dissimilarities between entities. The optimal hierarchical clusters are equivalent to a binary tree that is a solution to the following problem:

where is a (known) function of pairwise dissimilarities of the leaves in , and

is a given loss function.

4.1. The Dasgupta objective

Dasgupta (Dasgupta, 2016) introduced the following cost function for HC:


The optimal HC solution is a binary tree that maximizes (8). In the formalism of Problem 4.1, this approach assumes that the pairwise dissimilarity between leaf vertices and equals the number of leaves of the tree rooted at their LCA, i.e.,

In words, two dissimilar leaves and should have an LCA that is close to the root. This ensures that the size of is proportional to . It is also important to note that this method is sensitive to the absolute value of the input dissimilarities, since the loss function is the inner product between measurements and tree leaf dissimilarities, i.e., .

4.2. The loss objective

Figure 4. Hierarchical clusters learned by different objectives. Figure- Dissimilarity measurements are pairwise leaf distances on the weighted tree (optimal tree in sense). Figure-: The tree optimizes Dasgupta’s objective for measurements . Figure-: Dissimilarity measurements are pairwise distances of points on a line in , i.e., where for all . Figure-: The Dasgupta-optimal tree for . Figure-: Both trees and optimize the cost function ().

Instead of using the Dasgupta objective, one can choose to work with a leaf dissimilarity metric that is provided by the tree distance function, . This way, one can learn a weighted tree that best represents the input dissimilarity measurements. This is formally described as follows.

Problem 4.2 ().

Let be a set of pairwise dissimilarities of entities. The optimal hierarchical clusters in the norm sense are equivalent to a tree that minimizes the objective


According to our general formalism, Problem 4.2 uses the natural tree-distance function to compute the leaf dissimilarities, i.e., , and the loss to measure the quality of the learned trees, i.e., . The objective function in Problem 4.2 aims to jointly find a topology (adjacency matrix) and edge weights for a tree that best approximates a set of dissimilarity measurements. This problem can be interpreted as approximating the measurement with an additive distance matrix (i.e., distances induced by trees). This is a significantly stronger constraint than the one induced by the requirement that two entities that are similar to one another should be placed (as leaves) as close as possible in the tree. We illustrate this distinction in Figure 4 : The dissimilarity measurements are generated according to the tree and the vertices and can be arbitrarily close to each other ()— see Figure 4-. The optimal tree in the sense does not place these vertices close to each other since the objective is to approximate all dissimilarity measurements with a set of additive distances. This is therefore in contrast to Dasgupta’s objective where, for a small enough , it favors placing the vertices and closest to each other (see Figure 4-). In Figures 4 , we provide another example to illustrate the distinction between optimal clusters according to the and Dasgupta’s criteria.

Remark 1. We can interpret Problem 4.2 as two independent problems: (1) Learning edge-weights, and (2) Learning the topology. A binary tree with leaves has a total of vertices and edges. Let be the vector of the edge weights. Any distance between pairs of vertices in can be expressed as a linear combination of the edge weights. We can hence write for the vector of all pairwise distances of leaves, so that . Given a fixed topology (adjacency matrix of the tree ), we can compute the edges weight of that minimize the objective function in Problem 4.2


There always exists at least one solution to the optimization problem 4.2. This is the result of the following proposition.

Proposition 4.1 ().

The objective in Problem 4.2 is a convex function of the edge weights of the tree .

Remark 2. Let and be two edges, with weights and , incident to the root of a binary tree . We can delete the root along with the edges and , and place an edge of weight between its children. This process trims the binary tree while preserving all pairwise distances between its leaf vertices. Hence, we can also work with trimmed nonbinary trees. Furthermore, in this formalism, the root vertex does not carry any hierarchy-related information about the entities (leaves), although it is important for “anchoring” the tree in HC. We discuss how to select a root vertex upon tree reconstruction in our subsequent exposition.

4.3. Hierarchical objectives: vs. Dasgupta

Figure 5. The reconstruction error for the trees , , , and . We have where is the pairwise leaf distances in the tree , assuming unit length edges.

Here we discuss which objective function produces more “acceptable” trees and hierarchical clusters, or Dasgupta? In clustering problems for which we do not have the ground truth clusters, it is hard to quantify the quality of the solution. This is especially the case when using different objectives, as in this case comparisons may be meaningless. To address this issue, we performed an experiment in which ground-truth synthetic data is modeled in a way that reflect the properties of the objectives used in optimization. Then, the quality of the objective is measured with respect to its performance on both model-matched and model-unmatched data.

Let us start by generating a ground truth random (weighted) tree with edge weights distributed uniformly and independently in . Measurements are subsequently generated by computing the dissimilarities of leaves in . According to the Dasgupta objective, the dissimilarity between is given by , whereas according to the objective, it is given by (all distances are computed wrt the weighted tree ). We postulate that if the measurements are generated according to the Dasgupta notion of dissimilarities, then his objective should perform better in terms of recovering the original tree than the objective; and vice versa.

In this experiment, we generate random ground truth trees with i.i.d. edge weights from ; and produce measurements according to both objectives, denoted by and . Then, we generate a fixed set of random trees and find the tree that minimizes the cost function of each of the problems. Let be the measurements of the form and be the measurements of the form for the randomly generated tree . We let be the binary tree with the minimum cost on dissimilarity measurements . We define , , and similarly. The goal is to compare the deviation of and from the ground truth tree , i.e., vs. . We ask the same question for and .

In Figure 5, we show the scatter plot of the results for each of the random trials, pertaining to trees with leaves. If the measurements are generated according to the objective, then recovers the underlying tree with higher accuracy compared to , as desired. On the other hand, for dissimilarities generated according to Dasgupta’s objective, still manages to recover the underlying tree better than for larger trees (. This suggests that the objective could be more useful in recovering the underlying tree for more versatile types of dissimilarity measurements and larger datasets.

5. Hyperbolic tree-denoising

We describe next how to solve the constrained optimization Problem 4.2 in a hyperbolic space, and explain why our approach has the effect of data denoising for tree-metric learning. The HyperAid framework is depicted in Figure 2. In particular, the HyperAid encoder is described in Section 5.1, while the decoder is described in Section 5.2.

5.1. A -hyperbolic metric as a soft tree-metric

The first key idea is to exploit the fact that a tree-metric is -hyperbolic. This can be deduced from the following theorem.

Theorem 5.1 (Theorem 1 in (Buneman, 1974)).

A graph is a tree if and only if it is connected, contains no loops and has shortest path distance satisfying the four-point condition:


Note that through some simple algebra, one can rewrite condition (2) for -hyperbolicity as


It is also known that any -hyperbolic metric space embeds isometrically into a metric-tree (Dress, 1984).

The second key idea is that condition (2) with a value of may be viewed as a relaxation of the tree-metric constraint. The parameter can be viewed as the slackness variable, while a -hyperbolic metric can be viewed as a “soft” tree-metric. Hence, the smaller the value of , the closer the output metric to a tree-metric. Denoising therefore refers to the process of reducing .

Note that direct testing of the condition (2) takes time, which is prohibitively complex in practice. Checking the soft tree-metric constraint directly also takes time, but to avoid this problem we exploit the connection between -hyperbolic metric spaces and negative curvature spaces. This is our third key idea, which allows us to optimize the Problem 4.2 satisfying the soft tree-metric constraint efficiently. More precisely, (see Appendix B). Hence, a distance metric induced by a hyperbolic space of negative curvature will satisfy the soft tree-metric constraint with slackness at most . Therefore, using recent results pertaining to optimization in hyperbolic spaces (i.e., Riemannian SGD and Riemannian Adam (Bonnabel, 2013; Kochurov et al., 2020)), we can efficiently solve the Problem 4.2 with a soft tree-metric constraint in a hyperbolic space. Combined with our previous argument, we demonstrate that finding the closest -hyperbolic metric with small enough may be viewed as tree-metric denoising.

Based on , one many want to let . Unfortunately, optimization in hyperbolic spaces is known to have precision-quality tradeoff problems (Sala et al., 2018; Nickel and Kiela, 2018), and some recent works has attempted to address this issue (Yu and De Sa, 2019, 2021). In practice, one should choose at a moderate scale ( in our experiment). The loss optimized by the encoder in HyperAid equals


where , and are dissimilarities between entities.

A comparison between HyperAid and HypHC is once more in place. The purpose of using hyperbolic optimization in HypHC and our HyperAid is very different. HypHC aims to optimize the Dasgupta objective and introduces a continuous relaxation of this cost in hyperbolic space, based on continuous versions of lowest common ancestors (See Equation (9) in (Chami et al., 2020)). In contrast, we leverage hyperbolic geometry for learning soft tree-metrics under loss, which denoises the input metrics to better fit “tree-like” measurements. The resulting hyperbolic metric can then be further used with any downstream tree-metric learning and inference method, and has therefore a level of universality not matched by HypHC whose goal is to perform HC. Our ablation study to follow shows that HypHC cannot offer performance improvements like our approach when combined with downstream tree-metric learners.

5.2. Decoding the tree-metric

Tree-metric fitting methods produce trees based on arbitrary input metrics. In this context, the NJ (Saitou and Nei, 1987)

algorithm is probably the most frequently used algorithm, especially in the context of reconstructing phylogenetic trees. TreeRep 

(Sonthalia and Gilbert, 2020), a divide-and-conquer method, was also recently proposed in the machine learning literature. In contrast to linkage base methods that restrict their output to be ultrametrics, both NJ and TreeRep can output a general tree-metric. When the input metric is a tree-metric, TreeRep is guaranteed to recover it correctly in the absence of noise. It is also known that the local distortion of TreeRep is at most when the input metric is not a tree-metric but a -hyperbolic metric. NJ can recover the ground truth tree-metric when input metric is a “slightly” perturbed tree-metric (Atteson, 1997). Note that methods that generate ultrametrics such as linkage-based methods and Ufit do not have similar accompanying theoretical results.

The key idea of HyperAid is that if we are able to reduce for the input metric, both NJ, TreeRep (as well as other methods) should offer better performance and avoid pitfalls such as negative tree edge-weights. Since an ultrametric is a special case of a tree-metric, we also expect that our denoising process will offer empirically better performance for ultrametrics as well, which is confirmed by the results presented in the Experiments Section.

Rooting the tree. The NJ and TreeRep methods produce unrooted trees, which may be seen as a drawback for HC applications. This property of the resulting tree is a consequence of the fact that NJ does not assume constant rates of changes in the generative phenomena. The rooting issue may be resolved in practice through the use of outgroups, or the following simple solution known as mid-point rooting. In mid-point rooting, the root is assumed to be the midpoint of the longest distance path between a pair of leaves; additional constraints may be used to ensure stability under limited leaf-removal. Rooting may also be performed based on priors, as is the case for outgroup rooting (Hess and De Moraes Russo, 2007).

Complexity analysis. We focus on the time complexity of our encoder module. Note that for the exact computation of the (Dasgupta) loss, HypHC takes time, while our norm loss only takes time. Nevertheless, as pointed out by the authors of HypHC, one can reduce the time complexity to by sampling only one triplet per leaf. A similar approach can be applied in our encoder, resulting in a time complexity. Hence, the time complexity of our HyperAid encoder is significantly lower than that of HypHC, as we focus on a different objective function. With respect to tree-metric and ultrametric fitting methods, we still have low-order polynomial time complexities. For example, NJ runs in time, while an advanced implementation of UPGMA requires time (Murtagh, 1984).

6. Experiments

To demonstrate the effectiveness of our HyperAid framework, we conduct extensive experiments on both synthetic and real-world datasets. We mainly focus on determining if denoised data leads to decoded trees that have a smaller losses compared to those generated by decoders that operate on the input metrics directly (Direct). Due to space limitations, we only examine the norm loss, as this is the most common choice of objective in practice; HyperAid easily generalizes to arbitrary . An in-depth description of experiments is deferred to the Supplement.

Tree-metric methods. We test both NJ and TreeRep methods as decoders. Note that TreeRep is a randomized algorithm so we repeat it times and pick the tree with lowest loss, as suggested in the original paper (Sonthalia and Gilbert, 2020). We also test the T-REX (Boc et al., 2012) version of NJ and Unweighted NJ (UNJ) (Gascuel and others, 1997)

, both of which outperform NJ due to their sophisticated post-processing step. T-REX is not an open-source toolkit and we can only use it by interfacing with its website manually, which limits the size of the trees that can be tested; this is the current reason for reporting the performance of T-REX only for small-size real-world datasets.

Ultrametric methods.

We also test if our framework can help ultrametric fitting methods, including linkage-based methods (single, complete, average, weight) and gradient-based ultrametric methods such as Ufit. For Ufit, we use the default hyperparameters provided by the authors 

(Chierchia and Perret, 2020), akin to (Chami et al., 2020).

6.1. Synthetic datasets

Figure 6. Visualization of the resulting trees for zoo (left) and iris (right). Vertex colors indicate the ground truth labels.
, EL , EL , EL
Gain (%)
Gain (%)
Gain (%)
NJ 139.57 [HTML]EFEFEF 117.19 19.09 97.14 [HTML]EFEFEF 93.14 4.29 80.92 [HTML]EFEFEF 72.36 11.82
TreeRep 167.59 [HTML]EFEFEF150.37 11.45 169.06 [HTML]EFEFEF138.28 22.26 131.52 [HTML]EFEFEF97.15 35.37
single 379.14 [HTML]EFEFEF370.55 2.31 290.11 [HTML]EFEFEF254.77 13.87 236.10 [HTML]EFEFEF196.38 20.22
complete 376.24 [HTML]EFEFEF355.54 5.82 293.06 [HTML]EFEFEF279.55 4.83 281.92 [HTML]EFEFEF231.41 21.82
average [HTML]EFEFEF148.68 149.83 -0.77 108.05 [HTML]EFEFEF106.64 1.32 85.22 [HTML]EFEFEF83.63 1.89
weighted 162.35 [HTML]EFEFEF150.86 7.61 132.96 [HTML]EFEFEF130.85 1.61 111.55 [HTML]EFEFEF104.41 6.84
Ufit 228.21 [HTML]EFEFEF227.25 0.42 131.97 [HTML]EFEFEF124.78 5.76 90.20 [HTML]EFEFEF85.45 5.56
, EL , EL , EL
Gain (%)
Gain (%)
Gain (%)
NJ 283.73 [HTML]EFEFEF 259.38 9.38 201.71 [HTML]EFEFEF 181.52 11.12 151.58 [HTML]EFEFEF 142.86 6.09
TreeRep 505.03 [HTML]EFEFEF318.62 58.50 383.12 [HTML]EFEFEF228.30 67.81 343.21 [HTML]EFEFEF192.57 78.22
single 936.98 [HTML]EFEFEF767.25 22.12 688.52 [HTML]EFEFEF519.70 32.48 563.58 [HTML]EFEFEF431.50 30.61
complete 826.73 [HTML]EFEFEF698.71 18.32 738.98 [HTML]EFEFEF583.67 26.60 619.42 [HTML]EFEFEF454.95 36.14
average [HTML]EFEFEF323.58 326.26 -0.82 236.29 [HTML]EFEFEF234.92 0.58 182.29 [HTML]EFEFEF178.40 2.18
weighted 340.93 [HTML]EFEFEF340.30 0.18 [HTML]EFEFEF301.30 316.85 -4.90 [HTML]EFEFEF262.61 273.47 -3.97
Ufit 681.52 [HTML]EFEFEF556.37 22.49 388.47 [HTML]EFEFEF303.14 28.15 231.69 [HTML]EFEFEF198.29 16.84
, EL , EL , EL
Gain (%)
Gain (%)
Gain (%)
NJ 910.31 [HTML]EFEFEF 642.68 41.64 450.57 [HTML]EFEFEF 420.97 7.03 320.29 [HTML]EFEFEF 312.83 2.38
TreeRep 1484.36 [HTML]EFEFEF1078.49 37.63 1125.81 [HTML]EFEFEF751.72 49.76 925.65 [HTML]EFEFEF551.10 67.96
single 2343.78 [HTML]EFEFEF2277.91 2.89 1666.06 [HTML]EFEFEF1397.08 19.25 1313.64 [HTML]EFEFEF1066.32 23.19
complete 2186.13 [HTML]EFEFEF2024.22 7.99 1786.13 [HTML]EFEFEF1532.99 16.51 1565.70 [HTML]EFEFEF1390.28 12.61
average 758.35 [HTML]EFEFEF756.91 0.18 [HTML]EFEFEF521.83 521.91 -0.01 392.09 [HTML]EFEFEF386.87 1.35
weighted 794.90 [HTML]EFEFEF776.84 2.32 [HTML]EFEFEF681.73 683.01 -0.18 626.06 [HTML]EFEFEF600.36 4.27
Ufit 2026.42 [HTML]EFEFEF1974.59 2.62 1206.72 [HTML]EFEFEF1008.93 19.60 774.92 [HTML]EFEFEF641.07 20.88
Table 1. The norm loss of the resulting trees, averaged over three independent runs; , , and EL stand for the number of leaves, edge-noise ratio and the loss of the hyperbolic distances that we learned, respectively (listed for each dataset). Bold values indicate the best results among tree-metric methods and ultrametric methods. If additionally depicted in red, the values pertain to the best achievable performance. Grey shaded boxes indicate the better of the two results for Direct and HyperAid, for the same decoder. We also report the hyperbolicity of the raw input metric and our learned hyperbolic metric, as well as the values of Gain (Direct/HyperAid -1). Note that the first two decoders produce tree-metrics while the remaining ones result in ultrametrics.

Let be what we refer to the edge-noise rate, which controls the number of random edges added to a tree as a means of introducing perturbations. We generate a random binary tree and then add random edges between vertices, including both leaves and internal vertices. The input metric to our encoder is the collection of shortest path distances between leaves. We generate such corrupted random binary trees with . All the results are averages over three independent runs with the same set of input metrics. Note that this edge-noise model results in graph-generated similarities which may not have full generality but are amenable to simple evaluations of HyperAid in terms of detecting spurious noisy edges and removing their effect from the resulting tree-fits or hierarchies.

It is easy to see that our framework consistently improves the performance of tree-metric learning methods, with gains as large as for NJ and for TreeRep (and and on average, respectively). Furthermore, one can see that the hyperbolic metric learned by HyperAid indeed results in smaller -hyperbolicity values compared to the input. For ultrametric methods, our framework frequently leads to improvements but not of the same order as those seen for general tree-metrics. We also test the centroid, median and ward linkage methods for raw input metric data, but given that these methods requires the input metric to be Euclidean we could not combine them with HyperAid. Nevertheless, their results are significantly worse than those offered by our framework and thus not included in Table 1.

6.2. Real-world datasets

We examine the effectiveness of our HyperAid framework on five standard datasets from the UIC Machine Learning repository (Dua and Graff, 2017). We use cosine similarities as the raw input metrics. The results are listed in Table 2.

One can see that the HyperAid framework again consistently and significantly improves the performance of all downstream methods (reaching gains in excess of ). Surprisingly, we find that average linkage, the best ultrametric fitting method, achieves smaller norm losses on the Spambase dataset when compared to tree-metric methods like NJ. We conjecture two possible explanations: 1) We have not explored more powerful tree-metric reconstruction methods like T-REX. Again, the reason for not being able to make full use of T-REX is the fact that the software requires interactive data loading and processing; 2) The optimal tree-metric is very close to an ultrametric. It is still worth pointing out that despite the fact that the ultrametric fits have smaller loss than tree-metric fits, even in the former case HyperAid consistently improved all tested algorithms on real-world datasets.

6.3. Ablation study

While minimizing the loss in the encoding process of the HyperAid framework is an intuitively good choice given that our final tree-metric fit is evaluated based on the loss, it is still possible to choose other objective functions (such as the Dasguptas loss) for the encoder module. We therefore compare the performance of the hyperbolic metric learned using HypHC, centered on the Dasgupta objective, with the performance of our HyperAid encoder coupled with several possible decoding methods. For HypHC, we used the official implementation and hyperparameters provided by the authors (Chami et al., 2020). The results are listed in Table 3 in the Supplement. The results reveal that minimizing the loss during hyperbolic embedding indeed offers better results than those obtained by using the Dasgupta loss when our ultimate performance criterion is the loss. Note that the default decoding rule of HypHC produces unweighted trees and hence offers worse performance than the tested mode (and the results are therefore omitted from the table).

6.4. Tree visualization and analysis

One conceptually straightforward method for evaluating the quality of the hierarchical clustering is to use the ground truth labels of object classes available for real-world datasets such as Zoo and Iris.

The Zoo dataset comprises classes of the animal kingdom, including insects, cartilaginous and bony fish, amphibians, reptiles, birds and mammals. The hierarchical clusterings based on NJ, HyperAid+NJ and HypHC reveal that the first and last method tend to misplace amphibians and bony fish as well as insects and cartilaginous fish by placing them in the same subtrees. The NJ method also shows a propensity to create “path-like trees” which do not adequately capture known phylogenetic information (e.g., for mammals in particular). On the other hand, HyperAid combined with NJ discriminates accurately between all seven classes with the exception of two branches of amphibian species.

The Iris dataset includes three groups of iris flowers, Iris-setosa, Iris-versicolor and Iris-virginica. All three methods perfectly identify the subtrees of Iris-setosa species, but once again the NJ method produces a “path-like tree” for all three groups which does not conform to nonlinear models of evolution. It is well-known that it is hard to distinguish Iris versicolor and Iris virginica flowers as both have similar flower color and growth/bloom times. As a result, all three methods result in subtrees containing intertwined clusters of flowers from both groups.

Zoo, , EL Iris, , EL Glass, , EL Segmentation, , EL Spambase, , EL Average Gain (%)
Gain (%)
Gain (%)
Gain (%)
Gain (%)
Gain (%)
NJ 11.77 [HTML]EFEFEF8.99 30.94 43.85 [HTML]EFEFEF18.17 141.22 60.96 [HTML]EFEFEF30.76 98.15 856.47 [HTML]EFEFEF 292.78 192.53 1008.6 [HTML]EFEFEF377.93 166.87 125.94
TreeRep 14.65 [HTML]EFEFEF9.32 57.12 22.59 [HTML]EFEFEF19.00 18.89 71.81 [HTML]EFEFEF32.00 124.41 1357.28 [HTML]EFEFEF1038.98 30.64 2279.31 [HTML]EFEFEF927.02 145.87 75.38
NJ(T-REX) 8.85 [HTML]EFEFEF8.59 3.09 26.38 [HTML]EFEFEF16.96 55.51 37.61 [HTML]EFEFEF29.58 27.11 NA NA NA NA NA NA 28.57
UNJ(T-REX) 8.60 [HTML]EFEFEF 8.57 0.31 17.47 [HTML]EFEFEF 16.86 3.58 29.41 [HTML]EFEFEF 29.38 0.09 NA NA NA NA NA NA 1.33
single 34.53 [HTML]EFEFEF17.17 101.11 84.66 [HTML]EFEFEF65.97 28.33 95.91 [HTML]EFEFEF52.82 81.55 1174.95 [HTML]EFEFEF924.76 27.06 1809.83 [HTML]EFEFEF1109.75 63.08 60.23
complete 20.78 [HTML]EFEFEF10.64 95.27 58.13 [HTML]EFEFEF30.95 87.80 88.19 [HTML]EFEFEF37.75 133.61 928.63 [HTML]EFEFEF486.31 90.96 1248.15 [HTML]EFEFEF633.70 96.96 100.92
average 10.04 [HTML]EFEFEF 9.17 9.47 23.38 [HTML]EFEFEF 22.99 1.68 31.94 [HTML]EFEFEF 30.79 3.73 339.41 [HTML]EFEFEF 300.07 13.11 365.78 [HTML]EFEFEF 350.68 4.31 6.46
weighted 12.08 [HTML]EFEFEF11.05 9.33 [HTML]EFEFEF27.20 29.07 -6.42 34.75 [HTML]EFEFEF34.16 1.74 458.17 [HTML]EFEFEF337.75 35.66 385.37 [HTML]EFEFEF382.13 0.84 8.22
Ufit 10.53 [HTML]EFEFEF9.35 12.70 26.02 [HTML]EFEFEF23.02 13.03 33.21 [HTML]EFEFEF31.04 6.98 354.66 [HTML]EFEFEF314.08 12.92 381.68 [HTML]EFEFEF377.67 1.06 9.34
Table 2. The norm loss of the resulting trees (see the caption of Table 1 for the format description). In addition, we use to indicate that the value is approximated by random sampling of one million quadruples of leaves. NA refers to “not applicable”. Note that the first four decoders produce tree-metrics while the remaining produce ultrametrics.
The work was supported by the NSF CIF program under grant number 19566384.


  • I. Abraham, M. Balakrishnan, F. Kuhn, D. Malkhi, V. Ramasubramanian, and K. Talwar (2007) Reconstructing approximate tree metrics. In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, pp. 43–52. Cited by: §1.
  • R. Agarwala, V. Bafna, M. Farach, M. Paterson, and M. Thorup (1998) On the approximability of numerical taxonomy (fitting distances by tree metrics). SIAM Journal on Computing 28 (3), pp. 1073–1085. Cited by: §2.
  • S. Ahmadian, V. Chatziafratis, A. Epasto, E. Lee, M. Mahdian, K. Makarychev, and G. Yaroslavtsev (2019) Bisect and conquer: hierarchical clustering via max-uncut bisection. arXiv preprint arXiv:1912.06983. Cited by: §2.
  • N. Ailon and M. Charikar (2005) Fitting tree metrics: hierarchical clustering and phylogeny. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), pp. 73–82. Cited by: §1, §1, §2, §3, §3.
  • N. Alon, Y. Azar, and D. Vainstein (2020) Hierarchical clustering: a 0.585 revenue approximation. In Conference on Learning Theory, pp. 153–162. Cited by: §2.
  • K. Atteson (1997) The performance of neighbor-joining algorithms of phylogeny reconstruction. In International Computing and Combinatorics Conference, pp. 101–110. Cited by: §5.2.
  • A. Boc, A. B. Diallo, and V. Makarenkov (2012) T-rex: a web server for inferring, validating and visualizing phylogenetic trees and networks. Nucleic acids research 40 (W1), pp. W573–W579. Cited by: §1, §6.
  • S. Bonnabel (2013) Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control 58 (9), pp. 2217–2229. Cited by: §5.1.
  • P. Buneman (1974) A note on the metric properties of trees. Journal of combinatorial theory, series B 17 (1), pp. 48–50. Cited by: Theorem 5.1.
  • I. Chami, A. Gu, V. Chatziafratis, and C. Ré (2020) From trees to continuous embeddings and back: hyperbolic hierarchical clustering. arXiv preprint arXiv:2010.00402. Cited by: §C.1, §2, §5.1, §6.3, §6.
  • I. Chami, Z. Ying, C. Ré, and J. Leskovec (2019)

    Hyperbolic graph convolutional neural networks

    In Advances in Neural Information Processing Systems, pp. 4868–4879. Cited by: §2.
  • M. Charikar, V. Chatziafratis, and R. Niazadeh (2019) Hierarchical clustering better than average-linkage. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2291–2304. Cited by: §1.
  • M. Charikar and V. Chatziafratis (2017) Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 841–854. Cited by: §2.
  • E. Chien, C. Pan, P. Tabaghi, and O. Milenkovic (2021) Highly scalable and provably accurate classification in poincaré balls. In 2021 IEEE International Conference on Data Mining (ICDM), pp. 61–70. Cited by: §2, §3.
  • G. Chierchia and B. Perret (2020) Ultrametric fitting by gradient descent. Journal of Statistical Mechanics: Theory and Experiment 2020 (12), pp. 124004. Cited by: §2, §6.
  • H. Cho, B. DeMeo, J. Peng, and B. Berger (2019) Large-margin classification in hyperbolic space. In

    International Conference on Artificial Intelligence and Statistics

    pp. 1832–1840. Cited by: §2.
  • A. Coenen, E. Reif, A. Yuan, B. Kim, A. Pearce, F. Viégas, and M. Wattenberg (2019) Visualizing and measuring the geometry of bert. arXiv preprint arXiv:1906.02715. Cited by: §1.
  • V. Cohen-Addad, D. Das, E. Kipouridis, N. Parotsidis, and M. Thorup (2021) Fitting distances by tree metrics minimizing the total error within a constant factor. arXiv preprint arXiv:2110.02807. Cited by: §1, §2.
  • S. Dasgupta (2016) A cost function for similarity-based hierarchical clustering. In

    Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

    pp. 118–127. Cited by: 3rd item, §1, §2, §2, §4.1.
  • A. W. Dress (1984) Trees, tight extensions of metric spaces, and the cohomological dimension of certain groups: a note on combinatorial properties of metric spaces. Advances in Mathematics 53 (3), pp. 321–402. Cited by: §5.1.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §6.2.
  • M. Farach, S. Kannan, and T. Warnow (1995) A robust model for finding optimal evolutionary trees. Algorithmica 13 (1), pp. 155–179. Cited by: §2.
  • O. Ganea, G. Bécigneul, and T. Hofmann (2018) Hyperbolic neural networks. In Advances in Neural Information Processing Systems, pp. 5345–5355. Cited by: §2, §3.
  • O. Gascuel et al. (1997) Concerning the nj algorithm and its unweighted version, unj. Cited by: §6.
  • M. Gromov (1987) Hyperbolic groups. In Essays in group theory, pp. 75–263. Cited by: §1, §3.
  • B. Harb, S. Kannan, and A. McGregor (2005) Approximating the best-fit tree under l p norms. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, pp. 123–133. Cited by: 3rd item, §1.
  • P. N. Hess and C. A. De Moraes Russo (2007) An empirical test of the midpoint rooting method. Biological Journal of the Linnean society 92 (4), pp. 669–674. Cited by: §5.2.
  • J. Hewitt and C. D. Manning (2019) A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129–4138. Cited by: §1.
  • M. Kochurov, R. Karimov, and S. Kozlukov (2020)

    Geoopt: riemannian optimization in pytorch

    External Links: 2005.02819 Cited by: §5.1.
  • D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, and M. Boguná (2010) Hyperbolic geometry of complex networks. Physical Review E 82 (3). Cited by: §2.
  • N. Linial, E. London, and Y. Rabinovich (1995) The geometry of graphs and some of its algorithmic applications. Combinatorica 15 (2), pp. 215–245. Cited by: §1.
  • Q. Liu, M. Nickel, and D. Kiela (2019) Hyperbolic graph neural networks. In Advances in Neural Information Processing Systems, pp. 8230–8241. Cited by: §2.
  • S. Macaluso, C. Greenberg, N. Monath, J. A. Lee, P. Flaherty, K. Cranmer, A. McGregor, and A. McCallum (2021) Cluster trellis: data structures & algorithms for exact inference in hierarchical clustering. In International Conference on Artificial Intelligence and Statistics, pp. 2467–2475. Cited by: §1, §1.
  • N. Monath, M. Zaheer, D. Silva, A. McCallum, and A. Ahmed (2019) Gradient-based hierarchical clustering using continuous representations of trees in hyperbolic space. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 714–722. Cited by: §2.
  • B. Moseley and J. R. Wang (2017)

    Approximation bounds for hierarchical clustering: average linkage, bisecting k-means, and local search

    In Proceedings of the 31st International Conference on Neural Information Processing Systems, Cited by: §2.
  • F. Murtagh (1984) Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics Quarterly 1 (2), pp. 101–113. Cited by: §5.2.
  • M. Nickel and D. Kiela (2017) Poincaré embeddings for learning hierarchical representations. Advances in neural information processing systems 30, pp. 6338–6347. Cited by: §C.1, §1, §2.
  • M. Nickel and D. Kiela (2018) Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International Conference on Machine Learning, pp. 3779–3788. Cited by: §2, §5.1.
  • F. Papadopoulos, R. Aldecoa, and D. Krioukov (2015) Network geometry inference using common neighbors. Physical Review E 92 (2). Cited by: §2.
  • R. Sahoo, I. Chami, and C. Ré (2020) Tree covers: an alternative to metric embeddings. In Differential Geometry for Machine Learning Workshop at NeurIPS, Cited by: §1.
  • N. Saitou and M. Nei (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees.. Molecular biology and evolution 4 (4), pp. 406–425. Cited by: §1, §5.2.
  • F. Sala, C. De Sa, A. Gu, and C. Re (2018) Representation tradeoffs for hyperbolic embeddings. In International Conference on Machine Learning, Vol. 80, pp. 4460–4469. Cited by: §2, §5.1.
  • R. Sarkar (2011) Low distortion delaunay embedding of trees in hyperbolic plane. In International Symposium on Graph Drawing, pp. 355–366. Cited by: §2.
  • R. Shimizu, Y. Mukuta, and T. Harada (2021) Hyperbolic neural networks++. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • R. Sonthalia and A. C. Gilbert (2020) Tree! i am no tree! i am a low dimensional hyperbolic embedding. arXiv preprint arXiv:2005.03847. Cited by: §C.2, §1, §3, §3, §5.2, §6.
  • P. Tabaghi, E. Chien, C. Pan, J. Peng, and O. Milenković (2021)

    Linear classifiers in product space forms

    arXiv preprint arXiv:2102.10204. Cited by: §2, §3.
  • A. Tifrea, G. Becigneul, and O. Ganea (2019) Poincaré glove: hyperbolic word embeddings. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • A. A. Ungar (2008) Analytic hyperbolic geometry and albert einstein’s special theory of relativity. World Scientific. Cited by: §3.
  • J. Väisälä (2005) Gromov hyperbolic spaces. Expositiones Mathematicae 23 (3), pp. 187–231. Cited by: Appendix B.
  • J. Vermeer (2005) A geometric interpretation of ungar’s addition and of gyration in the hyperbolic plane. Topology and its Applications 152 (3), pp. 226–242. Cited by: §3.
  • H. Wareham (1993) On the complexity of inferring evolutionary trees. Technical Report Technical Report 9301. Cited by: §2.
  • M. S. Waterman, T. F. Smith, M. Singh, and W. A. Beyer (1977) Additive evolutionary trees. Journal of theoretical Biology 64 (2), pp. 199–213. Cited by: §2, §2.
  • T. J. Wheeler (2009) Large-scale neighbor-joining with ninja. In International Workshop on Algorithms in Bioinformatics, pp. 375–389. Cited by: §1.
  • T. Yu and C. M. De Sa (2019) Numerically accurate hyperbolic embeddings using tiling-based models. Advances in Neural Information Processing Systems 32. Cited by: §5.1.
  • T. Yu and C. M. De Sa (2021) Representing hyperbolic space accurately using multi-component floats. Advances in Neural Information Processing Systems 34. Cited by: §5.1.

Appendix A Proof of Proposition 4.1

Proposition A.1 ().

The objective in Problem 4.2 is a convex function of the edge weights of the tree .


The objective in Problem 4.2 can be written as

where is the set of admissible linear operators that map edge weights to additive pairwise distances on a binary tree with leaves. Let . For a , we have

This result follows from the convexity of the norm for . ∎

Appendix B Proof of

Theorem B.1 ().



Let us define a one-to-one map , where . The Gromov hyperbolicity of the Poincaré ball is finite (Väisälä, 2005) and it scales linearly with the pairwise distances; see Definition 3.2. Since for all , the Gromov hyperbolicity of equals . ∎

Appendix C Additional experiment details

c.1. Detailed description of the encoder

We implemented our framework in Python. The encoder part is developed in Pytorch, and is a modification based on the official implementation of HypHC222

Training details. Note that all hyperparameters of each dataset are described in our implementation. Here, we briefly describe the most important ones. All the hyperparameters are tuned based on the loss of Hyp+NJ. Note that since we are solving a search problem, this is a standard approach – for combinatorial search algorithms, one usually reports the best loss for different random seeds. HypHC also report the best loss results over several random seeds (Chami et al., 2020).

When training hyperbolic embeddings in the Poincaré model, we follow the suggestion stated in (Nickel and Kiela, 2017). We initialize the embeddings within a small radius 1e-6 with small learning rate for a few (burnin

) epochs, then increase the learning rate by a

burnin_factor. We also tune the negative curvature c for different datasets. As mentioned in the main text, although choosing larger c makes the resulting hyperbolic metric more similar to the tree-metric (i.e. becomes smaller based on the relation ), it can make the optimization process harder and can potentially cause numerical issues. For synthetic datasets, we find that scaling the input metric by a factor scaling_factor during training (and scale it back before decoding) can help with improving the final results. We conjecture that this is due to the numerical precision issues associated with hyperbolic embeddings approaches for which the resulting points are close to the boundary of Poincaré ball. We use Riemannian Adam implemented by the authors of HypHC as our optimizer.

c.2. Detailed description of the decoder

For NJ and TreeRep, we use the implementation from the TreeRep paper333 We manually interact with the T-REX for NJ-TREX and UNJ-TREX. For linkage-based methods, we use the scipy library555 For UFit, we use the official implementation666 along with default hyperparameters.

Remarks. The NJ implementation, which is based on the PhyloNetworks library777 in Julia, is a naive version of the method. It is known from the literature that such versions of NJ can potentially produce negative edge weights888 In the naive version of NJ, one simply replaces the negative edge-weights by . On the other hand, a more sophisticated cleaning approach is implemented in T-REX, which is the reason why the T-REX version offers better performance. Note that TreeRep is also known to have the negative edge-weights problem, where the naive correction (replacing them with ) was adopted by the authors (Sonthalia and Gilbert, 2020).

c.3. Additional results

Figure 7. Visualization of resulting trees for Glass. Vertex colors indicate the ground truth labels.
Zoo Iris Glass
HyperAid HypHC HyperAid HypHC HyperAid HypHC
NJ [HTML]EFEFEF11.77 31.55 [HTML]EFEFEF18.18 70.43 [HTML]EFEFEF30.76 87.75
TreeRep [HTML]EFEFEF14.65 28.51 [HTML]EFEFEF19.00 64.61 [HTML]EFEFEF32.00 80.31
single [HTML]EFEFEF34.53 45.01 [HTML]EFEFEF65.97 85.40 [HTML]EFEFEF52.83 108.44
complete [HTML]EFEFEF20.78 24.01 [HTML]EFEFEF30.96 59.60 [HTML]EFEFEF37.75 69.54
average [HTML]EFEFEF10.05 33.31 [HTML]EFEFEF23.00 72.36 [HTML]EFEFEF30.79 90.58
weighted [HTML]EFEFEF12.09 32.97 [HTML]EFEFEF29.07 71.00 [HTML]EFEFEF34.16 87.15
Ufit [HTML]EFEFEF10.54 33.07 [HTML]EFEFEF23.03 73.19 [HTML]EFEFEF31.04 90.34
Table 3. The losses of resulting trees generated as part of the ablation study pertaining to methods for generating hyperbolic embeddings.

Tree visualization for the Glass dataset. Clustering results obtained for the Glass dataset comprising seven classes of material are hardest to interpret, as all three methods only accurately cluster tableware glass. HypAid appears to better delineate between building-windows glass that is float-processed from building-windows glass that is nonfloat-processed.