# Hyperbolic Manifold Regression

Geometric representation learning has recently shown great promise in several machine learning settings, ranging from relational learning to language processing and generative models. In this work, we consider the problem of performing manifold-valued regression onto an hyperbolic space as an intermediate component for a number of relevant machine learning applications. In particular, by formulating the problem of predicting nodes of a tree as a manifold regression task in the hyperbolic space, we propose a novel perspective on two challenging tasks: 1) hierarchical classification via label embeddings and 2) taxonomy extension of hyperbolic representations. To address the regression problem we consider previous methods as well as proposing two novel approaches that are computationally more advantageous: a parametric deep learning model that is informed by the geodesics of the target space and a non-parametric kernel-method for which we also prove excess risk bounds. Our experiments show that the strategy of leveraging the hyperbolic geometry is promising. In particular, in the taxonomy expansion setting, we find that the hyperbolic-based estimators significantly outperform methods performing regression in the ambient Euclidean space.

## Authors

• 4 publications
• 69 publications
• 30 publications
• ### Low-rank approximations of hyperbolic embeddings

The hyperbolic manifold is a smooth manifold of negative constant curvat...
03/18/2019 ∙ by Pratik Jawanpuria, et al. ∙ 0

• ### HyperExpan: Taxonomy Expansion with Hyperbolic Representation Learning

Taxonomies are valuable resources for many applications, but the limited...
09/22/2021 ∙ by Mingyu Derek Ma, et al. ∙ 0

• ### Representation of 2D frame less visual space as a neural manifold and its information geometric interpretation

Representation of 2D frame less visual space as neural manifold and its ...
11/27/2020 ∙ by Debasis Mazumdar, et al. ∙ 0

• ### Robust Large-Margin Learning in Hyperbolic Space

Recently, there has been a surge of interest in representation learning ...
04/11/2020 ∙ by Melanie Weber, et al. ∙ 12

• ### Hyperbolic Busemann Learning with Ideal Prototypes

Hyperbolic space has become a popular choice of manifold for representat...
06/28/2021 ∙ by Mina Ghadimi Atigh, et al. ∙ 5

• ### Unsupervised Hierarchy Matching with Optimal Transport over Hyperbolic Spaces

This paper focuses on the problem of unsupervised alignment of hierarchi...
11/06/2019 ∙ by David Alvarez-Melis, et al. ∙ 0

• ### Extrinsic Kernel Ridge Regression Classifier for Planar Kendall Shape Space

Kernel methods have had great success in the statistics and machine lear...
12/17/2019 ∙ by Hwiyoung Lee, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

11footnotetext: Istituto Italiano di Tecnologia, Via Morego, 30, Genoa 16163, Italy.22footnotetext: Imperial College London, SW7 2BU London, United Kingdom33footnotetext: Massachusetts Institute of Technology, Cambridge, MA 02139, USA.44footnotetext: Universitá degli Studi di Genova, Genova, Italy.

Representation learning is a key paradigm in machine learning and artificial intelligence. It has enabled important breakthroughs in computer vision

(Krizhevsky et al., 2012; He et al., 2016)natural language processing (Mikolov et al., 2013; Bojanowski et al., 2016; Joulin et al., 2016), relational learning (Nickel et al., 2011; Perozzi et al., 2014), generative modeling (Kingma and Welling, 2013; Radford et al., 2015), and many other areas (Bengio et al., 2013; LeCun et al., 2015). Its objective is typically to infer latent feature representations of objects (e.g., images, words, entities, concepts) such that their similarity or distance in the representation space captures their semantic similarity. For this purpose, the geometry of the representation space has recently received increased attention (Wilson et al., 2014; Falorsi et al., 2018; Davidson et al., 2018; Xu and Durrett, 2018). Here, we focus on Riemannian representation spaces and in particular on hyperbolic geometry. Nickel and Kiela (2017) introduced Poincaré embeddings to infer hierarchical representations of symbolic data, which led to substantial gains in representational efficiency and generalization performance. Hyperbolic representations have since been extended to other manifolds (Nickel and Kiela, 2018; De Sa et al., 2018), word embeddings (Tifrea et al., 2018; Le et al., 2019), recommender systems (Chamberlain et al., 2019), and image embeddings (Khrulkov et al., 2019).

However, it is yet an open problem how to efficiently integrate hyperbolic representations with standard machine learning methods which often make a Euclidean or vector space assumption. The work of

Ganea et al. (2018)

establishes some fundamental steps in this direction by proposing a generalization of fully connected neural network layers from Euclidean space to hyperbolic space. However most of the experiments shown were from hyperbolic to Euclidean space using recurrent models. In this paper we focus on the task of learning manifold-valued functions from Euclidean on to hyperbolic space that allows us to leverage its hierarchical structure for supervised learning. For this purpose, we propose two novel approaches: a deep learning model trained with a geodesic-based loss to learn hyperbolic-valued functions and a non-parametric kernel-based model for which we provide a theoretical analysis.

We illustrate the effectiveness of this strategy on two challenging tasks, i.e., hierarchical classification via label embeddings and taxonomy expansion by predicting concept embeddings from text. For standard classification tasks, label embeddings have shown great promise as they allow to scale supervised learning methods to datasets with massive label spaces (Chollet, 2016; Veit et al., 2018)

. By embedding labels in hyperbolic space according to their natural hierarchical structure (e.g, the underlying WordNet taxonomy of ImageNet labels) we are then able to combine the benefits of hierarchical classification with the scalability of label embeddings. Moreover, the continuous nature of hyperbolic space allows the model to

invent new concepts by predicting their placement in a pre-embedded base taxonomy. We exploit this property for a novel task which we refer to as taxonomy expansion: Given an embedded taxonomy , we infer the placement of unknown novel concepts by predicting their features onto the embedding. In contrast to hierarchical classification, the predicted embeddings are here full members of the taxonomy, i.e., they can themselves act as parents of other points. For both tasks, we show empirically that the proposed strategy can often lead to more effective estimators than its Eucliean counterpart. These findings support the thesis of this work that leveraging the hyperbolic geometry can be advantageous for several machine learning settings. Additionally, we observe that the hyperbolic-based estimators introduced in this work achieve comparable performance to the previously proposed hyperbolic neural networks (Ganea et al., 2018). This suggests that, in practice, it is not necessary to work with hyperbolic layers as long as the training procedure exploits the geodesic as an error measure. This is advantageous from the computational perspective, since we found our proposed approaches to be generally significantly easier to train in practice.

The remainder of this paper is organized as follows: In Section 2 we briefly review hyperbolic embeddings and related concepts such as Riemannian optimization. In Section 3, we introduce our proposed methods and prove excess risk bounds for the kernel-based method. In Section 4 we evaluate our methods on the tasks of hierarchical classification and taxonomy expansion.

## 2 Hyperbolic Representations

Hyperbolic space is the unique, complete, simply con- nected Riemannian manifold with constant negative sectional curvature. There exist multiple equivalent models for hyperbolic space. To estimate the embed- dings using stochastic optimization we will employ the Lorentz model due to its numerical advantages. For analysis, we will map embeddings into the Poincaré disk which provides an intuitive visualization of hyper- bolic embeddings. This can be easily done because the two models are isometric Nickel and Kiela (2018). We review both manifolds in the following.

#### Lorentz Model

Let , and let denote the Lorentzian scalar product. The Lorentz model of -dimensional hyperbolic space is then defined as the Riemannian manifold , where

 Hn={u∈Rn+1:⟨u,u⟩L=−1,x0>0}, (1)

denotes the upper sheet of a two-sheeted -dimensional hyperboloid and where

is the associated metric tensor. Furthermore, the distance on

is defined as

 dL(u,v)=acosh(−⟨u,v⟩L). (2)

An advantage of the Lorentz model is that its exponential map has as simple, closed-form expression. As showed by Nickel and Kiela (2018), this allows us to perform Riemannian optimization efficiently and with increased numerical stability. In particular, let and let denote a point in the associated tangent space. The exponential map is then defined as

 expu(z)=cosh(∥z∥L)u+sinh(∥z∥L)z∥z∥L. (3)

#### Poincaré ball

The Poincaré ball model is the Riemannian manifold , where is the open -dimensional unit ball and where is the associated metric tensor. The distance function on is defined as

 dp(u,v)=acosh(1+2∥u−v∥2(1−∥u∥2)(1−∥v∥2)). (4)

An advantage of the Poincaré ball is that it provides an intuitive model of hyperbolic space which is well suited for analysis and visualization of the embeddings. It can be seen from Eq. 4, that the distance within the Poincaré ball changes smoothly with respect to the norm of and . This locality property of the distance is key for representing hierarchies efficiently (Hamann, 2018). For instance, by placing the root node of a tree at the origin of , it would have relatively small distance to all other nodes, as its norm is zero. On the other hand, leaf nodes can be placed close to the boundary of the ball, as the distance between points grows quickly with a norm close to one.

#### Hyperbolic embeddings

We consider supervised datasets where class labels can be organized according to a taxonomy or class hierarchy . Edges indicate that is-a . To compute hyperbolic embeddings of all that capture these hierarchical relationships of , we follow the works of Nickel and Kiela (2017, 2018) and infer the embedding from pairwise similarities. In particular, let be the similarity function such that

 γ(ci,cj)={1,if ci, cj are adjacent in % clos(T)0,otherwise (5)

where is the transitive closure of . Furthermore, let denote the set of concepts that are less similar to then (including ) and let denote the nearest neighbor of in the set . We then learn embeddings by optimizing

 minΘ−∑i,jlogPr(ϕ(i,j)=j | Θ) (6)

with

 Pr(ϕ(i,j)=j | Θ)=ed(ui,uj)∑k∈N(i,j)ed(ui,uk). (7)

Eq. 7 can be interpreted as a ranking loss that aims to extract latent hierarchical structures from . For computational efficiency, we follow Jean et al. (2014) and randomly subsample on large datasets. To infer the embeddings we then minimize Eq. 7 using Riemannian SGD (Bonnabel, 2013). In RSGD, updates to the parameters are computed via

where denotes the Riemannian gradient, denotes the learning rate, and is a set of random uniformly sampled indexes.

By computing hyperbolic embeddings of , we have then recast the learning problem from a discrete tree to its embedding in a continuous manifold with . This allows us to apply manifold regression techniques as discussed in the following. This idea is depicted in Fig. 1.

## 3 Manifold Valued Prediction in Hyperbolic Space

We study the problem of learning a map taking values in the hyperbolic space, often referred to as manifold regression (Steinke and Hein, 2009; Steinke et al., 2010). We assume for simplicity that and are compact subsets. In particular, we assume a training dataset

of points independently sampled from a joint distribution

on and aim to find an estimator for the minimizer of the expected risk

 minf:X→Y E(f)E(f)=∫dL(f(x),y)2 dρ(x,y). (9)

Here we consider as target space and

as loss function, but all results extend to

. Eq. 9

is the natural generalization of standard vector-valued ridge regression (indeed the geodesic of

is the Euclidean distance ). We tackle this problem proposing two novel approaches: one leveraging recent results on structured prediction and one using geodesic neural networks.

#### Structured Prediction

Rudi et al. (2018) proposed a new approach to address manifold regression problems. The authors adopted a perspective based on structured prediction and interpreted the target manifold as a “structured” output. While standard structured prediction studies settings where is a discrete (often finite) space (Bakir et al., 2007), this extension allowed the authors to design a kernel-based approach for structured prediction for which they provided a theoretical analysis under suitable assumptions on the output space. We formulate the corresponding Hyperbolic Structured Prediction (HSP) estimator when applying this strategy to our problem (namely ). In particular, we have the function such that for any test point

 fhsp(x)=argminy∈Ln m∑i=1αi(x) dL(y,yi)2, (10)

where the weights are learned by solving a variant of kernel ridge regression: given a reproducing kernel on the input space, we obtain

 α(x)=(K+λI)−1v(x), (11)

where is the empirical kernel matrix and is the evaluation vector with entries with entries respectively and for .

In line with most literature on structured prediction, the estimator in creftype 10 requires solving an optimization problem at every test point. Hence, while this approach offers a significant advantage at training time (when learning the weights ), it can lead to a more expensive operation at test time. To solve this problem in practice we resort to RSGD as defined in creftype 8.

Rudi et al. (2018), studied the generalization properties of estimators of the form of creftype 10. The authors proved that under suitable assumptions on the regularity of the output manifold, it was possible to give bounds on the excess risk in terms of the number of training examples available. The following theorem specializes this result to the case of . A key role will be played by the -Sobolev space of functions from to , which generalizes the standard notion on Euclidean domains (see Hebey, 2000).

###### Theorem 1.

Let be sampled independently according to on with compact sets. Let defined as in creftype 10 with weights creftype 11 learned with reproducing kernel with reproducing kernel Hilbert space (RKHS) . If the map belongs to with , then for any

 E(fhsp)−inffE(f)  ≤  ∥∥d2L∥∥s,2qτ2 1n1/4, (12)

holds with probability at least

, where is a constant not depending on or .

The result guarantees a learning rate of order . We comment on the assumptions and constants appearing in Thm. 1. First, we point out that, albeit the requirement

can seem overly abstract, it reduces to a standard assumption in statistical learning theory. Informally, it corresponds to a regularity assumption on the conditional mean embedding of the distribution

(see the work of Song et al. (2013) for more details), and can be interpreted as requiring the solution of creftype 9 to belong to the hypotheses space associated to the kernel . Second, we comment on the constant in creftype 12 that depending on the geodesic distance. In particular, we note that by Thm. 2 of Rudi et al. (2018) the squared geodesic on any compact subset of belongs to for any . Hence also for any , as required by Thm. 1.

###### Proof.

The proof of Thm. 1 is a specialization of Thm. 2 and 4 by Rudi et al. (2018). We recall a key assumption that is required to apply such results.

###### Assumption 1.

is a complete -dimensional smooth connected Riemannian manifold, without boundary, with Ricci curvature bounded below and positive injectivity radius.

The assumption above imposes basic regularity conditions on the output manifold. A first implication is indeed that.

###### Proposition 2 (Thm. 2 in Rudi et al. (2018)).

Let satisfy assumption 1 and let is a compact geodesically convex subset of . Then, the squared geodesic distance is smooth on . Moreover, by the proof of Thm. in the appendix of Manifold Structured Prediction (Rudi et al., 2018), we have for any .

Leveraging standard results from Riemannian geometry, we can guarantee that the manifolds considered in this paper satisfy the above requirements. For simplicity, we restrict on corresponding to an open bounded ball in either or . In particular,

• has sectional curvature constantly equal to . Hence the Ricci curvature is bounded from below since we are in a bounded ball in either or .

• The injectivity radius is positive (actually lower bounded by with the integer parts of ), see Main Theorem by Martin (1989).

We see that we are in the hypotheses of Prop. 2, from which we conclude the following.

###### Corollary 3.

For any , the geodesic distance (respectively ) belongs to for any compact subspace of (respectively ).

This guarantees us that we are in the hypotheses of (Rudi et al., 2018, Thm. ), from which Thm. 1 follows. We note in particular that takes the role of the loss function in the original theorem. Which needs to be a so-called “Structure Encoding Loss Function”. The latter is guaranteed by Cor. 3 above. ∎

#### Neural Network with Geodesic loss (NN-G)

As an alternative to the non-parametric model

, we consider also a parametric method based on deep neural networks. An important challenge when dealing with manifold regression is how to design a suitable model for the estimator. While neural networks of the form (parametrized by some weights ) have proven to be powerful models for regression and feature representation (LeCun et al., 2015; Bengio et al., 2013; Xiao et al., 2016; Ngiam et al., 2011), it is unclear how to enforce the constraint for a candidate function to take values on the manifold since their canonical forms are designed to act between linear spaces. To address this limitation, we consider in the following the Poincaré ball model and develop a neural architecture mapping the Euclidean space into the open unit ball. In particular, let the element-wise hyperbolic tangent be defined as

 h:Rk→{x∈Rk:∥x∥∞<1} (13) (x1,…,xk)↦(tanhx1,…,tanhxk), (14)

which maps a linear space onto the open ball. Moreover, we define a “squashing“ function

 s:{x∈Rk:∥x∥∞<1}→{x∈Rk:∥x∥2<1} (15) s(x)={x↦x∥x∥∞∥x∥2,if x≠00,if x=0 (16)

where is the vector of all zeros. Since , this function is continuous and maps the open ball into the open ball. And because both and are bijective continuous function with continuous inverse, the composition is also a homeomorphism from into the open ball and therefore also on the Poincaré model manifold. By composing with the neural network feature extractor we obtain a deep model that jointly learns features into a linear space and maps them to the hyperbolic manifold:

 fnng=s∘h∘gθ:Rd→Pk. (17)

Note that the homeomorphism is sub-differentiable. Therefore learning the parameters

of this model is akin to training a classical deep learning architecture with activation functions at the output layer corresponding to

. The key difference here lies in the loss used for training. In this setting, analogously to the task addressed by HSP, we replaced of the standard mean-squared error (Euclidean) loss with the squared geodesic distance between predictions and true labels.

#### Hyperbolic embeddings and manifold regression

In this work we propose to leverage the hyperbolic geometry to address machine learning tasks where hierarchical structures play a central role. In particular, we combine label embeddings approaches with hyperbolic regression to perform hierarchical classification. We do this by following a two step procedure: assuming a hierarcy , we consider an augmented where each example corresponds to a child to its associated class from the original . Then, we embed into the hyperboilic space using the procedure reviewed in Section 2. We compute similarity scores in the transitive closure of , using either a Gaussian kernel on the features – when both nodes have a corresponding representation available – or otherwise employing the original . This allows us to incorporate information about feature similarities within the label embedding.

## 4 Experiments

We evaluate our proposed methods for hyperbolic manifold regression on the following experiments:

Hierarchical Classification via Label Embeddings.

For this task, the goal is to classify examples with a single label from a class hierarchy with tree structure. We begin by computing label embeddings of the class hierarchy via hyperbolic representations. We then learn to regress examples onto label embeddings and classify them using the nearest label in the target space, i.e., by denoting

the embedding of class and taking .

 ˆc=argminc∈Cd(f(x),yc) (18)

Taxonomy expansion. For this task, the goal is to expand an existing taxonomy based on feature information about new concepts. As for hierarchical classification, we first embed the existing taxonomy in hyperbolic space and then learn to regress onto the label embeddings. However, a key difference is that a new example can themselves act as the parent of another class .

#### Models and training details

For hierarchical classification, we compare to standard baselines such as top-down classification with logistic regression (TD-LR) and hierarchical SVM (HSVM). Furthermore, since both tasks can be regarded as regression problems onto the Poincaré ball (which has a canonical embedding in

) we also compare to kernel regularized least squares regression (KRLS) and a neural network with squared Euclidean loss (NN-E). In both cases, we constrain predictions to remain within the Poincaré ball via the projection

 proj(y)={y/∥y∥−εif ∥y∥≥1yotherwise,

where is a small constant to ensure numerical stability, equal to . These regression baselines allows us to evaluate the advantages of training manifold-valued models with squared geodesic loss compared to standard methods that are agnostic of the underlying geodesics.

For kernel-based methods, we employ a Gaussian kernel selecting the bandwidth and regularization parameter via cross-validation. Both parameter ranges are logarithmically spaced. For HSP inference we use RSGD with batch size equal to and a maximum of iterations. We stop the minimization if the the gradient Euclidean norm is smaller than (In most cases the inference stops before the iteration). The learning rate for RSGD is chosen via cross-validation on the interval . For the neural network models (NN-G, NN-E) we use the same architecture for : each layer is a fully connected network

 zℓ=ψ(Wℓzℓ−1+bℓ)

where

is a ReLU non-linearity and

, with the dimension of the previous layer (with the exception of the first and last layer which must fit input and output dimensions). We use a depth of 5 layers with intermediate dimensionalities for taxonomy expansion and

for hierarchical classification. We did not find significant improvements with deeper architectures in performance. We train the deep models using mini-batch stochastic gradient descent, with a scheduler until the model reaches convergence on the training loss. For taxonomy expansion we also compare our algorithms with a hyperbolic neural networks (HNN) as introduced by

Ganea et al. (2018). This architecture is trained with Riemannian Stochastic Gradient Descent until convergence and has the same structure and the same number of parameters of NN-G and NN-E. Because NN-G uses fully connected layers until the homeomorphic transformation, it can be trained with traditional optimizers such as stochastic gradient descent or Adam (Kingma and Welling, 2013). In our experiments, we observe that this can be an important advantage as these models require typically one third of the training time compared to HNNs.

### 4.1 Hierarchical classification

For hierarchical classification, we are given a supervised training set where the class labels are organized in a tree . We first embed the augmented hierarchy as discussed in Section 3 and learn a regression function using . For a test point , we first map it onto the target manifold and then classify according to Eq. 18. For evaluation, we use various benchmark datasets for hierarchical classification111https://sites.google.com/site/hrsvmproject/, and Newsgroups-20222http://qwone.com/~jason/20Newsgroups/ for which we manually extract TF-IDF features from the original documents. We compute an embedding for the augmented hierarchies of each dataset. To make sure to obtain a good embedding, we perform parameter-tuning in order to attain mAP of at least . We then train HSP, NN-G and NN-E as described above and measure classification performance in terms of F1 and macroF1 scores. As a baseline we also train Hierarchical SVM (HSVM) (Vateekul et al., 2012) and Top-Down Logisitic Regression (TD-LR) (Naik and Rangwala, 2018).

Table 1 shows the results of our experiments. It can be seen that the hyperbolic structured predictor achieves results comparable to state-of-the-art on this task although we did not explicitly optimize the embedding or training loss for hierarchical classification. We also observe that while NN-G outperforms NN-E, both algorithms perform significantly worse on Wipo and Diatoms datasets. Interestingly, these two datasets are significantly smaller compared to Newsgroup-20 and Imclef07a in terms of number of training points ( Vs training samples). This seems to suggest that NN-G and NN-E models have a higher sample complexity.

### 4.2 Taxonomy expansion

For taxonomy expansion, we assume a similar setting as for hierarchical classification. We are given a dataset where concepts are organized in a taxonomy and for each concept we have an additional feature representation . Again, we first embed the augmented hierarchy as discussed in Section 3 and split it in train and test set . We vary the size of the test set, i.e., the number of unknown concepts in such that . Whenever necessary, we also create a validation set from for model selection with a ratio for model selection. We then train all regression functions using and predict embeddings for . In contrast to hierarchical classification, the predicted points can themselves act as parents of other points, i.e., they are full members of the taxonomy . To assess the quality of the predictions we use mean average prediction (mAP) as proposed by Nickel and Kiela (2017). We report mAP for the predicted points as well as for the points originally embedded by the Lorentz embedding (Orig). This experiment is repeated times for a given size of the test set, each time selecting a new training-test split. In our experiments, we consider the following datasets:

WordNet Mammals. For WordNet Mammals, the goal is to expand an existing taxonomy by predicting concept embeddings from text. For this purpose, we take the mammals hierarchy of WordNet and retrieve for each node its corresponding Wikipedia page. If a page is missing, we remove the corresponding node and if a page has multiple candidates we disambiguate manually. The transitive closure of has nodes and edges. Next, we pre-process the retrieved Wikipedia descriptions by removing all non alphabetical characters, tokenizing words and removing stopwords using NLTK (Loper and Bird, 2002). Finally, we associate to each concept the TF-IDF vector of its Wikipedia description as feature representation computed using Scikit-learn (Pedregosa et al., 2011). We then embed following Section 2 and obtain an embedding with mAP and mean rank

. This dataset is particularly difficult given the way features were collected: Wikipedia pages have a high variance in quality and amount of content, while some pages are detailed and rich in information other barely contain a full sentence.

Synthetic datasets. To better control for noise in the feature representations, we also generate datasets based on synthetic random trees, i.e., a smaller tree with nodes and edges and a larger tree with nodes and edges after transitive closure. For each node we take as feature vector the corresponding row of the adjacency matrix of the transitive closure of the tree. We project these rows on the first principal components of the adjacency matrix, where for the small tree and for the big tree. We then embed the nodes of the graph in using both the tree structure and similarity scores computed using the vector features. The similarity is computed by a Gaussian kernel with equal to the average tenth nearest neighbour of the dataset.

Results We provide the results of our evaluation for different sizes on in Table 2. It can be seen that all hyperbolic-based methods can successfully predict the embeddings of unknown concepts when the test set is small. The performance degrades as the size of the test set increases, since it becomes harder to leverage the original structure of the graph. While all methods are affected by this trend, we note that algorithms using the geodesic loss tend to perform better than those working in the linear space. This suggest that taking into account the local geometry of the embedding is indeed beneficial in estimating the relative position of novel points in the space.

We conclude by noting that all hyperbolic-based methods have comparable performance across the three settings. However, we point out that HSP and NN-G offer significant practical advantages over HNN: in all our experiments they were faster to train and in general more amenable do model design. In particular, since HSP is based on a kernel method, it has relatively fewer hyperparameters and requires only solving a linear system at training time. NN-G consists of a standard neural architecture with the homeomorphism activation function introduced in

Section 3 and trained with the geodesic loss. This allows one to leverage all current packages available to train neural networks, significantly reducing both modeling and training times.

## 5 Conclusion

In this paper, we showed how to recast supervised problems with hierarchical structure as manifold-valued regressions in the hyperbolic manifold. We then proposed two algorithms for learning manifold-valued functions mapping from Euclidean to hyperbolic space: a non-parametric kernel-based method for which we also proved generalization bounds and a parametric deep-learning model that is informed by the geodesics of the output space. The latter makes possible to leverage traditional neural network layers for regression on hyperbolic space without resorting to hyperbolic layers, thus requiring a smaller training time. We evaluated both methods empirically on the task of hierarchical classification and showed that hyperbolic structured prediction shows strong generalization performance. We also showed that hyperbolic manifold regression enables new applications in supervised learning. By exploiting the continuous representation of hierarchies in hyperbolic space we were able to place unknown concepts in the embedding of a taxonomy using manifold regression. Moreover, by comparing to hyperbolic neural networks we showed that for this application, the key step is leveraging the geodesic of the manifold. In this work, we have aimed at developing a foundation for regressing onto hyperbolic representations. In future work, we plan to exploit this framework in dedicated methods for hierarchical machine learning and extending the applications to manifold product spaces.

## Acknowledgments

We thank Maximilian Nickel for his invaluable support and feedback throughout this project. Without him, this paper would have not been possible.

This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216, and the Italian Institute of Technology. We gratefully acknowledge the support of NVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this research. This work has been carried out at the Machine Learning Genoa (MaLGa) center, Università di Genoa (IT) L. R. acknowledges the financial support of the European Research Council (grant SLING 819789), the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS - DLV-777826.