1 Introduction
Many public datasets exhibit hierarchical structures. For instance, the conceptual relations in WordNet Miller (1995) form a hierarchical structure, users in social networks such as Facebook or twitter form hierarchies based on different occupations and organizations Gupte et al. (2011). Representing such hierarchical data in Euclidean space cannot capture and reflect their semantic or functional resemblance AlanisLobato et al. (2016); Nickel and Kiela (2017). Hyperbolic space, i.e., nonEuclidean space with constant negative curvature, has been leveraged to embed data with hierarchical structures with low distortion owing to the nature of exponential growth in volume with respect to its radius Nickel and Kiela (2017); Sarkar (2011); Sala et al. (2018). For instance, hyperbolic space has been used for analyzing the hierarchical structure in single cell data Klimovskaia et al. (2020), learning hierarchical word embedding Nickel and Kiela (2017), embedding complex networks AlanisLobato et al. (2016), etc.
Recently, algorithms which operate directly on hyperbolic representations have also been derived to further exploit the potential of hyperbolic representations. For example, in Weber et al. (2020) the authors proposed hyperbolic perceptron Weber et al. (2020)
to perform perceptron algorithm directly on hyperbolic representations. Hyperbolic Support Vector Machine
Cho et al. (2019) was also proposed to perform large margin classification in hyperbolic space. Hyperbolic Neural Networks (HNNs) Ganea et al. (2018) are proposed as an alternative to Euclidean neural networks (ENNs) to further exploit hyperbolic space and can be used for more complex problems. When HNNs are applied to image datasets, they employ a hybrid architecture Khrulkov et al. (2020) as shown in Figure 1: an ENN is first used for extracting features from images, then the Euclidean embeddings are projected onto hyperbolic space as hyperbolic embeddings, finally the hyperbolic embeddings are classified by a hyperbolic multiclass logistic regression
Ganea et al. (2018).While HNNs are able to achieve improvements over ENNs on several datasets with explicit hierarchical structure Ganea et al. (2018), they perform poorly on standard image classification benchmarks. This, undesirably, causes severe limitations when applying HNNs. In Khrulkov et al. (2020), while the authors show that several image datasets possess latent hierarchical structure, there are no experimental results showing that HNNs can capture such structure or provide similar performance to ENNs. Existing improvements on HNNs mainly focus on reducing the number of parameters Shimizu et al. (2020) or incorporating different types of neural network layers such as attention Gulcehre et al. (2018) or convolution Shimizu et al. (2020). Unfortunately, the reason behind the inferior performance of HNNs compared with ENNs on standard image datasets is not investigated or understood.
Our key insight is that the inferiority of HNNs on standard image datasets is not their intrinsic limitation but stems from the improper training procedures. We first conduct an empirical study showing that the hybrid nature of HNNs leads to vanishing gradient problem during training using backpropagation. In particular, the training dynamics of HNNs push the hyperbolic embeddings to the boundary of the Poincaré ball Anderson (2006) which causes the gradients of Euclidean parameters to vanish. Inspired by the above analysis, we propose a simple yet effective remedy to this problem, called Feature Clipping
, by constraining the norm of the hyperbolic embedding during training. With the proposed technique, HNNs are on par with ENNs on several standard image classification benchmarks including MNIST, CIFAR10, CIFAR100 and ImageNet. This shows that HNNs can not only perform better than ENNs on hierarchical datasets but also achieve comparable performance to ENNs on standard image datasets. This undoubtedly broadens the application of HNNs for computer vision tasks. The improved HNNs are also more robust to ENNs and exhibit stronger outofdistribution detection ability.
The contributions of the paper are as follows. (1) We conduct a detailed analysis to understand the underlying issues when applying HNNs on the standard image datasets. (2) We propose a simple yet effective solution to address the vanishing gradient problem during training HNNs by constraining the norm of the hyperbolic embedding. (3) We conduct extensive experiments on standard image datasets including MNIST, CIFAR10, CIFAR100 and ImageNet. The results show that by addressing the problem of vanishing gradients, the performance of HNNs on standard datasets has been greatly improved and matches ENNs. Meanwhile, the improved HNNs are also more robust to adversarial attacks and exhibits stronger outofdistribution detection capability than their Euclidean counterparts.
2 Related Work
Supervised Learning In the seminal work of Hyperbolic Neural Networks (HNNs) Ganea et al. (2018)
, the authors proposed different hyperbolic neural network layers including multinomial logitstic regression (MLR), fully connected layers and Recurrent Neural Networks which can operate directly on hyperbolic embeddings. The proposed HNNs outperform the Euclidean variants on text entailment and noisyprefix prediction task. Recently, Hyperbolic Neural Networks++
Shimizu et al. (2020) was proposed to reduce the number of parameters of HNNs and also introduced hyperbolic convolutional layers. In Gulcehre et al. (2018), the authors proposed Hyperbolic attention networks Gulcehre et al. (2018) by rewriting the operations in the attention layers using gyrovector operations Ungar (2005)which leads to improvements on neural machine translation, learning on graphs and visual question answering. Hyperbolic graph neural network
Liu et al. (2019) was proposed by extending the representational geometry of Graph Neural Networks (GNNs) Zhou et al. (2020) to hyperbolic space. Hyperbolic graph attention network Zhang et al. (2019) further studied GNNs with attention mechanism in hyperbolic space. Recently, hyperbolic neural networks have also been used for tasks such as fewshot classification and person reidentification Khrulkov et al. (2020).Unsupervised LearningUnsupervised learning methods based on variants of HNNs have also attracted a lot of attention. In Nagano et al. (2019)
, the authors proposed a wrapped normal distribution on hyperbolic space to construct hyperbolic variational autoencoders (VAEs)
Kingma and Welling (2013). In a concurrent work Mathieu et al. (2019), the authors proposed Gaussian generalizations on hyperbolic space to construct Poincaré VAEs. Recent work Hsu et al. (2020) applied hyperbolic neural networks for unsupervised 3D segmentation based on complex volumetric data.Compared with the abovementioned methods which focus on the application of HNNs in data with natural tree structure, this paper attempts to extend the application of HNNs to standard image recognition datasets and improves the performance of HNNs on these datasets to the level of Euclidean counterparts, greatly enhancing the universality of HNNs.
3 Free Hyperbolic Neural Networks with Limited Radii
Our goal is to address the vanishing gradient problem when training HNNs. We propose an efficient solution to solve the problem and the improved HNNs are on par with ENNs on standard recognition datasets and show better performance in terms of fewshot learning, adversarial robustness and outofdistribution detection. First, we review the basics of HNNs. Then, we analyze the vanishing gradient problem in training HNNs. Finally, we present the proposed method and show its effectiveness of addressing the issue of HNNs.
Riemannian Geometry An dimensional topological manifold is a topological space that is locally Euclidean of dimension : every point has a neighborhood that is homeomorphic to an open subset of . A smooth manifold is a topological manifold with additional smooth structure which is a maximal smooth atlas. A Riemannian manifold is a real smooth manifold with a Riemannian metric . The Riemannian metric is defined on the tangent space of which is a smoothly varying inner product. For
and any two vectors
in the tangent space of , the inner product is defined as . With the definition of inner product, for , the norm is defined as . A geodesic is a curve of unit speed that is locally minimizing the distance between two points on the manifold. Given , and a geodesic of length such that , the exponential map satisfies and the inverse exponential map satisfies . For more details please refer to Carmo (1992); Lee (2018)Poincaré Ball Model for Hyperbolic Space A hyperbolic space is a Riemannian manifold with constant negative curvature. There are several isometric models for hyperbolic space, one of the commonly used models is Poincaré ball model Nickel and Kiela (2017); Ganea et al. (2018). The dimensional Poincaré ball model of constant negative curvature is defined as , where = and
is the Riemannian metric tensor.
is the conformal factor and is the Euclidean metric tensor. The conformal factor induces the inner product and norm for all . The exponential map of Poincaré ball model can be written analytically with the operations of gyrovector space which will be introduced in Section 3.Gyrovector Space A gyrovector space Ungar (2005, 2008) is an algebraic structure that provides an analytic way to operate in hyperbolic space. Each point in hyperbolic space is endowed with vectorlike properties similar to the point in Euclidean space.
The basic operation in gyrovector space is called Möbius addition . With Möbius addition , we can define vector addition of two points in Poincaré ball model as,
(1) 
for all . Particularly, converges to the standard in the Euclidean space. Similarly, we can define various operations such as scalar multiplication, subtraction, exponential map, inverse exponential map in Poincaré ball model with the operations of gyrovector space. Those operations form the basis for constructing hyperbolic neural network layers as shown in Ganea et al. (2018). For more details, please refer to Appendix A.1.
Hyperbolic Neural Networks In Ganea et al. (2018), the authors derived different hyperbolic neural network layers based on the algebra of gyrovector space. When applying hyperbolic neural networks to image datasets Khrulkov et al. (2020), they consist of an Euclidean subnetwork and a hyperbolic classifier as shown in Figure 1. The Euclidean subnetwork converts an input such as an image into a representation in Euclidean space. is then projected onto hyperbolic space via an exponential map as . The hyperbolic classifier performs classification based on with the standard crossentropy loss .
Let the parameters of the Euclidean subnetwork be and the parameters of the hyperbolic classifier be
. Given the loss function
, the optimization problem can be formalized as,(2) 
where the outer and inner functions are and . As shown in Ganea et al. (2018), the exponential map is defined as,
(3) 
The construction of hyperbolic classifier relies on the following definition of Poincaré hyperplanes,
Definition 3.1 (Poincaré hyperplanes Ganea et al. (2018))
For , , the Poincaré hyperplane is defined as,
(4) 
where is the normal vector and defines the bias of the Poincaré hyperplane.
As shown in Ganea et al. (2018)
, in hyperbolic space the probability that a given
is classified as class is,(5) 
where is the distance of the embedding to the Poincaré hyperplane of class . In hyperbolic classifier the parameters are the vectors for each class .
Training Hyperbolic Neural Networks with Backpropagation The standard backpropagation algorithm Rumelhart et al. (1986) is used for training HNNs Ganea et al. (2018); Khrulkov et al. (2020). During backpropagation, the gradient of the Euclidean parameters can be computed as,
(6) 
where is the hyperbolic embedding of the input , is the Jacobian matrix and is the gradient of the loss function with respect to the hyperbolic embedding . It is noteworthy that since is an embedding in hyperbolic space, is the Riemannian gradient Bonnabel (2013) and
(7) 
where is Euclidean gradient and is the inverse of the Riemannian metric tensor.
Vanishing Gradient Problem At initialization, all the hyperbolic embeddings of the inputs locate in the center of the Poincaré ball. From Equation 5, we can see that in order to maximize the probability of the correct prediction, we need to increase the distance of the hyperbolic embedding to the corresponding Poincaré hyperplane, i.e., . The training dynamics of HNNs thus push the hyperbolic embeddings to the boundary of the Poincaré ball in which case approaches one. The inverse of the Riemannian metric tensor becomes zero which causes to be small. From Equation 6, it is easy to see that if is small, then is small and the optimization makes no progress on .
To obtain further understanding, we conduct an experiment to show the vanishing gradient problem during training hyperbolic neural networks. We train a LeNetlike convolutional neural network
LeCun et al. (1998) with hyperbolic classifier on the MNIST data. We use a twodimensional Poincaré ball for visualization. In Figure 2, we show the trajectories of the hyperbolic embeddings of six randomly sampled inputs during training. The arrows indicate the movement of each embedding after one gradient update step. It can be observed that at initialization all the hyperbolic embeddings are close to the center of the Poincaré ball. During training, the hyperbolic embeddings gradually move to the boundary of the ball. The magnitude of the gradient diminishes during training as the training loss decays. However, at the end of training, while the training loss slightly increases, the gradient vanishes due to the issue that the hyperbolic embeddings approach the boundary of the ball.Vanishing gradient problem Hochreiter (1998); Pennington et al. (2017, 2018); Hanin (2018) is one of the difficulties in training deep neural networks using backpropagation. Vanishing gradient problem occurs when the magnitude of the gradient is too small for the optimization to make progress. For the standard Euclidean neural networks, vanishing gradient problem can be alleviated by architecture designs Hochreiter and Schmidhuber (1997); He et al. (2016), proper weight initialization Mishkin and Matas (2015)
and carefully chosen activation functions
Xu et al. (2015a). However, the vanishing gradient problem in training HNNs is not exploited in existing literature.The Effect of Gradient Update of Euclidean Parameters on the Hyperbolic Embedding We derive the effect of a single gradient update of the Euclidean parameters on the hyperbolic embedding, for more details please refer to Appendix A.2. For the Euclidean subnetwork , consider the firstorder Taylorexpansion with a single gradient update,
(8) 
where is the learning rate. The gradient of the exponential map can be computed as,
(9) 
Let be the projected point in hyperbolic space, i.e,
(10) 
By applying the firstorder Taylorexpansion on the exponential map and following standard derivations, we can find that,
(11) 
where . Thus once approaches one, the hyperbolic embedding stagnates no matter how large the training loss is.
Feature clipping for training hyperbolic neural networks There are several possible solutions to address the vanishing gradient problem for training HNN. One tentative solution is to replace all the Euclidean layers with hyperbolic layers, however it is not clear how to directly map the original input images onto hyperbolic space. Another solution is to use normalized gradient descent Hazan et al. (2015) for optimizing the Euclidean parameters to reduce the effect of gradient magnitude. However we observed that this introduces instability during training and makes it harder to tune the learning rate for optimizing Euclidean parameters.
We address the vanishing gradient problem by first reformulating the optimization problem in Equation 2 with a regularization term which controls the magnitude of hyperbolic embeddings,
(12) 
where and
is a hyperparameter. By minimizing the training loss, the hyperbolic embeddings tend to move to the boundary of the Poincaré ball which causes the vanishing gradient problem. The additional regularization term is used to prevent the hyperbolic embeddings from approaching the boundary.
While the soft constraint introduced in Equation 12 is effective, it introduces additional complexity to the optimization process. We instead employ the following hard constraint which regularizes the Euclidean embedding before the exponential map whenever its norm exceeding a given threshold,
(13) 
where and is a hyperparameter. Using the relation between the hyperbolic distance and Euclidean distance,
(14) 
where and is the curvature, can be converted into the effective radius of the Poincaré ball.
The proposed Feature Clipping imposes a hard constraint on the maximum norm of the hyperbolic embedding to prevent the inverse of the Riemannian metric tensor from approaching zero. Therefore there is always a gradient signal for optimizing the hyperbolic embedding. Although decreasing the norm of the hyperbolic embedding shrinks the effective radius of the embedding space, in the experiments we found that it does no harm to accuracy while alleviating the vanishing gradient problem.
4 Experimental Settings and Evaluation Protocol
We conduct extensive experiments on standard recognition datasets to show the effectiveness of the proposed feature clipping for HNNs. The results show that HNNs with feature clipping are on par with ENNs on standard recognition datasets while demonstrating better performance in terms of fewshot classification, adversarial robustness and outofdistribution detection.
Datasets We conduct experiments on various commonly used image classification datasets: MNIST LeCun (1998), CIFAR10 Krizhevsky et al. (2009), CIFAR100 Krizhevsky et al. (2009) and ImageNet Deng et al. (2009). The details of these datasets can be found in Appendix A.3. To our best knowledge, this paper is the first attempt to extensively evaluate hyperbolic neural networks on the standard image classification datasets for supervised classification.
Baselines and Networks We compare the performance of HNNs training with/without the proposed feature clipping method Ganea et al. (2018); Khrulkov et al. (2020) and their Euclidean counterparts. For MNIST, we use a LeNetlike convolutional neural network LeCun et al. (1998)
which has two convolutional layers with max pooling layers in between and three fully connected layers. For CIFAR10 and CIFAR100, we use WideResNet
Zagoruyko and Komodakis (2016). For ImageNet, we use a standard ResNet18 He et al. (2016).Training Setups For training ENNs, we use SGD with momentum as the optimizer. For training HNNs, the Euclidean parameters of HNNs are trained using SGD as the optimizer, and the hyperbolic parameters of HNNs are optimized using stochastic Riemann gradient descent Bonnabel (2013)
, just like the previous method. For training networks on MNIST, we train the network for 10 epochs with a learning rate of 0.1. The batch size is 64. For training networks on CIFAR10 and CIFAR100, we train the network for 100 epochs with an initial learning rate of 0.1 and use cosine learning rate scheduler
Loshchilov and Hutter (2016). The batch size is 128. For training networks on ImageNet, we train the network for 100 epochs with an initial learning rate of 0.1 and the learning rate decays by 10 every 30 epochs. The batch size is 256. We find the hyperbolic neural networks are robust to the choice of the hyperparameter , thus we fix to be 1.0 in all the experiments. For more discussions and results on the effect of , please see Appendix A.4. The experiments on MNIST, CIFAR10 and CIFAR100 are repeated for 5 times and we report both average accuracy and standard deviation. All the experiments are done on 8 NVIDIA TITAN RTX GPUs.
Results on the standard benchmarks In Table 1, we show the results of different networks on the considered benchmarks. On MNIST, we can observe that the accuracy of the improved hyperbolic neural networks with feature clipping is about 5% higher than the baseline hyperbolic neural networks and match the performance of Euclidean neural networks. On CIFAR10, CIFAR100 and ImageNet, the improved hyperbolic neural networks achieve 6%, 3% and 3% improvement over baseline hyperbolic neural networks. The results show that hyperbolic neural networks can perform well even on datasets which lack explicit hierarchical structure.
In Figure 3 we show the Poincaré hyperplanes of all the classes and the hyperbolic embeddings of 1000 sampled test images extracted by the baseline HNNs and HNNs with feature clipping. Note that the Poincaré hyperplanes consist of arcs of Euclidean circles that are orthogonal to the boundary of the ball. We also color the points in the ball based on the classification results. It can be observed that by regularizing the magnitude of the hyperbolic embedding, all the embeddings locate in a restricted region of the whole Poincaré ball and the network learns more regular and discriminative feature in hyperbolic space.
Methods  MNIST  CIFAR10  CIFAR100  ImageNet 

Euclidean He et al. (2016)  99.12 0.34 %  94.81 0.42%  76.24 0.35%  69.82% 
Hyperbolic Ganea et al. (2018); Khrulkov et al. (2020)  94.42 0.29 %  88.82 0.51%  72.26 0.41%  65.74% 
Hyperbolic w/ Feature Clipping  99.08 0.31 %  94.76 0.44%  75.88 0.38%  68.45% 
4.66%  5.94%  3.62%  2.79% 
Methods  Embedding Net  1Shot 5Way  5Shot 5Way 

ProtoNet Snell et al. (2017)  4 Conv  51.31 0.91%  70.77 0.69% 
Hyperbolic ProtoNet Khrulkov et al. (2020)  4 Conv  61.18 0.24%  79.51 0.16% 
Hyperbolic ProtoNet w/ Feature Clipping  4 Conv  64.66 0.24%  81.76 0.15% 
on 1shot 5way and 5shot 5way tasks. All accuracies are reported with 95% confidence intervals.
Methods  Embedding Net  1Shot 5Way  5Shot 5Way 

ProtoNet Snell et al. (2017)  4 Conv  49.42 0.78%  68.20 0.66% 
Hyperbolic ProtoNet Khrulkov et al. (2020)  4 Conv  51.88 0.20%  72.63 0.16% 
Hyperbolic ProtoNet w/ Feature Clipping  4 Conv  53.01 0.22%  72.66 0.15% 
Fewshot Learning We show that the proposed feature clipping can also improve the performance of Hyperbolic ProtoNet Khrulkov et al. (2020) for fewshot learning. Different from the standard ProtoNet Snell et al. (2017) which computes the prototype of each class in Euclidean space, Hyperbolic ProtoNet computes the class prototype in hyperbolic space using hyperbolic averaging. Hyperbolic geometry has been shown to learn more accurate embeddings than Euclidean geometry for fewshot learning Khrulkov et al. (2020).
We follow the experimental settings in Khrulkov et al. (2020) and conduct experiments on CUB dataset Welinder et al. (2010) and miniImageNet dataset Russakovsky et al. (2015). We consider 1shot 5way and 5shot 5way tasks as in Khrulkov et al. (2020). The evaluation is repeated for 10000 times and we report the average performance and the 95% confidence interval. Table 2 and Table 3 show that the proposed feature clipping further improves the accuracy of Hyperbolic ProtoNet for fewshot classification by as much as 3%.
Adversarial Robustness We show that HNNs are more robust to adversarial attacks including FGSM Goodfellow et al. (2014) and PGD Madry et al. (2017) than ENNs. We train the networks regularly without adversarial training with the setups described in Section 4. For attacking networks trained on MNIST using FGSM, we consider the perturbation . For attacking networks trained on MNIST using PGD, we consider the perturbation . The number of steps is 40. For attacking networks trained on CIFAR10 using PGD, we consider the perturbation . The number of steps is 7. From Figure 4 we can see that across all the cases hyperbolic neural networks show more robustness than ENNs to adversarial attacks. More results and discussions can be found in Appendix A.5.
Outofdistribution Detection We conduct experiments to show that HNNs have stronger outofdistribution detection capability than ENNs. Outofdistribution detection aims at determining whether or not an given input is from the same distribution as the training data. We follow the experimental settings in Liu et al. (2020). The indistribution datasets are CIFAR10 and CIFAR100. The outofdistribution datasets are ISUN Xu et al. (2015b), Place365 Zhou et al. (2017), Texture Cimpoi et al. (2014)
, SVHN
Netzer et al. (2011), LSUNCrop Yu et al. (2015) and LSUNResize Yu et al. (2015). We use the same network and training setups as described in Section 4 for training models on CIFAR10 and CIFAR100. For detecting outofdistribution data, we use both softmax score and energy score as described in Liu et al. (2020). For metrics, we consider FPR95, AUROC and AUPR Liu et al. (2020). In Table 4 and Table 5 we show the results of using softmax score on CIFAR10 and CIFAR100 respectively. We can see that HNNs and ENNs achieve similar AUPR, however HNNs achieve much better performance in terms of FPR95 and AUROC. In particular, HNNs reduce FPR95 by 5.82% and 9.55% on CIFAR10 and CIFAR100 respectively. For results using energy score, please see Appendix A.6[width=11em]OOD DatasetNetwork  Euclidean Neural Network  Hyperbolic Neural Network  

ISUN  46.30 0.78  91.50 0.16  98.16 0.05  45.28 0.65  91.61 0.21  98.09 0.06 
Place365  51.09 0.92  87.56 0.37  96.76 0.15  54.77 0.76  86.82 0.41  96.17 0.20 
Texture  65.04 0.91  82.80 0.35  94.59 0.20  47.12 0.62  89.91 0.20  97.39 0.09 
SVHN  71.66 0.84  86.58 0.21  97.06 0.06  49.89 1.03  91.34 0.22  98.13 0.06 
LSUNCrop  22.22 0.78  96.05 0.10  99.16 0.03  23.87 0.73  95.65 0.22  98.98 0.07 
LSUNResize  41.06 1.07  92.67 0.16  98.42 0.04  41.49 1.24  92.97 0.24  98.46 0.07 
Mean  49.56  89.53  97.36  43.74  91.38  97.87 
[width=11em]OOD DatasetNetwork  Euclidean Neural Network  Hyperbolic Neural Network  

ISUN  74.07 0.87  82.51 0.39  95.83 0.11  68.37 0.90  81.31 0.43  94.96 0.20 
Place365  81.01 1.07  76.90 0.45  94.02 0.15  79.66 0.69  76.94 0.28  93.91 0.18 
Texture  83.67 0.68  77.52 0.32  94.47 0.10  64.91 0.80  83.26 0.25  95.77 0.08 
SVHN  84.56 0.78  84.32 0.22  96.69 0.07  53.11 1.04  89.53 0.26  97.71 0.07 
LSUNCrop  43.46 0.79  93.09 0.23  98.58 0.05  51.08 1.17  87.21 0.39  96.83 0.13 
LSUNResize  71.50 0.73  82.12 0.40  95.69 0.13  63.86 1.10  82.36 0.42  95.16 0.13 
Mean  73.05  82.74  95.88  63.50  83.43  95.72 
5 Conclusion
We address one important issue when training HNNs which is ignored in previous literature. We identify the vanishing gradient problem when training hyperbolic neural networks and propose a simple yet effective solution which does not need to modify the current optimizer or architecture. We conduct extensive experiments on commonly used image dataset benchmarks including MNIST, CIFAR10, CIFAR100 and ImageNet. Hyperbolic neural networks with feature clipping show significant improvement over baseline HNNs and match the performance of ENNs. The proposed method also improves the performance of hyperbolic neural networks for fewshot learning. Further studies reveal that hyperbolic neural networks are more robust to adversarial attacks and have stronger outofdistribution detection capability.
6 Discussion on Societal Impacts and Limitations
This paper addresses an important problem in geometric deep learning. Our contribution consists of both theoretical and practical sides. The proposed method can advance the progress of hyperbolic neural networks and we expect to see more applications of hyperbolic neural networks in computer vision. The proposed method can also help hyperbolic neural networks model longtail data to address the fairness issue. We also believe the proposed method should not raise any ethical considerations.
In terms of limitations, there are still several unexplored aspects of HNNs. Future research could study and improve the applications of HNNs for computer vision tasks such as semantic segmentation and object detection.
References
 [1] (2016) Efficient embedding of complex networks to hyperbolic space via their laplacian. Scientific reports 6 (1), pp. 1–10. Cited by: §1.
 [2] (2006) Hyperbolic geometry. Springer Science & Business Media. Cited by: §1.
 [3] (2013) Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control 58 (9), pp. 2217–2229. Cited by: §3, §4.
 [4] (1992) Riemannian geometry. Birkhäuser. Cited by: §3.

[5]
(2019)
Largemargin classification in hyperbolic space.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pp. 1832–1840. Cited by: §1. 
[6]
(2014)
Describing textures in the wild.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 3606–3613. Cited by: §4.  [7] (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.
 [8] (2018) Hyperbolic neural networks. arXiv preprint arXiv:1805.09112. Cited by: Definition A.4, §1, §1, §2, Definition 3.1, §3, §3, §3, §3, §3, §3, Table 1, §4.
 [9] (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §4.
 [10] (2018) Hyperbolic attention networks. arXiv preprint arXiv:1805.09786. Cited by: §1, §2.
 [11] (2011) Finding hierarchy in directed online social networks. In Proceedings of the 20th international conference on World wide web, pp. 557–566. Cited by: §1.
 [12] (2018) Which neural net architectures give rise to exploding and vanishing gradients?. arXiv preprint arXiv:1801.03744. Cited by: §3.
 [13] (2015) Beyond convexity: stochastic quasiconvex optimization. arXiv preprint arXiv:1507.02030. Cited by: §3.
 [14] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3, Table 1, §4.
 [15] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.
 [16] (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems 6 (02), pp. 107–116. Cited by: §3.
 [17] (2020) Learning hyperbolic representations for unsupervised 3d segmentation. arXiv preprint arXiv:2012.01644. Cited by: §2.
 [18] (2020) Hyperbolic image embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6418–6428. Cited by: §1, §1, §2, §3, §3, Table 1, Table 2, Table 3, §4, §4, §4.
 [19] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
 [20] (2020) Poincaré maps for analyzing complex hierarchies in singlecell data. Nature communications 11 (1), pp. 1–9. Cited by: §1.
 [21] (2009) Learning multiple layers of features from tiny images. Cited by: Table 6, §4.
 [22] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3, §4.
 [23] (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: Table 6, §4.
 [24] (2018) Introduction to riemannian manifolds. Springer. Cited by: §3.
 [25] (2019) Hyperbolic graph neural networks. arXiv preprint arXiv:1910.12892. Cited by: §2.
 [26] (2020) Energybased outofdistribution detection. arXiv preprint arXiv:2010.03759. Cited by: §A.6, §4.
 [27] (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §4.
 [28] (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §4.
 [29] (2019) Continuous hierarchical representations with poincar’e variational autoencoders. arXiv preprint arXiv:1901.06033. Cited by: §2.
 [30] (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §1.
 [31] (2015) All you need is a good init. arXiv preprint arXiv:1511.06422. Cited by: §3.
 [32] (2019) A wrapped normal distribution on hyperbolic space for gradientbased learning. In International Conference on Machine Learning, pp. 4693–4702. Cited by: §2.
 [33] (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §4.
 [34] (2017) Poincar’e embeddings for learning hierarchical representations. arXiv preprint arXiv:1705.08039. Cited by: §1, §3.
 [35] (2018) The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics, pp. 1924–1932. Cited by: §3.
 [36] (2017) Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. arXiv preprint arXiv:1711.04735. Cited by: §3.
 [37] (1986) Learning representations by backpropagating errors. nature 323 (6088), pp. 533–536. Cited by: §3.
 [38] (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: Table 6, §4.
 [39] (2018) Representation tradeoffs for hyperbolic embeddings. In International conference on machine learning, pp. 4460–4469. Cited by: §1.
 [40] (2011) Low distortion delaunay embedding of trees in hyperbolic plane. In International Symposium on Graph Drawing, pp. 355–366. Cited by: §1.
 [41] (2020) Hyperbolic neural networks++. arXiv preprint arXiv:2006.08210. Cited by: §1, §2.
 [42] (2017) Prototypical networks for fewshot learning. arXiv preprint arXiv:1703.05175. Cited by: Table 2, Table 3, §4.
 [43] (2001) Hyperbolic trigonometry and its application in the poincaré ball model of hyperbolic geometry. Computers & Mathematics with Applications 41 (12), pp. 135–147. Cited by: §A.1.
 [44] (2005) Analytic hyperbolic geometry: mathematical foundations and applications. World Scientific. Cited by: §A.1, §2, §3.
 [45] (2008) A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics 1 (1), pp. 1–194. Cited by: §A.1, §3.
 [46] (2020) Robust largemargin learning in hyperbolic space. arXiv preprint arXiv:2004.05465. Cited by: §1.
 [47] (2010) CaltechUCSD Birds 200. Technical report Technical Report CNSTR2010001, California Institute of Technology. Cited by: §4.
 [48] (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853. Cited by: §3.
 [49] (2015) Turkergaze: crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755. Cited by: §4.
 [50] (2015) Lsun: construction of a largescale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §4.
 [51] (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.
 [52] (2019) Hyperbolic graph attention network. arXiv preprint arXiv:1912.03046. Cited by: §2.

[53]
(2017)
Places: a 10 million image database for scene recognition
. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1452–1464. Cited by: §4.  [54] (2020) Graph neural networks: a review of methods and applications. AI Open 1, pp. 57–81. Cited by: §2.
Appendix A Appendix
a.1 Gyrovector space
We give more details on gyrovector space, for a more systematic treatment, please refer to [44, 45, 43].
Gyrovector space provides a way to operate in hyperbolic space with vector algebra. Gyrovector space to hyperbolic geometry is similar to standard vector space to Euclidean geometry. The geometric objects in gyrovector space are called gyroevectors which are equivalent classes of directed gyrosegments. Similar to the vectors in Euclidean space which are added according to parallelogram law, gyrovectors are added according to gyroarallelogram law. Technically, gyrovector spaces are gyrocommutative gyrogroups of gyrovectors that admit scalar multiplications.
We start from the introduction of gyrogroups which give rise to gyrovector spaces.
Definition A.1 (Gyrogroups)
A groupoid is a gyrogroup if it satisfies the follow axioms,

There exist one element satisfies for all .

For each , there exist an element which satisfies

For every , there exist a unique element gry such that satisfies the left gyroassociative law gry.

The map gry: given by gry is an automorphism of the groupoid : gyr. The automorphism gyr of is called the gyroautomorphism of generated by .

The operation gry: is called gyrator of . The gyroautomorphism gyr generated by any has the left loop property: gyr = gyr.
In particular, Möbius complex disk groupoid is a gyrocommunicative gyrogroup, where and is the Möbius addition. The same applies to the ball which is defined as,
(15) 
Gyrocommutative gyrogroups which admit scalar multiplication become gyrovector space . Möbius gyrogroups admit scalar multiplication become Möbius gyrovector space .
Definition A.2 (Möbius Scalar Multiplication)
Let be a Möbius gyrogroup, the Möbius scalar multiplication is defined as,
(16) 
where and , .
Definition A.3 (Gyrolines)
Let be two distinct points in the gyrovector space . The gyroline in which passes through is the set of points:
(17) 
where .
It can be proven that gyrolines in a Möbius gyrovector space coincide with the geodesics of the Poincaré ball model of hyperbolic geometry.
With the aid of operations in gyrovector spaces, we can define important properties of the Poincaré ball model in closedform expressions.
Definition A.4 (Exponential Map and Logarithmic Map)
As shown in [8], the exponential map is defined as,
(18) 
The logarithmic map is defined as,
(19) 
The distance between two points in the Poincaré ball can be defined as,
Definition A.5
(Poincaré Distance between Two Points)
(20) 
a.2 The Effect of Gradient Update of Euclidean Parameters on the Hyperbolic Embedding
We derive the effect of the a single gradient update of the Euclidean parameters on the hyperbolic embedding. For the Euclidean subnetwork . Consider the firstorder Taylorexpansion of the Euclidean network with a single gradient update,
(21) 
Meanwhile, the exponentional map of the Poincaré ball is,
(22) 
The gradient of the exponential map can be computed as,
(23) 
Let be the projected point in hyperbolic space, i.e,
(24) 
Again we can apply the firstorder Taylorexpansion on the exponential map,
(25) 
Denote by , we have
(26) 
Denote by ,
(27) 
a.3 Datasets
a.4 The effect of hyperparameter r
We conduct ablation studies to show the effect of the hyperparameter which is the maximum norm of the Euclidean embedding. In Figure 5 we show the change of test accuracy as we vary the hyperparameter on MNIST, CIFAR10 and CIFAR100. We repeat the experiments for each choice of five times and report both average accuracy and standard deviation. On the one hand, it can be observed that a larger leads to a drop in test accuracy. As we point out, this is caused by the vanishing gradient problem in training hyperbolic neural networks. On the other hand, a small can also lead to a drop in test accuracy especially on more complex tasks such as CIFAR10 and CIFAR100. The plausible reason is that a small reduces the capacity of the embedding space which is detrimental for learning discriminative features.
To conclude, there is a sweet spot in terms of choosing which is neither too large (causing vanishing gradient problem) nor too small (not enough capacity). The performance of hyperbolic neural network is also robust to the choice of the hyperparameter if it is around the sweet spot.
a.5 More results on adversarial robustness
Although we observe that with adversarial training, hyperbolic neural networks achieve similar robust accuracy to Euclidean neural networks, in a further study, we consider training models using a small but attacking with a larger with FGSM on MNIST. In Table 7 we show the results of training the networks using and attacking with = 0.1, 0.2 and 0.3. We can observe that for attacking with larger such as 0.2 and 0.3, hyperbolic neural networks show more robustness to Euclidean neural networks. The possible explanation is that the proposed feature clipping reduces the adversarial noises in the forward pass. One of the future directions is to systematically understand and analyze the reason behind the robustness of hyperbolic neural networks. In Figure 6, we show the clean and adversarial images generated by FGSM with hyperbolic neural networks and Euclidean neural networks respectively. The predictions of the networks are shown above the image. It can be observed that hyperbolic neural networks show more adversarial robustness compared with Euclidean neural networks.
NetworkPerturbation  = 0.1  = 0.2  = 0.3 

Euclidean Network  94.51%  67.85%  42.18% 
Hyperbolic Network  93.58%  74.97%  46.27% 
a.6 More results on outofdistribution detection
In Table 8 and 9 we show the results of using energy score [26] on CIFAR10 and CIFAR100 for outofdistribution detection. Similar to the case of using softmax score, we can observe that on both datasets hyperbolic neural networks achieve similar performance in terms of one metric and perform better in terms other two metrics compared with Euclidean neural networks.
[width=11em]OOD DatasetNetwork  Euclidean Neural Network  Hyperbolic Neural Network  

ISUN  34.19 0.97  93.07 0.24  98.42 0.07  25.39 0.32  95.48 0.09  99.01 0.04 
Place365  43.34 1.22  88.50 0.48  96.76 0.17  45.17 1.19  89.61 0.28  97.20 0.14 
Texture  58.51 0.77  82.98 0.20  94.55 0.14  49.70 0.94  90.66 0.20  97.98 0.04 
SVHN  49.04 1.05  91.57 0.13  98.12 0.05  57.33 1.34  88.45 0.20  97.44 0.06 
LSUNCrop  9.48 0.60  98.21 0.07  99.63 0.02  24.78 0.73  95.06 0.15  98.92 0.05 
LSUNResize  28.28 0.66  94.31 0.14  98.72 0.04  22.52 0.67  96.15 0.09  99.18 0.02 
Mean  37.14  91.44  97.70  37.48  92.57  98.29 
[width=11em]OOD DatasetNetwork  Euclidean Neural Network  Hyperbolic Neural Network  

ISUN  74.49 0.60  82.45 0.33  95.84 0.12  68.75 0.93  81.33 0.31  94.93 0.16 
Place365  81.20 0.86  77.02 0.34  94.13 0.13  79.51 0.69  77.23 0.37  93.97 0.17 
Texture  83.19 0.31  77.74 0.35  94.54 0.11  65.03 0.52  83.38 0.29  95.85 0.10 
SVHN  84.12 0.59  84.41 0.16  96.72 0.04  55.44 1.00  89.43 0.25  97.69 0.06 
LSUNCrop  43.80 1.29  93.04 0.22  98.56 0.05  74.89 0.73  84.98 0.18  96.46 0.08 
LSUNResize  71.86 0.69  81.86 0.27  95.60 0.09  64.35 0.62  82.64 0.36  95.27 0.14 
Mean  73.11  82.75  95.90  67.99  83.17  95.70 
a.7 Softmax with temperature scaling
We consider softmax with temperature scaling as an alternative for addressing the vanishing gradient problem in training hyperbolic neural networks. Softmax with temperature scaling introduces an additional temperature parameter
to adjust the logits before applying the softmax function. Softmax with temperature scaling can be formulated as,
(28) 
In hyperbolic neural networks, is the output of the hyperbolic fullyconnected layer and is the number of classes. If the additional temperature parameter is smaller than 1, the magnitude (in the Euclidean sense) of the hyperbolic embedding will be scaled up which prevents it from approaching the boundary of the ball.
In Figure 7, we show the performance of training hyperbolic neural networks with temperature scaling compared with the proposed feature clipping. We consider feature dimensions of 2 and 64 respectively. Different temperature parameters are considered and the experiments are repeated for 10 times with different random seeds. We show both the average accuracy and the standard deviation. We can observe that softmax with temperature scaling and a carefully tuned temperature parameter can approach the performance of the proposed feature clipping when the feature dimension is 2. However, the feature dimension is 64, softmax with temperature scaling severely underperforms the proposed feature clipping. The results again confirm the effectiveness of the proposed approach.