Free Hyperbolic Neural Networks with Limited Radii

07/23/2021 ∙ by Yunhui Guo, et al. ∙ 0

Non-Euclidean geometry with constant negative curvature, i.e., hyperbolic space, has attracted sustained attention in the community of machine learning. Hyperbolic space, owing to its ability to embed hierarchical structures continuously with low distortion, has been applied for learning data with tree-like structures. Hyperbolic Neural Networks (HNNs) that operate directly in hyperbolic space have also been proposed recently to further exploit the potential of hyperbolic representations. While HNNs have achieved better performance than Euclidean neural networks (ENNs) on datasets with implicit hierarchical structure, they still perform poorly on standard classification benchmarks such as CIFAR and ImageNet. The traditional wisdom is that it is critical for the data to respect the hyperbolic geometry when applying HNNs. In this paper, we first conduct an empirical study showing that the inferior performance of HNNs on standard recognition datasets can be attributed to the notorious vanishing gradient problem. We further discovered that this problem stems from the hybrid architecture of HNNs. Our analysis leads to a simple yet effective solution called Feature Clipping, which regularizes the hyperbolic embedding whenever its norm exceeding a given threshold. Our thorough experiments show that the proposed method can successfully avoid the vanishing gradient problem when training HNNs with backpropagation. The improved HNNs are able to achieve comparable performance with ENNs on standard image recognition datasets including MNIST, CIFAR10, CIFAR100 and ImageNet, while demonstrating more adversarial robustness and stronger out-of-distribution detection capability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many public datasets exhibit hierarchical structures. For instance, the conceptual relations in WordNet Miller (1995) form a hierarchical structure, users in social networks such as Facebook or twitter form hierarchies based on different occupations and organizations Gupte et al. (2011). Representing such hierarchical data in Euclidean space cannot capture and reflect their semantic or functional resemblance Alanis-Lobato et al. (2016); Nickel and Kiela (2017). Hyperbolic space, i.e., non-Euclidean space with constant negative curvature, has been leveraged to embed data with hierarchical structures with low distortion owing to the nature of exponential growth in volume with respect to its radius Nickel and Kiela (2017); Sarkar (2011); Sala et al. (2018). For instance, hyperbolic space has been used for analyzing the hierarchical structure in single cell data Klimovskaia et al. (2020), learning hierarchical word embedding Nickel and Kiela (2017), embedding complex networks Alanis-Lobato et al. (2016), etc.

Recently, algorithms which operate directly on hyperbolic representations have also been derived to further exploit the potential of hyperbolic representations. For example, in Weber et al. (2020) the authors proposed hyperbolic perceptron Weber et al. (2020)

to perform perceptron algorithm directly on hyperbolic representations. Hyperbolic Support Vector Machine

Cho et al. (2019) was also proposed to perform large margin classification in hyperbolic space. Hyperbolic Neural Networks (HNNs) Ganea et al. (2018) are proposed as an alternative to Euclidean neural networks (ENNs) to further exploit hyperbolic space and can be used for more complex problems. When HNNs are applied to image datasets, they employ a hybrid architecture Khrulkov et al. (2020) as shown in Figure 1

: an ENN is first used for extracting features from images, then the Euclidean embeddings are projected onto hyperbolic space as hyperbolic embeddings, finally the hyperbolic embeddings are classified by a hyperbolic multiclass logistic regression

Ganea et al. (2018).

While HNNs are able to achieve improvements over ENNs on several datasets with explicit hierarchical structure Ganea et al. (2018), they perform poorly on standard image classification benchmarks. This, undesirably, causes severe limitations when applying HNNs. In Khrulkov et al. (2020), while the authors show that several image datasets possess latent hierarchical structure, there are no experimental results showing that HNNs can capture such structure or provide similar performance to ENNs. Existing improvements on HNNs mainly focus on reducing the number of parameters Shimizu et al. (2020) or incorporating different types of neural network layers such as attention Gulcehre et al. (2018) or convolution Shimizu et al. (2020). Unfortunately, the reason behind the inferior performance of HNNs compared with ENNs on standard image datasets is not investigated or understood.

Our key insight is that the inferiority of HNNs on standard image datasets is not their intrinsic limitation but stems from the improper training procedures. We first conduct an empirical study showing that the hybrid nature of HNNs leads to vanishing gradient problem during training using backpropagation. In particular, the training dynamics of HNNs push the hyperbolic embeddings to the boundary of the Poincaré ball Anderson (2006) which causes the gradients of Euclidean parameters to vanish. Inspired by the above analysis, we propose a simple yet effective remedy to this problem, called Feature Clipping

, by constraining the norm of the hyperbolic embedding during training. With the proposed technique, HNNs are on par with ENNs on several standard image classification benchmarks including MNIST, CIFAR10, CIFAR100 and ImageNet. This shows that HNNs can not only perform better than ENNs on hierarchical datasets but also achieve comparable performance to ENNs on standard image datasets. This undoubtedly broadens the application of HNNs for computer vision tasks. The improved HNNs are also more robust to ENNs and exhibit stronger out-of-distribution detection ability.

The contributions of the paper are as follows. (1) We conduct a detailed analysis to understand the underlying issues when applying HNNs on the standard image datasets. (2) We propose a simple yet effective solution to address the vanishing gradient problem during training HNNs by constraining the norm of the hyperbolic embedding. (3) We conduct extensive experiments on standard image datasets including MNIST, CIFAR10, CIFAR100 and ImageNet. The results show that by addressing the problem of vanishing gradients, the performance of HNNs on standard datasets has been greatly improved and matches ENNs. Meanwhile, the improved HNNs are also more robust to adversarial attacks and exhibits stronger out-of-distribution detection capability than their Euclidean counterparts.

Figure 1: Hyperbolic neural networks employ a hybrid architecture. The Euclidean neural network converts an input into Euclidean embedding. Then the Euclidean embedding is projected onto hyperbolic space via exponential map Exp

. Finally, the hyperbolic embeddings are classified with Poincaré hyperplanes.

2 Related Work

Supervised Learning In the seminal work of Hyperbolic Neural Networks (HNNs) Ganea et al. (2018)

, the authors proposed different hyperbolic neural network layers including multinomial logitstic regression (MLR), fully connected layers and Recurrent Neural Networks which can operate directly on hyperbolic embeddings. The proposed HNNs outperform the Euclidean variants on text entailment and noisy-prefix prediction task. Recently, Hyperbolic Neural Networks++

Shimizu et al. (2020) was proposed to reduce the number of parameters of HNNs and also introduced hyperbolic convolutional layers. In Gulcehre et al. (2018), the authors proposed Hyperbolic attention networks Gulcehre et al. (2018) by rewriting the operations in the attention layers using gyrovector operations Ungar (2005)

which leads to improvements on neural machine translation, learning on graphs and visual question answering. Hyperbolic graph neural network

Liu et al. (2019) was proposed by extending the representational geometry of Graph Neural Networks (GNNs) Zhou et al. (2020) to hyperbolic space. Hyperbolic graph attention network Zhang et al. (2019) further studied GNNs with attention mechanism in hyperbolic space. Recently, hyperbolic neural networks have also been used for tasks such as few-shot classification and person re-identification Khrulkov et al. (2020).

Unsupervised LearningUnsupervised learning methods based on variants of HNNs have also attracted a lot of attention. In Nagano et al. (2019)

, the authors proposed a wrapped normal distribution on hyperbolic space to construct hyperbolic variational autoencoders (VAEs)

Kingma and Welling (2013). In a concurrent work Mathieu et al. (2019), the authors proposed Gaussian generalizations on hyperbolic space to construct Poincaré VAEs. Recent work Hsu et al. (2020) applied hyperbolic neural networks for unsupervised 3D segmentation based on complex volumetric data.

Compared with the above-mentioned methods which focus on the application of HNNs in data with natural tree structure, this paper attempts to extend the application of HNNs to standard image recognition datasets and improves the performance of HNNs on these datasets to the level of Euclidean counterparts, greatly enhancing the universality of HNNs.

3 Free Hyperbolic Neural Networks with Limited Radii

Our goal is to address the vanishing gradient problem when training HNNs. We propose an efficient solution to solve the problem and the improved HNNs are on par with ENNs on standard recognition datasets and show better performance in terms of few-shot learning, adversarial robustness and out-of-distribution detection. First, we review the basics of HNNs. Then, we analyze the vanishing gradient problem in training HNNs. Finally, we present the proposed method and show its effectiveness of addressing the issue of HNNs.

Riemannian Geometry An -dimensional topological manifold is a topological space that is locally Euclidean of dimension : every point has a neighborhood that is homeomorphic to an open subset of . A smooth manifold is a topological manifold with additional smooth structure which is a maximal smooth atlas. A Riemannian manifold is a real smooth manifold with a Riemannian metric . The Riemannian metric is defined on the tangent space of which is a smoothly varying inner product. For

and any two vectors

in the tangent space of , the inner product is defined as . With the definition of inner product, for , the norm is defined as . A geodesic is a curve of unit speed that is locally minimizing the distance between two points on the manifold. Given , and a geodesic of length such that , the exponential map satisfies and the inverse exponential map satisfies . For more details please refer to Carmo (1992); Lee (2018)

Poincaré Ball Model for Hyperbolic Space A hyperbolic space is a Riemannian manifold with constant negative curvature. There are several isometric models for hyperbolic space, one of the commonly used models is Poincaré ball model Nickel and Kiela (2017); Ganea et al. (2018). The -dimensional Poincaré ball model of constant negative curvature is defined as , where = and

is the Riemannian metric tensor.

is the conformal factor and is the Euclidean metric tensor. The conformal factor induces the inner product and norm for all . The exponential map of Poincaré ball model can be written analytically with the operations of gyrovector space which will be introduced in Section 3.

Gyrovector Space A gyrovector space Ungar (2005, 2008) is an algebraic structure that provides an analytic way to operate in hyperbolic space. Each point in hyperbolic space is endowed with vector-like properties similar to the point in Euclidean space.

The basic operation in gyrovector space is called Möbius addition . With Möbius addition , we can define vector addition of two points in Poincaré ball model as,

(1)

for all . Particularly, converges to the standard in the Euclidean space. Similarly, we can define various operations such as scalar multiplication, subtraction, exponential map, inverse exponential map in Poincaré ball model with the operations of gyrovector space. Those operations form the basis for constructing hyperbolic neural network layers as shown in Ganea et al. (2018). For more details, please refer to Appendix A.1.

Hyperbolic Neural Networks In Ganea et al. (2018), the authors derived different hyperbolic neural network layers based on the algebra of gyrovector space. When applying hyperbolic neural networks to image datasets Khrulkov et al. (2020), they consist of an Euclidean sub-network and a hyperbolic classifier as shown in Figure 1. The Euclidean sub-network converts an input such as an image into a representation in Euclidean space. is then projected onto hyperbolic space via an exponential map as . The hyperbolic classifier performs classification based on with the standard cross-entropy loss .

Let the parameters of the Euclidean sub-network be and the parameters of the hyperbolic classifier be

. Given the loss function

, the optimization problem can be formalized as,

(2)

where the outer and inner functions are and . As shown in Ganea et al. (2018), the exponential map is defined as,

(3)

The construction of hyperbolic classifier relies on the following definition of Poincaré hyperplanes,

Definition 3.1 (Poincaré hyperplanes Ganea et al. (2018))

For , , the Poincaré hyperplane is defined as,

(4)

where is the normal vector and defines the bias of the Poincaré hyperplane.

As shown in Ganea et al. (2018)

, in hyperbolic space the probability that a given

is classified as class is,

(5)

where is the distance of the embedding to the Poincaré hyperplane of class . In hyperbolic classifier the parameters are the vectors for each class .

Training Hyperbolic Neural Networks with Backpropagation The standard backpropagation algorithm Rumelhart et al. (1986) is used for training HNNs Ganea et al. (2018); Khrulkov et al. (2020). During backpropagation, the gradient of the Euclidean parameters can be computed as,

(6)

where is the hyperbolic embedding of the input , is the Jacobian matrix and is the gradient of the loss function with respect to the hyperbolic embedding . It is noteworthy that since is an embedding in hyperbolic space, is the Riemannian gradient Bonnabel (2013) and

(7)

where is Euclidean gradient and is the inverse of the Riemannian metric tensor.

Vanishing Gradient Problem At initialization, all the hyperbolic embeddings of the inputs locate in the center of the Poincaré ball. From Equation 5, we can see that in order to maximize the probability of the correct prediction, we need to increase the distance of the hyperbolic embedding to the corresponding Poincaré hyperplane, i.e., . The training dynamics of HNNs thus push the hyperbolic embeddings to the boundary of the Poincaré ball in which case approaches one. The inverse of the Riemannian metric tensor becomes zero which causes to be small. From Equation 6, it is easy to see that if is small, then is small and the optimization makes no progress on .

To obtain further understanding, we conduct an experiment to show the vanishing gradient problem during training hyperbolic neural networks. We train a LeNet-like convolutional neural network

LeCun et al. (1998) with hyperbolic classifier on the MNIST data. We use a two-dimensional Poincaré ball for visualization. In Figure 2, we show the trajectories of the hyperbolic embeddings of six randomly sampled inputs during training. The arrows indicate the movement of each embedding after one gradient update step. It can be observed that at initialization all the hyperbolic embeddings are close to the center of the Poincaré ball. During training, the hyperbolic embeddings gradually move to the boundary of the ball. The magnitude of the gradient diminishes during training as the training loss decays. However, at the end of training, while the training loss slightly increases, the gradient vanishes due to the issue that the hyperbolic embeddings approach the boundary of the ball.

Vanishing gradient problem Hochreiter (1998); Pennington et al. (2017, 2018); Hanin (2018) is one of the difficulties in training deep neural networks using backpropagation. Vanishing gradient problem occurs when the magnitude of the gradient is too small for the optimization to make progress. For the standard Euclidean neural networks, vanishing gradient problem can be alleviated by architecture designs Hochreiter and Schmidhuber (1997); He et al. (2016), proper weight initialization Mishkin and Matas (2015)

and carefully chosen activation functions

Xu et al. (2015a). However, the vanishing gradient problem in training HNNs is not exploited in existing literature.

Figure 2: Hyperbolic neural networks suffer from vanishing gradient problem during training with backpropagation. Left: The trajectories of the hyperbolic embeddings of six randomly sampled inputs during training in a 2-dimensional Poincaré ball. The arrows indicate the change of location of each embedding with each gradient update. The embeddings move to the boundary of the ball during optimization which causes vanishing gradient problem. Right: The gradient vanishes while the training loss goes up at the end of training.

The Effect of Gradient Update of Euclidean Parameters on the Hyperbolic Embedding We derive the effect of a single gradient update of the Euclidean parameters on the hyperbolic embedding, for more details please refer to Appendix A.2. For the Euclidean sub-network , consider the first-order Taylor-expansion with a single gradient update,

(8)

where is the learning rate. The gradient of the exponential map can be computed as,

(9)

Let be the projected point in hyperbolic space, i.e,

(10)

By applying the first-order Taylor-expansion on the exponential map and following standard derivations, we can find that,

(11)

where . Thus once approaches one, the hyperbolic embedding stagnates no matter how large the training loss is.

Feature clipping for training hyperbolic neural networks There are several possible solutions to address the vanishing gradient problem for training HNN. One tentative solution is to replace all the Euclidean layers with hyperbolic layers, however it is not clear how to directly map the original input images onto hyperbolic space. Another solution is to use normalized gradient descent Hazan et al. (2015) for optimizing the Euclidean parameters to reduce the effect of gradient magnitude. However we observed that this introduces instability during training and makes it harder to tune the learning rate for optimizing Euclidean parameters.

We address the vanishing gradient problem by first reformulating the optimization problem in Equation 2 with a regularization term which controls the magnitude of hyperbolic embeddings,

(12)

where and

is a hyperparameter. By minimizing the training loss, the hyperbolic embeddings tend to move to the boundary of the Poincaré ball which causes the vanishing gradient problem. The additional regularization term is used to prevent the hyperbolic embeddings from approaching the boundary.

While the soft constraint introduced in Equation 12 is effective, it introduces additional complexity to the optimization process. We instead employ the following hard constraint which regularizes the Euclidean embedding before the exponential map whenever its norm exceeding a given threshold,

(13)

where and is a hyperparameter. Using the relation between the hyperbolic distance and Euclidean distance,

(14)

where and is the curvature, can be converted into the effective radius of the Poincaré ball.

The proposed Feature Clipping imposes a hard constraint on the maximum norm of the hyperbolic embedding to prevent the inverse of the Riemannian metric tensor from approaching zero. Therefore there is always a gradient signal for optimizing the hyperbolic embedding. Although decreasing the norm of the hyperbolic embedding shrinks the effective radius of the embedding space, in the experiments we found that it does no harm to accuracy while alleviating the vanishing gradient problem.

4 Experimental Settings and Evaluation Protocol

We conduct extensive experiments on standard recognition datasets to show the effectiveness of the proposed feature clipping for HNNs. The results show that HNNs with feature clipping are on par with ENNs on standard recognition datasets while demonstrating better performance in terms of few-shot classification, adversarial robustness and out-of-distribution detection.

Figure 3: HNNs with feature clipping learn more discriminative feature in hyperbolic space. The per class accuracy in the center figure indicates that the baseline HNNs learn biased feature space which hurts the performance of certain classes. Left: the Poincaré decision hyperplanes and the hyperbolic embeddings of sampled test images of baseline HNNs. Center: The per class test accuracy of baseline HNNs and HNNs with feature clipping. Right: the Poincaré decision hyperplanes and the hyperbolic embeddings of sampled test images of HNNs with feature clipping.

Datasets We conduct experiments on various commonly used image classification datasets: MNIST LeCun (1998), CIFAR10 Krizhevsky et al. (2009), CIFAR100 Krizhevsky et al. (2009) and ImageNet Deng et al. (2009). The details of these datasets can be found in Appendix A.3. To our best knowledge, this paper is the first attempt to extensively evaluate hyperbolic neural networks on the standard image classification datasets for supervised classification.

Baselines and Networks We compare the performance of HNNs training with/without the proposed feature clipping method Ganea et al. (2018); Khrulkov et al. (2020) and their Euclidean counterparts. For MNIST, we use a LeNet-like convolutional neural network LeCun et al. (1998)

which has two convolutional layers with max pooling layers in between and three fully connected layers. For CIFAR10 and CIFAR100, we use WideResNet

Zagoruyko and Komodakis (2016). For ImageNet, we use a standard ResNet18 He et al. (2016).

Training Setups For training ENNs, we use SGD with momentum as the optimizer. For training HNNs, the Euclidean parameters of HNNs are trained using SGD as the optimizer, and the hyperbolic parameters of HNNs are optimized using stochastic Riemann gradient descent Bonnabel (2013)

, just like the previous method. For training networks on MNIST, we train the network for 10 epochs with a learning rate of 0.1. The batch size is 64. For training networks on CIFAR10 and CIFAR100, we train the network for 100 epochs with an initial learning rate of 0.1 and use cosine learning rate scheduler

Loshchilov and Hutter (2016). The batch size is 128. For training networks on ImageNet, we train the network for 100 epochs with an initial learning rate of 0.1 and the learning rate decays by 10 every 30 epochs. The batch size is 256. We find the hyperbolic neural networks are robust to the choice of the hyperparameter , thus we fix to be 1.0 in all the experiments. For more discussions and results on the effect of , please see Appendix A.4

. The experiments on MNIST, CIFAR10 and CIFAR100 are repeated for 5 times and we report both average accuracy and standard deviation. All the experiments are done on 8 NVIDIA TITAN RTX GPUs.

Results on the standard benchmarks In Table 1, we show the results of different networks on the considered benchmarks. On MNIST, we can observe that the accuracy of the improved hyperbolic neural networks with feature clipping is about 5% higher than the baseline hyperbolic neural networks and match the performance of Euclidean neural networks. On CIFAR10, CIFAR100 and ImageNet, the improved hyperbolic neural networks achieve 6%, 3% and 3% improvement over baseline hyperbolic neural networks. The results show that hyperbolic neural networks can perform well even on datasets which lack explicit hierarchical structure.

In Figure 3 we show the Poincaré hyperplanes of all the classes and the hyperbolic embeddings of 1000 sampled test images extracted by the baseline HNNs and HNNs with feature clipping. Note that the Poincaré hyperplanes consist of arcs of Euclidean circles that are orthogonal to the boundary of the ball. We also color the points in the ball based on the classification results. It can be observed that by regularizing the magnitude of the hyperbolic embedding, all the embeddings locate in a restricted region of the whole Poincaré ball and the network learns more regular and discriminative feature in hyperbolic space.

Methods MNIST CIFAR10 CIFAR100 ImageNet
Euclidean He et al. (2016) 99.12 0.34 % 94.81 0.42% 76.24 0.35% 69.82%
Hyperbolic Ganea et al. (2018); Khrulkov et al. (2020) 94.42 0.29 % 88.82 0.51% 72.26 0.41% 65.74%
Hyperbolic w/ Feature Clipping 99.08 0.31 % 94.76 0.44% 75.88 0.38% 68.45%
4.66% 5.94% 3.62% 2.79%
Table 1: By incorporating the proposed method, the performance gap between Hyperbolic Neural Network (HNN) and Euclidean Neural Network (ENN) can be greatly closed on all experimented benchmarks. Top-1 accuracies on standard image classification datasets are compared here. Top-1 accuracy gains to the vanilla HNNs Ganea et al. (2018); Khrulkov et al. (2020) are shown in the last row.
Methods Embedding Net 1-Shot 5-Way 5-Shot 5-Way
ProtoNet Snell et al. (2017) 4 Conv 51.31 0.91% 70.77 0.69%
Hyperbolic ProtoNet Khrulkov et al. (2020) 4 Conv 61.18 0.24% 79.51 0.16%
Hyperbolic ProtoNet w/ Feature Clipping 4 Conv 64.66 0.24% 81.76 0.15%
Table 2: Hyperbolic embeddings provide a better alternative to Euclidean embeddings on few-shot learning task, and further improvements can be obtained through the proposed feature clipping method. Here are comparisons of few-shot classification results on fine-grained CUB dataset

on 1-shot 5-way and 5-shot 5-way tasks. All accuracies are reported with 95% confidence intervals.

Methods Embedding Net 1-Shot 5-Way 5-Shot 5-Way
ProtoNet Snell et al. (2017) 4 Conv 49.42 0.78% 68.20 0.66%
Hyperbolic ProtoNet Khrulkov et al. (2020) 4 Conv 51.88 0.20% 72.63 0.16%
Hyperbolic ProtoNet w/ Feature Clipping 4 Conv 53.01 0.22% 72.66 0.15%
Table 3: Few-shot classification results on miniImageNet on 1-shot 5-way and 5-shot 5-way tasks. All accuracies are reported with 95% confidence intervals.

Few-shot Learning We show that the proposed feature clipping can also improve the performance of Hyperbolic ProtoNet Khrulkov et al. (2020) for few-shot learning. Different from the standard ProtoNet Snell et al. (2017) which computes the prototype of each class in Euclidean space, Hyperbolic ProtoNet computes the class prototype in hyperbolic space using hyperbolic averaging. Hyperbolic geometry has been shown to learn more accurate embeddings than Euclidean geometry for few-shot learning Khrulkov et al. (2020).

We follow the experimental settings in Khrulkov et al. (2020) and conduct experiments on CUB dataset Welinder et al. (2010) and miniImageNet dataset Russakovsky et al. (2015). We consider 1-shot 5-way and 5-shot 5-way tasks as in Khrulkov et al. (2020). The evaluation is repeated for 10000 times and we report the average performance and the 95% confidence interval. Table 2 and Table 3 show that the proposed feature clipping further improves the accuracy of Hyperbolic ProtoNet for few-shot classification by as much as 3%.

(a) On MNIST
(b) On MNIST
(c) On CIFAR10
Figure 4: Adversarial robustness of hyperbolic neural networks (HNNs) and Euclidean neural networks (ENNs) to different attack methods and perturbations.

Adversarial Robustness We show that HNNs are more robust to adversarial attacks including FGSM Goodfellow et al. (2014) and PGD Madry et al. (2017) than ENNs. We train the networks regularly without adversarial training with the setups described in Section 4. For attacking networks trained on MNIST using FGSM, we consider the perturbation . For attacking networks trained on MNIST using PGD, we consider the perturbation . The number of steps is 40. For attacking networks trained on CIFAR10 using PGD, we consider the perturbation . The number of steps is 7. From Figure 4 we can see that across all the cases hyperbolic neural networks show more robustness than ENNs to adversarial attacks. More results and discussions can be found in Appendix A.5.

Out-of-distribution Detection We conduct experiments to show that HNNs have stronger out-of-distribution detection capability than ENNs. Out-of-distribution detection aims at determining whether or not an given input is from the same distribution as the training data. We follow the experimental settings in Liu et al. (2020). The in-distribution datasets are CIFAR10 and CIFAR100. The out-of-distribution datasets are ISUN Xu et al. (2015b), Place365 Zhou et al. (2017), Texture Cimpoi et al. (2014)

, SVHN

Netzer et al. (2011), LSUN-Crop Yu et al. (2015) and LSUN-Resize Yu et al. (2015). We use the same network and training setups as described in Section 4 for training models on CIFAR10 and CIFAR100. For detecting out-of-distribution data, we use both softmax score and energy score as described in Liu et al. (2020). For metrics, we consider FPR95, AUROC and AUPR Liu et al. (2020). In Table 4 and Table 5 we show the results of using softmax score on CIFAR10 and CIFAR100 respectively. We can see that HNNs and ENNs achieve similar AUPR, however HNNs achieve much better performance in terms of FPR95 and AUROC. In particular, HNNs reduce FPR95 by 5.82% and 9.55% on CIFAR10 and CIFAR100 respectively. For results using energy score, please see Appendix A.6

[width=11em]OOD DatasetNetwork Euclidean Neural Network Hyperbolic Neural Network
ISUN 46.30 0.78 91.50 0.16 98.16 0.05 45.28 0.65 91.61 0.21 98.09 0.06
Place365 51.09 0.92 87.56 0.37 96.76 0.15 54.77 0.76 86.82 0.41 96.17 0.20
Texture 65.04 0.91 82.80 0.35 94.59 0.20 47.12 0.62 89.91 0.20 97.39 0.09
SVHN 71.66 0.84 86.58 0.21 97.06 0.06 49.89 1.03 91.34 0.22 98.13 0.06
LSUN-Crop 22.22 0.78 96.05 0.10 99.16 0.03 23.87 0.73 95.65 0.22 98.98 0.07
LSUN-Resize 41.06 1.07 92.67 0.16 98.42 0.04 41.49 1.24 92.97 0.24 98.46 0.07
Mean 49.56 89.53 97.36 43.74 91.38 97.87
Table 4: The results of out-of-distribution detection on CIFAR10 with softmax score
[width=11em]OOD DatasetNetwork Euclidean Neural Network Hyperbolic Neural Network
ISUN 74.07 0.87 82.51 0.39 95.83 0.11 68.37 0.90 81.31 0.43 94.96 0.20
Place365 81.01 1.07 76.90 0.45 94.02 0.15 79.66 0.69 76.94 0.28 93.91 0.18
Texture 83.67 0.68 77.52 0.32 94.47 0.10 64.91 0.80 83.26 0.25 95.77 0.08
SVHN 84.56 0.78 84.32 0.22 96.69 0.07 53.11 1.04 89.53 0.26 97.71 0.07
LSUN-Crop 43.46 0.79 93.09 0.23 98.58 0.05 51.08 1.17 87.21 0.39 96.83 0.13
LSUN-Resize 71.50 0.73 82.12 0.40 95.69 0.13 63.86 1.10 82.36 0.42 95.16 0.13
Mean 73.05 82.74 95.88 63.50 83.43 95.72
Table 5: The results of out-of-distribution detection on CIFAR100 with softmax score

5 Conclusion

We address one important issue when training HNNs which is ignored in previous literature. We identify the vanishing gradient problem when training hyperbolic neural networks and propose a simple yet effective solution which does not need to modify the current optimizer or architecture. We conduct extensive experiments on commonly used image dataset benchmarks including MNIST, CIFAR10, CIFAR100 and ImageNet. Hyperbolic neural networks with feature clipping show significant improvement over baseline HNNs and match the performance of ENNs. The proposed method also improves the performance of hyperbolic neural networks for few-shot learning. Further studies reveal that hyperbolic neural networks are more robust to adversarial attacks and have stronger out-of-distribution detection capability.

6 Discussion on Societal Impacts and Limitations

This paper addresses an important problem in geometric deep learning. Our contribution consists of both theoretical and practical sides. The proposed method can advance the progress of hyperbolic neural networks and we expect to see more applications of hyperbolic neural networks in computer vision. The proposed method can also help hyperbolic neural networks model long-tail data to address the fairness issue. We also believe the proposed method should not raise any ethical considerations.

In terms of limitations, there are still several unexplored aspects of HNNs. Future research could study and improve the applications of HNNs for computer vision tasks such as semantic segmentation and object detection.

References

  • [1] G. Alanis-Lobato, P. Mier, and M. A. Andrade-Navarro (2016) Efficient embedding of complex networks to hyperbolic space via their laplacian. Scientific reports 6 (1), pp. 1–10. Cited by: §1.
  • [2] J. W. Anderson (2006) Hyperbolic geometry. Springer Science & Business Media. Cited by: §1.
  • [3] S. Bonnabel (2013) Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control 58 (9), pp. 2217–2229. Cited by: §3, §4.
  • [4] M. P. d. Carmo (1992) Riemannian geometry. Birkhäuser. Cited by: §3.
  • [5] H. Cho, B. DeMeo, J. Peng, and B. Berger (2019) Large-margin classification in hyperbolic space. In

    The 22nd International Conference on Artificial Intelligence and Statistics

    ,
    pp. 1832–1840. Cited by: §1.
  • [6] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3606–3613. Cited by: §4.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.
  • [8] O. Ganea, G. Bécigneul, and T. Hofmann (2018) Hyperbolic neural networks. arXiv preprint arXiv:1805.09112. Cited by: Definition A.4, §1, §1, §2, Definition 3.1, §3, §3, §3, §3, §3, §3, Table 1, §4.
  • [9] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §4.
  • [10] C. Gulcehre, M. Denil, M. Malinowski, A. Razavi, R. Pascanu, K. M. Hermann, P. Battaglia, V. Bapst, D. Raposo, A. Santoro, et al. (2018) Hyperbolic attention networks. arXiv preprint arXiv:1805.09786. Cited by: §1, §2.
  • [11] M. Gupte, P. Shankar, J. Li, S. Muthukrishnan, and L. Iftode (2011) Finding hierarchy in directed online social networks. In Proceedings of the 20th international conference on World wide web, pp. 557–566. Cited by: §1.
  • [12] B. Hanin (2018) Which neural net architectures give rise to exploding and vanishing gradients?. arXiv preprint arXiv:1801.03744. Cited by: §3.
  • [13] E. Hazan, K. Y. Levy, and S. Shalev-Shwartz (2015) Beyond convexity: stochastic quasi-convex optimization. arXiv preprint arXiv:1507.02030. Cited by: §3.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3, Table 1, §4.
  • [15] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.
  • [16] S. Hochreiter (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6 (02), pp. 107–116. Cited by: §3.
  • [17] J. Hsu, J. Gu, G. Wu, W. Chiu, and S. Yeung (2020) Learning hyperbolic representations for unsupervised 3d segmentation. arXiv preprint arXiv:2012.01644. Cited by: §2.
  • [18] V. Khrulkov, L. Mirvakhabova, E. Ustinova, I. Oseledets, and V. Lempitsky (2020) Hyperbolic image embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6418–6428. Cited by: §1, §1, §2, §3, §3, Table 1, Table 2, Table 3, §4, §4, §4.
  • [19] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • [20] A. Klimovskaia, D. Lopez-Paz, L. Bottou, and M. Nickel (2020) Poincaré maps for analyzing complex hierarchies in single-cell data. Nature communications 11 (1), pp. 1–9. Cited by: §1.
  • [21] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: Table 6, §4.
  • [22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3, §4.
  • [23] Y. LeCun (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: Table 6, §4.
  • [24] J. M. Lee (2018) Introduction to riemannian manifolds. Springer. Cited by: §3.
  • [25] Q. Liu, M. Nickel, and D. Kiela (2019) Hyperbolic graph neural networks. arXiv preprint arXiv:1910.12892. Cited by: §2.
  • [26] W. Liu, X. Wang, J. D. Owens, and Y. Li (2020) Energy-based out-of-distribution detection. arXiv preprint arXiv:2010.03759. Cited by: §A.6, §4.
  • [27] I. Loshchilov and F. Hutter (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §4.
  • [28] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §4.
  • [29] E. Mathieu, C. L. Lan, C. J. Maddison, R. Tomioka, and Y. W. Teh (2019) Continuous hierarchical representations with poincar’e variational auto-encoders. arXiv preprint arXiv:1901.06033. Cited by: §2.
  • [30] G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §1.
  • [31] D. Mishkin and J. Matas (2015) All you need is a good init. arXiv preprint arXiv:1511.06422. Cited by: §3.
  • [32] Y. Nagano, S. Yamaguchi, Y. Fujita, and M. Koyama (2019) A wrapped normal distribution on hyperbolic space for gradient-based learning. In International Conference on Machine Learning, pp. 4693–4702. Cited by: §2.
  • [33] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §4.
  • [34] M. Nickel and D. Kiela (2017) Poincar’e embeddings for learning hierarchical representations. arXiv preprint arXiv:1705.08039. Cited by: §1, §3.
  • [35] J. Pennington, S. Schoenholz, and S. Ganguli (2018) The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics, pp. 1924–1932. Cited by: §3.
  • [36] J. Pennington, S. S. Schoenholz, and S. Ganguli (2017) Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. arXiv preprint arXiv:1711.04735. Cited by: §3.
  • [37] D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning representations by back-propagating errors. nature 323 (6088), pp. 533–536. Cited by: §3.
  • [38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: Table 6, §4.
  • [39] F. Sala, C. De Sa, A. Gu, and C. Ré (2018) Representation tradeoffs for hyperbolic embeddings. In International conference on machine learning, pp. 4460–4469. Cited by: §1.
  • [40] R. Sarkar (2011) Low distortion delaunay embedding of trees in hyperbolic plane. In International Symposium on Graph Drawing, pp. 355–366. Cited by: §1.
  • [41] R. Shimizu, Y. Mukuta, and T. Harada (2020) Hyperbolic neural networks++. arXiv preprint arXiv:2006.08210. Cited by: §1, §2.
  • [42] J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175. Cited by: Table 2, Table 3, §4.
  • [43] A. A. Ungar (2001) Hyperbolic trigonometry and its application in the poincaré ball model of hyperbolic geometry. Computers & Mathematics with Applications 41 (1-2), pp. 135–147. Cited by: §A.1.
  • [44] A. A. Ungar (2005) Analytic hyperbolic geometry: mathematical foundations and applications. World Scientific. Cited by: §A.1, §2, §3.
  • [45] A. A. Ungar (2008) A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics 1 (1), pp. 1–194. Cited by: §A.1, §3.
  • [46] M. Weber, M. Zaheer, A. S. Rawat, A. Menon, and S. Kumar (2020) Robust large-margin learning in hyperbolic space. arXiv preprint arXiv:2004.05465. Cited by: §1.
  • [47] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: §4.
  • [48] B. Xu, N. Wang, T. Chen, and M. Li (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853. Cited by: §3.
  • [49] P. Xu, K. A. Ehinger, Y. Zhang, A. Finkelstein, S. R. Kulkarni, and J. Xiao (2015) Turkergaze: crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755. Cited by: §4.
  • [50] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao (2015) Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §4.
  • [51] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.
  • [52] Y. Zhang, X. Wang, X. Jiang, C. Shi, and Y. Ye (2019) Hyperbolic graph attention network. arXiv preprint arXiv:1912.03046. Cited by: §2.
  • [53] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017)

    Places: a 10 million image database for scene recognition

    .
    IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1452–1464. Cited by: §4.
  • [54] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2020) Graph neural networks: a review of methods and applications. AI Open 1, pp. 57–81. Cited by: §2.

Appendix A Appendix

a.1 Gyrovector space

We give more details on gyrovector space, for a more systematic treatment, please refer to [44, 45, 43].

Gyrovector space provides a way to operate in hyperbolic space with vector algebra. Gyrovector space to hyperbolic geometry is similar to standard vector space to Euclidean geometry. The geometric objects in gyrovector space are called gyroevectors which are equivalent classes of directed gyrosegments. Similar to the vectors in Euclidean space which are added according to parallelogram law, gyrovectors are added according to gyroarallelogram law. Technically, gyrovector spaces are gyrocommutative gyrogroups of gyrovectors that admit scalar multiplications.

We start from the introduction of gyrogroups which give rise to gyrovector spaces.

Definition A.1 (Gyrogroups)

A groupoid is a gyrogroup if it satisfies the follow axioms,

  1. There exist one element satisfies for all .

  2. For each , there exist an element which satisfies

  3. For every , there exist a unique element gry such that satisfies the left gyroassociative law gry.

  4. The map gry: given by gry is an automorphism of the groupoid : gyr. The automorphism gyr of is called the gyroautomorphism of generated by .

  5. The operation gry: is called gyrator of . The gyroautomorphism gyr generated by any has the left loop property: gyr = gyr.

In particular, Möbius complex disk groupoid is a gyrocommunicative gyrogroup, where and is the Möbius addition. The same applies to the -ball which is defined as,

(15)

Gyrocommutative gyrogroups which admit scalar multiplication become gyrovector space . Möbius gyrogroups admit scalar multiplication become Möbius gyrovector space .

Definition A.2 (Möbius Scalar Multiplication)

Let be a Möbius gyrogroup, the Möbius scalar multiplication is defined as,

(16)

where and , .

Definition A.3 (Gyrolines)

Let be two distinct points in the gyrovector space . The gyroline in which passes through is the set of points:

(17)

where .

It can be proven that gyrolines in a Möbius gyrovector space coincide with the geodesics of the Poincaré ball model of hyperbolic geometry.

With the aid of operations in gyrovector spaces, we can define important properties of the Poincaré ball model in closed-form expressions.

Definition A.4 (Exponential Map and Logarithmic Map)

As shown in [8], the exponential map is defined as,

(18)

The logarithmic map is defined as,

(19)

The distance between two points in the Poincaré ball can be defined as,

Definition A.5

(Poincaré Distance between Two Points)

(20)

a.2 The Effect of Gradient Update of Euclidean Parameters on the Hyperbolic Embedding

We derive the effect of the a single gradient update of the Euclidean parameters on the hyperbolic embedding. For the Euclidean sub-network . Consider the first-order Taylor-expansion of the Euclidean network with a single gradient update,

(21)

Meanwhile, the exponentional map of the Poincaré ball is,

(22)

The gradient of the exponential map can be computed as,

(23)

Let be the projected point in hyperbolic space, i.e,

(24)

Again we can apply the first-order Taylor-expansion on the exponential map,

(25)

Denote by , we have

(26)

Denote by ,

(27)

a.3 Datasets

MNIST [23] CIFAR10 [21] CIFAR100 [21] ImageNet [38]
# of Training Examples 60,000 50,000 50,000 1,281,167
# of Test Examples 10,000 10,000 10,000 50,000
Table 6: The statistics of the datasets.

a.4 The effect of hyperparameter r

We conduct ablation studies to show the effect of the hyperparameter which is the maximum norm of the Euclidean embedding. In Figure 5 we show the change of test accuracy as we vary the hyperparameter on MNIST, CIFAR10 and CIFAR100. We repeat the experiments for each choice of five times and report both average accuracy and standard deviation. On the one hand, it can be observed that a larger leads to a drop in test accuracy. As we point out, this is caused by the vanishing gradient problem in training hyperbolic neural networks. On the other hand, a small can also lead to a drop in test accuracy especially on more complex tasks such as CIFAR10 and CIFAR100. The plausible reason is that a small reduces the capacity of the embedding space which is detrimental for learning discriminative features.

To conclude, there is a sweet spot in terms of choosing which is neither too large (causing vanishing gradient problem) nor too small (not enough capacity). The performance of hyperbolic neural network is also robust to the choice of the hyperparameter if it is around the sweet spot.

Figure 5: We show the change of the test accuracy as we vary the hyperparameter . A large leads to vanishing gradient problem and a small causes insufficient capacity. Both lead to a drop in test accuracy.

a.5 More results on adversarial robustness

Although we observe that with adversarial training, hyperbolic neural networks achieve similar robust accuracy to Euclidean neural networks, in a further study, we consider training models using a small but attacking with a larger with FGSM on MNIST. In Table 7 we show the results of training the networks using and attacking with = 0.1, 0.2 and 0.3. We can observe that for attacking with larger such as 0.2 and 0.3, hyperbolic neural networks show more robustness to Euclidean neural networks. The possible explanation is that the proposed feature clipping reduces the adversarial noises in the forward pass. One of the future directions is to systematically understand and analyze the reason behind the robustness of hyperbolic neural networks. In Figure 6, we show the clean and adversarial images generated by FGSM with hyperbolic neural networks and Euclidean neural networks respectively. The predictions of the networks are shown above the image. It can be observed that hyperbolic neural networks show more adversarial robustness compared with Euclidean neural networks.

NetworkPerturbation = 0.1 = 0.2 = 0.3
Euclidean Network 94.51% 67.85% 42.18%
Hyperbolic Network 93.58% 74.97% 46.27%
Table 7: Adversarial training with FGSM ( = 0.05) on MNIST.
Figure 6: Hyperbolic neural networks show more adversarial robustness compared with Euclidean neural networks. We show the clean image and the corresponding adversarial image and the predictions of the network of 10 randomly sampled images. In several cases, hyperbolic neural networks make correct predictions on the adversarial images while Euclidean neural networks make wrong predictions.

a.6 More results on out-of-distribution detection

In Table 8 and 9 we show the results of using energy score [26] on CIFAR10 and CIFAR100 for out-of-distribution detection. Similar to the case of using softmax score, we can observe that on both datasets hyperbolic neural networks achieve similar performance in terms of one metric and perform better in terms other two metrics compared with Euclidean neural networks.

[width=11em]OOD DatasetNetwork Euclidean Neural Network Hyperbolic Neural Network
ISUN 34.19 0.97 93.07 0.24 98.42 0.07 25.39 0.32 95.48 0.09 99.01 0.04
Place365 43.34 1.22 88.50 0.48 96.76 0.17 45.17 1.19 89.61 0.28 97.20 0.14
Texture 58.51 0.77 82.98 0.20 94.55 0.14 49.70 0.94 90.66 0.20 97.98 0.04
SVHN 49.04 1.05 91.57 0.13 98.12 0.05 57.33 1.34 88.45 0.20 97.44 0.06
LSUN-Crop 9.48 0.60 98.21 0.07 99.63 0.02 24.78 0.73 95.06 0.15 98.92 0.05
LSUN-Resize 28.28 0.66 94.31 0.14 98.72 0.04 22.52 0.67 96.15 0.09 99.18 0.02
Mean 37.14 91.44 97.70 37.48 92.57 98.29
Table 8: The results of out-of-distribution detection on CIFAR10 with energy score
[width=11em]OOD DatasetNetwork Euclidean Neural Network Hyperbolic Neural Network
ISUN 74.49 0.60 82.45 0.33 95.84 0.12 68.75 0.93 81.33 0.31 94.93 0.16
Place365 81.20 0.86 77.02 0.34 94.13 0.13 79.51 0.69 77.23 0.37 93.97 0.17
Texture 83.19 0.31 77.74 0.35 94.54 0.11 65.03 0.52 83.38 0.29 95.85 0.10
SVHN 84.12 0.59 84.41 0.16 96.72 0.04 55.44 1.00 89.43 0.25 97.69 0.06
LSUN-Crop 43.80 1.29 93.04 0.22 98.56 0.05 74.89 0.73 84.98 0.18 96.46 0.08
LSUN-Resize 71.86 0.69 81.86 0.27 95.60 0.09 64.35 0.62 82.64 0.36 95.27 0.14
Mean 73.11 82.75 95.90 67.99 83.17 95.70
Table 9: The results of out-of-distribution detection on CIFAR100 with energy score

a.7 Softmax with temperature scaling

We consider softmax with temperature scaling as an alternative for addressing the vanishing gradient problem in training hyperbolic neural networks. Softmax with temperature scaling introduces an additional temperature parameter

to adjust the logits before applying the softmax function. Softmax with temperature scaling can be formulated as,

(28)

In hyperbolic neural networks, is the output of the hyperbolic fully-connected layer and is the number of classes. If the additional temperature parameter is smaller than 1, the magnitude (in the Euclidean sense) of the hyperbolic embedding will be scaled up which prevents it from approaching the boundary of the ball.

In Figure 7, we show the performance of training hyperbolic neural networks with temperature scaling compared with the proposed feature clipping. We consider feature dimensions of 2 and 64 respectively. Different temperature parameters are considered and the experiments are repeated for 10 times with different random seeds. We show both the average accuracy and the standard deviation. We can observe that softmax with temperature scaling and a carefully tuned temperature parameter can approach the performance of the proposed feature clipping when the feature dimension is 2. However, the feature dimension is 64, softmax with temperature scaling severely underperforms the proposed feature clipping. The results again confirm the effectiveness of the proposed approach.

Figure 7: We show the change of the test accuracy as we vary the temperature parameter . The red horizontal line is the result of the hyperbolic neural networks with the proposed feature clipping. Softmax with temperature scaling with a carefully tuned temperature can approach the performance of the proposed feature clipping. However, it is sensitive to the feature dimension and the temperature parameter. Left: the embedding dimension is 2. Right: the embedding dimension is 64.