hyperbolicimageembeddings
Supplementary code for the paper "Hyperbolic Image Embeddings".
view repo
Computer vision tasks such as image classification, image retrieval and fewshot learning are currently dominated by Euclidean and spherical embeddings, so that the final decisions about class belongings or the degree of similarity are made using linear hyperplanes, Euclidean distances, or spherical geodesic distances (cosine similarity). In this work, we demonstrate that in many practical scenarios hyperbolic embeddings provide a better alternative.
READ FULL TEXT VIEW PDF
A hyperbolic space has been shown to be more capable of modeling complex...
read it
Embeddings of treelike graphs in hyperbolic space were recently shown t...
read it
Many wellestablished recommender systems are based on representation
le...
read it
Existing deep embedding methods in vision tasks are capable of learning ...
read it
This paper introduces a method of calculating and rendering shapes in a
...
read it
Unsupervised text embedding has shown great power in a wide range of NLP...
read it
As a kind of semantic representation of visual object descriptions,
attr...
read it
Supplementary code for the paper "Hyperbolic Image Embeddings".
Highdimensional embeddings are ubiquitous in modern computer vision. Many, perhaps most, modern computer vision systems learn nonlinear mappings (in the form of deep convolutional networks) from the space of images or image fragments into highdimensional spaces. The operations at the end of deep networks imply a certain type of geometry of the embedding spaces. For example, image classification networks [14, 17]
use linear operators (matrix multiplication) to map embeddings in the penultimate layer to class logits. The class boundaries in the embedding space are thus piecewiselinear, and pairs of classes are separated by Euclidean hyperplanes. The embeddings learned by the model in the penultimate layer, therefore, live in the Euclidean space. The same can be said about systems where Euclidean distances are used to perform image retrieval
[22, 32, 43][23, 42] or oneshot learning [31].Alternatively, some fewshot learning [39], face recognition [30] and person reidentification methods [38, 44] learn spherical embeddings, so that sphere projection operator is applied at the end of a network that computes the embeddings. Cosine similarity (closely associated with sphere geodesic distance) is then used by such architectures to match images.
Euclidean spaces with their zero curvature and spherical spaces with their positive curvature have certain profound implications on the nature of embeddings that existing computer vision systems can learn. In this work, we argue that hyperbolic spaces with negative curvature might often be more appropriate for learning embedding of images. Towards this end, we add the recentlyproposed hyperbolic network layers [7] to the end of several computer vision networks, and present a number of experiments corresponding to image classification, image retrieval, oneshot, and fewshot learning and person reidentification. We show that in many cases, the use of hyperbolic geometry improves the performance over Euclidean or spherical embeddings.
Our work is inspired by the recent body of works that demonstrate the advantage of learning hyperbolic embeddings for language entities such as taxonomy entries [20], common words [36], phrases [5]
and for other NLP tasks, such as neural machine translation
[8]. Our results imply that hyperbolic spaces may be as valuable for improving the performance of computer vision systems.The use of hyperbolic spaces in natural language processing is motivated by their natural ability to embed hierarchies (e.g., tree graphs) with low distortion
[29]. Hierarchies are ubiquitous in natural language processing. First, there are natural hierarchies corresponding to, e.g., biological taxonomies and linguistic ontologies. Likewise, a more generic short phrase can have many plausible continuations and is therefore semanticallyrelated to a multitude of long phrases that are not necessarily closely related to each other (in the semantic sense). The innate suitability of hyperbolic spaces to embedding hierarchies [27, 29] explains the success of such spaces in natural language processing [20].Here, we argue that similar hierarchical relations between images are common in computer vision tasks (Figure 2). One can observe the following example cases:
In image retrieval, an overview photograph is related to many images that correspond to the closeups of different distinct details. Likewise, for classification tasks inthewild, an image containing the representatives of multiple classes is related to images that contain representatives of the classes in isolation. Embedding a dataset that contains composite images into continuous space is therefore similar to embedding a hierarchy.
In some tasks, more generic images may correspond to images that contain less information and are therefore more ambiguous. E.g., in face recognition, a blurry and/or lowresolution face image taken from afar can be related to many highresolution images of faces that clearly belong to distinct people. Again natural embeddings for image datasets that have widely varying image quality/ambiguity calls for retaining such hierarchical structure.
Many of the natural hierarchies investigated in natural language processing transcend to visual domain. E.g., the visual concepts of different animal species have the same natural hierarchical groupings stemming from the similarity of their genotypes (most felines share visual similarity while being visually distinct from pinnipeds). Hyperbolic spaces thus remain a natural choice for embedding biological taxonomies irrespective of whether we deal with text documents or images.
In order to build deep learning models which operate on the embeddings to hyperbolic spaces, we capitalize on recent developments
[7], which construct the analogues of familiar layers (such as a feed–forward layer, or a multinomial regression layer) in hyperbolic spaces. We show that many standard architectures used for tasks of image classification, and in particular in the few–shot learning setting can be easily modified to operate on hyperbolic embeddings, which in many cases also leads to their improvement.Formally, dimensional hyperbolic space denoted as is defined as the homogeneous, simply connected dimensional Riemannian manifold of constant negative sectional curvature. The property of constant negative curvature makes it analogous to the ordinary Euclidean sphere (which has constant positive curvature), however, the geometrical properties of the hyperbolic space are very different. It is known that hyperbolic space cannot be isometrically embedded into Euclidean space [13, 18], but there exist several well–studied models of hyperbolic geometry. In every model a certain subset of Euclidean space is endowed with a hyperbolic metric, however, all these models are isomorphic to each other and we may easily move from one to another base on where the formulas of interest are easier. We follow the majority of NLP works and use the Poincaré ball model (see Figure 3). Investigating the alternative models that might provide better numerical stability remain future work (though already started in the NLP community [21, 28]). Here, we provide a very short summary of the model.
The Poincaré ball model is defined by the manifold endowed with the Riemannian metric , where is the conformal factor and
is the Euclidean metric tensor
. In this model the geodesic distance between two points is given by the following expression:(1) 
In order to define the hyperbolic average, we will make use of the Klein model of hyperbolic space. Similarly to the Poincaré model, it is defined on the set , however, with a different metric, not relevant for further discussion. In Klein coordinates, the hyperbolic average (generalizing the usual Euclidean mean) takes the most simple form, and we present the necessary formulas in Section 4.
From the viewpoint of hyperbolic geometry, all points of Poincaré ball are equivalent. The models that we consider below are, however, hybrid in the sense that most layers use Euclidean operators, such as standard generalized convolutions, while only the final layers operate within the hyperbolic geometry framework. The hybrid nature of our setups makes the origin a special point, since from the Euclidean viewpoint the local volumes in Poincare ball expand exponentially from the origin to the boundary. This leads to the useful tendency of the learned embeddings to place more generic/ambiguous objects closer to the origin, while moving more specific objects towards the boundary. The distance to the origin in our models therefore provides a natural estimate of uncertainty, that can be used in several ways, as we show below.
Hyperbolic embeddings in the natural language processing field have recently been very successful [20, 21]. They are motivated by the innate ability of hyperbolic spaces to embed hierarchies (e.g., tree graphs) with low distortion [28, 29]. The main result in this area states that any tree can be embedded into (two dimensional) hyperbolic space with arbitrarily low distortion. Another direction of research, more relevant to the present work is based on imposing hyperbolic structure on activations of neural networks [7, 8].
Natural language naturally gives rises to hierarchies, as, for instance, some words are more specific while others are more generic. We argue that inthewild computer vision datasets also have more or less specific images that have a varying amount of information (e.g., middle images within triplets in Figure 2 are more generic/less specific).
The task of few–shot learning, which has recently attracted a lot of attention, is concerned with the overall ability of the model to generalize to unseen data during training. A body of papers devoted to few–shot classification that focuses on metric learning methods includes Siamese Networks [12], Matching Networks [39], Prototypical Networks [31], Relation Networks [35]. In contrast, other models apply metalearning to fewshot learning: e.g., MAML by [6], MetaLearner LSTM by [24], SNAIL by [19]. While these methods employ either Euclidean or spherical geometries (like in [39]), there is no model extension to hyperbolic space.
The task of person reidentification is to match pedestrian images captured by possibly nonoverlapping surveillance cameras. Papers [1, 9, 41]
adopt the pairwise models that accept pairs of images and output their similarity scores. The resulting similarity scores are used to classify the input pairs as being matching or nonmatching. Another popular direction of work includes approaches that aim at learning a mapping of the pedestrian images to the Euclidean descriptor space. Several papers, e.g.,
[34, 44]use verification loss functions based on the Euclidean distance or cosine similarity. A number of methods utilize a simple classification approach for training
[3, 33, 11, 45], and Euclidean distance is used in test time. Some methods use the combination of the aforementioned approaches: [4] adopts verification losses that are applied to the similarities predicted by the special subnetwork instead of the simple Euclidean similarities.In this section, we remind on recent work introducing hyperbolic neural networks [7]
. Hyperbolic networks are extensions of conventional neural networks in a sense that they generalize typical neural network operations to those in hyperbolic space using the formalism of Möbius gyrovector spaces. In this paper, the authors present the hyperbolic versions of feedforward networks, multinomial logistic regression, and recurrent neural networks.
Further in this section, we briefly discuss the hyperbolic functions and layers crucial for the understanding of hyperbolic neural networks used in the remainder of the paper. Similarly to the paper [7]
, we use an additional hyperparameter
corresponding to the radius of the Poincaré ball, which is then defined in the following manner: . The corresponding conformal factor is then modified as . In practice, the choice of allows one to balance between hyperbolic and Euclidean geometries, which is made precise by noting that with all the formulas discussed below take their usual Euclidean form.For a pair , the Möbius addition is defined as follows:
(2) 
The induced distance function is defined as
(3) 
Note that with one recovers the geodesic distance (1), while with we obtain the Euclidean distance
To perform operations in the hyperbolic space, one first needs to define a bijective map from to in order to map Euclidean vectors to the hyperbolic space, and vice versa. The so–called exponential and (inverse to it) logarithmic map serve as such a bijection.
The exponential map is a function from to , which is given by
(4) 
The inverse logarithmic map is defined as
(5) 
In practice, we use the maps and for transition between the Euclidean and Poincaré ball representations of a vector.
Assume we have a standard (Euclidean) linear layer . In order to generalize it, one needs to define the Möbius matrix by vector product:
(6) 
if , and
otherwise. Finally, for a bias vector
the operation underlying the hyperbolic linear layer is then given by .In several architectures (e.g., in siamese networks), it is needed to concatenate two vectors; such operation is obvious in Euclidean space. However, straightforward concatenation of two vectors from hyperbolic space does not necessarily remain in hyperbolic space. Thus, we have to use a generalized version of the concatenation operation, which is then defined in the following manner. For , we define the mapping as follows.
(7) 
where and are trainable matrices of sizes and correspondingly. The motivation for this definition is simple: usually, the Euclidean concatenation layer is followed by a linear map, which when written explicitly takes the (Euclidean) form of Equation (7).
Another important operation common in image processing is averaging of feature vectors, used, e.g., in prototypical networks for few–shot learning [31]. In the Euclidean setting this operation takes the form . Extension of this operation to hyperbolic spaces is called the Einstein midpoint and takes the most simple form in Klein coordinates:
(8) 
where are the Lorentz factors. Recall from the discussion in Section 2 that the Klein model is supported on the same space as the Poincaré ball, however the same point has different coordinate representations in these models. Let and denote the coordinates of the same point in the Poincaré and Klein models correspondingly. Then the following transition formulas hold.
(9)  
(10) 
Thus, given points in the Poincaré ball we can first map them to the Klein model, compute the average using Equation (8), and then move it back to the Poincaré model.
In our experiments, to perform the multiclass classification, we take advantage of the generalization of multiclass logistic regression to hyperbolic spaces. The idea of this generalization is based on the observation that in Euclidean space logits can be represented as the distances to certain hyperplanes, where each hyperplane can be specified with a point of origin and a normal vector. The same construction can be used in the Poincaré ball after a suitable analogue for hyperplanes is introduced. Given and , such an analogue would be the union of all geodesics passing through and orthogonal to .
The resulting formula for hyperbolic MLR for classes is written below; here and are learnable parameters.
For a more thorough discussion of hyperbolic neural networks, we refer the reader to the paper [7].
While implementing most of the formulas described above is straightforward, we employ some tricks to make the training more stable.
To ensure numerical stability we perform clipping by norm after applying the exponential map, which constrains the norm to not exceed .
Some of the parameters in the aforementioned layers are naturally elements of . While in principle it is possible to apply Riemannian optimization techniques to them (e.g., previously proposed Riemannian Adam optimizer [2]), we did not observe any significant improvement. Instead, we parametrized them via ordinary Euclidean parameters which were mapped to their hyperbolic counterparts with the exponential map and used the standard Adam optimizer.
We found that the value of may affect the performance, especially for large dimensions of the Poincaré ball. In our experiments, the value showed good results for broad range of dimensions.
We start with a toy experiment supporting our abovementioned hypothesis that the distance to the center in Poincaré ball indicates a model uncertainty. To do so, we first train the MLR classifier in hyperbolic space on the MNIST dataset [16] and evaluate it on the Omniglot dataset [15]. We then investigate and compare the obtained distributions of distances to the origin of hyperbolic embeddings of the MNIST and Omniglot test sets.
In our further experiments, we concentrate on the fewshot classification and person reidentification tasks. The experiments on the Omniglot dataset serve as a starting point, and then we move towards more complex datasets. Afterwards, we consider two datasets, namely: MiniImageNet [24] and CaltechUCSD Birds2002011 (CUB) [40]. Here, for each dataset, we train four models: for oneshot fiveway and fiveshot fiveway classification tasks both in the Euclidean and hyperbolic spaces. Finally, we provide the reidentification results for the two popular datasets: Market1501 [46] and DukeMTMD [25, 47]. Further in this section, we provide a thorough description of each experiment.
Our code is available at github^{1}^{1}1https://github.com/KhrulkovV/hyperbolicimageembeddings.
In this subsection, we validate our hypothesis which claims that if one trains a hyperbolic classifier, then a distance of the Poincaré ball embedding of an image can serve as a good measure of confidence of a model. We start by training a simple hyperbolic convolutional neural network (ConvNet) on the MNIST dataset, consisting of three convolutional layers of size
with filters, followed by three linear layers with hidden size, except for the last linear layer, where hidden size corresponded to the embedding dimensionality and was varied. Parametric ReLU nonlinearity was used between all the layers. The output of the last hidden layer was mapped to the Poincaré ball using the exponential map (
4) and was followed by the hyperbolic MLR layer. After training the model to test accuracy, we evaluate it on the Omniglot dataset (by resizing images to and normalizing them to have the same background color as MNIST). We then evaluate the hyperbolic distance to the origin of embeddings produced by the network on both datasets. The closest Euclidean analogue to this approach would be comparing distributions of, maximum class probability predicted by the network. For the same range of dimensions we train ordinary Euclidean classifiers on MNIST, and compare these distributions for the same sets. Our findings are summarized in Figure
4 and Table 1. We observe that distances to the origin present a more statistically significant indicator of the dataset dissimilarity in cases.We have visualized the learned MNIST and Omniglot embeddings on Figure 1. We observe that more ‘unclear’ images are located near the center, while the images that are easy to classify are located closer to the boundary.


We hypothesize that a certain class of problems – namely the fewshot classification task can benefit greatly from hyperbolic embeddings.
The starting point for our analysis is the experiments on the Omniglot dataset for fewshot classification. This dataset consists of the images of characters sampled from different alphabets; each character is supported by examples. We test several fewshot learning algorithms to see how hyperbolic embeddings affect them. As a baseline approach, we chose a siamese net with the backbone consisting of convolutional blocks followed by two fully connected layers, producing embeddings of size . In order to produce the similarity score for two input images (which is then fed into sigmoid, predicting whether these two images are of the same class), we consider three approaches. First one is Euclidean, for which we follow [12], with . The other two use hyperbolic layers, for which we as before extend the backbone net with the exponential map. We try the following two approaches to compute the similarity between two hyperbolic embeddings. The first one is straightforward and we define similarity as . For the second one, we first concatenate the embeddings using the hyperbolic layer (7), mapping these two embeddings into one hyperbolic point of size . This point is then moved to via the logarithmic map (5), and mapped to a number with a fully connected layer of size . We test these approaches for different values of in the shot way setting and report the results in Table 2. We observe that in both cases we get an accuracy boost, but slightly smaller with the architecture. Value was used for these experiments.
In order to validate if hyperbolic embeddings can further improve models performing on the stateoftheart level, for the next architecture, we choose the prototype network (ProtoNet) introduced in the paper [31] with four convolutional blocks in a backbone. Each convolutional block consists of
convolutional layer followed by batch normalization, ReLU nonlinearity and
maxpooling layer. The number of filters in the last convolutional layer corresponds to the value of the embedding dimension, for which we choose . The hyperbolic model differs from the baseline in the following aspects. First, the output of the last convolutional block is embedded into the Poincaré ball of dimension using the exponential map. In ProtoNet, one uses a socalled prototype representation of a class, which is defined as a mean of the embedded support set of a class. Generalizing this concept to hyperbolic space, we substitute the Euclidean mean operation by , defined earlier in the Equation (8). We selected and trained the models for various fewshot scenarios. The initial value of learning rate equals to and is multiplied by every epochs out of total epochs. Results are presented in Table 3. We can see that in some scenarios, in particular for one–shot learning, hyperbolic embeddings are more beneficial, while in other cases results are slightly worse.dim  

Euclidean similarity 

Hyperbolic  
Hyperbolic distance 
ProtoNet  Hyperbolic ProtoNet  

shot way  
shot way  
shot way  
shot way 
Dataset  Model  1shot 5way  5shot 5way 

MiniImageNet  MatchNet [39]  
ProtoNet  
RelationNet [35]  
Hyperbolic ProtoNet  
CUB  ProtoNet  
Hyperbolic ProtoNet 
MiniImageNet dataset is the subset of ImageNet dataset [26], which contains of classes represented by examples per class. We use the following split provided in the paper [24]: training dataset consists of classes, validation dataset is represented by classes, and the remaining classes serve as a test dataset.
As a baseline model, we once again consider the ProtoNet model, mentioned earlier in the experiments on Omniglot dataset in Subsection 5.2. In these experiments, we set the value of the embedding dimension to . We test the models on tasks for oneshot and fiveshot classifications; the number of query points in each batch always equals to . Similarly, for the hyperbolic version of ProtoNet, we have the following experimental setup. We put the value of Poincaré ball radius to , and consider the following learning rate decay scheme: the initial learning rate equals to and is further multiplied by every epochs (out of total epochs).
Table 4 illustrates the obtained results on MiniImageNet dataset. For MiniImageNet dataset, the results of the other models are available for the same classification tasks (i.e., for oneshot and fiveshot learning). Therefore, we can compare our obtained results to those that were reported in the original papers. From these experimental results, we may observe a slight gain in model accuracy.
The CUB dataset consists of images of bird species and was designed for finegrained classification. We use the split introduced in [37]: classes out of were used for training, for validation and for testing. Also, following [37], we make the same preprocessing step by resizing each image to the size of .
Likewise, we use ProtoNet mentioned above with the following modifications. Here, we fix the embedding dimension to and use a slightly different setup for learning rate scheduler: the initial learning rate of value is multiplied by every epochs out of total epochs. Remaining architecture and parameters both in baseline and hyperbolic models are identical to those in the experiments on the MiniImageNet dataset.
Our findings on the experiments on the CUB dataset are summarized in Table 4. Interestingly, for this dataset, the hyperbolic version significantly outperforms its Euclidean counterpart.
.5in.5in

Market1501  DukeMTMCreID  

bs  hyp  bs  hyp  
dim 
lr schedule  r1  mAP  r1  mAP  r1  mAP  r1  mAP 
32  sch#1  71.4  49.7  69.8  45.9  56.1  35.6  56.5  34.9 
sch#2  68.0  43.4  75.9  51.9  57.2  35.7  62.2  39.1  
64  sch#1  80.3  60.3  83.1  60.1  69.9  48.5  70.8  48.6 
sch#2  80.5  57.8  84.4  62.7  68.3  45.5  70.7  48.6  
128  sch#1  86.0  67.3  87.8  68.4  74.1  53.3  76.5  55.4 
sch#2  86.5  68.5  86.4  66.2  71.5  51.5  74.0  52.2 
The DukeMTMCreID dataset contains training images of identities, query images of identities and gallery images. Market1501 contains training images of identities, queries of identities and gallery images respectively. We report Rank1 of the Cumulative matching Charcteric Curve and Mean Average Precision for both datasets. We use ResNet50 [10] architecture with one fully connected embedding layer following the global average pooling. Three embedding dimensionalities are used in our experiments: , and . For the baseline experiments, we add the additional classification linear layer, followed by the crossentropy loss. For the hyperbolic version of the experiments, we map the descriptors to the Poincaré ball and apply multiclass logistic regression as described in Section 4. We found that in both cases the results are very sensitive to the learning rate schedules. We tried four schedules for learning dimensional descriptors for both baseline and hyperbolic versions. Two best performing schedules were applied for the and dimensional descriptors. In these experiments, we also found that smaller values give better results. We finally set to . Therefore, based on the discussion in 4, our hyperbolic setting is quite close to Euclidean. The results are compiled in Table 5. We set starting learning rates to and for and correspondingly and multiply them by after each of the epochs and . The results are reported after the training epochs. As we can see in the Table 5, hyperbolic version generally performs better than the baseline, while the gap between the baseline and hyperbolic versions’ results is decreasing for larger dimensionalities.
We have investigated the use of hyperbolic spaces for image embeddings. The models that we have considered use Euclidean operations in most layers, and use the exponential map to move from the Euclidean to hyperbolic spaces at the end of the network (akin to the normalization layers that are used to map from the Euclidean space to Euclidean spheres). The approach that we investigate here is thus compatible with existing backbone networks trained in Euclidean geometry.
At the same time, we have shown that across a number of tasks, in particular in the fewshot image classification, learning hyperbolic embeddings can result in a substantial boost in accuracy. We speculate that the negative curvature of the hyperbolic spaces allows for embeddings that are better conforming to the intrinsic geometry of at least some image manifolds with their hierarchical structure.
Future work may include several potential modifications of the approach. We have observed that the use of hyperbolic embeddings improves performance for some problems and datasets, while not helping others. A better understanding of when and why the use of hyperbolic geometry is justified is therefore needed. Also, we note that while all hyperbolic geometry models are equivalent in the continuous setting, fixedprecision arithmetic used in real computers breaks this equivalence. In practice, we observed that care should be taken about numeric precision effects (following [7], we clip the embeddings to minimize numerical errors during learning). Using other models of hyperbolic geometry may result in more favourable floating point performance.
Conf. Computer Vision and Pattern Recognition, CVPR
, pages 3908–3916, 2015.Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 1126–1135. JMLR. org, 2017.
Comments
There are no comments yet.