1 Introduction
Building vector representations of various objects is one of the central tasks of machine learning. Word embeddings such as Glove
Pennington et al. (2014) and Word2Vec Mikolov et al. (2013)are widely used in natural language processing, a similar Prod2Vec
Grbovic et al. (2015) approach is used in recommendation systems. There are many algorithms proposed for graph embeddings, e.g., Node2Vec Grover and Leskovec (2016) and DeepWalk Perozzi et al. (2014). Recommendation systems often construct embeddings of a bipartite graph that describes interactions between users and items. Such embeddings can be constructed via matrix factorization techniques such as ALS Hu et al. (2008).For a long time, embeddings were considered exclusively in , but hyperbolic space was shown to be more suitable for graph and word representations due to the underlying hierarchical structure Nickel and Kiela (2017, 2018); Tifrea et al. (2018). Going beyond spaces of constant curvature, a recent study Gu et al. (2019) proposes product spaces, which combine several copies of Euclidean, spherical and hyperbolic spaces. While these spaces demonstrate promising results, the optimal signature (types of combined spaces and their dimensions) has to be chosen via brute force, which may not be acceptable in largescale applications.
In this paper, we propose a more general metric space called overlaying space together with an optimization algorithm that trains signature simultaneously with embedding allowing to avoid bruteforcing. We provide extensive empirical evaluation to see whether complex nonEuclidean spaces are useful in practice. For this purpose, we first consider the graph reconstruction task with both distortion loss and a more realistic ranking loss. We also apply the proposed methods to train embeddings via DSSM Huang et al. (2013) to compare the spaces in information retrieval and recommendation tasks. We conclude that the proposed overlaying space outperforms the competitors in the graph reconstruction task with distortion loss, i.e., when the aim is to embed data preserving the distances. On the other hand, when ranking losses are optimized and if the dimensionality is sufficiently large, the best results are achieved with the dotproduct similarity. Dot products are often overlooked in graph embedding literature since they cannot be converted to a metric. Our experiments show that despite this shortcoming, dot products provide goodquality embeddings. We try to explain this and discuss the advantages of the dotproduct similarity compared to metric spaces.
2 Background and related work
2.1 Embeddings and loss functions
For a graph , an embedding is a mapping , where is a metric space equipped with a distance .^{1}^{1}1Note that any discrete metric space correspond to a weighted graph, so graph terminology is not restrictive. On the graph one can consider a shortest path distance . In the graph reconstruction task, it is expected that a good embedding preserves the original graph distances: . The most commonly used evaluation mertic is distortion, which averages relative errors of distance reconstruction over all pairs of nodes:
(1) 
While commonly used in graph reconstruction, distortion is not the best choice for many practical applications. For example, in recommendation tasks, one usually deals with a partially observed graph (some positive and negative element pairs), so a huge graph distance between nodes in the observed part does not necessarily mean that the nodes are not connected by a short path in the full graph. Also, often only the order of the nearest elements is important while predicting distances to faraway objects is not critical. In such cases, it is more reasonable to consider a local ranking metric, e.g., the mean average precision (mAP) that measures the relative closeness of the relevant (adjacent) nodes compared to the others:^{2}^{2}2For mAP, the relevance labels are assumed to be binary (unweighted graphs). If a graph is weighted, then we say that consists of the closest element to (or several closest elements if the distances to them are equal).
(2)  
Note that mAP cannot be directly optimized since it is not differentiable. In our experiments, we use the following probabilistic loss function as a proxy:^{3}^{3}3
We have also experimented with other way to convert distance to probability, see the supplementary materials for more details.
(3) 
Note that when substituting (assuming that , so the dot product is defined), we get the standard word2vec loss function.
2.2 Spaces, distances and similarities
In the previous section, we assumed that is an arbitrary distance. In this section, we discuss particular choices often assumed in the literature. For many years, Euclidean space was the primary choice for structured data embeddings Goyal and Ferrara (2018). For two points , Euclidean distance is defined as
Spherical spaces were also found to be suitable for some applications Liu et al. (2017); Qian et al. (2004); Wilson et al. (2014)
. Indeed, in practice, vector representations are often normalized, so cosine similarity between vectors is a natural way to measure their similarity. This naturally corresponds to a spherical space
equipped with a spherical distance:In recent years, hyperbolic spaces also started to gain popularity. Hyperbolic embeddings have shown their superiority over Euclidean ones in a number of tasks, such as graph reconstruction and word embedding Nickel and Kiela (2017, 2018); Sala et al. (2018); Tifrea et al. (2018). To represent the points, early approaches were based on the Poincare model of the hyperbolic space Nickel and Kiela (2017), but later it has been shown that the hyperboloid (Lorentz) model may lead to more stable results Nickel and Kiela (2018). In this work, we also adopt the hyperboloid model: and the distance is defined as
(4) 
Going even further, a recent paper Gu et al. (2019) proposes more complex product spaces than combine several copies of Euclidean, spherical, and hyperbolic spaces. Namely, the overall dimension is split into parts (smaller dimensions): , . Each part is associated with a space and scale coefficient . Varying scale coefficients corresponds to changing curvature of hyperbolic/spherical space, while in Euclidean space this coefficient is not used (). Then, the distance in the product space is defined as:
where , , and is a vector . If , we get a standard Euclidean, spherical or hyperbolical space. In Gu et al. (2019), it is proposed to simultaneously learn an embedding and scale coefficients . However, choosing the optimal signature (how to split into
and which types of spaces to choose) is a challenge. A heuristics proposed in
Gu et al. (2019) allows to guess types of spaces if ’s are given. If , this heuristics agrees well with the experiments on three considered datasets. Generalizability of this idea to other datasets and configurations is unclear. In addition, it cannot be applied if a dataset is partially observed (e.g., there are several known positivenegative pairs), i.e., graph distances cannot be computed. Hence, in practice it is more reliable to choose a signature via the brute force which can be inapplicable on large datasets.Another way to measure objects’ similarity, which is usually overlooked in embedding literature but is often used in practical applications, is via the dot product of vectors . In this paper, we stress that the dotproduct similarity has some advantages over other spaces. In particular, it allows us to easily differentiate between more popular and less popular items (the vector norm can be considered as a measure of popularity). This feature is usually attributed to hyperbolic spaces, but it better agrees with the dotproduct similarity. The main shortcoming of the dot product is the fact that it does not correspond to a metric, however, it may be used to predict similarity or dissimilarity between objects, which is often sufficient in practice, and in some cases is able to preserve the distances.
2.3 Optimization
Gradient optimization in Euclidean space is straightforward, while for spherical or hyperbolic embeddings, we have to additionally control that points belong to a surface. In previous works, RiemannSGD was used to solve this problem Bonnabel (2013). In short, it projects Euclidean gradients on the tangent space at a point, and then uses a socalled exponential map to move the point along the surface according to the gradient projection. For product spaces, a generalization of exponential map has been proposed Ficken (1939); Turaga and Srivastava (2016).
In Wilson and Leimeister (2018), the authors compare RSGD with the retraction technique (points are moved along the gradients in the ambient space and are projected onto the surface after each update). From their experiment, the retraction technique requires from 2% to 46% more iterations, depending on the learning rate. However, the exponential update step takes longer, hence the advantage of RSGD in terms of computation time depends on the specific implementation.
3 Overlaying spaces
In this section, we propose a new concept of overlaying spaces. This concept generalizes product spaces and also allows us to make signature (types of combined spaces) trainable. Our main idea is to divide the embedding vector into several intersecting (unlike product spaces) segments, each segment corresponds to its own space. Then, instead of discrete signature bruteforcing, we optimize the weights of the signature elements.
Importantly, we allow the same coordinates of an embedding vector to define distances in spaces of different geometry. For this purpose, we need to map a vector (for any ) to a point in Euclidean, hyperbolic and spherical space. Let us denote this mapping by . Obviously, for Euclidean space, we may take . For hyperbolic and spherical spaces, we set
(5) 
Note that a dimensional vector is mapped into Euclidean and hyperbolic spaces of dimension and into a spherical space of dimension . While it is possible to parameterize points in by dimensional vectors, the most straightforward mapping usually used in practice is the one in (5).^{4}^{4}4For instance, Gu et al. (2019) uses dimensional vectors for storing points in both and .
Now we are ready to define an overlaying space. Consider two vectors . Let denote some subsets of coordinates, i.e., . We assume that together the subsets cover all coordinates, i.e., . By we denote a subvector of induced by . Let . We define and aggregate these distances with arbitrary positive weights :
(6)  
The obtained space equipped with distance , , or we call an overlaying space. It is defined by , , and . Note that it is sufficient to assume that spherical and hyperbolic spaces have curvatures and , respectively, since changing curvature is equivalent to changing scale which is captured by . The following statement follows from the definition above and from the fact that , , and are distances.
Statement 1
If and , then , , are distances on , i.e., they satisfy the metric axioms.
It is easy to see that overlaying spaces generalize product spaces. Indeed, if we assume for all , then an overlaying space reduces to a product space. However, the fact that we allow gives a larger expressive power for the same dimension .
4 Optimization in overlaying spaces
4.1 Universal signature
Overlaying spaces defined in the previous section are flexible and allow capturing various geometries. However, similarly to product spaces, they need a signature ( and ) to be chosen in advance. In this section, we show that a universal signature can be chosen, so no brute force is needed to choose the best signature for a particular dataset.
Let denote the depth (complexity) of the signature for a dimensional embedding. Each layer , , of the signature consists of subsets of coordinates:
Each is associated with Euclidean, spherical and hyperbolic spaces simultaneously. The corresponding weights are denoted by . Then, the distance is computed according to (6), see Figure 1 for an illustration of the procedure (for and ).
Informally, we first consider the original vectors and compute Euclidean, spherical and hyperbolic distances between them. Then, we split the vectors into two halves and for each half we also compute all three distances, etc. Finally, all the obtained distances are averaged with the weights coefficient according to (6). Note that we have different weights in our structure in general, but with aggregation this value may be reduced to since for the Euclidean space the distances between subvectors at the upper layers can be split into terms corresponding to smaller subvectors, so we essentially need only the last layer with terms. Recall that in product spaces the weights correspond to curvatures of the combined spaces. In our case, they also play another important role: weights allow us to balance between different spaces. Indeed, for each subset of coordinates, we simultaneously compute the distance between the points assuming each of the combined spaces. Varying the weights, we can increase or decrease the contribution of a particular space to the distance. As a result, our signature allows us to learn the optimal signature, which does not have to be a product space since all weights can be nonzero.
4.2 Optimization
In this section, we describe how we embed into the overlaying space. Although RiemannSGD (see Section 2.3
) is a good solution from the theoretical point of view, in practice, due to errors in storing and processing real numbers, it may cause some problems. A point that we assume to lie on a surface (sphere or hyperboloid) does not numerically lie on it usually. Due to the accumulation of numerical errors, with each iteration of RSGD, the point may move away from the surface. Therefore, in practice, after each step, all embeddings are explicitly projected onto the surface, which may slow down the algorithm. Moreover, RSGD is not applicable if one needs to process the output of a neural network, which cannot be required to belong to a given surface (e.g., to satisfy
). As a result, before finding the hyperbolic distance between two outputs of a neural network in Siamese Bromley et al. (1994) setup, one first needs to somehow map them to a hyperboloid.Instead of RSGD, we store the embedding vectors in Euclidean space and calculate distances between them using the mappings (5
) to the corresponding surfaces. Thus, we are able to evaluate the distances between the outputs of neural networks and also use conventional optimizers. To optimize embeddings, we first map Euclidean vectors into the corresponding spaces, calculate distances and loss function, and then backpropagate through projection functions. To improve the convergence, we use Adam
Kingma and Ba (2014) instead of the standard SGD. Applying this to product spaces, we achieve the results similar to the original paper Gu et al. (2019) (see Table 1 of the supplementary materials), where RSGD was used with the learning rate bruteforcing, custom learning rate for curvature coefficients, and other tricks.5 Experiments
5.1 Graph reconstruction
To compare with previous research, we start with the graph reconstruction task with distortion loss (1). The goal is to embed all nodes of a given graph into a dimensional space approximating the pairwise graph distances between the nodes. Similarly to Gu et al. (2019), we use the USCA312 dataset of distances between North American cities Burkardt (2011) (weighted complete graph), graph of computer science Ph.D. advisoradvisee relationships Bonacich (2008), a power grid distribution network with backbone structure Watts (1998), and a dense social network from Facebook Leskovec and Mcauley (2012). We also created a new dataset, obtained by launching the breadthfirst search on the Wikipedia category graph, starting from the “Linear Algebra” category with search depth limited to 6. Further, we refer to this dataset as WLA6 ^{5}^{5}5The dataset will is publicly available. and we expect it to be well described by a hyperbolic geometry due to its hierarchical structure.
UCSA312  CS PhDs  Power  WLA6  

Nodes  312  1025  4941  4039  3227 
Edges  48516 (weighted)  1043  6594  88234  3604 
We compare all spaces discussed in the paper: standard Euclidean, hyperbolic and spherical spaces (with trainable curvature); product spaces with all signatures from Gu et al. (2019); the proposed overlaying space; and also two dotproductbased distances. For the overalying space, we take which gives a weighted combination of Euclidean, hyperbolic and spherical distances, and where one more layer is added (see Figure 1). For we compare and aggregations. For the dotproductbased distances, we consider and with trainable parameter . While these functions are not distances (do not satisfy the metric axioms), we add them to analyze whether they are still able to approximate graph distances. Similarly to Gu et al. (2019), we fix the dimension . However, for a fair comparison, we fix the number of stored values for each embedding. In our case, this means that dimension of a spherical space is smaller by 1 ( or ), since for the each spherical space we store one additional value (see (5)).^{6}^{6}6In the supplementary materials we evaluate spherical spaces without this reduction to compare with Gu et al. (2019). All models are trained to minimize distortion (1). The code of our experiments supplements the submission ^{7}^{7}7https://github.com/shevkunov/OverlayingSpacesandPracticalApplicabilityofComplexGeometries.
Signature  UCSA312  CS PhDs  Power  WLA6  

0.00318  0.0475  0.0408  0.0487  0.0530  
0.01114  0.0443  0.0348  0.0483  0.0279  
0.00986  0.0524  0.0481  0.0597  0.0666  
0.00573  0.0345  0.0255  0.0372  0.0279  
0.00753  0.0543  0.0505  0.0633  0.0727  
0.00652  0.0346  0.0255  0.0336  0.0308  
0.00592  0.0344  0.0273  0.0439  0.0356  
0.00758  0.0761  0.0716  0.0990  0.1231  
0.00383  0.0395  0.0335  0.0577  0.0474  
0.04005  0.0412  0.0461  0.0236  0.0296  
0.08306  0.0424  0.0505  0.0192  0.0270  
0.00356  0.0368  0.0281  0.0458  0.0286  
0.00330  0.0300  0.0231  0.0371  0.0272  
0.00530  0.0328  0.0246  0.0324  0.0278 
The results are shown in Table 8. It can be clearly seen that the overlaying spaces outperform other metric spaces, and the best overlaying space (among considered) is the one with aggregation and complexity . Interestingly, the performance of such overlaying space is often better than for the best product space. Recall that we also added to the comparison the dotsimilaritybased functions and . These functions are not proper distances, hence their performance is highly unstable for this task: for example, for UCSA312 dataset the obtained distortion is orders of magnitude worse than the best one. However, on some datasets (Facebook and WLA6) the performance is quite good and for Facebook has much better performance than all other solutions. We conclude that for the graph reconstruction with distortion loss the dot products are worth trying, but their performance is very unstable, in contrast to overlaying spaces that show good and stable results on all datasets.
As discussed in Section 2.1, in many practical applications, only the order of the nearest neighbors matters. In this case, it is more reasonable to use mAP (2). In previous work Gu et al. (2019), mAP was also reported but the models were trained to minimize distortion. In our experiments, we observed that distortion optimization weakly correlates with mAP optimization. Hence, we minimize the proxyloss defined in equation (7). The results are shown in Table 9 and the obtained values for mAP are indeed much better than the ones obtained with distortion optimization Gu et al. (2019), i.e., it is important to use an appropriate loss function. According to Table 9, among the metric spaces, the best results are achieved with the overlaying spaces (especially for aggregation with ). However, in contrast to distortion loss, ranking based on the dotproduct clearly outperforms all metric spaces. This result is important from a practical point of view: there is no need to use complex geometries if the goal is to preserve the local neighborhood.
Signature  UCSA312  CS PhDs  Power  WLA6  

0.9290  0.9487  0.9380  0.7876  0.7199  
0.9173  0.9399  0.9385  0.7997  0.9617  
0.9271  0.9586  0.9481  0.7795  0.7200  
0.9247  0.9481  0.9415  0.8084  0.9682  
0.9178  0.9613  0.9517  0.7706  0.7109  
0.9274  0.9647  0.9524  0.8005  0.9770  
0.9364  0.9671  0.9508  0.7979  0.8597  
0.9311  0.9013  0.8101  0.7132  0.4957  
0.9343  0.9504  0.9397  0.7690  0.5876  
1  1  0.9983  0.8745  0.9990  
0.9522  0.9879  0.9728  0.8093  0.6759  
0.9522  0.9904  0.9762  0.8185  0.9598  
0.9522  0.9938  0.9907  0.8326  0.9694 
5.2 DSSM with custom distances
From a practical point of view, it is much more interesting to analyze whether an embedding is able to generalize to unseen examples. For instance, an embedding can be made via a neural network based on objects’ characteristics, such as text descriptions or images. In this section, we analyze whether it is reasonable to use complex geometries in such a scenario. For this purpose, we trained a classic DSSM model^{10}^{10}10We changed dense layers sizes in order to achieve required embedding length and used more complex text tokenization with char bigrams, trigrams and words, instead of just char trigrams. Huang et al. (2013) on a private Wikipedia search dataset consisting of 491044 pairs (search query, relevant page), examples are given in Table 4. All queries are divided into train, validation, and test sets and for each signature the optimal iteration was selected on validation.
Query  Web site 

Kris Wallace  en.wikipedia.org/wiki/Chris_Wallace 
1980: Mitsubishi produces one million cars…  en.wikipedia.org/wiki/Mitsubishi_Motors 
code napoleon  en.wikipedia.org/wiki/Napoleonic_Code 
Table 5 compares all models for two embedding sizes: short of length 10 and “industrial size” of length in 256. For short embeddings, we see that a product space based on spherical geometry is indeed useful. However, in large dimensions, the best results are achieved with the standard dot product, questioning the utility of complex geometries in industrial applications.
Signature  Test mAP 

0.4459  
0.4047  
0.4687  
0.4492  
0.4720  
0.3109  
0.3681  
0.3877  
0.3264  
0.4194  
0.4562  
0.4498  
0.4456 
Signature  Test mAP 

0.717  
0.412 ^{11}^{11}11The gap between and may seem suspicious, but in Table 5 of Gu et al. (2019) a similar pattern is observed.  
0.588  
0.547  
0.662  
0.501  
0.621  
0.701  
0.738  
0.677  
0.662 
5.3 A bipartite graph reconstruction
In Section 2.2 we already briefly discussed the advantages of dot products over metric spaces. Let us illustrate this intuition and show that dotproducts are indeed better suitable for data with a few objects being more popular than the other ones. For this purpose, we perform graph reconstruction on a synthetic bipartite graph with two sets of size 20 and 700 with 5% edge probability (isolated nodes were removed and the remaining graph is connected). Clearly, in the obtained graph there are a few popular nodes and many nodes of small degrees. Table 11 compares the performance of the best metric space with the dotproduct performance. As we can see, this experiment confirms our assumption that specific graphs are poorly embedded in metric spaces. In the supplementary materials, we show the results for all other metric spaces and also discuss why dot products are suitable for certain data structures and can outperform other spaces in practical applications.
mAP  distortion  

best metric space (type)  0.82 ()  0.082 ( 
0.86  0.079 
6 Conclusion
In this paper, we proposed overlaying spaces that have better or comparable performance relative to the best product space in the graph reconstruction task, but do not require signature bruteforcing. Improvements are observed for both global distortion and local mAP loss functions. However, the conventional dotproduct outperforms all considered methods in graph reconstruction task for mAP loss. In DSSM setup with large embeddings, it also outperforms all methods. This clearly shows the limitations of hyperbolical, product, and overlaying spaces and the necessity of comparison with the dot product in addition to Euclidean and spherical distances when exploring different spaces for vector representations. On the other hand, custom spaces are useful in DSSM setup for lowdimensional vector representations. This can be useful if there is a need to store very large embedding databases, for example in recommendation systems.
We have to pay attention that some of our conclusions can potentially be biased towards particular datasets considered and may not hold for datasets of different nature. In particular, in DSSMbased analysis we considered a particular web search dataset and for other datasets the impact of the use of complex geometries can be different.
Overlaying spaces proposed in the current paper are metric spaces and can be used in methods based on distances between the elements. However, more complex operations, e.g., algebraic operations over elements in an overlaying space, are questionable. In this case, one may still use the proposed idea and search for the optimal product space signature through overlaying space training with additional regularization. This question has not been considered in this paper and is a subject of a separate study.
Finally, it is important to stress that while vector spaces and dotproduct similarities are often used in practice, research papers usually compare new complex geometries with the Euclidean space. This may cause confusion and a false impression that complex geometries improve over widely used systems. Our results show that a comparison with the standard dotproduct similarity is necessary for research articles of this kind.
References
 (1)
 Bonacich (2008) Phillip Bonacich. 2008. Book Review: W. de Nooy, A. Mrvar, and V. Batagelj Exploratory Social Network Analysis With Pajek. (2004). Sociological Methods & Research  SOCIOL METHOD RES 36 (05 2008), 563–564. https://doi.org/10.1177/0049124107306674
 Bonnabel (2013) Silvere Bonnabel. 2013. Stochastic Gradient Descent on Riemannian Manifolds. IEEE Trans. Automat. Control 58 (2013), 2217–2229.
 Bromley et al. (1994) Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1994. Signature verification using a" siamese" time delay neural network. In Advances in neural information processing systems. 737–744.
 Burkardt (2011) John Burkardt. 2011. Cities – City Distance Datasets. https://people.sc.fsu.edu/~jburkardt/datasets/cities/cities.html
 Ficken (1939) Frederick Arthur Ficken. 1939. The Riemannian and affine differential geometry of productspaces. (1939), 892–913.
 Goyal and Ferrara (2018) Palash Goyal and Emilio Ferrara. 2018. Graph embedding techniques, applications, and performance: A survey. KnowledgeBased Systems 151 (2018), 78–94. https://doi.org/10.1016/j.knosys.2018.03.022
 Grbovic et al. (2015) Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. Ecommerce in your inbox: Product recommendations at scale. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 1809–1818.
 Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning for Networks. CoRR abs/1607.00653 (2016). arXiv:1607.00653 http://arxiv.org/abs/1607.00653
 Gu et al. (2019) Albert Gu, Frederic Sala, Beliz Gunel, and Christopher Ré. 2019. Learning mixedcurvature representations in product spaces. International Conference on Learning Representations (ICLR) (2019).
 Hu et al. (2008) Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In IEEE International Conference on Data Mining (ICDM 2008). 263–272. http://yifanhu.net/PUB/cf.pdf
 Huang et al. (2013) PoSen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. ACM International Conference on Information and Knowledge Management (CIKM).
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations (12 2014).
 Leskovec and Mcauley (2012) Jure Leskovec and Julian J. Mcauley. 2012. Learning to Discover Social Circles in Ego Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 539–547. http://papers.nips.cc/paper/4532learningtodiscoversocialcirclesinegonetworks.pdf

Liu
et al. (2017)
Weiyang Liu, Yandong Wen,
Zhiding Yu, Ming Li,
Bhiksha Raj, and Le Song.
2017.
Sphereface: Deep hypersphere embedding for face recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition
. 212–220. 
Mikolov
et al. (2013)
Tomas Mikolov, Kai Chen,
Greg Corrado, and Jeffrey Dean.
2013.
Efficient Estimation of Word Representations in Vector Space.
CoRR abs/1301.3781 (2013). http://dblp.unitrier.de/db/journals/corr/corr1301.html#abs13013781  Nickel and Kiela (2017) Maximillian Nickel and Douwe Kiela. 2017. Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems. 6338–6347. http://papers.nips.cc/paper/7213poincareembeddingsforlearninghierarchicalrepresentations.pdf
 Nickel and Kiela (2018) Maximillian Nickel and Douwe Kiela. 2018. Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry. In International Conference on Machine Learning. 3776–3785. https://arxiv.org/abs/1806.03417
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. http://www.aclweb.org/anthology/D141162
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. CoRR abs/1403.6652 (2014). arXiv:1403.6652 http://arxiv.org/abs/1403.6652
 Qian et al. (2004) Gang Qian, Shamik Sural, Yuelong Gu, and Sakti Pramanik. 2004. Similarity between Euclidean and Cosine Angle Distance for Nearest Neighbor Queries. In Proceedings of the 2004 ACM Symposium on Applied Computing (SAC ’04). Association for Computing Machinery, New York, NY, USA, 1232–1237. https://doi.org/10.1145/967900.968151
 Sala et al. (2018) Frederic Sala, Chris De Sa, Albert Gu, and Christopher Re. 2018. Representation Tradeoffs for Hyperbolic Embeddings. In International Conference on Machine Learning. 4457–4466.
 Tifrea et al. (2018) Alexandru Tifrea, Gary Bécigneul, and OctavianEugen Ganea. 2018. Poincar’e GloVe: Hyperbolic Word Embeddings. arXiv preprint arXiv:1810.06546 (2018).
 Turaga and Srivastava (2016) Pavan K Turaga and Anuj Srivastava. 2016. Riemannian Computing in Computer Vision. Springer.
 Watts (1998) Steven H Watts, Duncan J./Strogatz. 1998. Collective Dynamics of Small World Networks. Nature. 393:440 – 442. https://doi.org/10.1007/9783658217426_130
 Wilson and Leimeister (2018) Benjamin Wilson and Matthias Leimeister. 2018. Gradient descent in hyperbolic space. arXiv preprint arXiv:1805.08207 (2018).
 Wilson et al. (2014) Richard C Wilson, Edwin R Hancock, Elżbieta Pekalska, and Robert PW Duin. 2014. Spherical and hyperbolic embeddings of data. IEEE transactions on pattern analysis and machine intelligence 36, 11 (2014), 2255–2269.
Supplementary materials
A Experimental setup
a.1 Training details
All models discussed in Section 5.1 were trained with 2000 iterations. If more than one learning rate was used for a certain dataset (due to problems with the convergence of individual models), all the spaces were evaluated for all learning rates and the best result was reported for each space. For distortion, the learning rate was 0.1 for all datasets except UCSA312 (Cities), where we had 0.1 and 0.01. For mAP, the learning rate 0.1 was used for all datasets except UCSA312 and CSPhDs, where we had 0.01 and 0.05 for both datasets.
For the experiments in Section 5.2, we used 5000 iterations for short embeddings and 1000 for long ones (long embeddings converged faster). Hardnegative mining was not used for DSSM training. Instead, large batches of 4096 random training examples (almost 1% of the entire dataset) were used. During the learning process, only the training queries and documents were used. For evaluation, the nearest website was searched among all documents. The training part was 90% of the dataset, and the quality discrepancy between validation and test sets was quite small. Our code^{12}^{12}12https://github.com/shevkunov/OverlayingSpacesandPracticalApplicabilityofComplexGeometries supplements the submission.
a.2 WLA6 dataset details
As described in the main text, this dataset is obtained by running the breadthfirst search algorithm on the category graph ^{13}^{13}13https://en.wikipedia.org/wiki/Special:CategoryTree  in fact it is not a tree now: category "Matrix theory" have subcategory "Matrices" with subcategory "Matrix theory", for example of the Englishlanguage Wikipedia, starting from the vertex (category) “Linear algebra” and limited to the depth 6 (Wikipedia Linear Algebra 6). We provide this graph along with the texts (names) of the vertices (categories). The resulting graph is very close to being a tree, although there are some cycles. Predictably, hyperbolic space gives a significant profit for this graph, while using product spaces gives almost no additional advantage. The purpose of using this dataset is to check our conclusions on data other than those used in Gu et al. (2019) and to evaluate overlaying spaces on data where product spaces do not provide quality gains.
B Additional experimental results
b.1 Our implementation of product spaces vs original one
Table 7 compares our implementation with the results reported in Gu et al. (2019). It should be noted that we have significantly different algorithms with differing numbers of iterations.
The optimal values of distortion obtained with our algorithm (except the UCSA312 dataset) are comparable and usually better than the ones reported in Gu et al. (2019). On the UCSA312 the obtained distortion is orders of magnitude better, what can be caused by the proper choice of the learning rate (in our experiments on this dataset, this choice significantly affected the results). These results indicate that our solution is a good starting point to compare different spaces and similarities.
For MAP, we optimize the proxyloss, in contrast to the canonical implementation, where both metrics were specified for models trained with distortion. Obviously, for our approach, the results are more stable: we do not have such a large spread of values for different spaces. We noticed that optimizing mAP directly leads to significant improvements.
UCSA312  CS PhDs  Power  

Canon.  Our  Canon.  Our  Canon.  Our  Canon.  Our  
Distortion  
0.0735  0.0032  0.0543  0.0475  0.0917  0.0408  0.0653  0.0487  
0.0932  0.0111  0.0502  0.0443  0.0388  0.0348  0.0596  0.0483  
0.0598  0.0095  0.0569  0.0503  0.0500  0.0450  0.0661  0.0540  
0.0756  0.0057  0.0382  0.0345  0.0365  0.0255  0.0430  0.0372  
0.0593  0.0079  0.0579  0.0492  0.0471  0.0433  0.0658  0.0511  
0.0622  0.0068  0.0509  0.0337  0.0323  0.0249  0.0402  0.0318  
0.0687  0.0059  0.0357  0.0344  0.0396  0.0273  0.0525  0.0439  
0.0638  0.0072  0.0570  0.0460  0.0483  0.0418  0.0631  0.0489  
0.0765  0.0044  0.0391  0.0345  0.0380  0.0299  0.0474  0.0406  
mAP  
0.9290  0.8691  0.9487  0.8860  0.9380  0.5801  0.7876  
0.9173  0.9310  0.9399  0.8442  0.9385  0.7824  0.7997  
0.9254  0.8329  0.9578  0.7952  0.9436  0.5562  0.7868  
0.9247  0.9628  0.9481  0.8605  0.9415  0.7742  0.8084  
0.9231  0.7940  0.9662  0.8059  0.9466  0.5728  0.7891  
0.9316  0.9141  0.9654  0.8850  0.9467  0.7414  0.8087  
0.9364  0.9694  0.9671  0.8739  0.9508  0.7519  0.7979  
0.9281  0.8334  0.9714  0.8818  0.9521  0.5808  0.7915  
0.9391  0.8672  0.9611  0.8152  0.9486  0.5951  0.7970 
b.2 Graph reconstruction
In Tables 2 and 3 of the main text, we reduced the dimensionality of spherical spaces since we fixed the number of stored values for each space. Here, in Tables 8 and 9, we present the extended results, where we fix the mathematical dimension of product spaces, similarly to Gu et al. (2019). Taking into account the statistical significance estimated for five restarts of the algorithm with different random initialization, the results are similar to ones reported in the main text.
Signature  UCSA312  CS PhDs  Power  WLA6  

0.00318  0.0475  0.0408  0.0487  0.0530  
0.01114  0.0443  0.0348  0.0483  0.0279  
0.00951  0.0503  0.0450  0.0540  0.0589  
0.00573  0.0345  0.0255  0.0372  0.0279  
0.00792  0.0492  0.0433  0.0511  0.0585  
0.00681  0.0337  0.0249  0.0318  0.0296  
0.00592  0.0344  0.0273  0.0439  0.0356  
0.00720  0.0460  0.0418  0.0489  0.0549  
0.00436  0.0345  0.0299  0.0406  0.0405  
0.04005  0.0412  0.0461  0.0236  0.0296  
0.08306  0.0424  0.0505  0.0192  0.0270  
0.00356  0.0368  0.0281  0.0458  0.0286  
0.00330  0.0300  0.0231  0.0371  0.0272  
0.00530  0.0328  0.0246  0.0324  0.0278 
Signature  UCSA312  CS PhDs  Power  WLA6  

0.9290  0.9487  0.9380  0.7876  0.7199  
0.9173  0.9399  0.9385  0.7997  0.9617  
0.9254  0.9578  0.9436  0.7868  0.7287  
0.9247  0.9481  0.9415  0.8084  0.9682  
0.9231  0.9662  0.9466  0.7891  0.7353  
0.9316  0.9654  0.9467  0.8087  0.9779  
0.9364  0.9671  0.9508  0.7979  0.8597  
0.9281  0.9714  0.9521  0.7915  0.7346  
0.9391  0.9611  0.9486  0.7970  0.6796  
1  1  0.9983  0.8745  0.9990  
0.9522  0.9879  0.9728  0.8093  0.6759  
0.9522  0.9904  0.9762  0.8185  0.9598  
0.9522  0.9938  0.9907  0.8326  0.9694 
b.3 Other ways to convert distances to probabilities
For the proxyloss, we additionally experimented with other ways of converting distances to probabilities. Let us write in the general form:
(7) 
where is a function that decreases with distance . We compare the following alternatives for :
where is a small constant.
Recall that was used in the main text and it seems to be the most natural choice ^{14}^{14}14Note that this is the softmax over inverted distances. Table 10 compares the options and shows that the best results are indeed achieved with .
UCSA312  CS PhD  
0.929  0.911  0.899  0.949  0.956  0.831  
0.917  0.807  0.885  0.940  0.749  0.764  
0.927  0.801  0.829  0.959  0.583  0.684  
0.925  0.797  0.838  0.958  0.572  0.689  
0.925  0.890  0.883  0.948  0.976  0.723  
0.918  0.821  0.864  0.961  0.733  0.751  
0.923  0.802  0.858  0.966  0.748  0.775  
0.927  0.835  0.853  0.965  0.781  0.724  
0.932  0.838  0.865  0.965  0.804  0.721  
0.936  0.896  0.903  0.967  0.998  0.823  
0.931  0.850  0.851  0.901  0.863  0.826  
0.928  0.856  0.871  0.971  0.876  0.881  
0.934  0.887  0.820  0.950  0.891  0.751  
0.939  0.872  0.865  0.961  0.884  0.689  
0.952  0.933  0.872  0.988  0.961  0.762  
0.952  0.947  0.877  0.990  0.963  0.815  
0.952  0.939  0.880  0.994  0.979  0.810  
1  1  0.777  1  0.999  0.917 
b.4 Synthetic experiment with bipartite graph
In Table 11, we extend the results presented in Table 6 of the main text. We report distortion and mAP and the corresponding models were trained with distortion and proxy losses, respectively, similar to the experiments in Section 5.1. For each space, learning rates 0.1, 0.05, 0.01, 0.001 were used and the best result was selected. We had 2000 and 1000 iterations for distortion and mAP, respectively. Figure 2 shows the graph visualization.
mAP  distortion  
0.777  0.094  
0.794  0.095  
0.689  0.100  
0.799  0.090  
0.522  0.107  
0.787  0.094  
0.761  0.086  
0.334  0.148  
0.482  0.098  
0.824  0.094  
0.803  0.082  
0.814  0.092  
best metric space  0.824  0.082 
0.863  0.079 
C Discussion on advantages of dot products
In this section, we discuss the advantages of the dot product and give an intuition regarding particular structures that are better embedded using this similarity measure.
The most straightforward advantage of the dot product is that it allows us to easily differentiate between popular and unpopular items. This property is usually attributed to the hyperbolic space when it is compared with spherical and Euclidean ones. However, the concept of popularity can be much easier expressed with the dotproduct similarity. Popularity often affects the structure of realworld data, from social networks to recommendation systems. Assume that there are two items with similar properties/topic, but with different quality/popularity. Then, given a query with the same topic (the direction in the vector space), it is better to recommend the more popular item (with larger vector norm). This scenario can be visualized with the following graph structure. Assume that we have an arbitrary graph , which has a standard coreperiphery structure. Now we add two elements to this graph: the element is not very popular, it is connected only to several core elements of ; the element is popular and it is connected to all elements of . Such a situation is easily modeled with the dotproduct similarity: the vectors and have the same direction (corresponding to the core elements of ), but different norms; as a result, they have different numbers of neighbors.^{15}^{15}15The dotproduct similarity can be converted to a graph, e.g., in the following way: if the similarity between two nodes is higher than some threshold, then two nodes are connected. In other spaces, this situation is harder to model: and are connected to the same core elements of , so they have to be close to each other and hence have similar neighborhoods.
Also, let us discuss why dot products are well suitable for modeling structures similar to the bipartite graph used in our synthetic experiment. Assume that we have a small set of popular nodes and a large set of less popular nodes . On we may have an arbitrary structure, but we want all nodes in to be not connected to each other and connected to all nodes from . If is small enough (less than the dimension of the space), then we can easily get several popular items located far away from each other: we can take them to be codirectional to the basis vectors and with large norms. Then, they all will have pairwise dot products equal to 0. The elements of can be chosen in the positive orthant of the space. They can be easily made connected to all elements of (if norms of the elements in are large enough). This intuition led to our synthetic experiment, which demonstrated that the dotproduct similarity indeed allows us to capture bipartite structures.
Comments
There are no comments yet.