Symmetric Positive Definite (SPD) matrices have been applied in many tasks in computer vision such as pedestrian detectionTosato et al. (2010); Tuzel et al. (2008), action Harandi et al. (2014); Li and Lu (2018); Nguyen et al. (2019) et al. (2014, 2015), object Ionescu et al. (2015); Yin et al. (2016) and image set classification Wang et al. (2018), visual tracking Wu et al. (2015), and medical imaging analysis Arsigny et al. (2006a); Pennec et al. (2006)
among others. They have been used to capture statistical notions (Gaussian distributionsSaid et al. (2017), covariance Tuzel et al. (2006)), while respecting the Riemannian geometry of the underlying SPD manifold, which offers a convenient trade-off between structural richness and computational tractability Cruceru et al. (2020). Previous work has applied approximation methods that locally flatten the manifold by projecting it to its tangent space Carreira et al. (2012); Vemulapalli and Jacobs (2015), or by embedding the manifold into higher dimensional Hilbert spaces Ha Quang et al. (2014); Yin et al. (2016).
These methods face problems such as distortion of the geometrical structure of the manifold and other known concerns with regard to high-dimensional spaces Dong et al. (2017). To overcome these issues, several distances on SPD manifolds have been proposed, such as the Affine Invariant metric Pennec et al. (2006), the Stein metric Sra (2012), the Bures–Wasserstein metric Bhatia et al. (2019) or the Log-Euclidean metric Arsigny et al. (2006a, b), with their respective geometric properties. However, the representational power of SPD is not fully exploited in many cases Arsigny et al. (2006b); Pennec et al. (2006)
. At the same time, it is hard to translate operations into their non-Euclidean domain given the lack of closed-form expressions. There has been a growing need to generalize basic operations, such as addition, rotation, reflection or scalar multiplication, to their Riemannian geometric counterparts to leverage this structure in the context of Geometric Deep LearningBronstein et al. (2017).
SPD manifolds have a rich geometry that contains both Euclidean and hyperbolic subspaces. Thus, embeddings into SPD manifolds are beneficial, since they can accommodate hierarchical structure in data sets in the hyperbolic subspaces while at the same time represent Euclidean features. This makes them more versatile than using only hyperbolic or Euclidean spaces, and in fact, their different submanifold geometries can be used to identify and disentangle such substructures in graphs.
In this work, we introduce the vector-valued distance function to exploit the full representation power of SPD (§3.1). While in Euclidean or hyperbolic space the relative position between two points is completely captured by their distance (and this is the only invariant), in SPD this invariant is a vector, encoding much more information than a single scalar. This vector reflects the higher expressivity of SPD due to its richer geometry encompassing Euclidean as well as hyperbolic spaces. We develop algorithms using the vector-valued distance and showcase two main advantages: its versatility to implement universal models, and its use in explaining and visualizing what the model has learned.
Furthermore, we bridge the gap between Euclidean and SPD geometry by developing gyrocalculus in SPD (§4
), which yields closed-form expressions of arithmetic operations, such as addition, scalar multiplication and matrix scaling. This provides means to translate previously implemented ideas in different metric spaces to their analog notions in SPD. These arithmetic operations are also useful to adapt neural network architectures to SPD manifolds.
We showcase this on knowledge graph completion, item recommendation, and question answering. In the experiments, the proposed SPD models outperform their equivalents in Euclidean and hyperbolic space (§6). These results reflect the superior expressivity of SPD, and show the versatility of the approach and ease of integration with downstream tasks.
The vector-valued distance allows us to develop a new tool for the analysis of the structural properties of the learned representations. With this tool, we visualize high-dimensional SPD embeddings, providing better explainability on what the models learn (§6.4). We show that the knowledge graph models are capable of disentangling and clustering positive triples from negative ones.
2 Related Work
Symmetric positive definite matrices are not new in the Machine Learning literature. They have been used in a plethora of applicationsArsigny et al. (2006a); Dong et al. (2017); Harandi et al. (2014); Huang and Gool (2017); Huang et al. (2014, 2015); Ionescu et al. (2015); Li and Lu (2018); Nguyen et al. (2019); Pennec et al. (2006); Said et al. (2017); Tosato et al. (2010); Tuzel et al. (2006, 2008); Wang et al. (2018); Wu et al. (2015); Yin et al. (2016)), although not always respecting the intrinsic structure or the positive definiteness constraint Carreira et al. (2012); Feragen et al. (2015); Ha Quang et al. (2014); Vemulapalli and Jacobs (2015); Yin et al. (2016). The alternative has been to map manifold points onto a tangent space and employ Euclidean-based tools. Unfortunately, this mapping distorts the metric structure in regions far from the origin of the tangent space affecting the performance Jayasumana et al. (2013); Zhao et al. (2019).
Previous work has proposed alternatives to the basic neural building blocks respecting the geometry of the space. For example, transformation layers Dong et al. (2017); Gao et al. (2019); Huang and Gool (2017), alternate convolutional layers based on SPDs Zhang et al. (2018a) and Riemannian means Chakraborty et al. (2020), or appended after the convolution Brooks et al. (2019a), recurrent models Chakraborty et al. (2018), projections onto Euclidean spaces Li et al. (2018); Mao et al. (2019)et al. (2019b). Our work follows this line, providing explicit formulas for translating Euclidean arithmetic notions into SPDs.
Our general view, using the vector-valued distance function, allows us to treat Riemannian and Finsler metrics on SPD in a unified framework. Finsler metrics have previously been applied in compressed sensing Donoho and Tsaig (2008), information geometry Shen (2006), for clustering categorical distributions Nielsen and Sun (2019), and in robotics Ratliff et al. (2020)
. With regard to optimization, matrix backpropagation techniques have been exploredAbsil et al. (2009); Boumal et al. (2014); Ionescu et al. (2015), with some of them accounting for different Riemannian geometries Brooks et al. (2019b); Huang and Gool (2017). Nonetheless, we opt for tangent space optimization Chami et al. (2019) by exploiting the explicit formulations of the exponential and logarithmic map.
3 The Space
The space is a Riemannian manifold of non-positive curvature of dimensions. Points in are positive definite real symmetric
matrices, with the identity matrixbeing a natural basepoint. The tangent space to any point of can be identified with the vector space of all real symmetric matrices. contains -dimensional Euclidean subspaces, -dimensional hyperbolic subspaces as well as products of hyperbolic planes (see Helgason (1978) for an in-depth introduction).
In Figure 1 we visualize the smallest nontrivial example. identifies with the inside of a cone in
, cut out by requiring both eigenvalues of the matrixto be positive. It carries the product geometry of the hyperbolic plane times a line.
Exponential and logarithmic maps: The exponential map, , is a homeomorphism which links the Euclidean geometry of the tangent space and the curved geometry of . Its inverse is the logarithm map, . This pair of functions give diffeomorphisms that allows one to freely move between ’tangent space coordinates’ or the original ’manifold coordinates’. We apply both maps based at . The reason for this is that while mathematically any two points on are equivalent, and we could obtain a different concrete expression for any other choice of basepoint , the resulting formulas would be more complicated, and thus is the best choice from a computational point of view. We prove this in Appendix C.3.
Symmetries: The prototypical symmetries of are parameterized by elements of
: any invertible matrixdefines the symmetry acting on all points . Thus many geometric transformations can be completed using standard optimized matrix algorithms as opposed to custom-built procedures. See Appendix C.2. for a brief review of these symmetries.
Among these, we may find -generalizations of familiar symmetries of Euclidean geometry. When also is an element of , the symmetry is a generalization of an Euclidean translation, fixing no points of When
is an orthogonal matrix, the symmetryis conjugation by , and thus fixes the basepoint . We think of elements fixing the basepoint as being -rotations or -reflections, when the matrix is a familiar rotation or reflection (see Figure 5).
The Euclidean symmetry of reflecting in a point also has a natural generalization to . Euclidean reflection in the origin is given by ; and its -analog, reflection in the basepoint , is matrix inversion . The general -reflection in a point is a conjugate of this by an translation, given by .
3.1 Vector-valued Distance Function
In Euclidean or hyperbolic spaces, the relative position between two points is completely determined by their distance, which is given by a scalar. For the space , it is determined by a vector.
The VVD vector: To assign this vector in SPD we introduce the vector-valued distance (VVD) function . For two points , the VVD is defined as:
where are the eigenvalues of sorted in descending order. This vector is an invariant of the relative position of two points up to isometry. This means that in , only if the VVD between two points and is the vector , and the VVD between and is also , then there exists an isometry mapping to and to . Thus, we can recover completely the relative position of two points in from this vector. For example, the Riemannian metric is obtained by using the standard norm on the VVD vector. This is: . See Kapovich et al. (2017) §2.6 and Appendix C.4 for a review of VVDs in symmetric spaces.
Finsler metrics: Any norm on that is invariant under permutation of the entries induces a metric on . Moreover, do not only support a Riemannian metric, but also Finsler metrics, a whole family of distances with the same symmetry group (group of isometries). These metrics are of special importance since distance minimizing geodesics are not necessarily unique in Finsler geometry. Two different paths can have the same minimal length. This is particularly valuable when embedding graphs in , since in graphs there are generally several shortest paths. We obtain the Finsler metrics or by taking the respective or norms of the VVD in (see Figure 6). See Planche (1995) and Appendix C.6 for a review of the theory of Finsler metrics, Bhatia (2003) for the study of some Finsler metrics on , and López et al. (2021) for applications of Finsler metrics on symmetric spaces in representation learning.
Advantages of VVD: The proposed metric learning framework based on the VVD offers several advantages. First, a single model can be run with different metrics, according to the chosen norm. The VVD contains the full information of the Riemannian distance and of all invariant Finsler distances, hence we can easily recover the Riemannian metric and extend the approach to many other alternatives (in Appendix C.7, we detail how the VVD generalizes other metrics). Second, the VVD provides much more information than just the distance, and can be used to analyze the learned representation in , independent of the choice of the metric. Out of the VVD between two points, one can immediately read the regularity of the unique geodesics joining these two points. Geodesics in have different regularity, which is related with the number of maximal Euclidean subspaces that contain the geodesic. The Riemannian or Finsler distances cannot distinguish the differences between these geodesics of different regularity, but the VVD function can. Third, the VVD function can be leveraged as a tool to visualize and analyze the learned high-dimensional representations (see §6.4).
To build an analog of many Euclidean operators in , we require also a translation of operations internal to Euclidean geometry, chief among these being the vector space operations of addition and scalar multiplication. By means of tools introduced in pure mathematics literature111See Abe and Hatori (2015); Hatori (2017) for an abstract treatment of gyrocalculus, and Kim (2016) for the specific examples discussed here., we describe a gyro-vector space structure on , which provides geometrically meaningful extensions of these vector space operations, extending successful applications of this framework in geometric deep learning to hyperbolic space Ganea et al. (2018); López and Strube (2020); Shimizu et al. (2021). These operations provide a template for translation, where one may attempt to replace in formulas familiar from Euclidean spaces with the analogous operations on . While straightforward, such translation requires some care, as gyro-addition is neither commutative nor associative. See Appendix D for a review of the underlying general theory of gyrogroups and additional guidelines for accurate formula translation.
Addition and Subtraction: Given a fixed choice of basepoint and two points , we define the gyroaddition of and to be the point which is the image of under the isometry which translates to along the geodesic connecting them. This directly generalizes the gyroaddition of hyperbolic space exploited by Bachmann et al. (2020); Ganea et al. (2018); Shimizu et al. (2021), via the geometric interpretation of Vermeer (2005) (see Figure 10).
Fixing , we may compute the value of for arbitrary as the result of applying the -translation moving the basepoint to , evaluated on . We see also the additive inverse of a point with respect to this operation must then be given by its geodesic reflection in .
As this operation encodes a symmetry of , it is possible to recast certain geometric statements purely in the gyrovector formalism. In particular, the vector-valued distance may be computed as the logarithm of the eigenvalues of (see Appendix C.5).
Scalar Multiplication and Matrix Scaling: For a fixed basepoint , we define the scalar multiplication of a point by a scalar to be the point which lies at distance from in the direction of , where is the metric distance on . Geometrically, this is a transfer of the vector-space scalar multiplication on the tangent space to :
where are the matrix exponential and logarithm. We further generalize the notion of scalar multiplication to allow for different relative expansion rates in different directions. For a fixed basepoint and a point , we can replace the scalar from Eq. 3 with an arbitrary real symmetric matrix . We define this matrix scaling by:
where denotes the Hadamard product. We denote the matrix scaling with , extending the previous usage: for any , we have where is the matrix with every entry .
In this section we detail how we learn representations in , and implement different linear mappings so that they conform to the premises of each operator, yielding SPD neural layers.
Embeddings in and : We are interested in learning embeddings in . To do so we exploit the connection between and its tangent space through the exponential and logarithmic maps. To learn a point , we first model it as a symmetric matrix . We impose symmetry on by learning a triangular matrix with parameters, such that . To obtain the matrix , we employ the exponential map: . Modeling points on the tangent space offers advantages for optimization, explained in §5. For the matrix scaling , we impose symmetry on the factor matrix in the same way that we learn the symmetric matrix .
Isometries: Rotations and Reflections: Rotations in dimensions are described as collections of pairwise orthogonal
-dimensional rotations in planes (with a leftover 1-dimensional "axis of rotation" in odd dimensions). We utilize this observation to efficiently build elements ofout of two-dimensional rotations in coordinate planes. More precisely, for any and choice of sign we let denote the 2-dimensional rotation () or reflection () as .
Then for any pair in , we denote by the transformation which applies to the -plane of , and leaves all other coordinates fixed. For example, in the element (see on the right) denotes the transformation where we replace the entries of with the corresponding values of . More general, rotations and reflections are built by taking products of these basic transformations. Given a -dimensional vector of angles and a choice of sign, we define the rotation and reflection corresponding to by:
where are the isometry matrices, and the vector of angles can be regarded as a learnable parameter of the model. Finally, we denote the application of the transformation to the point by:
Optimization: For the proposed rotations and reflections, the learnable weights are vectors of angles , which do not pose an optimization challenge. On the other hand, embeddings in have to be optimized respecting the geometry of the manifold, but as already explained, we model them on the space of symmetric matrices , and then we apply the exponential map. In this manner, we are able to perform tangent space optimization Chami et al. (2019) using standard Euclidean techniques, and circumvent the need for Riemannian optimization Bonnabel (2011); Bécigneul and Ganea (2019), which we found to be less numerically stable. Due to the geometry of (see Appendix C.3), this is an exact procedure, which does not incur losses in representational power.
Complexity: The most frequently utilized operation when learning embeddings is the distance calculation, thus we analyze its complexity. In Appendix A.1 we detail the complexity of different operations. Calculating the distance between two points in implies computing multiplications, inversions and diagonalizations of matrices. We find that the cost of the distance computation with respect to the matrix dimensions is . Although a matrix of rank implies dimensions thus a large value is usually not required, the cost of many operations is polynomial instead of linear.
Towards neural network architectures:
We employ the proposed mappings along with the gyro-vector operations as building blocks for SPD neural layers. This is showcased in the experiments presented below. Scalings, rotations and reflections can be seen as feature transformations. Moreover, gyro-addition allows us to define the equivalent of bias addition. Finally, although we do not employ non-linearities, our approach can be seamlessly integrated with the ReEig layer (adaptation of a ReLU layer for SPD) proposed inHuang and Gool (2017).
In this section we employ the transformations developed on SPD to build neural models for knowledge graph completion, item recommendation and question answering. Task-specific models in different geometries have been developed in the three cases, hence we consider them adequate benchmarks for representation learning.222Code available at https://github.com/fedelopez77/gyrospd
6.1 Knowledge Graph Completion
Knowledge graphs (KG) represent heterogeneous knowledge in the shape of (head, relation, tail) triples, where head and tail are entities and relation represents a relationships among entities. KG exhibit an intricate and varying structure where entities can be connected by symmetric, anti-symmetric, or hierarchical relations. To capture these non-trivial patterns more expressive modelling techniques become necessary Chami et al. (2020), thus we choose this application to showcase the capabilities of our transformations on SPD manifolds. Given an incomplete KG, the task is to predict which unknown links are valid.
Problem formulation: Let be a knowledge graph where is the set of entities, is the set of relations and is the set of triples stored in the graph. The usual approach is to learn a scoring function that measures the likelihood of a triple to be true, with the goal of scoring all missing triples correctly. To do so, we propose to learn representations of entities as embeddings in , and relation-specific transformation in the manifold, such that the KG structure is preserved.
Scaling model: We follow the base hyperbolic model MuRP Balazevic et al. (2019) and adapt it into by means of the matrix scaling. Its scoring function has shown success in the task given that it combines multiplicative and additive components, which are fundamental to model different properties of KG relations Allen et al. (2021). We translate it into as:
where are embeddings and are scalar biases for the head and tail entities respectively. and are matrices that depend on the relation. For , we experiment with the Riemannian and the Finsler One metric distances.
Isometric model: A possible alternative is to embed the relation-specific transformations as elements of the group (i.e., rotations and reflections). This technique has proven effective in different metric spaces Chami et al. (2020); Yang et al. (2020). In this case, is a rotation or reflection matrix as in Eq. 5, and the scoring function is defined as:
We employ two standard benchmarks, namely WN18RRBordes et al. (2013); Dettmers et al. (2018)
and FB15k-237Bordes et al. (2013); Toutanova and Chen (2015). WN18RR is a subset of WordNet Miller (1992) containing lexical relationships between
word senses. FB15k-237 is a subset of FreebaseBollacker et al. (2008), with entities and relationships.
Training: We follow the standard data augmentation protocol by adding inverse relations to the datasets Lacroix et al. (2018). We optimize the cross-entropy loss with uniform negative sampling defined in Equation 9, where is the set of training triples, and if is a factual triple or if is a negative sample. We employ the AdamW optimizer Loshchilov and Hutter (2019). We conduct a grid search with matrices of dimension where (this is the equivalent of degrees of freedom respectively) to select optimal dimensions, learning rate and weight decay, using the validation set. More details and set of hyperparameters in Appendix B.1.
Evaluation metrics: At test time, we rank the correct tail or head entity against all possible entities using the scoring function, and use inverse relations for head prediction Lacroix et al. (2018). Following previous work, we compute two ranking-based metrics: mean reciprocal rank (MRR), which measures the mean of inverse ranks assigned to correct entities, and hits at K (H@K, ), which measures the proportion of correct triples among the top K predicted triples. We follow the standard evaluation protocol of filtering out all true triples in the KG during evaluation Bordes et al. (2013).
Baselines: We compare our models with their respective equivalents in different metric spaces, which are also state-of-the-art models for the task. For the scaling model, these are MuRE and MuRP Balazevic et al. (2019), which perform the scaling operation in Euclidean and hyperbolic space respectively. For the isometric models, we compare to RotC Sun et al. (2019), RotE and RotH, Chami et al. (2020) (rotations in Complex, Euclidean and hyperbolic space respectively), and RefE and RefH Chami et al. (2020) (reflections in Euclidean and hyperbolic space). Baseline results are taken from the original papers. We do not compare to previous work on SPD given that they lack the definition of an arithmetic operation in the space, thus a vis-a-vis comparison is not possible.
Results: We report the performance for all analyzed models, segregated by operation, in Table 1. On both dataset, the scaling model outperforms its direct competitors MuRE and MuRP, and this is specially notable in HR@10 for WN18RR: for vs and respectively. SPD reflections are very effective on WN18RR as well. They outperform their Euclidean and hyperbolic counterparts RefE and RefH, in particular when equipped with the Finsler metric. Rotations on the SPD manifold, on the other hand, seem to be less effective. However, Euclidean and hyperbolic rotations require dimensions whereas the models are trained on matrices of rank (equivalent to dims). Moreover, the underperformance observed in some of the analyzed cases for rotations and reflections does not repeat in the following experiments (§6.2 & §6.3). Hence, we consider this is due to overfitting in some particular setups. Although we tried different regularization methods, we regard a sub-optimal configuration rather than a geometric reason to be the cause for the underperformance.
Regarding the choice of a distance metric, the Finsler One metric is better suited with respect to HR@3 and HR@10 when using scalings and reflections on WN18RR. For the FB15k-237 dataset, SPD models operating with the Riemannian metric outperform their Finsler counterparts. This suggests that the Riemannian metric is capable of disentangling the large number of relationships in this dataset to a better extent.
In these experiments we have evaluated models applying equivalent operations and scoring functions in different geometries, thus they can be thought as a vis-a-vis comparison of the metric spaces. We observe that SPD models tie or outperform baselines in most instances. This showcases the improved representation capacity of the SPD manifold when compared to Euclidean and hyperbolic spaces. Moreover, it demonstrates the effectiveness of the proposed metrics and operations in this manifold.
6.2 Knowledge Graph Recommender Systems
Recommender systems (RS) model user preferences to provide personalized recommendations Zhang et al. (2019). KG embedding methods have been widely adopted into the recommendation problem as an effective tool to model side information and enhance the performance Zhang et al. (2016); Guo et al. (2020). For instance, one reason for recommending a movie to a particular user is that the user has already watched many movies from the same genre or director Ma et al. (2019). Given multiple relations between users, items, and heterogeneous entities, the goal is to predict the user’s next item purchase or preference.
Model: We model the recommendation problem as a link prediction task over users and items Li et al. (2014). In addition, we aim to incorporate side information between users, items and other entities. Hence we apply our KG embedding method from §6.1 as is, to embed this multi-relational graph. We evaluate the capabilities of the approach by only measuring the performance over user-item interactions.
Datasets: To investigate the recommendation problem endowed with added relationships, we employ the Amazon dataset McAuley and Leskovec (2013); Ni et al. (2019) (branches "Software", "Luxury & Beauty" and "Prime Pantry"), with users’ purchases of products, and the MindReader dataset Brams et al. (2020) of movie recommendations. Both datasets provide additional relationships between users, items and entities such as product brands, or movie directors and actors. To generate evaluation splits, the penultimate and last item the user has interacted with are withheld as dev and test sets respectively.
Training: In this setup we also augment the data by adding inverse relations and optimize the loss from Equation 9. We set the size of the matrices to dimensions (equivalent to free parameters). More details about relationships and set of hyperparameters in Appendix B.2.
Evaluation and metrics: We follow the standard procedure of evaluating against randomly selected samples the user has not interacted with He et al. (2017); López et al. (2021). To evaluate the recommendation performance we focus on the buys / likes relation. For each user we rank the items according to the scoring function . We adopt MRR and H@10, as ranking metrics for recommendations.
Results: In Table 2 we observe that the SPD models tie or outperform the baselines in both MRR and HR@10 across all analyzed datasets. Rotations in both, Riemannian and Finsler metrics, are more effective in this task, achieving the best performance in 3 out of 4 cases, followed by the scaling models. Overall, this shows the capabilities of the systems to effectively represent user-item interactions enriched with relations between items and their attributes, thus learning to better model users’ preferences. Furthermore, it displays the versatility of the approach to diverse data domains.
6.3 Question Answering
We evaluate our approach on the task of Question Answering (QA). In this manner we also showcase the capabilities of our methods to train word embeddings.
Model: We adapt the model from HyperQA Tay et al. (2018) to . We model word embeddings , and represent question/answers as a summation of the embeddings of their corresponding tokens. We apply a feature transformation followed by a bias addition, as an equivalent of a neural linear layer. can be a scaling, rotation or reflection. Finally we compute a distance-based similarity function between the resulting question/answer representations as defined in Equation 10, where , and the transformation are parameters of the model.
Datasets: We analyze two popular benchmarks for QA: TrecQA Wang et al. (2007)
(clean version) and WikiQAYang et al. (2015), filtering out questions with multiple answers from the dev and test sets.
Training: We optimize the cross-entropy loss from Eq. 9, where we replace for and for each question we use wrong answers as negative samples. We set the size of the matrices to dimensions (equivalent to free parameters). The set of hyperparameters can be found in Appendix B.3.
Evaluation metrics: At test time, for each question we rank its candidate answers according to Eq. 10. We adopt MRR and H@1 as evaluation metrics.
Baselines: We compare against Euclidean and hyperbolic spaces of dimensions. For the Euclidean model we employ a linear layer as feature transformation. For the hyperbolic model, we operate on the tangent space and project the points into the Poincaré ball to compute distances.
Results: We present results in Table 3. In both datasets, we see that the word embeddings and transformations learned by the SPD models are able to place questions and answers representations in the space such that they outperform Euclidean and hyperbolic baselines. Finsler metrics seem to be very effective in this scenario, improving the performance of their Riemannian counterparts in many cases. Overall, this suggests that embeddings in SPD manifolds learn meaningful representations that can be exploited into downstream tasks. Moreover, we showcase how to employ different operations as feature transformations and bias additions, replicating the behavior of linear layers in classical deep learning architectures that can be seamlessly integrated with different distance metrics.
(bottom) for 5 (left), 50 (center) and 3000 (right) epochs for themodel. The red dot corresponds to the relation addition R.
One reason to embed data into Riemannian manifolds, such as SPD, is to use geometric properties of the manifold to analyze the structure of the data López et al. (2021). Visualizations in SPD are difficult due to their high dimensionality. As a solution we use the vector-valued distance function to develop a new tool to visualize and analyze structural properties of the learned representations.
We adopt the vector , as the barycenter of the space in where the VVD is contained. Then, we plot the norm of the VVD vector and its angle with respect to this barycenter. In Figure 17, we compute and plot the VVD corresponding to and R as defined in Eq. 7 for KG models trained on WN18RR. In early stages of the training, all points fall near the origin (left side of the plots). As training evolves, the model learns to separate true triples from corrupted ones (center part). When the training converges, the model is able to clearly disentangle and cluster positive and negative samples. We observe how the position of the validation triples (green points, not seen during training) directly correlates with the performance of each relation. Plots for more relations in Appendix B.1.
Riemannian geometry has gained attention due to its capacity to represent non-Euclidean data arising in several domains. In this work we introduce the vector-valued distance function, which allows to implement universal models (generalizing previous metrics on SPD), and can be leveraged to provide a geometric interpretation on what the models learn. Moreover, we bridge the gap between Euclidean and SPD geometry under the lens of the gyrovector theory, providing means to transfer standard arithmetic operations from the Euclidean setting to their analog notions in SPD. These tools enable practitioners to exploit the full representation power of SPD, and profit from the enhanced expressivity of this manifold. We propose and evaluate SPD models on three tasks and eight datasets, which showcases the versatility of the approach and ease of integration with downstream tasks. The results reflect the superior expressivity of SPD when compared to Euclidean or hyperbolic baselines.
This work is not without limitations. We consider the computational complexity of working with spaces of matrices to be the main drawback, since the cost of many operations is polynomial instead of linear. Nevertheless, a matrix of rank implies dimensions thus a large value is usually not required.
This work has been supported by the German Research Foundation (DFG) as part of the Research Training Group Adaptive Preparation of Information from Heterogeneous Sources (AIPHES) under grant No. GRK 1994/1, as well as under Germany’s Excellence Strategy EXC-2181/1 - 390900948 (the Heidelberg STRUCTURES Cluster of Excellence), and by the Klaus Tschira Foundation, Heidelberg, Germany.
- Generalized gyrovector spaces and a mazur–ulam theorem. Publicationes Mathematicae Debrecen 87, pp. 393–413. External Links: Cited by: footnote 1.
- Optimization algorithms on matrix manifolds. Princeton University Press. External Links: Cited by: §2.
- Learning heterogeneous knowledge base embeddings for explainable recommendation. Algorithms 11 (9) (English (US)). External Links: Cited by: §B.2.
- Interpreting knowledge graph relation representation from word embeddings. In International Conference on Learning Representations, External Links: Cited by: §6.1.
- Geometric means in a novel vector space structure on symmetric positive-definite matrices.. SIAM J. Matrix Anal. Appl. 29 (1), pp. 328–347. External Links: Cited by: §C.7, §1, §1, §2.
Log-euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine 56 (2), pp. 411–421. External Links: Cited by: §C.7, §1.
- Constant curvature graph convolutional networks. In 37th International Conference on Machine Learning (ICML), Cited by: §4.
TuckER: tensor factorization for knowledge graph completion.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5185–5194. External Links: Cited by: §B.1.
- Multi-relational poincaré graph embeddings. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. 4463–4473. External Links: Cited by: §6.1, §6.1, §6.2.
- Structure of manifolds of nonpositive curvature. I. Annals of Mathematics 122 (1), pp. 171–203. External Links: Cited by: §C.3.
- Riemannian adaptive optimization methods. In 7th International Conference on Learning Representations, ICLR, New Orleans, LA, USA. External Links: Cited by: §A.2, §5.
- On the Bures–Wasserstein distance between positive definite matrices. Expositiones Mathematicae 37 (2), pp. 165–191. External Links: Cited by: §C.7, §1.
- On the exponential metric increasing property. Linear Algebra and its Applications 375, pp. 211–220. Cited by: §3.1.
- Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, New York, NY, USA, pp. 1247–1250. External Links: Cited by: §6.1.
- Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control 58, pp. . External Links: Cited by: §A.2, §5.
- Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26, pp. 2787–2795. External Links: Cited by: §6.1, §6.1, §6.2.
- Manopt, a matlab toolbox for optimization on manifolds. Journal of Machine Learning Research 15 (42), pp. 1455–1459. External Links: Cited by: §2.
- MindReader: recommendation over knowledge graph entities with explicit user ratings. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, New York, NY, USA, pp. 2975–2982. External Links: Cited by: §6.2.
- Geometric deep learning: going beyond Euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
- Exploring complex time-series representations for riemannian machine learning of radar data. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 3672–3676. External Links: Cited by: §2.
- Riemannian batch normalization for SPD neural networks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. 15489–15500. External Links: Cited by: §2, §2.
- Semantic segmentation with second-order pooling. In Computer Vision – ECCV 2012, A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid (Eds.), Berlin, Heidelberg, pp. 430–443. External Links: Cited by: §1, §2.
- ManifoldNet: a deep neural network for manifold-valued data with applications. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Cited by: §2.
- A statistical recurrent model on the manifold of symmetric positive definite matrices. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Cited by: §2.
- Low-dimensional hyperbolic knowledge graph embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 6901–6914. External Links: Cited by: §6.1, §6.1, §6.1.
Hyperbolic graph convolutional neural networks. In Advances in Neural Information Processing Systems 32, pp. 4869–4880. External Links: Cited by: §A.2, §2, §5.
- Computationally tractable Riemannian manifolds for graph embeddings. In 37th International Conference on Machine Learning (ICML), Cited by: §1.
Convolutional 2d knowledge graph embeddings.
Proceedings of the 32th AAAI Conference on Artificial Intelligence, pp. 1811–1818. External Links: Cited by: §6.1.
- Deep manifold learning of symmetric positive definite matrices with application to face recognition. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, pp. 4009–4015. Cited by: §1, §2, §2.
- Fast solution of -norm minimization problems when the solution may be sparse. IEEE Trans. Information Theory 54 (11), pp. 4789–4812. Cited by: §2.
Geodesic exponential kernels: when curvature and linearity conflict.
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3032–3042. External Links: Cited by: §2.
- Hyperbolic neural networks. In Advances in Neural Information Processing Systems 31, pp. 5345–5355. External Links: Cited by: §4, §4.
- Learning a robust representation via a deep network on symmetric positive definite manifolds. Pattern Recognition 92, pp. 1–12. External Links: Cited by: §2.
- A survey on knowledge graph-based recommender systems. IEEE Transactions on Knowledge and Data Engineering (), pp. 1–1. External Links: Cited by: §6.2.
- Log-Hilbert-Schmidt metric between positive definite operators on Hilbert spaces. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. . External Links: Cited by: §1, §2.
- From manifold to manifold: geometry-aware dimensionality reduction for SPD matrices. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 17–32. Cited by: §1, §2.
- Examples and applications of generalized gyrovector spaces. Results in Mathematics 71, pp. 295–317. Cited by: footnote 1.
- Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, Republic and Canton of Geneva, CHE, pp. 173–182. External Links: Cited by: §6.2.
- Differential geometry, lie groups, and symmetric spaces. Book, Academic Press New York (English). External Links: Cited by: §C.2, §3.
- A Riemannian network for SPD matrix learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, pp. 2036–2042. Cited by: §2, §2, §2, §5.
- Learning Euclidean-to-Riemannian metric for point-to-set classification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1677–1684. External Links: Cited by: §1, §2.
- Log-Euclidean metric learning on symmetric positive definite manifold with application to image set classification. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 720–729. Cited by: §1, §2.
- Matrix backpropagation for deep networks with structured layers. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2965–2973. External Links: Cited by: §1, §2, §2.
- Kernel methods on the Riemannian manifold of symmetric positive definite matrices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- Convex functions on symmetric spaces, side lengths of polygons and the stability inequalities for weighted configurations at infinity.. J. Differential Geom.. Cited by: §C.4, §C.6.
- Anosov subgroups: dynamical and geometric characterizations. European Journal of Mathematics 3 (3), pp. 808–898. External Links: Cited by: §C.4, §3.1.
- Gyrovector spaces on the open convex cone of positive definite matrices. Mathematics Interdisciplinary Research 1 (1), pp. 173–185. External Links: Cited by: footnote 1.
- Canonical tensor decomposition for knowledge base completion. In Proceedings of Machine Learning Research, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 2863–2872. External Links: Cited by: §6.1, §6.1.
- Recommendation algorithm based on link prediction and domain knowledge in retail transactions. Procedia Computer Science 31, pp. 875 – 881. Note: 2nd International Conference on Information Technology and Quantitative Management, ITQM 2014 External Links: Cited by: §6.2.
- Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 947–955. External Links: Cited by: §2.
- Locality preserving projection on SPD matrix lie group: algorithm and analysis. Sci. China Inf. Sci. 61 (9), pp. 092104:1–092104:15. External Links: Cited by: §1, §2.
- Symmetric spaces for graph embeddings: a finsler-riemannian approach. In Proceedings of the 38th International Conference on Machine Learning, M. Meila and T. Zhang (Eds.), Proceedings of Machine Learning Research, Vol. 139, pp. 7090–7101. External Links: Cited by: §3.1, §6.4.
- Augmenting the user-item graph with textual similarity models. External Links: Cited by: §6.2.
- A fully hyperbolic neural model for hierarchical multi-class classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 460–475. External Links: Cited by: §4.
- Decoupled weight decay regularization. In International Conference on Learning Representations, External Links: Cited by: §6.1.
- Jointly learning explainable rules for recommendation with knowledge graph. In The World Wide Web Conference, WWW ’19, New York, NY, USA, pp. 1210–1221. External Links: Cited by: §6.2.
- COSONet: compact second-order network for video face recognition. In Computer Vision – ACCV 2018, C. V. Jawahar, H. Li, G. Mori, and K. Schindler (Eds.), Cham, pp. 51–67. External Links: Cited by: §2.
- Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, New York, NY, USA, pp. 165–172. External Links: Cited by: §6.2.
- WordNet: a lexical database for English. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, External Links: Cited by: §6.1.
- A neural network based on SPD manifold learning for skeleton-based hand gesture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
- Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 188–197. External Links: Cited by: §6.2.
- Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6341–6350. External Links: Cited by: §B.1, §B.2, §B.3.
Clustering in hilbert’s projective geometry: the case studies of the probability simplex and the elliptope of correlation matrices. In Geometric Structures of Information, pp. 297–331. External Links: Cited by: §2.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Cited by: Appendix B.
- A Riemannian framework for tensor computing. Int. J. Comput. Vision 66 (1), pp. 41–66. External Links: Cited by: §C.7, §1, §1, §2.
- Géométrie de finsler sur les espaces symétriques. Ph.D. Thesis, Université de Genève, Geneve, Switzerland. Cited by: §C.6, §3.1.
- Generalized nonlinear and Finsler geometry for robotics. CoRR abs/2010.14745. External Links: Cited by: §2.
- Riemannian Gaussian distributions on the space of symmetric positive definite matrices. IEEE Transactions on Information Theory 63 (4), pp. 2153–2170. External Links: Cited by: §1, §2.
- Riemann-Finsler geometry with applications to information geometry. Chinese Annals of Mathematics, Series B 27, pp. 73–94. External Links: Cited by: §2.
- Hyperbolic neural networks++. In International Conference on Learning Representations, External Links: Cited by: §4, §4.
- Positive definite matrices and the S-divergence. Proceedings of the American Mathematical Society. Note: Published electronically: October 22, 2015 External Links: Cited by: §C.7.
- A new metric on the manifold of kernel matrices with application to matrix geometric means. In Advances in Neural Information Processing Systems, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), Vol. 25, pp. . External Links: Cited by: §C.7, §1.
- RotatE: knowledge graph embedding by relational rotation in complex space. In International Conference on Learning Representations, External Links: Cited by: §6.1, §6.2.
- Hyperbolic representation learning for fast and efficient neural question answering. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, New York, NY, USA, pp. 583–591. External Links: Cited by: §6.3.
- Multi-class classification on Riemannian manifolds for video surveillance. In Computer Vision - ECCV 2010, 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part II, K. Daniilidis, P. Maragos, and N. Paragios (Eds.), Lecture Notes in Computer Science, Vol. 6312, pp. 378–391. External Links: Cited by: §1, §2.
- Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, Beijing, China, pp. 57–66. External Links: Cited by: §6.1.
- Complex embeddings for simple link prediction. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 2071–2080. Cited by: §B.1.
- Region covariance: a fast descriptor for detection and classification. In Computer Vision – ECCV 2006, A. Leonardis, H. Bischof, and A. Pinz (Eds.), Berlin, Heidelberg, pp. 589–600. Cited by: §1, §2.
- Pedestrian detection via classification on Riemannian manifolds.. IEEE Trans. Pattern Anal. Mach. Intell. 30 (10), pp. 1713–1727. External Links: Cited by: §1, §2.
- Beyond pseudo-rotations in pseudo-euclidean spaces. Mathematical Analysis and its Applications, Academic Press. Cited by: §D.2.
- A gyrovector space approach to hyperbolic geometry. Morgan & Claypool. Cited by: Appendix D.
- Gyrovector spaces and their differential geometry. Nonlinear Functional Analysis and Applications 10, pp. . Cited by: §D.1, §D.1, Appendix D.
- Riemannian metric learning for symmetric positive definite matrices. ArXiv abs/1501.02393. Cited by: §1, §2.
- A geometric interpretation of Ungar’s addition and of gyration in the hyperbolic plane. Topology and Its Applications: a journal devoted to general, geometric, set-theoretic and algebraic topology 152 (3), pp. 226–242 (Undefined/Unknown). External Links: Cited by: §4.
- What is the Jeopardy model? A quasi-synchronous grammar for QA. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 22–32. External Links: Cited by: §6.3.
- Discriminant analysis on Riemannian manifold of Gaussian distributions for face recognition with image sets. IEEE Transactions on Image Processing 27 (1), pp. 151–163. External Links: Cited by: §1, §2.
- Manifold kernel sparse representation of symmetric positive-definite matrices and its applications. IEEE Transactions on Image Processing 24 (11), pp. 3729–3741. External Links: Cited by: §1, §2.
- NagE: non-abelian group embedding for knowledge graphs. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1735–1742. External Links: Cited by: §6.1.
- WikiQA: a challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 2013–2018. External Links: Cited by: §6.3.
- Kernel sparse subspace clustering on symmetric positive definite manifolds.. In CVPR, pp. 5157–5164. External Links: Cited by: §1, §2.
- Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 353–362. External Links: Cited by: §6.2.
- Quaternion knowledge graph embeddings. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. 2735–2745. External Links: Cited by: §B.1.
- Deep learning based recommender system: a survey and new perspectives. ACM Comput. Surv. 52 (1). External Links: Cited by: §6.2.
Deep manifold-to-manifold transforming network. In 2018 25th IEEE International Conference on Image Processing (ICIP), Vol. , pp. 4098–4102. External Links: Cited by: §2.
- Learning over knowledge-base embeddings for recommendation. CoRR abs/1803.06540. External Links: Cited by: §B.2.
- Convex class model on symmetric positive definite manifolds. Image and Vision Computing 87, pp. 57–67. External Links: Cited by: §2.
Appendix A Implementation Details
a.1 Computational Complexity of Operations
In this section we discuss the computational theoretical complexity of the different operations involved in the development of this work. We employ Big O notation333https://en.wikipedia.org/wiki/Big_O_notation. Since in all cases operations are not nested, but are applied sequentially, the costs can be added resulting in a polynomial expression. Thus, by applying the properties of the notation, we disregard lower-order terms of the polynomial.
For matrices, the associated complexity of each operation is as follows:444https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations
Addition and subtraction:
For SPD matrices, the associated complexity of each operation is as follows:
Exp/Log map: , due to diagonalizations.
Gyro-Addition: , due to matrix multiplications
Matrix Scaling: , due to exp and log maps.
Isometries: , due to matrix multiplications.
The full computation of the distance algorithm in involves matrix square root, inverses, multiplications, and diagonalizations. Since they are applied sequentially, without affecting the dimensionality of the matrices, we can take the highest value as the asymptotic cost of the algorithm, which is .
a.2 Tangent Space Optimization
Optimization in Riemannian manifolds normally requires Riemannian Stochastic Gradient Descent (RSGD) Bonnabel (2011) or other Riemannian techniques Bécigneul and Ganea (2019). We performed initial tests converting the Euclidean gradient into its Riemannian form, but found it to be less numerically stable and also slower than tangent space optimization Chami et al. (2019). With tangent space optimization, we can use standard Euclidean optimization techniques, and respect the geometry of the manifold. Note that tangent space optimization is an exact procedure, which does not incur losses in representational power. This is the case in specifically because of a completeness property given by the choice of as the basepoint: there is always a global bijection between the tangent space and the manifold.
Appendix B Experimental Details
All models and experiments were implemented in PyTorch Paszke et al. (2019) with distributed data parallelism, for high performance on clusters of CPUs/GPUs.
All experiments were run on Intel Cascade Lake CPUs, with microprocessors Intel Xeon Gold 6230 (20 Cores, 40 Threads, 2.1 GHz, 28MB Cache, 125W TDP). Although the code supports GPUs, we did not utilize them due to higher availability of CPU’s.
b.1 Knowledge Graph Completion
We train for epochs, with batch size of and negative samples, reducing the learning rate by a factor of if the model does not improve the performance on the dev set after epochs, and early stopping based on the MRR if the model does not improve after epochs. We use the burn-in strategy Nickel and Kiela (2017) training with a 10 times smaller learning rate for the first 10 epochs. We experiment with matrices of dimension where (this is the equivalent of degrees of freedom respectively), learning rates from and weight decays of .
Stats about the datasets used in Knowledge graph experiments can be found in Table 4.
b.2 Knowledge Graph Recommender Systems
We train for epochs, with batch size from and negative samples, and early stopping based on the MRR if the model does not improve after epochs. We use the burn-in strategy Nickel and Kiela (2017) training with a 10 times smaller learning rate for the first 10 epochs. We report average standard deviation of 3 runs. We experiment with matrices of dimension (this is the equivalent of degrees of freedom), learning rates from and weight decay of . Same grid search is applied to baselines.
On the Amazon dataset we adopt the 5-core split for the branches "Software", "Luxury & Beauty" and "Industrial & Scientific", which form a diverse dataset in size and domain. We add relationships used in previous work Zhang et al. (2018b); Ai et al. (2018). These are:
also_bought: users who bought item A also bought item B.
also_view: users who bought item A also viewed item B.
category: the item belongs to one or more categories.
brand: the item belongs to one brand.
On the MindReader dataset, we consider a user-item interaction when a user gave an explicit positive rating to the movie. The relationships added are:
directed_by: the movie was directed by this person.
produced_by: the movie was produced by this person/company.
from_decade: the movie was released in this decade.
followed_by: the movie was followed by this other movie.
has_genre: the movie belongs to this genre.
has_subject: the movie has this subject.
starring: the movie was starred by this person.
Statistics of the datasets with the added relationships can be seen in Table 6. For dev/test we only consider users with 3 or more interactions.
b.3 Question Answering
We train for epochs, with negative samples and early stopping based on the MRR if the model does not improve after epochs. We use the burn-in strategy Nickel and Kiela (2017) training with a 10 times smaller learning rate for the first 10 epochs. We report average standard deviation of 3 runs. We experiment with matrices of dimension (equivalent of degrees of freedom respectively), batch size from , learning rate from and weight decays of . Same grid search was applied to baselines.
Stats about the datasets used for Question Answering experiments can be found in Table 7.
Appendix C Differential Geometry of
c.1 Orthogonal Diagonalization
Every real symmetric matrix may be orthogonally diagonalized: For every point we may find a positive diagonal matrix and an orthogonal matrix such that . This diagonalization has two practical consequences: it allows efficient computation of important operations, and provides another means of generalizing Euclidean notions to .
With respect to computation, if has orthogonal diagonalization , we may compute its square root and logarithm as and where and for . Similarly, if a tangent vector has orthogonal diagonalization (here not necessarily positive definite), the exponential map is computed as , where .
We verify this in the two lemmas below.
If and is any matrix, then .
As is orthogonal, Conjugation is an automorphism of the algebra of matrices, and so applying this to any partial sum of the exponential yields
Taking the limit of this equality as gives the claimed result. ∎
If is a diagonal matrix, then .
The multiplication of diagonal matrices coincides with the elementwise product of their diagonal entries. Again applying this to any partial sum of the exponential of gives
Taking the limit of this equality as gives the claimed result. ∎
c.2 Metric and Isometries
The Riemannian metric on is defined as follows: if are tangent vectors based at , their inner product is:
Note that at the basepoint, this is just the standard matrix inner product as are symmetric. We now verify the action given by acting as is an action by isometries of this metric.
The action extends to tangent vectors based at without change in formula:
Let and be a tangent vector based at . Then by definition, is the derivative of some path of some path of matrices in throguh . We compute the action of on by taking the derivative of its action on the path:
For every the transformation preserves the Riemannian metric on .
Let and choose arbitrary point , and tangent vectors . We compute the pullback of the metric under the symmetry . Computing directly from the definition an the previous lemma,
where the penultimate equality uses that trace is invariant under conjugacy.
This provides a vivid geometric interpretation of the previously discussed orthogonal diagonalization operation on .
Given any , there exists a symmetry fixing which moves to a diagonal matrix.
This subspace of diagonal matrices plays an essential role in working with . As we verify below, the intrinsic geometry of this subspace of diagonal matrices inherited from the Riemannian metric on is flat.
Let be the set of diagonal matrices, and define by . Then is an isometry from the Euclidean metric on to the metric on induced from .
We pull back the metric on by , and see that on this results in the standard Euclidean metric. Given a point with tangent vectors , we compute this as
From the definition of , we see that the pushforward of along is and similarly for . Thus we may compute directly and see the result is the standard dot product on .
This subspace is in fact a maximal flat for , the largest dimensional totally geodesic Euclicean submanifold embedded in . For more information on the general theory of symmetric spaces from which the notion of maximal flats arises, see Helgason Helgason (1978). For our purposes, it is only important to note the following fact.
The set of diagonal matrices in is an isometrically and totally geodesically embedded copy of euclidean -space.
c.3 Exponential and Logarithmic Maps
The Riemannian exponential map gives a connection between the Euclidean geometry of the tangent space and the curved geometry of . It assigns the tangent vector to the point of reached by traveling along the geodesic starting from the basepoint in direction for distance .
As a consequence of non-positive curvature, is a diffeomorphism of onto , and so has an inverse: the Riemannian logarithm . See Ballmann et al. (1985) for a review of the general theory of manifolds of non-positive curvature. Together, this pair of functions allows one to freely move between ’tangent space coordinates’ or the original ’manifold coordinates’ which we exploit to transfer Euclidean optimization schemes to (see §5).
Secondly, the geometry of is so tightly tied to the algebra of matrices that the Riemannian exponential agrees exactly with the usual matrix exponential, and the Riemannian logarithm is the matrix logarithm (because of this, we do not distinguish the two notationally), as we verify in the proposition below. Both of these are readily computable via orthogonal diagonalization, as given in §C.1. This is in stark contrast to general Riemannian manifolds, where the exponential map may have no simple formula.
Let be the Riemannian exponential map based at , and be the matrix exponential. Then .
Let be a tangent vector to at the basepoint , and orthogonally diagonalize as for some , . As is tangent to the maximal flat of diagonal matrices, the geodesic segment must be a geodesic in which we know from Lemma 3 to be the coordinate-wise exponential of a straight line in . Precisely, this geodesic is , and so the original geodesic with initial tangent is by Lemma 1. Specializing to , this gives the claim:
This easily transfers to an understanding of the Riemannian exponential at an arbitrary point , if we identify the tangent space at with the symmetric matrices as well.
The exponential based at an arbitrary point is given by
Given and tangent vector identified with the set of symmetric matrices, note that is a symmetry of taking to and to . Using the fact that we understand the Riemannian exponential at the basepoint, we see It only remains to translate the result back to , giving the claimed formula. ∎
Let be the Riemannian logarithm map based at , and be the matrix logarithm (note that while the matrix logaritm is multivalued in general, it is uniquely defined on ). Then .
Defined as the inverse of , the Riemannian logarithm must satisfy
Let and orthogonally diagonalize as . Applying the Riemannian exponential, we see . Recalling from Lemma 3 the relation between isometries of and their application on tangent vectors, we see that we may rewrite the left hand side as . Appropriately cancelling the factors of we arrive at the relationship
That is, restricted to the diagonal matrices, the Riemannian logarithm is an inverse of the matrix exponential, so Riemannian log equals matrix log. Re-absorbing the original factors of shows the same to be true for any positive definite symmetric matrix; thus . ∎
As for the exponential, conjugating by a symmetry moving to an arbitrary point , we may describe the Riemannian logarithm at any point of .
The logarithm based at an arbitrary point