Ultrahyperbolic Representation Learning
In machine learning, data is usually represented in a (flat) Euclidean space where distances between points are along straight lines. Researchers have recently considered more exotic (non-Euclidean) Riemannian manifolds such as hyperbolic space which is well suited for tree-like data. In this paper, we propose a representation living on a pseudo-Riemannian manifold with constant nonzero curvature. It is a generalization of hyperbolic and spherical geometries where the nondegenerate metric tensor is not positive definite. We provide the necessary learning tools in this geometry and extend gradient method optimization techniques. More specifically, we provide closed-form expressions for distances via geodesics and define a descent direction that guarantees the minimization of the objective problem. Our novel framework is applied to graph representations.READ FULL TEXT VIEW PDF
We consider kernel methods on general geodesic metric spaces and provide...
Recent advances in deep representation learning on Riemannian manifolds
Representation of 2D frame less visual space as neural manifold and its
The elastic flow, which is the L^2-gradient flow of the elastic energy, ...
Embedding methods for mixed-curvature spaces are powerful techniques for...
Manifold learning seeks a low dimensional representation that faithfully...
Computer vision tasks such as image classification, image retrieval and
Ultrahyperbolic Representation Learning
In most machine learning applications, data representations lie on a smooth manifold lee2013introduction and the training procedure is optimized with an iterative algorithm such as line search or trust region methods nocedal2006numerical . In most cases, the smooth manifold is Riemannian: it is equipped with a positive definite metric on each tangent space (i.e.
every non-vanishing tangent vector has positive squared norm). Due to the positive definiteness of the metric, the negative of the (Riemannian) gradient is a descent direction that can be exploited to iteratively minimize some objective functionabsil2009optimization .
The choice of metric on the Riemannian manifold determines how relations between points are quantified. The most common Riemannian manifold is the flat Euclidean space, which has constant zero curvature and the distances between points are measured by straight lines. An intuitive example of non-Euclidean Riemannian manifold is the spherical model (i.e.
representations lie on a hypersphere) that has constant positive curvature and is used for instance in face recognitionTapaswi_2019_ICCV ; wang2017normface . On the hypersphere, geodesic distances are a function of angles. Similarly, Riemannian spaces of constant negative curvature are called hyperbolic petersen2006riemannian . Such spaces were shown by Gromov to be well suited to represent tree-like structures gromov1987hyperbolic . The machine learning community has adopted these spaces to learn tree-like graphs chen2013hyperbolicity and hierarchical data structures pmlr-v97-law19a ; nickel2017poincare ; nickel2018learning .
In this paper, we consider a class of non-Riemannian manifolds of constant nonzero curvature wolf1972spaces not previously considered in machine learning. These manifolds not only generalize the hyperbolic and spherical geometries mentioned above, but also contain hyperbolic and spherical submanifolds and can therefore describe relationships specific to those geometries. The difference is that we consider the larger class of pseudo-Riemannian manifolds where the considered nondegenerate metric tensor is not positive definite. Optimizing a cost function on our non-flat ultrahyperbolic space requires a descent direction method that follows a path along the curved manifold. We achieve this by employing tools from differential geometry such as geodesics and exponential maps. The theoretical contributions in this paper are two-fold: (1) explicit methods to calculate dissimilarities and (2) general optimization tools on pseudo-Riemannian manifolds of constant nonzero curvature.
Notation: We denote points on a smooth manifold lee2013introduction by boldface Roman characters . is the tangent space of at and we write tangent vectors in boldface Greek fonts. A vector dot product is a positive definite bilinear form denoted and defined as . The norm of is . denotes the -dimensional Euclidean space and is the Euclidean space with the origin removed.
Pseudo-Riemannian manifolds: A smooth manifold is pseudo-Riemannian (also called semi-Riemannian o1983semi ) if it is equipped with a pseudo-Riemannian metric tensor (named “metric” for short in differential geometry). The pseudo-Riemannian metric at some point is a non-degenerate smooth symmetric bilinear 2-form. Non-degeneracy means that if for a given and for all we have , then . If the metric is also positive definite (i.e. iff ), then it is Riemannian. Riemannian geometry is then a special case of pseudo-Riemannian geometry where the metric is positive definite. In general, this is not the case and non-Riemannian manifolds distinguish themselves by having non-vanishing tangent vectors that satisfy . For more details, we refer the reader to anciaux2011minimal ; o1983semi ; wolf1972spaces .
Pseudo-hyperboloids generalize spherical and hyperbolic geometries to the class of pseudo-Riemannian manifolds. First, we consider the flat Euclidean space with where each vector is written . becomes a pseudo-Euclidean space denoted when equipped with the following scalar product:
where is the diagonal matrix with the first diagonal elements equal to and the remaining equal to . Since is a vector space, we can identify the tangent space to the space itself by means of the natural isomorphism . Using the terminology of special relativity, has time dimensions and space dimensions.
A pseudo-hyperboloid is the following submanifold of codimension one in :
where is a nonzero real value and is the associated quadratic form of the scalar product at . It is equivalent to work with either or since they are interchangeable via an anti-isometry o1983semi (see supp. material for details). For instance, the unit -dimensional hypersphere is anti-isometric to which is then spherical.
In the literature, the set is called a “pseudo-sphere” when and a “pseudo-hyperboloid” when . In the following, we will restrict ourselves to the pseudo-hyperbolic case (i.e. ). We can obtain the spherical and hyperbolic geometries by constraining all the elements of the space dimensions of a pseudo-hyperboloid to be zero or constraining all the elements of the time dimensions except one to be zero, respectively. Pseudo-hyperboloids are then more general.
The pseudo-hyperboloids that we consider in this paper are hard to visualize as they live in ambient spaces with dimension higher than 3. In Fig. 1, we show iso-surfaces of a projection of the 3-dimensional pseudo-hyperboloid (embedded in ) into along its first time dimension.
Tangent space: By using the isomorphism mentioned above, the tangent space of at can be defined as for all . The orthogonal projection of an arbitrary -dimensional vector onto is:
This section introduces the differential geometry tools necessary to quantify dissimilarities/distances between points on . Measuring dissimilarity is an important task in machine learning and has many applications (e.g. in metric learning xing2003distance ). Before considering distances on , we consider distances in the pseudo-Euclidean space in which it is embedded. We recall that is isomorphic to its tangent space. Tangent vectors are therefore naturally identified with points. Using the quadratic form defined in Eq. (1), the squared ambient distance between two points is:
This distance is a good proxy for the geodesic distance , that we introduce below, if it preserves distance relations: iff . This relation is satisfied for two special cases of pseudo-hyperboloids for which the geodesic distance is well known:
Hyperbolic ( and ): called Poincaré distance.
Spherical (): called spherical distance.
The ambient distance has been shown to be a good proxy in hyperbolic geometry pmlr-v97-law19a . Unfortunately, for the remaining ultrahyperbolic case (i.e. and ), the distance relations are not preserved: . Therefore, we need to compute the geodesics explicitly for these cases. This section contains the first theoretical contribution of this paper, specifically closed-form expressions for the geodesic distance on ultrahyperbolic manifolds.
Geodesics: Informally, a geodesic is a curve joining points on a manifold that minimizes some “effort” depending on the metric. More precisely, let be a (maximal) interval containing . A geodesic maps a real value to a point on the manifold . It is a curve on defined by its initial point and initial tangent vector where is the derivative of at . By analogy with physics, is considered as a time value. Intuitively, one can think of the curve as the trajectory over time of a ball being pushed from a point at with initial velocity and constrained to roll on the manifold. We denote this curve explicitly by unless the dependence is obvious from the context. For this curve to be a geodesic, its acceleration has to be zero: . This condition is a second-order ODE that has a unique solution for a given set of initial conditions lindelof1894application . The interval is said to be maximal if it cannot be extended to a larger interval. In the case of , we have and is then maximal.
Geodesic of : As we show in the supp. material, the geodesics of are a combination of the hyperbolic, flat and spherical cases. The nature of the geodesic depends on the sign of . For all , the geodesic of is written:
We recall that does not imply . The geodesics are an essential ingredient to define a mapping known as the exponential map. See Fig. 2 (left) for a depiction of these three types of geodesics, and Fig. 2 (right) for a depiction of the other quantities introduced in this section.
Exponential map: Exponential maps are a way of collecting all of the geodesics of a pseudo-Riemannian manifold into a unique differentiable mapping. Let be the set of tangent vectors such that is defined at least on the interval . This allows us to uniquely define the exponential map such that .
For pseudo-hyperboloids, the exponential map is complete, that is . Using Eq. (5) with , we obtain an exponential map of the entire tangent space to the manifold:
We make the important observation that the image of the exponential map does not necessarily cover the entire manifold: not all points on a manifold are connected by a geodesic. This is the case for our pseudo-hyperboloids. Namely, for a given point there exist points that are not in the image of the exponential map (i.e. there does not exist a tangent vector such that ).
Logarithm map: We provide a closed-form expression of the logarithm map for pseudo-hyperboloids. Let be some neighborhood of . The logarithm map is defined as the inverse of the exponential map on (i.e. ). We propose:
By substituting into Eq. (6), one can verify that our formulas are the inverse of the exponential map. The set is called a normal neighborhood of since for all , there exists a geodesic from to such that . We show in the supp. material that the logarithm map is not defined if .
Proposed dissmilarity: We define our dissimilarity function based on the general notion of arc length and radius function on pseudo-Riemannian manifolds that we recall in the next paragraph (see details in Chapter 5 of o1983semi ). This corresponds to the geodesic distance in the Riemannian case.
Let be a normal neighborhood of with pseudo-Riemannian. The radius function is defined as where is the metric at . If is the radial geodesic from to (i.e. ), then the arc length of equals .
We then define the geodesic “distance” between and as the arc length of :
It is important to note that our “distance” is not a distance metric. However, it satisfies the axioms of a symmetric premetric: (i) and (ii) . These conditions are sufficient to quantify the notion of nearness via a -ball centered at : .
In general, topological spaces provide a qualitative (not necessarily quantitative) way to detect “nearness” through the concept of a neighborhood at a point lee2010introduction . Something is true “near ” if it is true in the neighborhood of (e.g. in ). Our premetric is similar to metric learning methods 7780793 ; 8100113 ; xing2003distance that learn a Mahalanobis-like distance pseudo-metric parameterized by a positive semi-definite matrix. Pairs of distinct points can have zero “distance” if the matrix is not positive definite. However, unlike metric learning, we can have triplets () that satisfy but (e.g. in ).
Since the logarithm map is not defined if , we propose to use the following continuous approximation defined on the whole manifold instead:
To the best of our knowledge, the explicit formulation of the logarithm map for in Eq. (7) and its corresponding radius function in Eq. (8) to define a dissimilarity function are novel. We have also proposed some linear approximation to evaluate dissimilarity when the logarithm map is not defined but other choices are possible. For instance, one might consider instead if is not defined. This interesting problem is left for future research.
In this section we present optimization frameworks to optimize any differentiable function defined on . Our goal is to compute descent directions on the ultrahyperbolic space. We consider two approaches. In the first approach, we map our representation from Euclidean space to ultrahyperbolic space. This is similar to the approach taken by pmlr-v97-law19a in hyperbolic space. In the second approach, we optimize using gradients defined directly in pseudo-Riemannian tangent space. We propose a novel descent direction which guarantees the minimization of some cost function.
Our first method maps Euclidean representations that lie in to the pseudo-hyperboloid
, and the chain rule is exploited to perform standard gradient descent. To this end, we construct a differentiable mapping. The image of a point already on under the mapping is itself: . Let denote the -dimensional unit hyper-sphere. We first introduce the following diffeomorphisms:
For any , there is a diffeomorphism . Let us note with and , let us note where and . The mapping and its inverse are formulated (see proofs in appendix):
With these mappings, any vector can be mapped to via . is differentiable everywhere except when , which should never occur in practice. It can therefore be optimized using standard gradient methods.
We now introduce a novel method to optimize any differentiable function defined on the pseudo-hyperboloid. As we show below, the (negative of the) pseudo-Riemannian gradient is not a descent direction. We propose a simple and efficient way to calculate a descent direction.
Pseudo-Riemannian gradient: Since also lies in the ambient Euclidean space , the function has a well defined Euclidean gradient . The gradient of in the pseudo-Euclidean ambient space is . Since is a submanifold of , the pseudo-Riemannian gradient of on is the orthogonal projection of onto (see chapter 4 of o1983semi ):
This gradient forms the foundation of our descent method optimizer as will be shown in Eq. (13).
Iterative optimization: Our goal is to iteratively decrease the value of the function by following some descent direction. Since is not a vector space, we do not “follow the descent direction” by adding the descent direction multiplied by a negative scalar as this would result in a new point that does not necessarily lie on . Instead, to remain on the manifold, we use our exponential map defined in Eq. (6). This is a standard way to optimize on Riemannian manifolds absil2009optimization . Given a step size , one step of descent along a tangent vector is given by:
Descent direction: We now explain why the negative of the pseudo-Riemannian gradient is not always a descent direction. Our explanation extends Chapter 3 of nocedal2006numerical that gives the criteria for a vector to be a descent direction when the domain of is a Euclidean space. By using the properties of the exponential map and geodesics described in Section 3, we know that for all and all , we have the equalities: so we can equivalently fix to 1 and choose the scale of appropriately. By exploiting Taylor’s first-order approximation, there exists some small enough tangent vector (i.e. with belonging to a convex neighborhood of carmo1992riemannian ; gao2018semi ) that satisfies the following conditions: , , , and the function can be approximated at by:
where we use the following properties: (see details in pages 11, 15 & 85 of o1983semi ), is the differential of and is a geodesic (we omit indices).
To be a descent direction at (i.e. so that ), the search direction has to satisfy . However, choosing , where is a step size, might increase the function since the scalar product is not positive definite. A simple solution would be to choose depending on the sign of , but the issue where even if would remain. The optimization algorithm might then be stuck to an isocontour of . Moreover, the sign of the search direction might be hard to choose in the stochastic gradient setting.
Proposed solution: To ensure that is a descent direction, we propose a simple expression that satisfies if and otherwise. We propose to formulate . We then define the following tangent vector :
The tangent vector is a descent direction because is non-positive:
We also have iff (i.e. is a stationary point). It is worth noting that implies . Moreover, implies that . We then have iff .
Our proposed algorithm to the minimization problem is illustrated in Algorithm 1. Following generic Riemannian optimization algorithms absil2009optimization , at each iteration, it first computes the descent direction , then decreases the function by applying the exponential map defined in Eq. (6). It is worth noting that our proposed descent method can be applied to any differentiable function , not only to those that exploit the distance introduced in Section 3.
Interestingly, our method can also be seen as a preconditioning technique nocedal2006numerical where the descent direction is obtained by preconditioning the pseudo-Riemannian gradient with the matrix . In other words, we have .
In the more general setting of pseudo-Riemannian manifolds, another preconditioning technique was proposed in gao2018semi
. The method there requires performing a Gram-Schmidt process at each iteration to obtain an (ordered wolf1972spaces ) orthonormal basis of the tangent space at w.r.t. the induced norm of the manifold. However, the Gram-Schmidt process is unstable and has algorithmic cubic complexity in the dimensionality of the tangent space. On the other hand, our method is more stable and has algorithmic complexity linear in the dimensionality of the tangent space.
We now experimentally validate our proposed optimization methods and the effectiveness of our dissimilarity function. Our main experimental results can be summarized as follows:
Both optimizers introduced in Section 4 decrease some objective function at each iteration of our iterative descent method (until reaching a plateau). While both optimizers manage to learn high-dimensional representations that satisfy the problem-dependent training constraints, only the pseudo-Riemannian optimizer satisfies all the constraints in lower-dimensional spaces. This is because it exploits the underlying metric of the manifold.
Hyperbolic representations are popular in machine learning as they are well suited to represent hierarchical trees gromov1987hyperbolic ; nickel2017poincare ; nickel2018learning . On the other hand, hierarchical datasets whose graph contains cycles cannot be represented using trees. Therefore, we propose to represent such graphs using our ultrahyperbolic representations. An important example are community graphs such as Zachary’s karate club zachary1977information that contain leaders. Because our ultrahyperbolic representations are more flexible than hyperbolic representations and contain hyperbolic subparts, we believe that our representations are better suited for these non tree-like hierarchical structures.
Graph: Our ultrahyperbolic representations describe graph-structured datasets. Each dataset is an undirected weighted graph which has node-set and edge-set . Each edge is weighted by an arbitrary capacity that models the strength of the relationship between nodes. The higher the capacity , the stronger the relationship between the nodes connected by .
Learned representations: Our problem formulation is inspired by hyperbolic representation learning approaches nickel2017poincare ; nickel2018learning where the nodes of a tree (i.e. graph without cycles) are represented in hyperbolic space. The hierarchical structure of the tree is then reflected by the order of distances between its nodes. More precisely, a node representation is learned so that each node is closer to its descendants and ancestors in the tree (w.r.t. the hyperbolic distance) than to any other node. For example, in a hierarchy of words, ancestors and descendants are hypernyms and hyponyms, respectively.
Our goal is to learn a set of points (embeddings) from a given graph of supervision. The presence of cycles in the graph makes it difficult to determine ancestors and descendants. For this reason, we introduce for each pair of nodes , the set of “weaker” pairs that have lower capacity: . Our goal is to learn representations such that pairs with higher capacity have their representations closer to each other than weaker pairs. Following nickel2017poincare , we formulate our problem as:
where is the chosen dissimilarity function (e.g. defined in Eq. (9)) and is a fixed temperature parameter. The formulation of Eq. (17) is classic in the metric learning literature Cao2020A ; law2018dimensionality ; 8683393 and corresponds to optimizing some order on the learned distances via a softmax function.
We coded our approach in PyTorchNEURIPS2019_9015 that automatically calculates the Euclidean gradient . Initially, a random set of vectors is generated close to the positive pole with every coordinate perturbed uniformly with a random value in the interval where is chosen small enough so that . We set , and . Initial embeddings are generated as follows: .
Zachary’s karate club dataset zachary1977information is a social network graph of a karate club comprised of nodes, each representing a member of the karate club. The club was split due to a conflict between instructor "Mr. Hi" (node ) and administrator "John A" (node ). The remaining members now have to decide whether to join the new club created by or not. In zachary1977information , Zachary defines a matrix of relative strengths of the friendships in the karate club called dependant on various criteria. We note that the matrix is not symmetric and has 7 different pairs for which . Since our dissimilarity function is symmetric, we consider the symmetric matrix instead. The value of is the capacity/weight assigned to the edge joining and , and there is no edge between and if . Fig. 3 (left) illustrates the 34 nodes of the dataset, an edge joining the nodes and is drawn iff . The level of a node in the hierarchy corresponds approximately to its height in the figure.
Optimizers: We validate that our optimizers introduced in Section 4 decrease the cost function. First, consider the simple unweighted case where every edge weight is 1. For each edge , is then the set of pairs of nodes that are not connected. In other words, Eq. (17) learns node representations that have the property that every connected pair of nodes has smaller distance than non connected pairs. We use this condition as a stopping criterion of our algorithm.
Fig. 3 (right) illustrates the loss values of Eq. (17) as a function of the number of iterations with the Euclidean gradient descent (Section 4.1) and our pseudo-Riemannian optimizer (introduced in Section 4.2). In each test, we vary the number of time dimensions while the ambient space is of fixed dimensionality . We omit the case since it corresponds to the (hyperbolic) Riemannian case already considered in pmlr-v97-law19a ; nickel2018learning . Both optimizers decrease the function and manage to satisfy all the expected distance relations. We note that when we use instead of as a search direction, the algorithm does not converge. Moreover, our pseudo-Riemannian optimizer manages to learn representations that satisfy all the constraints for low-dimensional manifolds such as and , while the optimizer introduced in Section 4.1 does not. Consequently, we only use the pseudo-Riemannian optimizer in the following results.
|Rank of the first leader|
|Rank of the second leader|
|top 5 Spearman’s|
|top 10 Spearman’s|
Hierarchy extraction: To quantitatively evaluate our approach we apply it to the problem of predicting the high level nodes in the hierarchy from the weighted matrix S as supervision. We consider the challenging low-dimensional setting where all the learned representations lie on a 4-dimensional manifold (i.e. ). Hyperbolic distances are known to grow exponentially as we get further from the origin. Therefore, the sum of distances of a node with all other nodes is a good indication of importance. Intuitively, high-level nodes will be closer to most nodes than low-level nodes. We then sort the scores and report the ranks of the two leaders or (in no particular order) in the first two rows of Table 1 averaged over 5 different initializations/runs. Leaders tend to have a smaller score with ultrahyperbolic distances than with Euclidean or hyperbolic distances. Instead of using for hyperbolic representations, the importance of a node can be evaluated by using the Euclidean norm of its embedding as proxy. This is because high-level nodes in a tree represented in hyperbolic space are usually closer to the origin than low-level nodes pmlr-v97-law19a ; nickel2017poincare ; nickel2018learning . Not surprisingly, this proxy leads to worse performance ( and ) as the relationships are not that of a tree. Since hierarchy levels are hard to compare for low-level nodes, we select the 10 (or 5) most influential members based on the score . The corresponding nodes are 34, 1, 33, 3, 2, 32, 24, 4, 9, 14 (in that order). Spearman’s rank correlation coefficient spearman2015proof between the selected scores and corresponding is reported in Table 1 and shows the relevance of our representations.
Due to lack of space, we also report in the supp. material similar experiments on a larger hierarchical dataset (chechik2007eec, ) that describes co-authorship from papers published at NIPS from 1988 to 2003.
We have introduced ultrahyperbolic representations. Our representations lie on a pseudo-Riemannian manifold with constant nonzero curvature which generalizes hyperbolic and spherical geometries and includes them as submanifolds. Any relationship described in those geometries can then be described with our representations that are more flexible. We have introduced new optimization tools and experimentally shown that our representations can extract hierarchies in graphs that contain cycles.
We introduce a novel way of representing relationships between data points by considering the geometry of non-Riemannian manifolds of constant nonzero curvature. The relationships between data points are described by a dissimilarity function that we introduce and exploits the structure of the manifold. It is more flexible than the distance metric used in hyperbolic and spherical geometries often used in machine learning and computer vision. Nonetheless, since the problems involving our representations are not straightforward to optimize, we propose novel optimization algorithms that can potentially benefit the machine learning, computer vision and natural language processing communities. Indeed, our method is application agnostic and could extend existing frameworks.
Our contribution is mainly theoretical but we have included one practical application. Similarly to hyperbolic representations that are popular for representing tree-like data, we have shown that our representations are well adapted to the more general case of hierarchical graphs with cycles. These graphs appear in many different fields of research such as medicine, molecular biology and the social sciences. For example, an ultrahyperbolic representation of proteins might assist in understanding their complicated folding mechanisms. Moreover, these representations could assist in analyzing features of social media such as discovering new trends and leading "connectors". The impact of community detection for commercial or political advertising is already known in social networking services. We foresee that our method will have many more graph-based practical applications.
We know of very few applications outside of general relativity that use pseudo-Riemannian geometry. We hope that our research will stimulate other applications in machine learning and related fields.
We thank Jonah Philion and Guojun Zhang for helpful feedback on early versions of this manuscript.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3909–3917, 2016.
Pytorch: An imperative style, high-performance deep learning library.In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
The supplementary material is structured as follows:
In Section B.3, we explain the anti-isometry between or .
In Section B.4, we study the curvature of .
In Section B.6, we show that the hyperboloid is a Riemannian manifold (i.e. its metric tensor is positive definite) even if its metric matrix is not positive definite.
In Section C, we report additional experiments.
Tangent space of pseudo-Riemannian submanifolds: In the paper, we exploit the fact that our considered manifold (here ) is a pseudo-Riemannian submanifold of (here ). For any vector space , we have a natural isomorphism between and its tangent space.
If is a pseudo-Riemannian submanifold of , we have the following direct sum decomposition:
where is the orthogonal complement of and called the normal space of at . It is a nondegenerate subspace of , and is the metric at . In the case of , is defined as:
where denotes isomorphism.
Geodesic of a submanifold: As mentioned in the main paper, a curve is a geodesic if its acceleration is zero. However, the acceleration depends on the choice of the affine connection while the velocity does not (i.e. the velocity does not depend on the Christoffel symbol whereas the acceleration does, and different connections produce different geodesics). Let us note (resp. ) the covariant derivative of along in (resp. in ). By using the induced connection (see page 98 of ) and the fact that is isometrically embedded in , the second order ODE about the zero acceleration of the geodesic is equivalent to (see page 103 of ):