Poincare Embeddings for Word Vector Representations
Neural embeddings have been used with great success in Natural Language Processing (NLP). They provide compact representations that encapsulate word similarity and attain state-of-the-art performance in a range of linguistic tasks. The success of neural embeddings has prompted significant amounts of research into applications in domains other than language. One such domain is graph-structured data, where embeddings of vertices can be learned that encapsulate vertex similarity and improve performance on tasks including edge prediction and vertex labelling. For both NLP and graph based tasks, embeddings have been learned in high-dimensional Euclidean spaces. However, recent work has shown that the appropriate isometric space for embedding complex networks is not the flat Euclidean space, but negatively curved, hyperbolic space. We present a new concept that exploits these recent insights and propose learning neural embeddings of graphs in hyperbolic space. We provide experimental evidence that embedding graphs in their natural geometry significantly improves performance on downstream tasks for several real-world public datasets.READ FULL TEXT VIEW PDF
Natural language text exhibits hierarchical structure in a variety of
Graph convolutional neural networks (GCNs) embed nodes in a graph into
Unsupervised text embedding has shown great power in a wide range of NLP...
Despite recent monumental advances in the field, many Natural Language
Learning useful representations is a key ingredient to the success of mo...
Learning useful representations is a key ingredient to the success of mo...
Modern deep transfer learning approaches have mainly focused on learning...
Poincare Embeddings for Word Vector Representations
Embedding (or vector space) methods find a lower-dimensional continuous space in which to represent high-dimensional complex data(Roweis2000, ; Belkin2001, )
. The distance between objects in the lower-dimensional space gives a measure of their similarity. This is usually achieved by first postulating a low-dimensional vector space and then optimising an objective function of the vectors in that space. Vector space representations provide three principle benefits over sparse schemes: (1) They encapsulate similarity, (2) they are compact, (3) they perform better as inputs to machine learning models(Salton1975, ). This is true of graph structured data where the native data format is the adjacency matrix, a typically large, sparse matrix of connection weights.
Neural embedding models are a flavour of embedding scheme where the vector space corresponds to a subset of the network weights, which are learned through backpropagation. Neural embedding models have been shown to improve performance in a large number of downstream tasks across multiple domains. These include word analogies(Mikolov2013, ; Mnih2013, ), machine translation (Sutskever2014, ), document comparison (Kusner2015, ), missing edge prediction (Grover, ), vertex attribution (Perozzi2014, ), product recommendations (Grbovic2015, ; Baeza-yates2015, ), customer value prediction (Kooti2017, ; Chamberlain2017, ) and item categorisation (Barkan2016, ). In all cases the embeddings are learned without labels (unsupervised) from a sequence of entities.
To the best of our knowledge, all previous work on neural embedding models either explicitly or implicitly (by using the Euclidean dot product) assumes that the vector space is Euclidean. Recent work from the field of complex networks has found that many interesting networks, such as the Internet (Boguna2010, ) or academic citations (Clough2015a, ; Clough2016, ) can be well described by a framework with an underlying non-Euclidean hyperbolic geometry. Hyperbolic geometry provides a continuous analogue of tree-like graphs, and even infinite trees have nearly isometric embeddings in hyperbolic space (Gromov, ). Additionally, the defining features of complex networks, such as power-law degree distributions, strong clustering and hierarchical community structure, emerge naturally when random graphs are embedded in hyperbolic space (Krioukov, ).
. Skipgram is a shallow neural network with three layers: (1) An input projection layer that maps from a one-hot-encoded to a distributed representation, (2) a hidden layer, and (3) an output softmax layer. The network is necessarily simple for tractability as there are a very large number of output states (every word in a language). Skipgram is trained on a sequence of words that is decomposed into (input word, context word)-pairs. The model employs two separate vector representations, one for the input words and another for the context words, with the input representation comprising the learned embedding. The word pairs are generated by taking a sequence of words and running a sliding window (the context) over them. As an example the word sequence “chance favours the prepared mind” with a context window of size three would generate the following training data: (chance, favours), (chance, the), (favours, chance), … }. Words are initially randomly allocated to vectors within the two vector spaces. Then, for each training pair, the vector representations of the observed input and context words are pushed towards each other and away from all other words (see Figure2).
The concept can be extended from words to network structured data using random walks to create sequences of vertices. The vertices are then treated exactly analogously to words in the NLP formulation. This was originally proposed as DeepWalk (Perozzi2014, ). Extensions varying the nature of the random walks have been explored in LINE (Tang2015, ) and Node2vec (Grover, ).
In this paper, we introduce the new concept of neural embeddings in hyperbolic space. We formulate backpropagation in hyperbolic space and show that using the natural geometry of complex networks improves performance in vertex classification tasks across multiple networks.
Hyperbolic geometry emerges from relaxing Euclid’s fifth postulate (the parallel postulate) of geometry. In hyperbolic space there is not just one, but an infinite number of parallel lines that pass through a single point. This is illustrated in Figure 0(b) where every line is parallel to the bold, blue line and all pass through the same point. Hyperbolic space is one of only three types of isotropic spaces that can be defined entirely by their curvature. The most familiar is Euclidean, which is flat, having zero curvature. Space with uniform positive curvature has an elliptic geometry (e.g. the surface of a sphere), and space with uniform negative curvature is called hyperbolic, which is analogous to a saddle-like surface. As, unlike Euclidean space, in hyperbolic space even infinite trees have nearly isometric embeddings, it has been successfully used to model complex networks with hierarchical structure, power-law degree distributions and high clustering (Krioukov, ).
One of the defining characteristics of hyperbolic space is that it is in some sense larger than the more familiar Euclidean space; the area of a circle or volume of a sphere grows exponentially with its radius, rather than polynomially. This suggests that low-dimensional hyperbolic spaces may provide effective representations of data in ways that low-dimensional Euclidean spaces cannot. However this makes hyperbolic space hard to visualise as even the 2D hyperbolic plane can not be isometrically embedded into Euclidean space of any dimension,(unlike elliptic geometry where a 2-sphere can be embedded into 3D Euclidean space). For this reason there are many different ways of representing hyperbolic space, with each representation conserving some geometric properties, but distorting others. In the remainder of the paper we use the Poincaré disk model of hyperbolic space.
The Poincaré disk models two-dimensional hyperbolic space where the infinite plane is represented as a unit disk. We work with the two-dimensional disk, but it is easily generalised to the -dimensional Poincaré ball, where hyperbolic space is represented as a unit -ball.
In this model hyperbolic distances grow exponentially towards the edge of the disk. The circle’s boundary represents infinitely distant points as the infinite hyperbolic plane is squashed inside the finite disk. This property is illustrated in Figure 0(a) where each tile is of constant area in hyperbolic space, but the tiles rapidly shrink to zero area in Euclidean space. Although volumes and distances are warped, the Poincaré disk model is conformal, meaning that Euclidean and hyperbolic angles between lines are equal. Straight lines in hyperbolic space intersect the boundary of the disk orthogonally and appear either as diameters of the disk, or arcs of a circle. Figure 0(b) shows a collection of straight hyperbolic lines in the Poincaré disk. Just as in spherical geometry, the shortest path from one place to another is a straight line, but appears as a curve on a flat map. Similarly, these straight lines show the shortest path (in terms of distance in the underlying hyperbolic space) from one point on the disk to another, but they appear curved. This is because it is quicker to move close to the centre of the disk, where distances are shorter, than nearer the edge. In our proposed approach, we will exploit both the conformal property and the circular symmetry of the Poincaré disk.
Overall, the geometric intuition motivating our approach is that vertices embedded near the middle of the disk can have more close neighbours than they could in Euclidean space, whilst vertices nearer the edge of the disk can still be very far from each other.
The mathematics is considerably simplified if we exploit the symmetries of the model and describe points in the Poincaré disk using polar coordinates, , with and . To define similarities and distances, we require an inner product. In the Poincaré disk, the inner product of two vectors and is given by
The distance of from the origin of the hyperbolic co-ordinate system is given by and the circumference of a circle of hyperbolic radius R is .
We adopt the original notation of (Mikolov2013, ) whereby the input vertex is and the output is . Their corresponding vector representations are and , which are elements of the two vector spaces shown in Figure 3, and respectively. Skipgram has a geometric interpretation, which we visualise in Figure 2 for vectors in . Updates to are performed by simply adding (if is the observed output vertex) or subtracting (otherwise) an error-weighted portion of the input vector. Similar, though slightly more complicated, update rules apply to the vectors in . Given this interpretation, it is natural to look for alternative geometries in which to perform these updates.
To embed a graph in hyperbolic space we replace Skipgram’s two Euclidean vector spaces ( and in Figure 3) with two Poincaré disks. We learn embeddings by optimising an objective function that predicts output/context vertices from an input vertex, but we replace the Euclidean dot products used in Skipgram with hyperbolic inner products. A softmax function is used for the conditional predictive distribution
where is the vector representation of the vertex, primed indicates members of the output vector space (See Figure 3) and is the hyperbolic inner product. Directly optimising (3) is computationally demanding as the sum in the denominator extends over every vertex in the graph. Two commonly used techniques to make word2vec more efficient are (a) replacing the softmax with a hierarchical softmax (Mnih2008, ; Mikolov2013, ) and (b) negative sampling (Mnih2012, ; Mnih2013, ). We use negative sampling as it is faster.
Negative sampling is a form of Noise Contrastive Estimation (NCE)(Gutmann2012, )
. NCE is an estimation technique that is based on the assumption that a good model should be able to separate signal from noise using only logistic regression.
As we only care about generating good embeddings, the objective function does not need to produce a well-specified probability distribution. The negative log likelihood using negative sampling is
where , are the vector representation of the input and output vertices, , is a set of samples drawn from the noise distribution, is the number of samples and
is the sigmoid function. The first term represents the observed data and the second term the negative samples. To draw, we specify the noise distribution to be unigrams raised to as in (Mikolov2013, ).
We learn the model using backpropagation. To perform backpropagation it is easiest to work in natural hyperbolic co-ordinates on the disk and map back to Euclidean co-ordinates only at the end. In natural co-ordinates , and . The major drawback of this co-ordinate system is that it introduces a singularity at the origin. To address the complexities that result from radii that are less than or equal to zero, we initialise all vectors to be in a patch of space that is small relative to its distance from the origin.
The gradient of the negative log-likelihood in (5) w.r.t. is given by
Taking the derivatives w.r.t. the components of vectors in (in natural polar hyperbolic co-ordinates) yields
The Jacobian is then
which leads to
where is the learning rate and is the prediction error defined in Equation (6). Calculating the derivatives w.r.t. the input embedding follows the same pattern, and we obtain
The corresponding update equations are
where is an indicator variable s.t. if and only if , and otherwise. On completion of backpropagation, the vectors are mapped back to Euclidean co-ordinates on the Poincaré disk through and .
The factions of the Zachary karate network are easily linearly separable when embedded in 2D hyperbolic space. This is not true when embedding in Euclidean space. Both embeddings were run for 5 epochs on the same random walks
In this section, we assess the quality of hyperbolic embeddings and compare them to embeddings in Euclidean spaces on a number of public benchmark networks.
|adjnoun||112||425||2||0.52||Part of Speech|
We report results on five publicly available network datasets for the problem of vertex attribution.
Karate: Zachary’s karate club contains 34 vertices divided into two factions (Zachary1977, ).
Polbooks: A network of books about US politics published around the time of the 2004 presidential election and sold by the online bookseller Amazon.com. Edges between books represent frequent co-purchasing of books by the same buyers.
Football: A network of American football games between Division IA colleges during regular season Fall 2000 (Girvan2002, ).
Adjnoun: Adjacency network of common adjectives and nouns in the novel David Copperfield by Charles Dickens (Newman2006, ).
Polblogs: A network of hyperlinks between weblogs on US politics, recorded in 2005 (Adamic2005, ).
Statistics for these datasets are recorded in Table 1.
-axis). In all cases hyperbolic embeddings (blue) significantly outperform Euclidean deepwalk embeddings (red). Error bars show standard error from the mean over ten repetitions. The legend used in subfigure (a) applies to all subfigures. A consistent trend across the datasets is that an embedding into a 2D hyperbolic space outperforms deepwalk architectures with embeddings ranging from 2D to 128D.
To illustrate the utility of hyperbolic embeddings we compare embeddings in the Poincaré disk to the two-dimensional deepwalk embeddings for the 34-vertex karate network with two factions. The results are shown in Figure 4. Both embeddings were generated by running for five epochs on an intermediate dataset of 34, ten step random walks, one originating at each vertex.
The figure clearly shows that the hyperbolic embedding is able to capture the community structure of the underlying network. When embedded in hyperbolic space, the two factions (black and white discs) of the underlying graph are linearly separable, while the Deepwalk embedding does not exhibit such an obvious structure.
We evaluate the success of neural embeddings in hyperbolic space by using the learned embeddings to predict held-out labels of vertices in networks. In our experiments, we compare our embedding to deepwalk (Perozzi2014, ) embeddings of dimensions 2, 4, 8, 16, 32, 64 and 128. To generate embeddings we first create an intermediate dataset by taking a series of random walks over the networks. For each network we use a ten-step random walk originating at each vertex.
The embedding models are all trained using the same parameters and intermediate random walk dataset. For deepwalk, we use the gensim (Rehurek2010, )
python package, while our hyperbolic embeddings are written in custom TensorFlow. In both cases, we use five training epochs, a window size of five and do not prune any vertices.
The results of our experiments are shown in Figure 5
. The graphs show macro F1 scores against the percentage of labelled data used to train a logistic regression classifier. Here we follow the method for generating F1 scores when each test case can have multiple labels that is described in(Liu2006, ). The error bars show one standard error from the mean over ten repetitions. The blue lines show hyperbolic embeddings while the red lines depict deepwalk embeddings at various dimensions. It is apparent that in all datasets hyperbolic embeddings significantly outperform deepwalk.
We have introduced the concept of neural embeddings in hyperbolic space. To the best of our knowledge, all previous embeddings models have assumed a flat Euclidean geometry. However, a flat geometry is not the natural geometry of all data structures. A hyperbolic space has the property that power-law degree distributions, strong clustering and hierarchical community structure emerge naturally when random graphs are embedded in hyperbolic space. It is therefore logical to exploit the structure of the hyperbolic space for useful embeddings of complex networks. We have demonstrated that when applied to the task of classifying vertices of complex networks, hyperbolic space embeddings significantly outperform embeddings in Euclidean space.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 6, pages 421–426, 2006.
Finding community structure in networks using the eigenvectors of matrices.Physical Review E - Statistical, Nonlinear, and Soft Matter Physics, 74(3):1–19, 2006.