Gaussian word embeddings
Current work in lexical distributed representations maps each word to a point vector in low-dimensional space. Mapping instead to a density provides many interesting advantages, including better capturing uncertainty about a representation and its relationships, expressing asymmetries more naturally than dot product or cosine similarity, and enabling more expressive parameterization of decision boundaries. This paper advocates for density-based distributed embeddings and presents a method for learning representations in the space of Gaussian distributions. We compare performance on various word embedding benchmarks, investigate the ability of these embeddings to model entailment and other asymmetric relationships, and explore novel properties of the representation.READ FULL TEXT VIEW PDF
We introduce a method for embedding words as probability densities in a
Ordinal embedding aims at finding a low dimensional representation of ob...
We propose a method for embedding two-dimensional locations in a continu...
By representing words with probability densities rather than point vecto...
Word embeddings provide point representations of words containing useful...
Word embedding models offer continuous vector representations that can
This paper introduces a novel approach to learn visually grounded meanin...
Gaussian word embeddings
Gaussian word embeddings
Gaussian word embeddings
Gaussian word embeddings
In recent years there has been a surge of interest in learning compact distributed representations or embeddings
for many machine learning tasks, including collaborative filtering(Koren et al., 2009)2011), relation extraction (Riedel et al., 2013), word semantics and language modeling (Bengio et al., 2006; Mnih & Hinton, 2008; Mikolov et al., 2013), and many others. In these approaches input objects (such as images, relations or words) are mapped to dense vectors having lower-dimensionality than the cardinality of the inputs, with the goal that the geometry of his low-dimensional latent embedded space be smooth with respect to some measure of similarity in the target domain. That is, objects associated with similar targets should be mapped to nearby points in the embedded space.
While this approach has proven powerful, representing an object as a single point
in space carries some important limitations. An embedded vector representing a point estimate does not naturally express uncertainty about the target concepts with which the input may be associated. Point vectors are typically compared by dot products, cosine-distance or Euclean distance, none of which provide for asymmetric comparisons between objects (as is necessary to represent inclusion or entailment). Relationships between points are normally measured by distances required to obey the triangle inequality.
This paper advocates moving beyond vector point representations to potential functions (Aizerman et al., 1964)
, or continuous densities in latent space. In particular we explore Gaussian function embeddings (currently with diagonal covariance), in which both means and variances are learned from data. Gaussians innately represent uncertainty, and provide a distance function per object. KL-divergence between Gaussian distributions is straightforward to calculate, naturally asymmetric, and has a geometric interpretation as an inclusion between families of ellipses.
There is a long line of previous work in mapping data cases to probability distributions, perhaps the most famous being radial basis functions (RBFs), used both in the kernel and neural network literature. We draw inspiration from this work to propose novel word embedding algorithms that embed words directly as Gaussian distributional potential functions in an infinite dimensional function space. This allows us to map word types not only to vectors but to soft regions in space, modeling uncertainty, inclusion, and entailment, as well as providing a rich geometry of the latent space.
After discussing related work and presenting our algorithms below we explore properties of our algorithms with multiple qualitative and quantitative evaluation on several real and synthetic datasets. We show that concept containment and specificity matches common intuition on examples concerning people, genres, foods, and others. We compare our embeddings to Skip-Gram on seven standard word similarity tasks, and evaluate the ability of our method to learn unsupervised lexical entailment. We also demonstrate that our training method also supports new styles of supervised training that explicitly incorporate asymmetry into the objective.
This paper builds on a long line of work on both distributed and distributional semantic word vectors, including distributional semantics, neural language models, count-based language models, and, more broadly, the field of representation learning.
Related work in probabilistic matrix factorization (Mnih & Salakhutdinov, 2007) embeds rows and columns as Gaussians, and some forms of this do provide each row and column with its own variance (Salakhutdinov & Mnih, 2008). Given the parallels between embedding models and matrix factorization (Deerwester et al., 1990; Riedel et al., 2013; Levy & Goldberg, 2014), this is relevant to our approach. However, these Bayesian methods apply Bayes’ rule to observed data to infer the latent distributions, whereas our model works directly in the space of probability distributions and discriminatively trains them. This allows us to go beyond the Bayesian approach and use arbitrary (and even asymmetric) training criteria, and is more similar to methods that learn kernels (Lanckriet et al., 2004) or function-valued neural networks such as mixture density networks (Bishop, 1994).
Other work in multiplicative tensor factorization for word embeddings(Kiros et al., 2014) and metric learning (Xing et al., 2002)
learns some combinations of representations, clusters, and a distance metric jointly; however, it does not effectively learn a distance function per item. Fitting Gaussian mixture models on embeddings has been done in order to apply Fisher kernels to entire documents(Clinchant & Perronnin, 2013b, a). Preliminary concurrent work from Kiyoshiyo et al. (2014) describes a significantly different model similar to Bayesian matrix factorization, using a probabilistic Gaussian graphical model to define a distribution over pairs of words, and they lack quantitative experiments or evaluation.
In linguistic semantics, work on the distributional inclusion hypothesis (Geffet & Dagan, 2005), uses traditional count-based vectors to define regions in vector space (Erk, 2009) such that subordinate concepts are included in these regions. In fact, one strength of our proposed work is that we extend these intuitively appealing ideas (as well as the ability to use a variety of asymmetric distances between vectors) to the dense, low-dimensional distributed vectors that are now gaining popularity.
Our goal is to map every word type in some dictionary and context word type in a dictionary to a Gaussian distribution over a latent embedding space, such that linguistic properties of the words are captured by properties of and relationships between the distributions. For precision, we call an element of the dictionary a word type, and a particular observed token in some context a word token. This is analogous to the class vs. instance distinction in object-oriented programming.
In unsupervised learning of word vectors, we observe a sequence of word tokensfor each type , and their contexts (sets of nearby word tokens), . The goal is to map each word type and context word type to a vector, such that types that appear in similar contexts have similar vectors. When it is unambiguous, we also use the variables and to denote the vectors associated to that given word type or context word type.
An energy function (LeCun et al., 2006) is a function that scores pairs of inputs and outputs , parametrized by . The goal of energy-based learning is to train the parameters of the energy function to score observed positive input-output pairs higher (or lower, depending on sign conventions) than negative pairs. This is accomplished by means of a loss function which defines which pairs are positive and negative according to some supervision, and provides gradients on the parameters given the predictions of the energy function.
In prediction-based (energy-based) word embedding models, the parameters correspond to our learned word representations, and the and input-output pairs correspond to word tokens and their contexts. These contexts can be either positive (observed) or negative (often randomly sampled). In the word2vec Skip-Gram (Mikolov et al., 2013) word embedding model, the energy function takes the form of a dot product between the vectors of an observed word and an observed context
. The loss function is a binary logistic regression classifier that treats the score of a word and its observed context as the score of a positive example, and the score of a word and a randomly sampled context as the score of a negative example.
Backpropagating (Rumelhart et al., 1986) this loss to the word vectors trains them to be predictive of their contexts, achieving the desired effect (words in similar contexts have similar vectors). In recent work, word2vec has been shown to be equivalent to factoring certain types of weighted pointwise mutual information matrices (Levy & Goldberg, 2014).
In our work, we use a slightly different loss function than Skip-Gram word2vec embeddings. Our energy functions take on a more limited range of values than do vector dot products, and their dynamic ranges depend in complex ways on the parameters. Therefore, we had difficulty using the word2vec loss that treats scores of positive and negative pairs as positive and negative examples to a binary classifier, since this relies on the ability to push up on the energy surface in an absolute, rather than relative, manner. To avoid the problem of absolute energies, we train with a ranking-based loss. We chose a max-margin ranking objective, similar to that used in Rank-SVM (Joachims, 2002) or Wsabie (Weston et al., 2011), which pushes scores of positive pairs above negatives by a margin:
In this terminology, the contribution of our work is a pair of energy functions for training Gaussian distributions to represent word types.
Given a pre-trained set of word embeddings trained from contexts, there is a simple way to construct variances using the empirical variance of a word type’s set of context vectors.
For a word with word vector sets representing the words found in its contexts, and window size , the empirical variance is
This is an estimator for the covariance of a distribution assuming that the mean is fixed at . In practice, it is also necessary to add a small ridge term to the diagonal of the matrix to regularize and avoid numerical problems when inverting.
However, in Section 6.2 we note that the distributions learned by this empirical estimator do not possess properties that we would want from Gaussian distributional embeddings, such as unsupervised entailment represented as inclusion between ellipsoids. By discriminatively embedding our predictive vectors in the space of Gaussian distributions, we can improve this performance. Our models can learn certain forms of entailment during unsupervised training, as discussed in Section 6.2 and exemplified in Figure 1.
As discussed in Section 3, our architecture learns Gaussian distributional embeddings to predict words in context given the current word, and ranks these over negatively sampled words. We present two energy functions to train these embeddings.
While the dot product between two means of independent Gaussians is a perfectly valid measure of similarity (it is the expected dot product), it does not incorporate the covariances and would not enable us to gain any benefit from our probabilistic model.
The most logical next choice for a symmetric similarity function would be to take the inner product between the distributions themselves. Recall that for two (well-behaved) functions , a standard choice of inner product is
i.e. the continuous version of for discrete vectors and .
This idea seems very natural, and indeed has appeared before – the idea of mapping data cases into probability distributions (often over their contexts), and comparing them via integrals has a history under the name of the expected likelihood or probability product kernel (Jebara et al., 2004).
For Gaussians, the inner product is defined as
The proof of this identity follows from simple calculus. This is a consequence of the broader fact that the Gaussian is a stable distribution
, i.e. the convolution of two Gaussian random variables is another Gaussian.
Since we aim to discriminatively train the weights of the energy function, and it is always positive, we work not with this quantity directly, but with its logarithm. This has two motivations: firstly, we plan to use ranking loss, and ratios of densities and likelihoods are much more commonly worked with than differences – differences in probabilities are less interpretable than an odds ratio. Secondly, it is easier numerically, as otherwise the quantities can get exponentially small and harder to deal with.
The logarithm of the energy (in dimensions) is
Recalling that the gradient of the log determinant is , and the gradient (Petersen, 2006) we can take the gradient of this energy function with respect to the means and covariances :
For diagonal and spherical covariances, these matrix inverses are trivial to compute, and even in the full-matrix case can be solved very efficiently for the small dimensionality common in embedding models. If the matrices have a low-rank plus diagonal structure, they can be computed and stored even more efficiently using the matrix inversion lemma.
This log-energy has an intuitive geometric interpretation as a similarity measure. Gaussians are measured as close to one another based on the distance between their means, as measured through the Mahalanobis distance defined by their joint inverse covariance. Recalling that is equivalent to the log-volume of the ellipse spanned by the principle components of , we can interpret this other term of the energy as a regularizer that prevents us from decreasing the distance by only increasing joint variance. This combination pushes the means together while encouraging them to have more concentrated, sharply peaked distributions in order to have high energy.
Training vectors through KL-divergence to encode their context distributions, or even to incorporate more explicit directional supervision re: entailment from a knowledge base or WordNet, is also a sensible objective choice. We optimize the following energy function (which has a similarly tractable closed form solution for Gaussians):
Note the leading negative sign (we define the negative energy), since KL is a distance function and not a similarity. KL divergence is a natural energy function for representing entailment between concepts – a low KL divergence from to indicates that we can encode easily as , implying that entails . This can be more intuitively visualized and interpreted as a soft form of inclusion between the level sets of ellipsoids generated by the two Gaussians – if there is a relatively high expected log-likelihood ratio (negative KL), then most of the mass of lies inside .
Just as in the previous case, we can compute the gradients for this energy function in closed form:
using the fact that and (Petersen, 2006).
Another benefit of embedding objects as probability distributions is that we can look at the distribution of dot products between vectors drawn from two Gaussian representations. This distribution is not itself a one-dimensional Gaussian, but it has a finite mean and variance with a simple structure in the case where the two Gaussians are assumed independent (Brown & Rutemiller, 1977). For the distribution , we have
this means we can find e.g. a lower or upper bound on the dot products of random samples from these distributions, that should hold some given percent of the time. Parametrizing this energy by some number of standard deviations, we can also get a range for the dot product as:
We can choose in a principled using an (incorrect) Gaussian approximation, or more general concentration bounds such as Chebyshev’s inequality.
To learn our model, we need to pick an energy function (EL or KL), a loss function (max-margin), and a set of positive and negative training pairs. As the landscape is highly nonconvex, it is also helpful to add some regularization.
We regularize the means and covariances differently, since they are different types of geometric objects. The means should not be allowed to grow too large, so we can add a simple hard constraint to the norm:
However, the covariance matrices need to be kept positive definite as well as reasonably sized. This is achieved by adding a hard constraint that the eigenvalueslie within the hypercube for constants and .
For diagonal covariances, this simply involves either applying the min or max function to each element of the diagonal to keep it within the hypercube, .
Controlling the bottom eigenvalues of the covariance is especially important when training with expected likelihood, since the energy function includes a term that can give very high scores to small covariances, dominating the rest of the energy.
We optimize the parameters using AdaGrad (Duchi et al., 2011) and stochastic gradients in small minibatches containing 20 sentences worth of tokens and contexts.
We evaluate the representation learning algorithms on several qualitative and quantitative tasks, including modeling asymmetric and linguistic relationships, uncertainty, and word similarity. All Gaussian experiments are conducted with 50-dimensional vectors, with diagonal variances except where noted otherwise. Unsupervised embeddings are learned on the concatenated ukWaC and WaCkypedia corpora (Baroni et al., 2009), consisting of about 3 billion tokens. This matches the experimental setup used by Baroni et al. (2012), aside from leaving out the small British National Corpus, which is not publicly available and contains only 100 million tokens. All word types that appear less than 100 times in the training set are dropped, leaving a vocabulary of approximately 280 thousand word types.
When training word2vec Skip-Gram embeddings for baselines, we follow the above training setup (50 dimensional embeddings), using our own implementation of word2vec to change as little as possible between the two models, only the loss function. We train both models with one pass over the data, using separate embeddings for the input and output contexts, 1 negative sample per positive example, and the same subsampling procedure as in the word2vec paper (Mikolov et al., 2013). The only other difference between the two training regimes is that we use a smaller regularization constraint when using the word2vec loss function, which improves performance vs. the diagonal Gaussian model which does better with “spikier” mean embeddings with larger norms (see the comment in Section 6.4). The original word2vec implementation uses no constraint, but we saw better performance when including it in our training setup.
In Figure 2, we examine some of the 100 nearest neighbors of several query words as we sort from largest to smallest variance, as measured by determinant of the covariance matrix, using diagonal Gaussian embeddings. Note that more specific words, such as joviality and electroclash have smaller variance, while polysemous words or those denoting broader concepts have larger variances, such as mix, mind, and graph. This is not merely an artifact of higher frequency words getting more variance – when sorting by those words whose rank by frequency and rank by variance are most dissimilar, we see that genres with names like chillout, avant, and shoegaze overindex their variance compared to how frequent they are, since they appear in different contexts. Similarly, common emotion words like sadness and sincerity have less variance than their frequency would predict, since they have fairly fixed meanings. Another emotion word, coldness, is an uncommon word with a large variance due to its polysemy.
|Query Word||Nearby Words, Descending Variance|
|rock||mix sound blue folk jazz rap avant hardcore chillout shoegaze powerpop|
|food||drink meal meat diet spice juice bacon soya gluten stevia|
|feeling||sense mind mood perception compassion sadness coldness sincerity|
|perplexity diffidence joviality|
|algebra||theory graph equivalence finite predicate congruence topology|
|quaternion symplectic homomorphism|
As can be seen qualitatively in Figure 1, our embeddings can learn some forms of unsupervised entailment directly from the source data. We evaluate quantitatively on the Entailment dataset of Baroni et al. (2012). Our setup is essentially the same as theirs but uses slightly less data, as mentioned in the beginning of this section. We evaluate with Average Precision and best F1 score. We include the best F1 score (by picking the optimal threshold at test) because this is used by Baroni et al. (2012), but we believe AP is better to demonstrate the correlation of various asymmetric and symmetric measures with the entailment data.
|Baroni et al. (2012)||E||balAPinc||75.1||–|
In Figure 3, we compare variances learned jointly during embedding training by using the expected likelihood objective, with empirical variances gathered from contexts on pre-trained word2vec-style embeddings. We compare both diagonal (D) and spherical (S) variances, using both cosine similarity between means, and KL divergence. Baseline asymmetric measurements, such as the difference between the sizes of the two embeddings, did worse than the cosine. We see that KL divergence between the entailed and entailing word does not give good performance for the empirical variances, but beats the count-based balAPinc measure when used with learned variances.
For the baseline empirical model to achieve reasonable performance when using KL divergence, we regularized the covariance matrices, as the unregularized matrices had very small entries. We regularized the empirical covariance by adding a small ridge to the diagonal, which was tuned to maximize performance, to give the largest possible advantage to the baseline model. Interestingly, the empirical variances do worse with KL than the symmetric cosine similarity when predicting entailment. This appears to be because the empirically learned variances are so small that the choice is between either leaving them small, making it very difficult to have one Gaussian located “inside” another Gaussian, or regularizing so much that their discriminative power is washed out. Additionally, when examining the empirical variances, we noted that common words like “such,” which receive very large variances in our learned model, have much smaller empirical variances relative to rarer words. A possible explanation is that the contrastive objective forces variances of commonly sampled words to spread out to avoid loss, while the empirical variance sees only “positive examples” and has no penalty for being close to many contexts at once.
While these results indicate that we can do as well or better at unsupervised entailment than previous distributional semantic measures, we would like to move beyond purely unsupervised learning. Although certain forms of entailment can be learned in an unsupervised manner from distributional data, many entailing relationships are not present in the training text in the form of lexical substitutions that reflect the is-a relationship. For example, one might see phrases such as “look at that bird,” “look at that eagle,” “look at that dog,” but rarely “look at that mammal.” One appealing aspect of our models versus count-based ones is that they can be directly discriminatively trained to embed hierarchies.
In Figure 4, we see the results of directly embedding simple tree hierarchies as Gaussians. We embed nodes as Gaussians with diagonal variances in two-dimensional space using gradient descent on the KL divergence between parents and children. We create a Gaussian for each node in the tree, and randomly initialize means. Negative contexts come from randomly sampled nodes that are neither ancestors nor descendents, while positive contexts come from ancestors or descendents using the appropriate directional KL divergence. Unlike our experiments with symmetric energy, we must use the same set of embeddings for nodes and contexts, or else the objective function will push the variances to be unboundedly large. Our training process captures the hierarchical relationships, although leaf-level siblings are not differentiated from each other by this objective function. This is because out of all the negative examples that a leaf node can receive, only one will push it away from its sibling node.
We evaluate the embeddings on seven different standard word similarity benchmarks (Rubenstein & Goodenough, 1965; Szumlanski et al., 2013; Hill et al., 2014; Miller & Charles, 1991; Bruni et al., 2014; Yang & Powers, 2006; Finkelstein et al., 2001). A comparison to all of the state of the art word-embedding numbers for different dimensionalities as in (Baroni et al., 2014) is out of the scope of this evaluation. However, we note that the overall performance of our 50-dimensional embeddings matches or beats reported numbers on these datasets for the 80-dimensional Skip-Gram vectors at wordvectors.org (Faruqui & Dyer, 2014), as well as our own Skip-Gram implementation. Note that the numbers are not directly comparable since we use a much older version of Wikipedia (circa 2009) in our WaCkypedia dataset, but this should not give us an edge.
While it is good to sanity-check that our embedding algorithms can achieve standard measures of distributional quality, these experiments also let us compare the different types of variances (spherical and diagonal). We also compare against Skip-Gram embeddings with 100 latent dimensions, since our diagonal variances have 50 extra parameters.
We see that the embeddings with spherical covariances have an overall slight edge over the embeddings with diagonal covariances in this case, in a reversal from the entailment experiments. This could be due to the diagonal variance matrices making the embeddings more axis-aligned, making it harder to learn all the similarities and reducing model capacity. To test this theory, we plotted the absolute values of components of spherical and diagonal variance mean vectors on a q-q plot and noted a significant off-diagonal shift, indicating that diagonal variance embedding mean vectors have “spikier” distributions of components, indicating more axis-alignment.
We also see that the distributions with diagonal variances benefit more from including the variance in the comparison (d) than the spherical variances. Generally, the data sets in which the cosine between distributions (d) outperforms cosine between means (m) are similar for both spherical and diagonal covariances. Using the cosine between distributions never helped when using empirical variances, so we do not include those numbers.
|Dataset||SG (50d)||SG (100d)||LG/50/m/S||LG/50/d/S||LG/50/m/D||LG/50/d/D|
In this work we introduced a method to embed word types into the space of Gaussian distributions, and learn the embeddings directly in that space. This allows us to represent words not as low-dimensional vectors, but as densities over a latent space, directly representing notions of uncertainty and enabling a richer geometry in the embedded space. We demonstrated the effectiveness of these embeddings on a linguistic task requiring asymmetric comparisons, as well as standard word similarity benchmarks, learning of synthetic hierarchies, and several qualitative examinations.
In future work, we hope to move beyond spherical or diagonal covariances and into combinations of low rank and diagonal matrices. Efficient updates and scalable learning is still possible due to the Sherman-Woodbury-Morrison formula. Additionally, going beyond diagonal covariances will enable us to keep our semantics from being axis-aligned, which will increase model capacity and expressivity. We also hope to move past stochastic gradient descent and warm starting and be able to learn the Gaussian representations robustly in one pass from scratch by using e.g. proximal or block coordinate descent methods. Improved optimization strategies will also be helpful on the highly nonconvex problem of training supervised hierarchies with KL divergence.
Representing words and concepts as different types of distributions (including other elliptic distributions such as the Student’s t) is an exciting direction – Gaussians concentrate their density on a thin spherical ellipsoidal shell, which can lead to counterintuitive behavior in high dimensions. Multimodal distributions represent another clear avenue for future work. Combining ideas from kernel methods and manifold learning with deep learning and linguistic representation learning is an exciting frontier.
In other domains, we want to extend the use of potential function representations to other tasks requiring embeddings, such as relational learning with the universal schema (Riedel et al., 2013). We hope to leverage the asymmetric measures, probabilistic interpretation, and flexible training criteria of our model to tackle tasks involving similarity-in-context, comparison of sentences and paragraphs, and more general common sense reasoning.
This work was supported in part by the Center for Intelligent Information Retrieval, in part by IARPA via DoI/NBC contract #D11PC20152, and in part by NSF grant #CNS-0958392 The U.S. Government is authorized to reproduce and distribute reprint for Governmental purposes notwithstanding any copyright annotation thereon. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.
Theoretical foundations of the potential function method in pattern recognition learning.In Automation and Remote Control,, number 25 in Automation and Remote Control,, pp. 821–837, 1964.
Bayesian probabilistic matrix factorization using markov chain monte carlo.In Proceedings of the 25th international conference on Machine learning, pp. 880–887. ACM, 2008.