1 Introduction
Word embeddings are important in Natural language processing (NLP) which map words into a lowdimensional vector space. Many works have been proposed to generate word embeddings
Mnih and Kavukcuoglu (2013); Mikolov et al. (2013); Pennington et al. (2014); Levy and Goldberg (2014); Bojanowski et al. (2017); Devlin et al. (2019).With many different sets of word embeddings produced by different algorithms and corpora, it is interesting to investigate the relationships between these sets of word embeddings. Intrinsically, this would help us better understand word embeddings (Levy et al., 2015). Practically, knowing the relationship between different sets of word embeddings helps us build better word metaembeddings Yin and Schütze (2016), reduce biases in word embeddings (Bolukbasi et al., 2016), pick better hyperparameters Yin and Shen (2018), and choose suitable algorithms in different scenarios Kozlowski et al. (2019).
To study the relationship between different embedding spaces systematically, we propose RPD as a measure of the distance between different sets of embeddings. We derive statistical properties of RPD including its asymptotic upper bound and normality under the independence condition. We also provide a geometric interpretation of RPD. Furthermore, we show that RPD is strongly correlated with the performance of word embeddings measured by intrinsic metrics, such as comparing semantic similarity and evaluating analogies.
With the help of RPD, we study the relations among several popular embedding methods, including GloVe Pennington et al. (2014), SGNS^{1}^{1}1Skipgram with Negative Sampling Mikolov et al. (2013)
, Singular Value Decomposition (SVD) factorization of PMI matrix, and SVD factorization of log count (LC) matrix. Results show that these methods are statistically correlated, which suggests that there is an unified theory behind these methods.
Additionally, we analyze the influence of training processes, i.e. hyperparameters (negative sampling), random initialization; and the influence of corpora towards word embeddings. Our findings include the fact that different training corpora result in significantly different GloVe embeddings, and that the main difference between embedding spaces comes from the algorithms although hyperparameters also have certain influence. Those findings not only provide some interesting insights of word embeddings but also fit nicely with our intuition, which further proves RPD as a suitable measure to quantify the relationship between different sets of word embeddings.
2 Background
Before introducing RPD, we review the theory behind some static word embedding methods, and discuss some previous works investigating the relationship between embedding spaces.
2.1 Word Embedding Models
We consider the following four word embedding models: SGNS, GloVe, SVD, SVD. SGNS and GloVe are two widely used embedding methods, while SVD and SVD are matrix factorization methods which are intrinsically related to SGNS and GloVe Levy and Goldberg (2014); Levy et al. (2015); Yin and Shen (2018).
The embedding of all the words forms an embedding matrix , where the here is the dimension of each word vector and is the size of the vocabulary.
SGNS maximizes a likelihood function for word and context pairs that occur in the dataset and minimizes it for randomly sampled unobserved pairs, i.e. negative samples (NS). We denote the method with NS as SGNS.
GloVe factorizes the logcount matrix shifted by the entire vocabulary’s bias term. The bias here are parameters learned stochastically with an objective weighted according to the frequency of words.
SVD SVD factorizes a signal matrix , which aims at reducing the dimensions of the cooccurrence matrix. The resulting embedding is , where is the dimension of word embeddings. We denote the method as SVD, if the signal is the PMI matrix, and SVD if the signal is the log count matrix.
Although the scope of this paper focuses on standard word embeddings that were learned at the word level, RPD could be adapted to analyze embeddings that were learned from word pieces, for example, fastText (Bojanowski et al., 2017) and contextualized embeddings (Peters et al., 2018; Devlin et al., 2019).
2.2 Relationship Between Embedding Spaces
Levy and Goldberg (2014) provide a good analogy between SGNS and SVD. They suggest that SGNS is essentially factorizing the pointwise mutual information (PMI) matrix. However, their analogy is based on the assumption of no dimension constraint in SGNS, which is not possible in practice. Furthermore, their analogy is not suitable for analyzing methods besides SGNS and PMI models since their theoretical derivation relies on the specific objective of SGNS.
Yin and Shen (2018) provide a way to select the best dimension of word embeddings for specific tasks by exploring the relations of embedding spaces of different dimension. They introduce Pairwise Inner Product (PIP) loss Yin and Shen (2018), an unitaryinvariant metric for measuring word embeddings’ distance Smith et al. (2017)
. The unitaryinvariance of word embeddings states that two embedding vector spaces are equivalent if one can be obtained from another by multiplying a unitary matrix. However, PIP loss is not suitable for comparing numerically across embedding spaces since PIP loss has different energy for different embedding spaces.
3 Quantifying Distances between Embeddings
In this section, we describe the definition of RPD and its properties, which make RPD a suitable and effective method to quantify the distance between embedding spaces. Note that two embedding spaces do not necessarily have the same vocabulary for calculating the RPD.
3.1 Rpd
For the following discussion, we always use the Frobenius norm as the norm of matrices.
Definition 1.
(RPD) The RPD between embedding matrices and is defined as follows:
where comes from dividing each entry of
by its standard deviation. For convenience, we let
for the following discussion.The numerator of RPD respects the unitaryinvariant property of word embeddings, which means that unitary transformation (i.e. rotation) preserves the relative geometry of an embedding space. The denominator is a normalization, which allows us to regard the whole embedding matrix as an integrated part (i.e. RPD does not correlate with the number of words of embedding spaces). This step makes comparisons across methods possible.
3.2 Statistical Properties of RPD
We assume the widely used isotropic assumption Arora et al. (2016) that the ensemble of word vectors consists of i.i.d draws generated by , where
is from the spherical Gaussian distribution, and
is a scalar random variable. In our case, we can assume each entry of embedding comes from a standard normal distribution
: .Note that the assumption may not always work in practice, especially for other embeddings such as contextualized embeddings. However, under the isotropic conditions, the statistical properties derived are intuitively and empirically plausible. Besides, those properties serve to better interpret the value of RPD alone. Since RPD, in many cases, is used for comparison, we should be comfortable with the assumption.
Upper bound
We estimate the asymptotic upper bound of RPD. By factorizing the numerator of RPD, we get (1).
(1) 
Applying the CauchySchwarz inequality to the last term of (1)^{2}^{2}2The inner product of matrix A and B is defined as , we have the following estimation.
(2) 
By the law of large numbers, we can prove that
(Appendix A). Then, we can tell from (2) that RPD is bounded by 1 when . In practice, the number of words is large enough to let the maximum of RPD stay around 1, which means RPD is welldefined numerically.Normality For , if is independent of
, we can prove that RPD distributes normally from both an empirical and a theoretical perspective. Theoretically, by applying the central limit theorem to the numerator and the law of large numbers to the denominator of RPD, we can get the normality of RPD under the condition
, , whereremains constant (Appendix B). Empirically, we can use Monte Carlo simulation to show the normality and estimate the mean and variance of RPD (Appendix C). With the help of RPD, we can perform hypothesis test (ztest) to evaluate the independence of two embedding spaces.
3.3 Geometric Interpretation of RPD
From equation (1), we can tell that the first term goes to 1 when . So we only need to discuss the second term.
For the row in , we have vector , where is the word ’s vector in embedding , is the number of words. We can interpret as another representation of word i projected onto the space spanned by . So for convenience, we denote with its row as .
We can prove that . The is the angle between ( row vector of ) and ( row vector of
) (Appendix D). Therefore, we can understand the value of RPD from the perspective of cosine similarity between vectors.
3.4 RPD and Performance
As Yin and Shen (2018) discussed, usability of word embeddings, such as using them to solve analogy and relatedness tasks, is important to practitioners. Through applying different sets of word embeddings to word similarity and word analogy tasks Mikolov et al. (2013), we study the relationship between RPD and word embeddings’ performance. Specifically, we set the word embeddings produced by SGNS with 25 NS as a starting point and use other word embeddings, for example, GloVe as an end point. Then we get a two dimensional point with as their RPD, as their absolute performance change in word similarity^{3}^{3}3Our word similarity task can be found here: https://aclweb.org/aclwiki/WordSimilarity353_Test_Collection_(State_of_the_art) and analogy^{4}^{4}4Our word analogy task can be found here: https://aclweb.org/aclwiki/Google_analogy_test_set_(State_of_the_art) tasks.
By putting those points in Figure 1, we can tell in a certain range of RPD, the larger RPD between the two sets of word embedding means the bigger gap in their absolute performance. Intuitively, RPD is strongly related to cosine similarity, which is the measure of word similarity. RPD also shares the same property of PIP loss, where a small RPD leads to a small difference in relatedness and analogy tasks. We obtain similar results when the starting point is a different embedding space.
Note that this section serves to demonstrate the performance (at least in word similarity and analogy tasks) variation of different embedding spaces is correlated with their RPD. While we are aware of the relevance of other downstream tasks, we do not explore further since our focus lies in investigating the intrinsic geometry relation of embedding spaces.
4 Experiment
The following experiments serve to apply RPD to explore some questions of interest and further demonstrate that RPD is suitable for investigating the relations between embedding spaces. We leave applying RPD to help improve specific NLP tasks to future research. For example, RPD could be used for combining different embeddings together, which could help us produce better metaembeddings (Kiela et al., 2018).
4.1 Setup
If not explicitly stated, the experiments are performed on Text8 corpus Mahoney (2011), a standard benchmark corpus used for various natural language tasks Yin and Shen (2018). For all methods we experiment, we train 300 dimension embeddings, with window size of 10, and normalize the embedding matrices with their standard deviation^{5}^{5}5The code can be found on Bitbucket: https://bitbucket.org/omerlevy/hyperwords. The default NS for SGNS is 15.
4.2 Different Algorithms Produce Different Embeddings
Dependence of SGNS and SVD
As discussed in the introduction, the relationship between embeddings trained with SGNS and SVD remains controversial (Arora et al., 2016; Mimno and Thompson, 2017). We use the results we obtain in Section 3.2 to test their dependence. For example, if one believes that trained with SGNS and trained with SVD
have no relationship, then the null hypothesis
would be: and are independent.Methods  GloVe  SVD  SVD 

SGNS  0.792  0.609  0.847 
SGNS  0.773  0.594  0.837 
SGNS  0.725  0.550  0.805 
SGNS  0.719  0.511  0.799 
In our case, we estimate and with Monte Carlo simulation with randomly initialized embeddings. Take RPD, = 0.511 from Table 1 as an example, the statistic , which means the pvalue 0.01. Thus, we can confidently reject . Notice that we can test any two sets of word embeddings with this method. It is not hard to see that no pair of word embeddings in Table 1 are independent, which suggests that there exists an unified theory behind these methods.
SGNS is Closest to SVD
With the help of RPD, it is also interesting to investigate distances between embeddings produced by different methods. Here, we calculate the RPDs among SGNS (with negative sampling 25, 15, 5, 1), GloVe, SVD, SVD.
Hyperparameters Have Influence on Embeddings
From Table 1, an interesting phenomenon is that SGNS becomes closer to other methods with the decrease of negative samples, which suggests that negative sampling is one of the factors driving SGNS away from matrix factorization methods.
With RPDs between different sets of word embeddings, we plot the embeddings in 2D by treating each embedding space as a single point. We first fix point SVD and SVD, then we draw other points according to their RPDs with the other methods. Figure 2 helps us see how negative sampling affects the embedding intuitively. Increasing the number of negative samples pulls SGNS away from SVD. Combining Table 1 and Figure 2, we can tell that although the hyperparameters can influence the embeddings to some extent, the main difference comes from the algorithms.
4.3 Different Initializations Barely Influence Embeddings
Random initializations produce different embeddings with the same algorithms and hyperparameters. While those embeddings usually get similar performance on the downstream tasks, people are still concerned about their effects. We investigate the influence of random initializations for GloVe and SGNS.
We train the embedding in the same setting multiple times and get the average RPDs for each method. For SGNS, the average RPDs of random initialization is 0.027. For GloVe, the value is 0.059.
We can tell that different random initializations produce essentially the same embeddings. Neither SGNS or GloVe has a significant RPD in different initializations, which suggests random initialization has little influence over word embeddings’ performance (Section 3.4). However, SGNS seems to be more stable in this setting.
4.4 Different Corpora Produce Different Embeddings
It is well known that different corpora produce different word embeddings. However, it is hard for us to tell how different they are and whether the difference influences downstream applications Antoniak and Mimno (2018). Knowing this would help researchers choose the algorithms in specific scenarios, for example, evolving semantic discovery Yao et al. (2018); Kozlowski et al. (2019). They focus on the semantic evolution of words, but corpora are different in different time scales. Their methods use word embeddings to study semantic shift, which might be influenced by the word embeddings being trained on different corpora, thus getting unreliable results. In this case, it would be important to chose an algorithm less prone to influences by differences in corpora.
We train word embeddings using each of text8 (Wikipedia domain, 25097 unique words), WMT14 news crawl^{6}^{6}6http://www.statmt.org/wmt14/ (Newswire domain, 24359 unique words), TED speech^{7}^{7}7https://workshop2016.iwslt.org/ (Speech domain, 7389 unique words). We compute RPD on the intersections of their vocabulary
SGNS  GloVe  

Text8WMT14  0.168  0.686 
Text8TED  0.119  0.758 
WMT14TED  0.175  0.716 
From Table 2, we can tell that SGNS is consistently more stable than GloVe in different domains. We suggest that this is because GloVe trains the embedding with cooccurrence matrix, which gets influenced more by the corpus.
5 Discussion
While our work investigates some interesting problems about word embeddings, there are many other problems about embeddings that can be demonstrated with the help of RPD. We discuss some of them as follows.
5.1 RPD and Crosslingual Word Embeddings
Artetxe et al. (2018)
provide a framework to obtain bilingual embeddings, whose the core step of the framework is an orthogonal transformation and other existing methods can be seen as its variations. The framework proposes to train monolingual embeddings separately and then map them into a sharedembedding space with linear transformation.
While linear transformation is no guarantee for the alignment of two embedding spaces from different languages, RPD could potentially serve as a way to indicate how different language pairs benefit from mapping them with an orthogonal transformation. Since RPD is unitaryinvariant, we can calculate RPD between embedding spaces from different language pairs. The smaller RPD is, the better the framework could align this two language embedding spaces.
5.2 RPD and PostProcessing Word Embeddings
Postprocessing word embeddings can be useful in many ways. For example, Vulić et al. (2018) retrofit word embeddings with external linguistic resources, such as WordNet to obtain better embeddings; Rothe and Schütze (2016) decompose embedding space to get better performance at specialized domains; and Mu and Viswanath (2018) obtain stronger embeddings by eliminating the common mean vector and a few top dominating directions.
RPD could serve as a metric to evaluate how the embedding space changes intrinsically after postprocessing.
5.3 RPD and Contextualized Word Embeddings
Contextualized embeddings are popular NLP techniques which significantly improve a wide range of NLP tasks (Bowman et al., 2015; Rajpurkar et al., 2018). To understand why contextualized embeddings are beneficial to those NLP tasks, many works investigate the the nature of syntactic (Liu et al., 2019), semantic (Liu et al., 2019), and commonsense knowledge (Zhou et al., 2019) contained in such representations.
However, we still know little about the vector space of contextualized embeddings and their relationship with traditional word embeddings, which is important to further apply contextualized embeddings in various scenarios (Lin and Smith, 2019). RPD can potentially serve to help us better understand contextualized embeddings in future research.
6 Conclusion
In this paper, we propose RPD, a metric to quantify the distance between embedding spaces (i.e different sets of word embeddings). With the help of RPD and its properties, we verify some intuitions and answer some questions. Justifying RPD theoretically and empirically, we believe RPD can offer us a new perspective to understand and compare word embeddings.
Acknowledgments
I would like to thank Dr. Zi Yin, Dr. Vered Shwartz, Maarten Sap, and Jorge Balazs for their feedback that greatly improved the paper.
References
 Evaluating the stability of embeddingbased word similarities. Transactions of the Association for Computational Linguistics 6, pp. 107–119. External Links: ISSN 2307387X, Link Cited by: §4.4.
 A latent variable model approach to pmibased word embeddings. Transactions of the Association for Computational Linguistics 4, pp. 385–399. External Links: ISSN 2307387X, Link Cited by: §3.2, §4.2.
 Cited by: §5.1.
 Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: Link, Document Cited by: §1, §2.1.
 Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4349–4357. External Links: Link Cited by: §1.
 A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 632–642. External Links: Link, Document Cited by: §5.3.
 BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2.1.
 Dynamic metaembeddings for improved sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1466–1477. External Links: Link, Document Cited by: §4.
 The geometry of culture: analyzing the meanings of class through word embeddings. American Sociological Review 84 (5), pp. 905–949. External Links: Document, Link, https://doi.org/10.1177/0003122419877135 Cited by: §1, §4.4.
 Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3 (), pp. 211–225. External Links: Document, Link, https://doi.org/10.1162/tacl_a_00134 Cited by: §1, §2.1.
 Dependencybased word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, Maryland, pp. 302–308. External Links: Link, Document Cited by: §1.
 Neural word embedding as implicit matrix factorization. In Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2177–2185. External Links: Link Cited by: §2.1, §2.2, §4.2.
 Situating sentence embedders with nearest neighbor overlap. ArXiv abs/1909.10724. External Links: Link Cited by: §5.3.
 Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1073–1094. External Links: Link, Document Cited by: §5.3.
 Large text comparison benchmark, 2011. External Links: Link Cited by: §4.1.
 Efficient estimation of word representations in vector space. External Links: Link Cited by: §1, §1, §3.4.
 The strange geometry of skipgram with negative sampling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2873–2878. External Links: Document Cited by: §4.2.

Learning word embeddings efficiently with noisecontrastive estimation
. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 2265–2273. External Links: Link Cited by: §1.  Allbutthetop: simple and effective postprocessing for word representations. In International Conference on Learning Representations, External Links: Link Cited by: §5.2.
 Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §1, §1.
 Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §2.1.
 Know what you don’t know: unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 784–789. External Links: Link, Document Cited by: §5.3.
 Word embedding calculus in meaningful ultradense subspaces. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, pp. 512–517. External Links: Link, Document Cited by: §5.2.
 Offline bilingual word vectors, orthogonal transformations and the inverted softmax. CoRR abs/1702.03859. External Links: Link, 1702.03859 Cited by: §2.2.
 Postspecialisation: retrofitting vectors of words unseen in lexical resources. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 516–527. External Links: Link, Document Cited by: §5.2.
 Dynamic word embeddings for evolving semantic discovery. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, New York, NY, USA, pp. 673–681. External Links: ISBN 9781450355810, Link, Document Cited by: §4.4.
 Learning word metaembeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1351–1360. External Links: Link, Document Cited by: §1.
 On the dimensionality of word embedding. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 887–898. External Links: Link Cited by: §1, §2.1, §2.2, §3.4, §4.1.
 Evaluating commonsense in pretrained language models. External Links: 1911.11931 Cited by: §5.3.
Appendix A Appendix A. Limitation of
As discuss before, in our case, we can assume i.i.d. , where is the entry in the word vector of .
(3) 
By the assumption, we know that identically distributes for any . By applying the law of large numbers, the term goes to as goes to . The term goes to zero as goes to . Then, we know that .
We only need to calculate .
(4) 
Simple calculation shows that , . Then , is the dimension of word embedding here. Thus, .
Appendix B Appendix B. Normality of RPD
Let’s review the form of RPD.
(5) 
As we discuss in A, , as . We only have to prove distributes normally. The key is how to apply the central limit theorem (CLT).
We denote as follows.
(6) 
Notice that the term
does not contribute to the variance if we analyze the second moment of the numerator. So it is equivalent to prove
distributes normally.We project the to
Simple calculation would show that . Then by the Hajek projection theorem, we get has the same distribution as . It is not hard to see that each random variable in is independent of others. This allows us to apply CLT to and get . Thus, .
Appendix C Appendix C. Monte Carlo Simulation
Here is how we perform Monte Carlo simulation. We independently produce two matrix with each entry i.i.d as . Then we calculate RPD() and get the first RPD value. Repeat the process for 5000 times, we get a vector of RPDs. Drawing the histogram of this vector yields a normal distribution and we can estimate the mean and variance of the distribution by calculating the mean and variance of the vector of RPDs.
Appendix D Appendix D. Geometry Interpretation of RPD
Now we consider a general case, where and are embeddings with n words.
Then
(7) 
We denote as , as
It is not hard to see that the , when n is large enough. Then we get . Considering the isotropic assumption again, another observation is that the distributes normally.