1 Introduction
The study of linguistic regularities in distributional word embeddings—that the difference vector calculated between the vectors
jump and jumped shows a similar direction to that of walk and walked, and so on—has been both stimulating and controversial. While a number of such regularities appear to hold, across a number of different kinds of embeddings, the standard 3CosAdd analogy test used to measure the presence of these regularities has come under fire for confounding analogical regularities with unrelated properties of semantic embeddings. It is thus important to note that several papers have proposed theoretical explanations for why linguistic regularities should hold in distributional word embeddings. Particularly in light of the controversies over linguistic regularities, it is important to examine the soundness of these arguments.Allen and Hospedales (2019) develop such an explanation by linking the semantic definition of an analogy to paraphrases. In the sense of Gittens et al. (2017), paraphrases are sets of words which are semantically and distributionally closely equivalent to another word or set of words—for example, king may be paraphrased by {man, royal}. Allen and Hospedales argue that the standard analogy criterion, that king  man + woman = queen, is equivalent to a criterion whereby {king, woman} paraphrases {man, queen}. With this in mind, it becomes possible to rewrite the arithmetic analogy criterion in terms of vectors encoding the pointwise mutual information (PMI) between words and their contexts, and to decompose the error in the analogy equality into several components, including a paraphrase error
term measuring the degree to which the critical paraphrase holds. Making use of an assumption that the word2vec embedding is a linear transformation of the PMI matrix, they argue that results in terms of PMI apply to word vectors. Thus, under their explanation, a major part of success on an analogy
is due to and being close distributional paraphrases.We first review the literature on the analogy test itself, underlining known pitfalls which any explanation of linguistic regularities must navigate. We then show empirically that the relation between the PMI matrix and word2vec embeddings is to some degree linear, which may be enough to satisfy the assumption of Allen and Hospedales (2019). We further examine the proposed decomposition into error terms. We demonstrate that, empirically, these error terms tend to be undefined due to data sparseness, undermining their explanatory force. Most importantly, examining a number of analogies which pass the standard test, we show that the critical paraphrase error term is, contrary to the proposed explanation, very large.^{1}^{1}1Code is available at www.github.com/bootphon/paraphrases_do_not_explain_analogies.
2 Related work
Early works proposing explanations of the analogical properties of word embeddings include Mikolov et al. (2013b) and Pennington et al. (2014). A geometrical explanation is proposed by Arora et al. (2016), but this explanation relies on very strong preconditions, notably, that the word vectors be distributed uniformly in space. Ethayarajh et al. (2019) also propose an explanation, providing a link between the PMI and the norm of word embeddings. However, as pointed out by Allen and Hospedales (2019), this explanation, too, rests on strong assumptions. Notably, the words involved in the analogy are required to be coplanar, a property that seems unlikely in light of the lack of parallelism we discuss in the next section.
3 Issues with the test
Issues have arisen with the standard way of measuring linguistic analogies. Levy and Goldberg (2014), Vylomova et al. (2016), Rogers et al. (2017), and Fournier et al. (2020) all demonstrate that the standard 3CosAdd measure conflates several very different properties of embeddings, simultaneously measuring not only the directional regularities suggested by typical illustrations of vectors in a parallelogram, but also the similarity of individual matched pairs such as king, man, as well as the global arrangement of vectors in semantic fields, such as king, queen, prince, … versus man, woman, child, … in distinct regions of the space. These issues undermine the construct validity of the standard analogy test. This conflation of properties explains certain pathological behaviours of the test Linzen (2016); Rogers et al. (2017). In spite of these issues, Fournier et al. (2020) demonstrate, using alternative measures, that linguistic regularities are nevertheless coded by directional similarities. This parallelism is weak, with directions tending to be closer, in the absolute, to being orthogonal than to being parallel, but is present above chance level (unmatched word pairs).
Thus, before turning to Allen and Hospedales (2019), one of a number of theoretical attempts to explain performance on the 3CosAdd objective, we underscore that such demonstrations run the risk of explaining properties of the test which may be of secondary interest, or, conversely, of placing undue emphasis on the role of directional regularities, which have been shown to play only a small role in success on 3CosAdd.
4 Explaining analogies through paraphrases
For a word and a word which can appear in the context of , the pairwise mutual information is defined as . As shown by Levy and Goldberg (2014), skipgram word2vec with negative sampling factorizes the PMI: PMI , with and the word and context embedding matrices of a word2vec model.
For two pairs of words and from the same semantic relation, the standard arithmetic analogy test criterion is that . Writing , and the PMI vector of , Allen and Hospedales (2019) show that is possible to rewrite the arithmetic analogy formula with PMI vectors, and to decompose the error in the equality into five terms as follows:
(1) 
The error terms are vectors of length (vocabulary size), with each element defined as:
(2) 
The authors claim that these terms can be embedded linearly into a word2vec embedding space by multiplying them by the MoorePenrose pseudoinverse of the context matrix . Then with the word2vec embedding of , . Thus we get the final decomposition:
(3) 
The paraphrase error term is claimed to be small for successful analogies. Elaborating on the notation, is taken to paraphrase if, wherever all appear together, we observe the same distribution of surrounding words as for . The paraphrase error assesses the similarity of the distributions of words in the context of (all words in appearing together) versus .
5 Linearity of the link between PMI and word2vec
Though it is true that there is a relation between the word2vec matrices and the PMI matrix, in practice the link is more complicated than simple linear matrix factorization, due in part to the training tricks described in Mikolov et al. (2013a). The result of Allen and Hospedales (2019) requires that the embedding from PMI vectors to word2vec embeddings be “linear enough” for PMI to approximate .
To assess this, we use the text8 corpus ^{2}^{2}2 A text dataset composed of 100 million characters from Wikipedia: Mahoney (2006). both to train word2vec embeddings ^{3}^{3}3Skipgram architecture with negative sampling (1 word), negative sampling exponent equal to 1, no undersampling of common words, and a high dimension size of 500. These parameters allow us to be as close as possible to a direct factorization of the PMI matrix.
and to estimate a PMI matrix. We replace infinite values in the PMI matrix by 0. In Figure
0(a), we show the distribution of the Pearson correlation coefficient (assessing the presence of a linear relation) between the word2vec embedding and the corresponding row of PMI for the top ten thousand words in the corpus. As can be seen from the figure, the correlation tends to be between 0.5 and 0.8. For instance in Figure 0(b), the word2vec embedding for king is plotted against the row of PMI corresponding to king.While the relation is not perfectly linear—many words have a correlation of around 0.55, far lower than that of king—the empirical relations shown here leave open the possibility that it may indeed be “sufficiently linear” to be taken for granted. However, while linearity is necessary for the result of Allen and Hospedales (2019) to go through, it is not sufficient. In the next section, we assess the critical question of whether the paraphrase error is small enough to serve as an explanation for the success of linguistic analogies.
6 Empirical analysis of the error terms
We now seek to examine the proposed explanation by calculating the proposed error terms empirically. However, in practice, many of the terms are undefined, since they rely on cooccurrences unattested in practical corpora. The most extreme situation occurs when the two words of a paraphrase are never present in the same context window in the corpus. We found that only 16% of the paraphrase sets associated with the BATS analogy set Gladkova et al. (2016)—for example, king, woman—were present together in the text8 corpus in a context window of length five. We refer to such paraphrase sets as “welldefined” with respect to the corpus. The problem of zero cooccurrence counts was anticipated by Allen and Hospedales (2019), who propose to restrict their analysis to the case where the context window is sufficiently large that all relevant terms are well defined. We stress that our trained word2vec vectors are also trained with a context window of five, and yield expected levels of performance on the BATS analogy test, despite having access to little training data on which to model cooccurrences such as king, woman, queen, man, and so on.
Category  I01  I02  I05  I06  I07  I08  I09  I10  D02  D03  D05  D08  D10  E01  E02 

Paraphrase error norm  177  153  111  127  126  124  138  97  102  122  130  110  107  124  176 
Dependence errors sum norm  1006  938  867  903  957  883  952  908  856  893  514  585  699  749  848 
All errors sum norm  1032  957  878  917  970  897  966  916  864  905  539  602  710  765  875 
Category  E03  E04  E05  E08  E09  E10  L02  L03  L04  L05  L06  L07  L08  L09  L10 

Paraphrase error norm  162  176  155  229  179  190  197  189  209  206  133  169  185  175  432 
Dependence errors sum norm  866  797  519  739  910  833  642  982  907  1103  921  995  1044  1017  1302 
All errors sum norm  889  822  553  778  933  865  683  1007  939  1131  937  1016  1066  1040  1416 
Category  I01  I02  I05  I06  I07  I08  I09  I10  D02  D03  D05  D08  D10  E01  E02 

Average rank  7762K  7589K  7759K  8744K  8160K  6454K  7028K  11889K  31952K  19558K  7857K  1506K  2556K  4394K  9507K 
Median rank  1630K  2195K  3055K  3239K  2530K  4090K  3004K  4535K  6754K  3564K  3260K  1506K  2556K  2117K  1622K 
Category  E03  E04  E05  E08  E09  E10  L02  L03  L04  L05  L06  L07  L08  L09  L10 

Average rank  1305K  5611K  9192K  727K  8421K  11946K  52183K  1857K  12687K  6747K  2475K  7727K  4502K  4679K  16871K 
Median rank  695K  1703K  1426K  854K  1908K  169K  52182K  1261K  2460K  1343K  2255K  2136K  1549K  1 739K  785K 
At a minimum, if the proposed explanation holds, the cases for which the error terms are empirically welldefined should show signs of the paraphrase error being relatively small. We now detail how we implemented the error terms in cases for which they were welldefined. We count cooccurrences in text8 for all triplets of words , with at the center of the context window, and any paraphrase, both occurring anywhere within a context window of width five. We restrict analysis to the ten thousand most frequent word types and , yielding possible paraphrases.^{4}^{4}4 is allowed to vary over all of the types included in the training for word2vec, of which there are 71290. Thus, for each paraphrase, the error vectors have 71290 elements, one for each vocabulary word. We use the relative frequencies as estimators of and , and marginalize to obtain , and . The error terms follow. Since this can still lead to illdefined elements, we replace and by , with (within reason, the value of is immaterial). We also replace with 0.
Table 1 shows the mean and median values of the L2 norms of the paraphrase error vectors across several categories of the BATS dataset. We compare them with the sum of the four dependence error terms (the dependence error reflects statistical dependencies within and irrelevant to the analogy), as well as the sum of all five error terms (equal to the difference between the PMI of and ).The paraphrase error is indeed smaller than the other error terms. However, as we now show, the paraphrase error is not small enough to contribute substantially to the success of analogies.^{5}^{5}5We note also that the error values seem relatively consistent between categories, while success on the analogy test varies differ greatly between categories.
Take the norm of the paraphrase error vector as a measure of the divergence in the PMI between two paraphrases. For an analogy with associated paraphrases and , we assess how many paraphrases are closer to than to by calculating the rank of the norm of among all , where spans over all pairs of words constructible from the top ten thousand most frequent words in the corpus. To do so, we define a Paraphrase Conditional Information matrix (PCI). For and , we define , the value at column and row to be , where with is a unique index associate with tuple . We compute only the positive PCI, to obtain a sparse matrix. The difference between two PCI columns is a paraphrase error vector, and their Euclidean distance is the norm of the paraphrase error.
We now compute, for each analogy, the distance between the PCI column of and every other column (paraphrase) of the PCI matrix. We calculate the rank of the true analogy pair . Given that the analogy test generally succeeds in picking out as being the most similar to out of the entire vocabulary (modulo Linzen 2016), we would expect that, for successful analogies, the paraphrase error for the true analogy would be among the highest, if small paraphrase error were the explanation for success. Table 2 displays the mean of this rank within each BATS category. The rank is extremely low (in the millions), making the paraphrase error in true analogies far too high to be the explanation for their success.^{6}^{6}6Limiting the search to the paraphrases composed by at least one of the words of still results in a very low rank for .
7 Conclusion
Recent work has shown that, in spite of the standard analogy test’s confound with simple vector similarity, distributional word vectors genuinely do encode linguistic regularities as directional regularities above and beyond vector similarity (Fournier et al., 2020). Further research is warranted into the mechanisms by which distributional word embeddings come to show these regularities. However, the analysis of analogies as paraphrases does not hold up as an explanation of performance on the analogy test—nor would an explanation of performance on the 3CosAdd analogy test be a satisfying result, since the test is not a useful measure to begin with.
Acknowledgments
This work was funded in part by the European Research Council (ERC2011AdG295810 BOOTPHON), the Agence Nationale pour la Recherche (ANR17EURE0017 Frontcog, ANR17CE280009 GEOMPHON, ANR10IDEX000102 PSL*, ANR19P3IA0001 PRAIRIE 3IA Institute, ANR18IDEX0001 U de Paris, ANR10LABX0083 EFL) and grants from CIFAR (Learning in Machines and Brains), Facebook AI Research (Research Grant), Google (Faculty Research Award), Microsoft Research (Azure Credits and Grant), and Amazon Web Service (AWS Research Credits).
References
 Analogies Explained: Towards Understanding Word Embeddings. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 223–231. External Links: Link Cited by: §1, §1, §2, §3, §4, §5, §5, §6.
 A Latent Variable Model Approach to PMIbased Word Embeddings. Transactions of the Association for Computational Linguistics 4, pp. 385–399. External Links: Link, Document Cited by: §2.
 Towards Understanding Linear Word Analogies. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3253–3262. External Links: Link, Document Cited by: §2.
 Analogies minus analogy test: measuring regularities in word embeddings. In Proceedings of the 24th Conference on Computational Natural Language Learning, Online, pp. 365–375. External Links: Link, Document Cited by: §3, §7.
 Skipgram âˆ’ Zipf + uniform = vector additivity. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 69–76. External Links: Link, Document Cited by: §1.
 AnalogyBased Detection of Morphological and Semantic Relations with Word Embeddings: What Works and What Doesn’t. In Proceedings of the NAACLHLT SRW, San Diego, California, June 1217, 2016, pp. 47–54. External Links: Document, Link Cited by: §6.
 Linguistic Regularities in Sparse and Explicit Word Representations. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, Ann Arbor, Michigan, pp. 171–180. External Links: Link, Document Cited by: §3.
 Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2177–2185. External Links: Link Cited by: §4.
 Issues in Evaluating Semantic Spaces Using Word Analogies.. In Proceedings of the First Workshop on Evaluating Vector Space Representations for NLP, External Links: Document, Link Cited by: §3, §6.
 Large text compression benchmark: about the test data (web page). External Links: Link Cited by: footnote 2.
 Efficient Estimation of Word Representations in Vector Space. In Proceedings of International Conference on Learning Representations (ICLR), External Links: Link Cited by: §5.
 Linguistic Regularities in Continuous Space Word Representations.. In Proceedings of NAACLHLT 2013, Atlanta, Georgia, 9–14 June 2013, pp. 746–751. External Links: Link Cited by: §2.

GloVe: Global Vectors for Word Representation.
In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §2.  The (Too Many) Problems of Analogical Reasoning with Word Vectors. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017), pp. 135–148. External Links: Link Cited by: §3.
 Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1671–1682. External Links: Link, Document Cited by: §3.
Comments
There are no comments yet.