Modern word embedding models McCann et al. (2017); Peters et al. (2018); Devlin et al. (2019) build vector representations of words in context, i.e. the same word will have different vectors when used in different contexts (sentences). Earlier models Mikolov et al. (2013b); Pennington et al. (2014) built the so-called static embeddings: each word was represented by a single vector, regardless of the context in which it was used. Despite the fact that static word embeddings are considered obsolete today, they have several advantages compared to contextualized ones. Firstly, static embeddings are trained much faster (few hours instead of few days) and do not require large computing resources (1 consumer-level GPU instead of 8–16 non-consumer GPUs). Secondly, they have been studied theoretically in a number of works Levy and Goldberg (2014b); Arora et al. (2016); Hashimoto et al. (2016); Gittens et al. (2017); Tian et al. (2017); Ethayarajh et al. (2019); Allen et al. (2019); Allen and Hospedales (2019); Assylbekov and Takhanov (2019); Zobnin and Elistratova (2019) but not much has been done for the contextualized embeddings Reif et al. (2019)
. Thirdly, static embeddings are still an integral part of deep neural network models that produce contextualized word vectors, because embedding lookup matrices are used at the input and output (softmax) layers of such models. Therefore, we consider it necessary to further study both static and contextualized embeddings. It is noteworthy that with all the abundance of both theoretical and empirical studies on static vectors, they are not fully understood, as this work shows. For instance, it is generally accepted that good quality word vectors are inextricably linked with a low-rank approximation of the pointwise mutual information (PMI) matrix, but we show that vectors of comparable quality can also be obtained from a low-rank approximation of abinarized PMI matrix, which is a rather strong roughening of the original PMI matrix (Section 2). Thus, a binarized PMI matrix is a viable alternative to a standard PMI matrix when it comes to obtaining word vectors. At the same time, it is much easier to interpret the binarized PMI matrix as an adjacency matrix for a certain graph. Studying the properties of such a graph, we come to the conclusion that it is a so-called complex network, i.e. it has a strong clustering property and a scale-free degree distribution (Section 3). It is noteworthy that complex networks, in turn, are dual to hyperbolic spaces Krioukov et al. (2010), which were previously used to train word vectors Nickel and Kiela (2017); Tifrea et al. (2018) and have proven their suitability — in hyperbolic space, word vectors need lower dimensionality than in Euclidean space. Thus, to the best of our knowledge, this is the first work that establishes simultaneously a connection between good quality word vectors, a binarized PMI matrix, complex networks, and hyperbolic spaces. Figure 1 summarizes our work and serves as a guide for the reader.
We let denote the real numbers. Bold-faced lowercase letters () denote vectors, plain-faced lowercase letters () denote scalars, is the Euclidean inner product, is a set of elements indexed by , is a matrix with the -th entry being . ‘i.i.d.’ stands for ‘independent and identically distributed’, ‘w.r.t.’ stands for ‘with respect to’. We use the sign to abbreviate ‘proportional to’, and the sign to abbreviate ‘distributed as’.
Assuming that words have already been converted into indices, let be a finite vocabulary of words. Following the setup of the widely used word2vec model Mikolov et al. (2013b), we use two vectors per each word : (1) when is a center word, (2) when is a context word; and we assume that . Word vectors are also known as word embeddings, while context vectors are also known as context embeddings.
In what follows we assume that our dataset consists of co-occurence pairs . We say that “the words and co-occur” when they co-occur in a fixed-size window of words. E.g., using a window of size 1 we can convert the text the cat sat on the mat into a set of pairs: (the, cat), (cat, the), (cat, sat), (sat, cat), (sat, on), (on, sat), (on, the), (the, on), (the, mat), (mat, the). The number of such pairs, i.e. the size of our dataset, is denoted by . Let be the number of times the words and co-occur, then .
2 BPMI and Word Vectors
A well known skip-gram with negative sampling (SGNS) word embedding model of Mikolov et al. (2013b) maximizes the following objective function
is the logistic sigmoid function,
is a smoothed unigram probability distribution for words111The authors of SGNS suggest ., and is the number of negative samples to be drawn. Interestingly, training SGNS is approximately equivalent to finding a low-rank approximation of a shifted PMI matrix Levy and Goldberg (2014b) in the form , where the left-hand side is the -th element of the PMI matrix, and the right-hand side is an element of a matrix with rank since . This approximation was later re-derived by Arora et al. (2016); Assylbekov and Takhanov (2019); Allen et al. (2019) under different sets of assuptions. In this section we show that constraint optimization of a slightly modified SGNS objective (1) leads to a low-rank approximation of the binarized PMI (BPMI) matrix, defined as , where if , and otherwise.
Assuming , the following objective function
reaches its optimum at .
We begin by expanding the sum in (2) as
where we used , and , are empirical probability distributions defined by , . Next, we use the definition of an expected value:
Thus, we can rewrite the individual objective in (2) as
Differentiating (6) w.r.t. we get
Setting this derivative to zero gives
is the logit function which is the inverse of the logistic sigmoid function, i.e.. Since can be regarded as a smooth approximation of the Heaviside step function , from (7) we have , which concludes the proof. ∎
in the logistic regression instead of the. Thus we will refer to the objective (2) as Logit SGNS.
Direct Matrix Factorization
Optimization of the Logit SGNS (2
) is not the only way to obtain a low-rank approximation of the BPMI matrix. A viable alternative is factorizing the BPMI matrix with the singular value decomposition (SVD):, with orthogonal and diagonal , and then zeroing out the smallest singular values, i.e.
where we use to denote a submatrix located at the intersection of rows and columns of . By the Eckart-Young theorem Eckart and Young (1936), the right-hand side of (8) is the closest rank- matrix to the BPMI matrix in Frobenius norm. The word and context embedding matrices can be obtained from (8) by setting , and . When this is done for a positive PMI matrix (PPMI), the resulting word embeddings are comparable in quality with those from the SGNS Levy and Goldberg (2014b). Although there are other methods of low-rank matrix approximation Kishore Kumar and Schneider (2017), a recent study Sorokina et al. (2019) shows that two of such methods, rank revealing QR factorization Chan (1987) and non-negative matrix factorization Paatero and Tapper (1994), produces word embeddings of worse quality than the truncated SVD. Thus we will consider only the truncated SVD (8) as an alternative to optimizing the Logit SNGS objective (2).
Empirical Evaluation of the BPMI-based Word Vectors
To evaluate the quality of word vectors resulting from the Logit SGNS objective and BPMI factorization, we use the well-known corpus, text8.222http://mattmahoney.net/dc/textdata.html. We were ignoring words that appeared less than 5 times, resulting in a vocabulary of 71,290 tokens. The SGNS and Logit SGNS embeddings were trained using our custom implementation.333https://github.com/zh3nis/BPMI The PMI and BPMI matrices were extracted using the hyperwords tool of Levy et al. (2015) and the truncated SVD was performed using the scikit-learn library of Pedregosa et al. (2011).
|Method||WordSim||MEN||M. Turk||Rare Words||MSR|
|PMI + SVD||.660||.651||.670||.224||.285||.186|
|BPMI + SVD||.618||.540||.669||.146||.129||.102|
Evaluation of word embeddings on the analogy tasks (Google and MSR) and on the similarity tasks (the rest). For word similarities evaluation metric is the Spearman’s correlation with the human ratings, while for word analogies it is the percentage of correct answers.
The trained embeddings were evaluated on several word similarity and word analogy tasks: WordSim Finkelstein et al. (2002), MEN Bruni et al. (2012), M.Turk Radinsky et al. (2011), Rare Words Luong et al. (2013), Google Mikolov et al. (2013a), and MSR Mikolov et al. (2013c). We used the Gensim tool of Řehůřek and Sojka (2010) for evaluation. We mention here a few key points: (1) Our goal is not to beat state of the art, but to compare PMI-based embeddings (SGNS and PMI+SVD) versus BPMI-based ones (Logit SGNS and BPMI+SVD). (2) For answering analogy questions ( is to as is to ) we use the 3CosAdd method of Levy and Goldberg (2014a) and the evaluation metric for the analogy questions is the percentage of correct answers. The results of evaluation are provided in Table 1. As we can see the LogitSNGS embeddings in general underperform the SGNS ones but not by a large margin. SVD is inferior to SGNS/LogitSGNS especially in the analogy tasks. Overall, it is surprising that such aggressive compression as binarization still retains important information on word vectors.
3 BPMI and Complex Networks
A remarkable feature of the BPMI matrix is that it can be considered as the adjacency matrix of a certain graph. As usually, by a graph we mean a set of vertices and a set of edges which consists of pairs with . It is convenient to represent graph edges by its adjacency matrix , in which if , and otherwise. The graph with and will be referred to as BPMI Graph. Since , only those word pairs are connected by an edge which co-occur more often than expected when independence between and is assumed.
3.1 Spectrum of the BPMI Graph
First of all, we look at the spectral properties of the BPMI Graphs.444We define the graph spectrum as the set of eigenvalues of its adjacency matrix. For this, we extract BPMI matrices from the text8 and enwik9 datasets using the hyperwords tool of Levy et al. (2015)
. We use the default settings for all hyperparameters, except the word frequency threshold and context window size. We were ignoring words that appeared less than 100 times and 200 times intext8 and enwik9
correspondingly, resulting in vocabularies of 11,815 and 24,294 correspondingly. We additionally experiment with the context window size 5, which by default is set to 2. The eigenvalues of the PMI matrices are calculated using theTensorFlow library Abadi et al. (2016), and the above-mentioned threshold of 200 for enwik9 was chosen to fit the GPU memory (12GB, NVIDIA Titan X Pascal). The eigenvalue distributions are provided in Figure 2. The distributions seem to be symmetric, however, the shapes of distributions are far from resembling the Wigner semicircle law , which is the limiting distribution for the eigenvalues of many random symmetric matrices with i.i.d. entries Wigner (1955, 1958). This means that the entries of a typical BPMI matrix are
dependent, otherwise we would observe approximately semicircle distributions for its eigenvalues. Interestingly, there is a striking similarity between the spectral distributions of the BPMI matrices and of the so-calledcomplex networks which arise in physics and network science (Figure 2). In the context of network theory, a complex network is a graph with non-trivial topological features — features that do not occur in simple networks such as lattices or random graphs but often occur in graphs modelling real systems. The study of complex networks is a young and active area of scientific research inspired largely by the empirical study of real-world networks such as computer networks, technological networks, brain networks and social networks. Notice that the connection between human language structure and complex networks was observed previously by Cancho and Solé (2001). A thorough review on approaching human language with complex networks was given by Cong and Liu (2014). In the following subsection we will specify precisely what we mean by a complex network.
3.2 Clustering and Degree Distribution of the BPMI Graph
We will use two statistical properties of a graph – degree distribution and clustering coefficient. The degree of a given vertex is the number of edges that connects it with other vertices, i.e. . The clustering coefficient measures the average fraction of pairs of neighbors of a vertex that are also neighbors of each other. The precise definition is as follows. [Clustering Coefficient] Let us indicate by the set of nearest neighbors of a vertex . By setting we define the local clustering coefficient as , and the clustering coefficient as the average over : . Let be the average degree per vertex, i.e. . For random graphs, i.e. graphs with edges , it is well known Erdős and Rényi (1960) that and . A complex network is a graph, for which and , where is some constant.
We constructed BPMI Graphs from the text8 and enwik9 datasets using context windows of sizes 2 and 5 (as in Subsection 3.1), and computed their clustering coefficients (Table 2) as well as degree distributions (Figure 4). Due to a large size of the BPMI-graph of enwik9 we partitioned it in batches of 10,000 words each and averaged clustering coefficients over the batches. To ensure the validity of such an approximation we applied it to the BPMI-graph of text8 and obtained .148 and .232 batch clustering coefficient averages, that are close enough to the original .164 and .235 values for window sizes of 2 and 5 respectively. As we see, the BPMI graphs are complex networks, and this brings us to the hyperbolic spaces.
4 Complex Networks and Hyperbolic Geometry
Hyperbolic geometry is a non-Euclidean geometry that studies spaces of constant negative curvature. This, for example, is associated with Minkowski space-time in the special theory of relativity. In network science, hyperbolic spaces have begun to attract attention because they are well suited for modeling hierarchical data. For example, a regular tree can be isometrically embedded into a Poincare disk , which is a model of a Hyperbolic space (see Figure 4): all connected nodes are spaced equally far apart (i.e., all black line segments have identical hyperbolic length). However, to embed the same tree isometrically into Euclidean space one will need an exponential (in tree depth) number of dimensions. Intuitively, hyperbolic spaces can be thought of as continuous versions of trees or vice versa, trees can be thought of as “discrete hyperbolic spaces”. Moreover, Krioukov et al. (2010) showed that complex networks (as defined in Subsection 3.2) and hyperbolic spaces are highly related to each other: (1) scale-free degree distributions and strong clustering in complex networks emerge naturally as simple reflections of the negative curvature and metric property of the underlying hyperbolic geometry; (2) conversely, if a network has some metric structure (strong clustering), and if the network degree distribution is scale-free, then the network has an effective hyperbolic geometry underneath. The curvature and metric of the hyperbolic space are related to the and of the complex network.
It is amazing how the seemingly fragmented sections of scientific knowledge can be closely interconnected. In this paper, we have established a chain of connections between word embeddings and hyperbolic geometry, and the key link in this chain is the binarized PMI matrix. Claiming that hyperbolicity underlies word vectors is not novel Nickel and Kiela (2017); Tifrea et al. (2018). However, this note is the first attempt to justify the connection between hyperbolic geometry and the word embeddings using analytical and empirical methods.
Tensorflow: a system for large-scale machine learning. In Proceedings of OSDI, pp. 265–283. Cited by: §3.1.
- What the vec? towards probabilistically grounded embeddings. In Advances in Neural Information Processing Systems, pp. 7465–7475. Cited by: §1, §2.
- Analogies explained: towards understanding word embeddings. In International Conference on Machine Learning, pp. 223–231. Cited by: §1.
- A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics 4, pp. 385–399. Cited by: §1, §2.
Context vectors are reflections of word vectors in half the dimensions.
Journal of Artificial Intelligence Research66, pp. 225–242. Cited by: §1, §2.
- Distributional semantics in technicolor. In Proceedings of ACL, pp. 136–145. Cited by: §2.
- The small world of human language. Proceedings of the Royal Society of London. Series B: Biological Sciences 268 (1482), pp. 2261–2265. Cited by: §3.1.
- Rank revealing qr factorizations. Linear algebra and its applications 88, pp. 67–82. Cited by: §2.
- Approaching human language with complex networks. Physics of life reviews 11 (4), pp. 598–618. Cited by: §3.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171–4186. Cited by: §1.
- The approximation of one matrix by another of lower rank. Psychometrika 1 (3), pp. 211–218. Cited by: §2.
- On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5 (1), pp. 17–60. Cited by: §3.2.
- Towards understanding linear word analogies. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3253–3262. Cited by: §1.
- Spectra of “real-world” graphs: beyond the semicircle law. Physical Review E 64 (2), pp. 026704. Cited by: Figure 2.
- Placing search in context: the concept revisited. ACM Transactions on information systems 20 (1), pp. 116–131. Cited by: §2.
- Skip-gram- zipf+ uniform= vector additivity. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 69–76. Cited by: §1.
Spectra and eigenvectors of scale-free networks. Physical Review E 64 (5), pp. 051903. Cited by: Figure 2.
- Word embeddings as metric recovery in semantic spaces. Transactions of the Association for Computational Linguistics 4, pp. 273–286. Cited by: §1.
- Literature survey on low rank approximation of matrices. Linear and Multilinear Algebra 65 (11), pp. 2212–2244. Cited by: §2.
- Hyperbolic geometry of complex networks. Physical Review E 82 (3), pp. 036106. Cited by: Figure 1, §1, §4.
- Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3, pp. 211–225. Cited by: §2, §3.1.
- Linguistic regularities in sparse and explicit word representations. In Proceedings of CoNLL, pp. 171–180. Cited by: §2.
- Neural word embedding as implicit matrix factorization. In Proceedings of NeurIPS, pp. 2177–2185. Cited by: §1, §2, §2.
- Better word representations with recursive neural networks for morphology. In Proceedings of CoNLL, pp. 104–113. Cited by: §2.
- Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems, pp. 6294–6305. Cited by: §1.
Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §2.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1, §1, §2.
- Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Cited by: §2.
- Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems, pp. 6338–6347. Cited by: §1, §5.
- Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5 (2), pp. 111–126. Cited by: §2.
- Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §2.
- Glove: global vectors for word representation. In Proceedings of EMNLP, pp. 1532–1543. Cited by: §1.
- Deep contextualized word representations. In Proceedings of NAACL-HLT, pp. 2227–2237. Cited by: §1.
- A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web, pp. 337–346. Cited by: §2.
- Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50 (English). Note: http://is.muni.cz/publication/884893/en Cited by: §2.
- Visualizing and measuring the geometry of bert. In Advances in Neural Information Processing Systems, pp. 8592–8600. Cited by: §1.
- Low-rank approximation of matrices for pmi-based word embeddings. In Proceedings of CICLing, Cited by: §2.
- The mechanism of additive composition. Machine Learning 106 (7), pp. 1083–1130. Cited by: §1.
- Poincaré glove: hyperbolic word embeddings. arXiv preprint arXiv:1810.06546. Cited by: §1, §5.
- Characteristic vectors of bordered matrices with infinite dimensions. Annals of Mathematics, pp. 548–564. Cited by: §3.1.
- On the distribution of the roots of certain symmetric matrices. Annals of Mathematics, pp. 325–327. Cited by: §3.1.
- Learning word embeddings without context vectors. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 244–249. Cited by: §1.