Invariance and identifiability issues for word embeddings

11/06/2019 ∙ by Rachel Carrington, et al. ∙ The University of Nottingham 0

Word embeddings are commonly obtained as optimizers of a criterion function f of a text corpus, but assessed on word-task performance using a different evaluation function g of the test data. We contend that a possible source of disparity in performance on tasks is the incompatibility between classes of transformations that leave f and g invariant. In particular, word embeddings defined by f are not unique; they are defined only up to a class of transformations to which f is invariant, and this class is larger than the class to which g is invariant. One implication of this is that the apparent superiority of one word embedding over another, as measured by word task performance, may largely be a consequence of the arbitrary elements selected from the respective solution sets. We provide a formal treatment of the above identifiability issue, present some numerical examples, and discuss possible resolutions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Word embeddings map a text corpus, say

, to a collection of vectors

where each , for a prescribed embedding dimension , represents one of words in the corpus. Different word embedding models can be cast as the solution of an optimisation

(1)

for particular corpus representation and objective function , where are vectors in representing contexts, typically not of main interest. The setup subsumes some popular embedding techniques such as Latent Semantic Analysis (LSA) (Deerwester et al., 1990), word2vec (Mikolov et al., 2013b, a), and GloVe (Pennington et al., 2014), wherein the matrices and appear in a suitably chosen only through their product .

Once a word embedding is constructed by solving (1), the embedding is evaluated on its performance in tasks, including identifying word similarity (given word , identify words with similar meanings), and word analogy (for the statement " is to what is to ", given , and , identify ). Similarities or analogies can be computed from , then performance evaluated against a test data set containing human-assigned judgements as

(2)

for some function . Constructing word embeddings is "unsupervised" with respect to the evaluation task in the sense that is determined from (1) independently of the choice of and the data in (2), although typically entails free parameters that may, consciously or not, be chosen to optimize (2) (Levy et al., 2015).

Different word embedding models, identified as different in (1), are often compared based on performance in word tasks in the sense of in (2). But there are several reasons why comparing performance in this way is difficult. First: performance may be affected less by the structure of model , and more by the number of free parameters it entails and how well they have been tuned (Levy et al., 2015). Second: for many embeddings, solving (1) entails a Monte Carlo optimisation, so different runs with identical will result in different realisations of and hence different values of . Third, more subtle and often conflated with the first and second: for most embedding models , (1) does not uniquely identify is said to be non-identifiable — and different solutions, , each equally optimal with respect to (1), correspond to different values of .

This raises the disconcerting question: can apparent differences in performances in word tasks as evaluated with be substantially attributed to the arbitrary selection of a solution from the set of solutions of ? In this paper we explore the non-identifiability of , particularly with respect to the class of non-singular transformations for which but , and the consequences for constructing and evaluating word embeddings. Specifically, our contributions are as follows.

  1. For

    defined using inner products of embedded word vectors (e.g. Cosine similarity) in

    dimensions, we characterise the subset contained in the set of non-singular transformations to which is not invariant.

  2. We study a widely used strategy for constructing word embeddings that involves multiplying a "base" embedding by a powered matrix of singular values, and show that this amounts to exploring a one-dimensional subset of the optimal solutions.

  3. We discuss resolutions to the non-identifiability, including (i) constraining the set of solutions of to ensure compatibility with invariances of , and (ii) optimizing over the solutions of with respect to

    in a supervised learning sense.

2 Non-identifiability of word embedding

The issue of non-identifiability is most transparent in word embedding models explicitly involving matrix factorisation. LSA assumes is an context-word matrix and seeks as

(3)

where is the Frobenius norm, and is an

matrix of contexts to be estimated. For any particular solution

of (3) is also a solution, where is any invertible matrix. The solution of (3) for is hence a set

(4)

where denotes the general linear group of invertible matrices.

One way to find an element of the solution set (4

) is by using the singular value decomposition (SVD) of

. The SVD decomposes as where and have orthogonal columns and is a diagonal matrix with the singular values in decreasing order on the diagonal. Then a rank matrix that minimizes is where and are matrices containing the first columns of and respectively, and is the upper left part of (Eckart and Young, 1936). Hence a solution to (3) is obtained by taking

(5)

called by Bullinaria and Levy (2012) the "simple SVD" solution. Bullinaria and Levy (2012) and Turney (2013) have investigated the word embedding which generalises in (5) by introducing a tunable parameter , motivated by empirical evidence that often leads to better performance on word tasks. Such an embedding is perfectly justified, however, as an alternative solution

to (3), for any . We can hence interpret the tuning parameter as indexing different elements of the solution set (4), each optimal with respect to the embedding model , with free to be chosen so that the word-task performance is maximized.

Indeed, by choosing the particular solution in (5), and setting , we see that tuning amounts to optimising over the one-parameter subgroup , a one-dimensional subset of the -dimensional group to which is non-identifiable. The motivation for restricting the optimisation to this particular subset is unclear, however. In fact, it is not clear that choice of the matrix of singular values in the subgroup necessarily leads to better performance with ; Figure 2 in Section 4.2, demonstrates superior performance for alternate (but arbitrary) diagonal matrices for certain values of .

Yin and Shen (2018) (see also references therein) recognise "unitary [equivalently orthogonal] invariance" of word embeddings, explaining that "two embeddings are essentially identical if one can be obtained from the other by performing a unitary [orthogonal] operation." Here "essentially identical" appears to mean with respect to the performance evaluation, our in this paper. We emphasise the distinction between this and the non-identifiability of , which refers to the invariance of to a (typically larger) class of transformations. The distinction was similarly made by Mu et al. (2019) who suggested modifying the embedding model such that the class of invariant transformations of and match. We briefly discuss further their approach later.

Remark 1.

The foregoing discussion focuses on the LSA embedding model, in (3), in which the optimal embedding arises clearly from a matrix factorisation with respect to Frobenius norm, and the non-identifiability is transparent. But other embedding models, including word2vec and GloVe, are defined by different yet share the same property that is non-identifiable, i.e. that the solution is defined as the set (4). Levy et al. (2015) have shown that word2vec and GloVe both amount to solving implicit matrix factorisation problems each with respect to a particular corpus representation and metric. To see this, and the consequent non-identifiability, it is sufficient to observe, as with the objective of LSA, that the objective functions of word2vec and GloVe involve matrices and appearing only as the product .

3 Effect of non-identifiability of embeddings on

The word embeddings are evaluated on tasks on the test data using the function , which typically is based on cosine similarity between elements of . Our focus will hence be on functions that depend on only through the cosine similarity between its columns.

The set of invariances associated with such consists of the group , where is the subset of orthogonal matrices . This set also contains the set of scale transformations . relates to transformations that leave invariant; the scale transformation preserves the angle between and .

Figure 1 (left) illustrates the incompatibility between invariances of and . For embedding dimension , and are 2D embeddings of words and obtained from solving with respect to coordinate vectors . For , with respect to orthogonally transformed coordinates , and are also viable solutions of . A that depends only on has the same value for . On the other hand, equally valid solutions and of with respect to nonsingularly transformed coordinates for lead to a different value of since unless .

Thus with respect to the evaluation function , each solution from the set is equally good (or bad). However, since , there still exist embeddings which solve with . Such are precisely those which characterise the incompatibility between invariances of and . One such example is the set of given by the one-parameter subgroup , where is a -dimensional diagonal matrix with positive elements. This generalises the subgroup discussed in §2, which is the special case with . Figure 1 (right) illustrates the solution set and 1D subsets for different and particular solutions . The discussion above is summarised through the following Proposition.

Figure 1: Left: For , orthogonally transformed coordinates (blue) with , and nonsingularly transformed (green) with , where (red) are standard coordinates. Distances between two embedding vectors and are preserved in the coordinates , but altered in the coordinates . However, and are valid solutions to (1). Right: Illustration of the solution set and one-dimensional subsets parameterised by for two choices of and two particular solutions .
Proposition 1.

Let be a solution of (1). Then is not invariant to non-singular transforms for any unless for some .

The key message from Proposition 1 is: for , comparison of performances of embeddings and using depends on the (arbitrary) choice of the orthogonal coordinates of . Note however that the choice of the orthogonal coordinates does not have any bearing on , and hence and are both solutions of . The first step towards addressing identifiability issues pertaining to and is to isolate and understand the structure of the set of transformations in which leave invariant but not .

3.1 Structure of the set

What is the dimension of the set ? The dimension of is and that of is . Since is one-dimensional, the dimension of is . Figure 1 (right) clarifies the implication of the result of Proposition 1: given a solution , tuning explores only a one-dimensional set within (yellow) within the overall solution set (green).

A group-theoretic formalism is useful in precisely identifying . Since is a subgroup of , we are interested in those elements of that cannot be related by an orthogonal transformation. Such elements can be identified as the (right) coset of in : equivalence classes for , known as orbits, under the equivalence relation if there exists such that . The set of orbits forms a partition of : each nonsingular transformation is associated with its , elements of which are orthogonally equivalent.

From the definition of , we can represent as , where represents what is left behind in once has been ‘removed’, and denotes the set difference.

Proposition 2.

The set can be identified with the subgroup of upper triangular matrices within with positive diagonal entries.

Proof.

The proof is based on identifying a set that is in bijection with the orbits in . Such a subset is known as a cross section of the coset , and intersects each orbit at a single point. Since is a subgroup of , no two members of belong to the same orbit of any . Thus can be identified with any cross section of .

The map is invariant to the action of since . This implies that is constant within each orbit . To show that is maximal invariant, we need to show that if and only if there is a with . To see this, suppose that , and let be a basis for . Let and . Then . There thus exists a linear isometry, say , such that for . This implies that for , and since is a basis for , with . Thus the range of is in bijection with the orbits in , and constitutes a cross section.

For any

consider its unique QR decomposition

, where and , made possible since is assumed to have positive diagonal elements. Clearly then , and its range can be identified with the set . ∎

Remark 2.

The result of Proposition 2 can be distilled to the existence of a unique QR decomposition of : , where and . There is no loss of generality in assuming that

has positive entries along the diagonal, since this amounts to multiplying by another orthogonal matrix which changes signs accordingly. Thus the map

uniquely identifies an element of .

The map is referred to as a maximal invariant function, and indexes the elements of , and hence . This offers verification of the fact that the dimension of is since it is one fewer than the dimension of the subgroup . Another way to arrive at the conclusion is to notice that any upper triangular matrix can be represented as , where is the identity, is an upper triangular matrix with zeroes along the diagonal, and is a diagonal matrix. The dimension of the set of is and that of the set of is , resulting in as the dimension of the set of .

4 Resolving the problem of non-identifiability

From the preceding discussion we gather that comprises the set of solutions of which do not leave invariant. We explore two resolutions: (i) imposing additional constraints on in (1) to identify solutions up to (Theorem 1), and uniquely (Corollary 1); and (ii) considering as a free parameter. In (i) the identified solution is chosen in a way that is mathematically natural, but need not be necessarily optimal with respect to . In (ii), where is considered as a free parameter, it may be chosen to optimize performance in tasks, i.e., by optimising over .

4.1 Constraining the solution set

Redefine (1) as a constrained optimisation

(6)

over a subset of possible values of which ensures that the only possible solutions are of the form for any solution . The set of possible is unconstrained. From Proposition 2 and the QR decomposition of an element of , this is tantamount to ensuring that for is a solution of (6) if and only if

, the identity matrix. Theorem below identifies the set

for any solution of .

Theorem 1.

Let . Then for any solution to the constrained problem (6), any other solution of the form for satisfies for a given test data .

Proof.

Let be a solution to the unconstrained problem. The proof rests on the simultaneous diagonalisation of and . Since is positive definite there exists such that . Then is symmetric, and there exists such that , where is diagonal. Setting results in .

We thus arrive at the conclusion that there exists a such that . The elements of

solve the generalised eigenvalue problem

. Evidently then is orthogonal if . ∎

An obvious but important corollary to the above Theorem is that any two solutions from are related through an orthogonal transformation (not necessarily unique).

Corollary 1.

For any solutions and of (6) in there exists an such that . In other words, acts transitively on .

Remark 3.

Optimisation over the constrained set results in a reduction of the invariance transformations of from to . This can be understood as choosing for a fixed solution and arbitrary , performing a Gram–Schmidt procedure to obtain for an and , and discarding . Topologically then, the set of solutions is homotopically equivalent to the set . This is because the inclusion

is a homotopy equivalence, as it is well-known that the Gram Schmidt process

is a (strong) deformation retraction.

A unique solution for can be identified by imposing additional constraints on as follows.

Corollary 2.

Denote by the set of all which satisfy the following conditions: (i) The columns of are orthogonal; (ii) the diagonal elements of are arranged in descending order; (iii) first non-zero element of each column of is positive. Then, any solution to the optimisation problem in (1) over the constrained set is unique.

Proof.

We need to show that on the constrained space , the orthogonal obtained by optimising (6) reduces to the identity.

On the set , from the proof of Theorem 1, we note that there exists a such that for a diagonal containing the eigenvalues of with respect to obtained a solution of .

In addition to being orthogonal, condition (i) forces to be a matrix with each column and row containing one non-zero element assuming values . In other words, is forced to be a monomial matrix with entries equal to . This implies that the diagonal contains the same elements as , but possibly in a different order. Condition (ii) then fixes a particular order, and condition (iii) ensures that each diagonal element is +1. We thus end up with . ∎

The idea to modify the optimisation so that the solution is unique up to transformations in , but not necessarily , is also used by Mu et al. (2019). Rather than place constraints on , as above, they modified the objective to include Frobenius norm penalties on and , which achieves the same outcome, although the relationship between the solutions of the penalised and unpenalised problems is not transparent.

4.1.1 Exploiting symmetry of

If the corpus representation is a symmetric matrix, for example involving counts of word-word co-occurrences, then the rows of and the columns of both have the same interpretation as word embeddings. In such cases the symmetry motivates the imposition . For example, in LSA (3) and its solution (5), this is achieved by taking , since owing to the symmetry. This identifies a solution up to sign changes and permutations of the word vectors, transformations which are contained within and hence are of no consequence to .

In GloVe, Pennington et al. (2014) observe that when is symmetric the and are equivalent but differ in practise "as a result of their random initializations". It seems likely that different runs involve the optimisation routine converging to different elements of the solution set, and not in general to solutions with . For a given run Pennington et al seek to treat solutions and symmetrically by taking the word embedding to be which is not itself in general optimal with respect to the GloVe objective function, (although they report that using it over typically confers a small performance advantage). A different approach is to take the embedding to be where is the solution to the equation which more directly identifies an element of the solution set for which , and hence avoids taking the final embedding to be one that is non-optimal with respect to criterion . The same strategy is also appropriate to other word embedding models, e.g. word2vec.

4.2 Optimizing over

To what extent can we optimize word-task performance by choosing an appropriate element of the solution set (4)? The set of transformations has dimension , typically much larger than the number of cases in , so care is needed to avoid overfitting. In particular, if the embeddings generated are to be regarded as a predictive model, then it is necessary to use cross-validation rather than just optimising the embeddings with respect to a particular test set. One approach is to restrict the dimension of the optimisation, for example as earlier by considering solutions for a particular solution and diagonal matrix . A widely used approach corresponds to choosing , a matrix containing the dominant singular values of ; Figure 2 shows how varies with for this and some other choices of chosen quite arbitrarily (details in the caption). There is clearly substantial variability in with , but performance with is only on a par with the other arbitrary choices.

Figure 3 shows the distribution of for where is a GloVe embedding, and is a random element of , which is either upper triangular or diagonal, with its non-zero elements taken from the distribution , and

measures the performance of the embeddings on two similarity test sets. (More details in caption.) The histograms shows substantial variance in the scores for different

. The score for the base embedding is at the higher end of the distribution, though for some instances of random the performance of is superior. It is also noticeable that there is a much greater range of scores when is sampled from the set of diagonal matrices than when it is sampled from the set of upper triangular matrices. We hypothesize that this is because when is diagonal, there is a possibility of very small elements on the diagonal which will essentially wipe out whole rows of , which could have a significant impact on the results.

Table 1 shows scores that result from optimising for with respect to the elements of , using R’s optim implementation of the Nelder–Mead method, where are -dimensional embeddings generated using GloVe and word2vec. The results show that there exists a transformed embedding that performs substantially better than the base embedding. In practice, in order to use this optimization method to generate embeddings, it would be necessary to use cross-validation, as embeddings which achieve optimal performance with respect to one test set may do less well on others. Our aim here is merely to point out that it is possible to improve the test scores by optimizing over elements of .

Pearson

Spearman

Pearson

Spearman

Figure 2: Plots showing word task evaluation scores corresponding to the WordSim-353 task (Finkelstein et al., 2002) (located at http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/) which provides a set of word pairs with human-assigned similarity scores. The embeddings are evaluated by calculating the cosine similarities between the word pairs and using either Pearson or Spearman correlation (each invariant to ) to score correspondence between embedding and human-assigned similarity values. The embedding is from model (3), with taken to be a document–term matrix computed from the Corpus of Historical American English (Davies, 2012), and the plotted lines show how performance varies with different elements of the solution set, namely for as indicated and different as follows: (red lines); (green); (blue); and (purple). Performance for , which is widely used, is not obviously superior to performance of the other completely arbitrary choices for .

(a)

(b)

(c)

(d)

Figure 3: For the same type of task as in Fig. 2, histograms of Spearman correlation scores for embeddings where is a GloVe embedding with trained on Wikipedia 2014 + Gigaword 5 corpus, evaluated on the WordSim-353 test set in (a) and (b), and on the SimLex-999 test set (Hill et al., 2015) in (c) and (d).

is a random matrix, taken to be diagonal in (a) and (c) and upper-triangular in (b) and (d), in each case with the non-zero elements each distributed as

. The number of runs in each case was . The red line on each graph shows the score for the original embedding in each case. Source: https://nlp.stanford.edu/projects/glove/

(a)

(b)

(c)

(d)

Figure 4: Histograms showing the performance of word2vec embeddings trained on the 100-billion word Google News corpus, where (downloaded from https://code.google.com/archive/p/word2vec). As for Figure 3, the test set used is the WordSim-353 test set in (a) and (b), and SimLex-999 in (c) and (d), with the test score being calculated using the Spearman correlation coefficient. In graphs (a) and (c) is sampled from the set of diagonal matrices, and in (b) and (d) it is taken to be upper triangular.
Test set Embeddings Spearman Pearson
WordSim-353 GloVe vectors reported in (Pennington et al., 2014) 0.658
GloVe embedding 0.601 0.603
GloVe embedding with Equation 6 imposed 0.641 0.637
optimized over 0.679 0.760
word2vec embedding 0.700 0.652
word2vec embedding with Equation 6 imposed 0.645 0.588
optimized over 0.797 0.838
SimLex-999 GloVe embedding 0.371 0.389
GloVe embedding with Equation 6 imposed 0.402 0.421
optimized over 0.560 0.582
word2vec embedding 0.441 0.453
word2vec embedding with Equation 6 imposed 0.475 0.480
optimized over 0.583 0.617
Table 1: Evaluation task scores corresponding to WordSim-353 (Finkelstein et al., 2002) and SimLex-999 (Hill et al., 2015) test sets. The base GloVe embedding is as described in the caption of Figure 3; the word2vec embedding is as described in the caption of Figure 4.

In the first row we note for reference the performance reported in (Pennington et al., 2014). The results indicate substantial scope for improving performance scores via an appropriate choice of .

5 Conclusions

We summarise our conclusions as follows.

  1. Examining word embeddings — including LSA, word2vec, GloVe — through the relationship with low-rank matrix factorisations with respect to a criterion makes it clear that the solution is non-identifiable: for a particular solution , for any is also a solution. Different elements of the -dimensional solution set perform differently in evaluations, , of word task performance.

  2. An important implication is that the disparity in performance between word embeddings on tasks maybe due to the particular elements selected from the solution sets. In word embeddings for which the is optimized numerically with some randomness, for example in the initializations, the optimisation may converge to different elements of the solution set. An embedding chosen based on the best performance in over repeated runs of the optimisation can essentially be viewed as a Monte Carlo optimisation over the solution set.

  3. The evaluation function is usually only invariant to orthogonal () and scale-type () transformations. Thus for an embedding dimension , the effective dimension of the solution set after accounting for the orthogonal transformations, and scaled versions of the identity, is . Conclusions from evaluations with large must hence be interpreted with some care, especially if the is optimized with respect to the incompatible transformations directly or indirectly, for example as in point 2 above.

  4. These considerations have a bearing on the interpretation of the performance of the popular embedding approach of taking where is a tuning parameter and is a diagonal matrix taken, for example, to contain the singular values of . This amounts to providing a way to perform a search over a one-dimensional subset of the -dimensional solution set. Our numerical results suggest there is nothing special about this particular choice of (or the corresponding one-dimensional subset being searched over), nor is there a clear rationale for restricting to a one-dimensional subset.

Acknowledgments

The authors gratefully acknowledge support for this work from grants NSF DMS 1613054 and NIH RO1 CA214955 (KB), a Bloomberg Data Science Research Grant (KB & SP), and an EPSRC PhD studentship (RC).

References

  • J. A. Bullinaria and J. P. Levy (2012) Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and svd. Behavior research methods 44 (3), pp. 890–907. Cited by: §2.
  • M. Davies (2012) Expanding horizons in historical linguistics with the 400-million word corpus of historical american english. Corpora 7 (2), pp. 121–157. Cited by: Figure 2.
  • S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman (1990) Indexing by latent semantic analysis. Journal of the American society for information science 41 (6), pp. 391–407. Cited by: §1.
  • C. Eckart and G. Young (1936) The approximation of one matrix by another of lower rank. Psychometrika 1 (3), pp. 211–218. Cited by: §2.
  • L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin (2002) Placing search in context: the concept revisited. ACM Transactions on information systems 20 (1), pp. 116–131. Cited by: Figure 2, Table 1.
  • F. Hill, R. Reichart, and A. Korhonen (2015) Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. Cited by: Figure 3, Table 1.
  • O. Levy, Y. Goldberg, and I. Dagan (2015) Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3, pp. 211–225. Cited by: §1, §1, Remark 1.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §1.
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013b) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
  • C. Mu, G. Yang, and Z. Yan (2019) Revisiting skip-gram negative sampling model with rectification. arXiv preprint arXiv:1804.00306v2. Cited by: §2, §4.1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing

    ,
    pp. 1532–1543. Cited by: §1, §4.1.1, Table 1.
  • P. D. Turney (2013) Distributional semantics beyond words: supervised learning of analogy and paraphrase. Transactions of the Association for Computational Linguistics 1, pp. 353–366. Cited by: §2.
  • Z. Yin and Y. Shen (2018) On the dimensionality of word embedding. In Advances in Neural Information Processing Systems, pp. 887–898. Cited by: §2.