Semantics- and Syntax-related Subvectors in the Skip-gram Embeddings

12/23/2019 ∙ by Maxat Tezekbayev, et al. ∙ 0

We show that the skip-gram embedding of any word can be decomposed into two subvectors which roughly correspond to semantic and syntactic roles of the word.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Assuming that words have already been converted into indices, let be a finite vocabulary of words. Following the setups of the widely used word2vec [2] model, we consider twovectors per each word :

  • is an embedding of the word when is a center word,

  • is an embedding of the word when is a context word.

We follow the assumptions of assylbekov2019context assylbekov2019context on the nature of word vectors, context vectors, and text generation, i.e.

  1. A priori word vectors

    are i.i.d. draws from isotropic multivariate Gaussian distribution:

    , where is the identity matrix.

  2. Context vectors are related to word vectors according to ,

    , for some orthogonal matrix

    .

  3. Given a word

    , the probability of any word

    being in its context is given by

    (1)

    where is the unigram probability for the word .

Figure 1: - and -embeddings

Hypothesis. Under the assumptions 13 above, assylbekov2019context assylbekov2019context showed that each word’s vector splits into two approximately equally-sized subvectors and , and the model (1) for generating a word in the context of a word can be rewritten as

Interestingly, embeddings of the first type ( and ) are responsible for pulling the word into the context of the word , while embeddings of the second type ( and ) are responsible for pushing the word away from the context of the word . We hypothesize that the -embeddings are more related to semantics, whereas the -embeddings are more related to syntax. In what follows we provide a motivating example for this hypothesis and then empirically validate it through controlled experiments.

Data Embeddings Size Finkelstein et al. Bruni et al. Radinsky et al. Luong, Socher, and Manning Google MSR
WordSim MEN M. Turk Rare Words
text8 200 .646 .650 .636 .063 .305 .319
Only 100 .703 .693 .673 .149 .348 .213
Only 100 .310 .102 .193 .019 .032 .128
enwik9 200 .664 .697 .616 .216 .518 .423
Only 100 .714 .729 .652 .256 .545 .303
Only 100 .320 .188 .196 .091 .096 .251
Table 1:

Evaluation of word vectors and subvectors on the analogy tasks (Google and MSR) and on the similarity tasks (the rest). For word similarities evaluation metric is the Spearman’s correlation with the human ratings, while for word analogies it is the percentage of correct answers. Model sizes are in number of trainable parameters.

Motivating Example

Consider a phrase

the dog barking at strangers

The word ‘barking’ appears in the context of the word ‘dog’ but the word vector is not the closest to the word vector (see Table 2). Instead, these vectors are split

in such way that the quantity is large enough. We can interpret this as follows: the word ‘barking’ is semantically close enough to the word ‘dog’ but is not the closest one: e.g. is much closer to than ; on the other hand the word ‘barking’ syntactically fits better being next to the word ‘dog’ than ‘puppy’, i.e. .

word
puppy
barking
Table 2: Dot products between vectors.

This combination of semantic proximity () and syntactic fit () allows the word ‘barking’ to appear in the context of the word ‘dog’.

Experiments

In this section we empirically verify our hypothesis. We train SGNS with tied weights [1] on two widely-used datasets, text8 and enwik9,111http://mattmahoney.net/dc/textdata.html. The enwik9 data was processed with the Perl-script wikifil.pl provided on the same webpage. which gives us word embeddings as well as their partitions:

The source code that reproduces our experiments is available at https://github.com/MaxatTezekbayev/Semantics–and-Syntax-related-Subvectors-in-the-Skip-gram-Embeddings.

-Subvectors Are Related to Semantics

We evaluate the whole vectors ’s, as well as the subvectors ’s and ’s on standard semantic tasks — word similarity and word analogy. We used the hyperwords tool of levy2015improving levy2015improving and we refer the reader to their paper for the methodology of evaluation. The results of evaluation are provided in Table 1. As one can see, the -subvectors outperform the whole -vectors in the similarity tasks and show competitive performance in the analogy tasks. However, the -parts demonstrate poor performance in these tasks. This shows that the -subvectors carry more semantic information than the -subvectors.

-Subvectors Are Related to Syntax

We train a softmax regression by feeding in the embedding of a current word to predict the part-of-speech (POS) tag of the next word:

We evaluate the whole vectors and the subvectors on tagging the Brown corpus with the Universal POS tags. The resulting accuracies are provided in Table 3.

Embeddings Size Trained on Trained on
text8 enwik9
200 .445 .453
Only 100 .381 .384
Only 100 .426 .451
Table 3: Accuracies on a simplified POS-tagging task.

We can see that the -subvectors are more suitable for POS-tagging than the -subvectors, which means than the -parts carry more syntactic information than the -parts.

Conclusion

Theoretical analysis of word embeddings gives us better understanding of their properties. Moreover, theory may provide us interesting hypotheses on the nature and structure of word embeddings, and such hypotheses can be verified empirically as is done in this paper.

Acknowledgements

This work is supported by the Nazarbayev University Collaborative Research Program 091019CRP2109, and by the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan, IRN AP05133700.

References

  • [1] Z. Assylbekov and R. Takhanov (2019) Context vectors are reflections of word vectors in half the dimensions.

    Journal of Artificial Intelligence Research

    66, pp. 225–242.
    Cited by: Experiments.
  • [2] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Proceedings of NeurIPS, pp. 3111–3119. Cited by: Introduction.