 # Semantics- and Syntax-related Subvectors in the Skip-gram Embeddings

We show that the skip-gram embedding of any word can be decomposed into two subvectors which roughly correspond to semantic and syntactic roles of the word.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Introduction

Assuming that words have already been converted into indices, let be a finite vocabulary of words. Following the setups of the widely used word2vec  model, we consider twovectors per each word :

• is an embedding of the word when is a center word,

• is an embedding of the word when is a context word.

We follow the assumptions of assylbekov2019context assylbekov2019context on the nature of word vectors, context vectors, and text generation, i.e.

1. A priori word vectors

are i.i.d. draws from isotropic multivariate Gaussian distribution:

, where is the identity matrix.

2. Context vectors are related to word vectors according to ,

, for some orthogonal matrix

.

3. Given a word

, the probability of any word

being in its context is given by

 p(i∣j)∝pi⋅ew⊤jci (1)

where is the unigram probability for the word .

Hypothesis. Under the assumptions 13 above, assylbekov2019context assylbekov2019context showed that each word’s vector splits into two approximately equally-sized subvectors and , and the model (1) for generating a word in the context of a word can be rewritten as

 p(i∣j)≈pi⋅ex⊤jxi−y⊤jyi.

Interestingly, embeddings of the first type ( and ) are responsible for pulling the word into the context of the word , while embeddings of the second type ( and ) are responsible for pushing the word away from the context of the word . We hypothesize that the -embeddings are more related to semantics, whereas the -embeddings are more related to syntax. In what follows we provide a motivating example for this hypothesis and then empirically validate it through controlled experiments.

## Motivating Example

Consider a phrase

 the dog barking at strangers

The word ‘barking’ appears in the context of the word ‘dog’ but the word vector is not the closest to the word vector (see Table 2). Instead, these vectors are split

 w⊤dog =[x⊤dog;y⊤dog] w⊤barking =[x⊤barking;y⊤barking]

in such way that the quantity is large enough. We can interpret this as follows: the word ‘barking’ is semantically close enough to the word ‘dog’ but is not the closest one: e.g. is much closer to than ; on the other hand the word ‘barking’ syntactically fits better being next to the word ‘dog’ than ‘puppy’, i.e. .

This combination of semantic proximity () and syntactic fit () allows the word ‘barking’ to appear in the context of the word ‘dog’.

## Experiments

In this section we empirically verify our hypothesis. We train SGNS with tied weights  on two widely-used datasets, text8 and enwik9,111http://mattmahoney.net/dc/textdata.html. The enwik9 data was processed with the Perl-script wikifil.pl provided on the same webpage. which gives us word embeddings as well as their partitions:

 w⊤i:=[x⊤i;y⊤i].

The source code that reproduces our experiments is available at https://github.com/MaxatTezekbayev/Semantics–and-Syntax-related-Subvectors-in-the-Skip-gram-Embeddings.

### x-Subvectors Are Related to Semantics

We evaluate the whole vectors ’s, as well as the subvectors ’s and ’s on standard semantic tasks — word similarity and word analogy. We used the hyperwords tool of levy2015improving levy2015improving and we refer the reader to their paper for the methodology of evaluation. The results of evaluation are provided in Table 1. As one can see, the -subvectors outperform the whole -vectors in the similarity tasks and show competitive performance in the analogy tasks. However, the -parts demonstrate poor performance in these tasks. This shows that the -subvectors carry more semantic information than the -subvectors.

### y-Subvectors Are Related to Syntax

We train a softmax regression by feeding in the embedding of a current word to predict the part-of-speech (POS) tag of the next word:

 ˆPOS[t+1]=softmax(Aw[t]+b)

We evaluate the whole vectors and the subvectors on tagging the Brown corpus with the Universal POS tags. The resulting accuracies are provided in Table 3.

We can see that the -subvectors are more suitable for POS-tagging than the -subvectors, which means than the -parts carry more syntactic information than the -parts.

## Conclusion

Theoretical analysis of word embeddings gives us better understanding of their properties. Moreover, theory may provide us interesting hypotheses on the nature and structure of word embeddings, and such hypotheses can be verified empirically as is done in this paper.

## Acknowledgements

This work is supported by the Nazarbayev University Collaborative Research Program 091019CRP2109, and by the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan, IRN AP05133700.

## References

•  Z. Assylbekov and R. Takhanov (2019) Context vectors are reflections of word vectors in half the dimensions.

Journal of Artificial Intelligence Research

66, pp. 225–242.
Cited by: Experiments.
•  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Proceedings of NeurIPS, pp. 3111–3119. Cited by: Introduction.