An Analysis on the Learning Rules of the Skip-Gram Model

03/18/2020 ∙ by Canlin Zhang, et al. ∙ Florida State University 0

To improve the generalization of the representations for natural language processing tasks, words are commonly represented using vectors, where distances among the vectors are related to the similarity of the words. While word2vec, the state-of-the-art implementation of the skip-gram model, is widely used and improves the performance of many natural language processing tasks, its mechanism is not yet well understood. In this work, we derive the learning rules for the skip-gram model and establish their close relationship to competitive learning. In addition, we provide the global optimal solution constraints for the skip-gram model and validate them by experimental results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the last few years, performance on natural language processing tasks has improved significantly due to the application of deep learning architectures and better representations of words

[1] [3] [13]. To improve generalization and reduce the complexity of language models, words are commonly represented using dense vectors, where similar vectors represent similar words [7] [8]. Among many such representations, word2vec (short for Word-to-Vector) is widely used due to its computational efficiency and its ability to capture interesting analogue relationships [6] [9]. In addition, systems built on word2vec representations often lead to significant performance improvements.

However, it is not well understood why word2vec exhibits these desirable properties. Although many researchers intended to find the source of efficiency of the word2vec, many works did not provide strict mathematical analysis on the formulas of the skip-gram model. The main contribution of this work is three-fold: First, we examine the gradient formulas of the skip-gram model and then derive the underlying learning rules for both the input and output vectors. Second, we establish that word2vec leads to a competitive learning rule [17] for word representations. Third, given the training corpus, we provide the global optimal solution constraints on both the input and output vectors in the skip-gram model.

The paper is organized as follows: In Section II, we present the skip-gram model as well as word2vec for learning vector representations of words. The learning rules of the skip-gram model as well as its connections to competitive learning are shown in section III. In Section IV, the global optimal solution constraints on the vectors of the skip-gram model are proved first; then, experimental results obtained on both a toy dataset and a big training corpus are provided to support our results. After that, the connections between the learning rules and the global optimal solution constraints are discussed. Finally, Section V concludes the paper with a brief summary and discussions on our future work.

Ii The Skip-Gram Model for Vector Representations of Words

The skip-gram model implemented by word2vec [6] can be described like this:

Suppose we are given a length- text training corpus . Based on this corpus, we can build a dictionary including words: , where the words in dictionary are descended according to their frequencies in the training corpus.

Then, for each word in dictionary , two vector representations are provided: the ( ) and the ( )

, both of which are initialized via random normal distribution

[4]

. Based on the embedding vectors, the conditional probability

between any two words and in

are estimated using the Softmax function:

(1)

where means the transformation of vector , and means the inner product between the two vectors and .

Therefore, the goal of the skip-gram model is to maximize:

(2)

where is the radius of the center-removed context window at .

Note that we use rather than to represent the probability estimation given by the vectors, which differs from the typical notification in the papers on skip-gram model. In the following sections, we will do analysis on both the estimated probability given by the vectors and the ground-truth probability based on the training corpus. Hence, we differ these two concepts ahead.

Since the calculation of formula (1) requires inner products and exponentials, researchers usually do not directly use it in practice. Instead, many simplified versions of formula (1) are applied [11] [12]. But, Mikolov et al. [6] came up with an efficient and effective method to approximate : Instead of calculating throught the words in the vocabulary, is used as an approximation with randomly chosen words from the distribution , where is around 2-5 for big datasets and 5-20 for small ones. The best specific distribution is the unigram distribution raised to the 3/4rd power, i.e., .

Then, in order to maximize , one only needs to maximize and minimize , which is why the words

are called negative samples. Moreover, the exponential function is usually replaced by the sigmoid function

in practice to avoid underflow. That is, Mikolov et al. aim at maximizing

(3)

for and in formula (2).

Formula (3) applying on word embeddings is the first efficient language model for neural network training. This skip-gram model implemented with negative sampling (SGNS) provides surprisingly meaningful results. Models using the SGNS pre-trained word embeddings provide not only good performance on many NLP tasks, but also a series of interesting analogue relationships [10].

On the other hand, according to formula (3), it is not hard to see that the method of negative sampling itself “involves no magic”: it only provides a simple and efficient way to approximate the conditional probability . We claim that the efficiency of the SGNS model lies in the skip-gram algorithm and the usage of dense vector embeddings of words instead of one-hot vectors. By implementing formula (3), the skip-gram model makes the embedding vectors of words with similar contexts converge to each other in the vector space. Hence, semantic information and analogue relationships are captured by the vector distribution. Detailed explanations are provided in the following two sections.

Iii The Learning Rules for the Skip-Gram Model

In this section, we shall provide a systematic understanding on the learning rules of the skip-gram model. We shall first find the gradient formula for each input and output vector, based on which the connections between the skip-gram model and the competitive learning will be addressed.

We reform the average log probability as:

Given a fixed word in the dictionary, the gradient of its input vector will be:

Hence, the gradient formula for the entire input vector will be:

where means that the word appearing at the position in the training corpus is the word in the dictionary.

Note that the training purpose of the skip-gram model is to maximize the average log probability . Yet in practice, researchers always apply gradient descent to minimize due to programming facts. However, in order to provide a clear theoretical analysis, we directly apply with respect to to maximize in our learning rule. Hence, the learning rule for updating the input vector will be:

(4)

where is the learning rate.

Intuitively speaking, adding a vector to the vector will make move towards the direction of , or make the angle between and smaller. On the contrast, subtracting vector from the vector (i.e. ) will make move away from , or make the angle between the two vectors larger.

By analyzing terms in the large bracket of formula (4), one can see that the term as well as each is always positive since for any word . Hence, intuitively speaking, the vector is added to , while vectors for all the are subtracted from . This means that the gradient ascent will make the input vector move towards the output vector of word that appears in the context window of word . Meanwhile, the gradient ascent will make move away from all the output vectors other than .

This process is a form of competitive learning: If a word appears in the context of the word , it shall compete against all the other words to “pull” the input vector closer to its own output vector

. But we need to indicate that there is a major difference between the gradient ascent of the skip-gram model and the typical winner-takes-all (WTA) algorithm of competitive learning: In the back propagation of WTA, the gradients to all the loser neurons are zero, which means that “the winner takes all while the losers stand still”

[17]. In the gradient ascent of the skip-gram model, however, the losers (which are all the words other than ) will be even worse: They have to “push” the input vector away from their own output vectors . That is, the competitive learning rule for updating an input vector in the skip-gram model is “the winner takes its winning while the losers even lose more”.

The implementation of SGNS is also compatible with our analysis here: Maximizing formula (3) leads to maximizing the inner product while minimizing the inner products for all the words . This means that the input vector of the word will be pulled closer towards the output vector , if word is in the context of word ; And meanwhile, will be pushed away from the output vectors , where are randomly chosen words (negative samples). That is, the negative samples play the role of simulating all the “loser” words : Since computing for all the words other than is not feasible, SGNS randomly pick a few words to act as the “loser” words, so that the winner word is differentiated [15].

As a result, by the gradient ascent updating, the input vector will gradually match with the output vectors of words that appear in the context of , while differ from output vectors of all the words that are not in the context of . In this way, the semantic information is therefore captured by the distribution of embedding vectors in the vector space [21] [22].

Similarly, for an output vector , the gradient of on each of its dimension is given as:

where means the number of times the word appears in the radius-c, center-removed window at the word . And similar to the gradient of the input vector , is the estimation of the conditional probability given by the current vector set .

Then, the gradient formula of the entire output vector is:

And therefore, the gradient ascent updating rule for the output vector will be:

(5)

By looking at formula (5), it is easy to see that the softmax definition shall make small (less than ) for most word pair . And the window size is usually around to , which means that multiplying will not significantly enlarge . As a result, the term will almost always be positive when the word appears in the context window of , since in that case will be at least one. But if is not in the context window of , will be zero and hence will be negative.

That is, if the word at position in the training corpus has in its context window, it shall “pull” the output vector of the word towards its own input vector . However, if is not in the context window of , then the word has to “push” away from its own input vector . Once again, this is a process of competitive learning: Each word in the training corpus shall compete against each other to make the output vector move towards its own input vector , otherwise the word at position will be competed out such that will move away from . Under this mechanism, those words with the word in their context window will win, and those without in their context window will lose. The competitive learning rule here is still “winners take their winnings while losers even lose”.

However, analyzing formula (3) again, we can see that in each training step the there is only one unique input vector . So, the role played by need to be regarded as “multiple”: It is the “winner” to the truly appeared word , but it is the “loser” to all the negative sampling words [14]. Hence, we have to admit that the competitive learning rule on updating the output vector of the skip-gram model is not fully reflected in the SGNS. In other words, the SGNS put a bias on the input vectors, while theoretically the status of input and output vectors in the skip-gram model should be equivalent [18].

In summation, the gradient updating formulas of the input and output vectors in the skip-gram model are connected to the competitive learning rules, which are inherited by the SGNS.

Based on the discussion in this section, we shall do analysis on the global optimal solution constraints of the skip-gram model in Section IV.

Iv Optimal Solutions for the Skip-Gram Model

In this section, we shall first provide the global optimal solution constraints on the skip-gram model. Then, we shall use experimental results to support our analysis. Then, we will do analysis to show how the gradient ascent formulas of the skip-gram model make the word embedding vectors converge to the global optimal solution.

Iv-a The Global Optimal Solution Constraints

While the gradient ascent formulas in the previous section result in the learning rule for one-step updating, it is desirable to know the properties of the global optimal solutions for quantitative analysis and reasoning: We concern about not only the rules to update input and output vectors, but also the final results of them.

To do this, we will first reorder the terms by putting together all the context words for each word. That is, we will reform as:

Here, is the number of times word appears in the radius-c, center-removed window of word throughout the training corpus (the counting is overlapping: if at time step , appears in two overlapping windows of two nearby words, then will be counted twice). Then, the global optimal problem of the skip-gram model can be defined as:

Given the word training corpus and its corresponding dictionary , we want to find an input and output vector set with respect to each word in the dictionary , such that under the definition

for any two words , the average log probability

(6)

is maximized.

Maximizing formula (6) directly is difficult. So, we are seeking for maximizing each term in it. That is, given a fixed training corpus (and hence a fixed dictionary and fixed for ), we want to maximize for each word .

By the definition of , we can see that no matter what the vector set is, is always true for any two words , ; and is always true for any word .

Therefore, the problem to maximizing for each specific word is equivalent to a constrained optimization problem that can be directly solved by the method of Lagrange Multiplier: Given a set of integers ( for ), we want to find non-negative real numbers with the constraint , so that the product

is maximized.

Now, define the Lagrange function to be:

According to the method of Lagrange Multiplier, if is a solution to the initial optimization problem with constraint, then there exists a such that

at .

Solving the above formula, we obtain a system of equations with equations and variables:

Taking the first two equations, we can get that

Dividing by the common factors, we can obtain that , which indicate that . That is, , which can be directly generalized as for any . Therefore, we can obtain that . Finally, taking the last equation , we can get that

But taking each back to , we can easily see that

where is the number of times the word appears in the training corpus. Now, notice that the term means: the number of times word appears in the context window of the word over the total amount of context the word has. That is, the probability of the word appears in the radius-c context window of the word , which can be regarded as a ground-truth probability decided by the training corpus. We use to represent this probability.

As a result, we obtained the solution on maximizing for each word . That is, we want the vector set satisfices:

This means that: In order to maximize for each word , the input and output vectors should make the estimated probability coincide with the ground true probability for all the word in the dictionary.

As a result, here we provide our conclusion on the global optimal solution of the skip-gram model with word2vec:

Given the word training corpus and its corresponding dictionary , the average log probability

of the skip-gram model is maximized, when the input and output vector set makes the estimated probability equal to the ground-truth probability for any two words , in the dictionary. That is,

(7)

for any , where is the number of times the word appears in the radius-c, center-removed window of the word throughout the training corpus, and is the number of times the word appears in the training corpus.

However, note that formula (7) is only a constraint based on the inner products between input and output vectors. It does not specify the exact positions in the vector space, which is somehow typical in many optimization projects [20] [16]. Actually, there are infinite number of vector sets satisfying formula (7): If satisfies formula (7), then any rotation of also does. The specific vector set obtained after training depends on the initial condition of vectors.

Iv-B Experimental Results

In this subsection, we shall first provide the experimental results obtained on a toy training corpus, the words of the song , to support our analysis on the global optimal solution constraints.

This song goes like: “Every person had a star, every star had a friend, and for every person carrying a star there was someone else who reflected it, and everyone carried this reflection like a secret confidante in the heart.” Based on this toy corpus, we will strictly implement formula (1) to compute in the skip-gram model. We set and then go over the song for 500 times to maximize formula (2).

After that, taking the word “every” as an example, we look at both the ground-truth probability and the estimated one for all the word appeared in the corpus: We can see that the word “every” appears three times in the corpus. Hence, and Then, we just count for each word in order to get . After that, we read out all the trained vectors , for each word to compute based on formula (1). The result is in Table I:

Word Word
star 0.1667 0.1718 friend 0.0000 0.0095
had 0.1667 0.1713 reflection 0.0000 0.0092
person 0.1667 0.1644 it 0.0000 0.0068
and 0.0833 0.0917 secret 0.0000 0.0062
a 0.0833 0.0893 was 0.0000 0.0061
for 0.0833 0.0886 like 0.0000 0.0052
carrying 0.0833 0.0865 everyone 0.0000 0.0046
every 0.0000 0.0137 confidance 0.0000 0.0046
this 0.0000 0.0109 heart 0.0000 0.0046
the 0.0000 0.0105 who 0.0000 0.0043
there 0.0000 0.0103 someone 0.0000 0.0043
else 0.0000 0.0100 carried 0.0000 0.0040
in 0.0000 0.0096 reflected 0.0000 0.0018
TABLE I: The ground-truth probability and estimated probability for the skip-gram model trained on the toy corpus .

And we provide the graph of and with respect to words ordered as in above:

Fig. 1: The ground-truth and estimated probabilities based on the word “every”.

Based on the table and the graph, we can see that after training, the input and output vectors indeed converge to a stage satisfying the constraints of global optimal solution.

Then, we shall provide our experimental results obtained from a big training corpus. We use an optimized word2vec code provided online by the TensorFlow group

[19]. Our training corpus is dataset Text8, which consists of articles in English Wikipedia [5].

After the vector set is trained, we shall choose a specific word to compute and for the first 10,000 most frequent words in the dictionary. We use and to represent the value set we obtained.

Then, regarding and as two sample sets, we shall compute the correlation coefficient between the samples in them. That is:

(8)

where and are the means of the samples in and respectively.

The reason for us to calculate the correlation in such a way is that, the difference between and for each pair of word appears to be chaotic in our experiments. We believe that the complexity of the big-data and the stochastic ambiguity caused by negative sampling shall generate noise upon the mathematical regularities, which makes the magnitude of , and move out of their initial ratio. Hence, statistical methods are required to capture the relationship between and . We fix the input word since we can then make all the estimated probabilities share the same denominator .

We choose 18 specific words as , including 6 nouns, 6 verbs and 6 adjectives. The results on their correlation coefficients are shown in Table II:

Word Word Word
water 0.3558 run 0.3433 smart 0.3327
man 0.3230 play 0.3125 pretty 0.4039
king 0.3169 eat 0.3879 beautiful 0.3074
car 0.3300 drink 0.3507 dark 0.3209
bird 0.2700 fly 0.2886 high 0.3859
war 0.3990 draw 0.2730 low 0.3707
TABLE II: The correlation coefficients between the ground-truth probability and the estimated probability of each word

Since there are 10000 participating in the computation of for each , a correlation coefficient around 0.3 to 0.4 is significant. That is, for a fixed word , there exists a linear relationship between the ground-truth probability and the estimated probability . And hence, the linear correlation: can be seen from the ambiguity of , which is a strong support to our stated formula (7) in subsection A.

Based on the results from both the toy corpus and the big dataset, we can see that our global optimal solution constraints on the vectors in word2vec is correct.111

Our work are open-source. Researchers can find our codes and datasets on our website:

https://github.com/canlinzhang/IJCNN-2019-paper.

Iv-C Connections to the Gradient Ascent Formulas

Based on our results so far, researchers may ask whether the vectors would truly converge to the optimal solution under the gradient ascent formulas. The answer is yes and we shall show it in this subsection.

According to the gradient formula of the input vector in Section III, we can furthermore obtain that:

Setting and to have the same meaning as in subsection A, we can get that:

Note that is actually the ground-truth occurring probability of the word based on the training corpus, which we denote as . Also, taking as in subsection A, we have that:

Therefore, the updating rule of the input vector under this gradient ascent formula will be:

(9)

Intuitively speaking, this formula shows that in the skip-gram model, the wider the context window is, the faster all the input vectors will change. And the more frequent a word is, the fast its input vector may change. However, the essential part of formula (9) is the summation , in which lies the connection between the gradient ascent rule of the input vector and the global optimal solution constraints.

For any two fixed words and , we have that