Incremental Skip-gram Model with Negative Sampling

04/13/2017
by   Nobuhiro Kaji, et al.
yahoo
0

This paper explores an incremental training strategy for the skip-gram model with negative sampling (SGNS) from both empirical and theoretical perspectives. Existing methods of neural word embeddings, including SGNS, are multi-pass algorithms and thus cannot perform incremental model update. To address this problem, we present a simple incremental extension of SGNS and provide a thorough theoretical analysis to demonstrate its validity. Empirical experiments demonstrated the correctness of the theoretical analysis as well as the practical usefulness of the incremental algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/09/2019

Dynamic Network Embedding via Incremental Skip-gram with Negative Sampling

Network representation learning, as an approach to learn low dimensional...
10/26/2017

Improving Negative Sampling for Word Representation using Self-embedded Features

Although the word-popularity based negative sampler has shown superb per...
04/01/2018

Revisiting Skip-Gram Negative Sampling Model with Regularization

We revisit skip-gram negative sampling (SGNS), a popular neural-network ...
06/06/2019

Second-order Co-occurrence Sensitivity of Skip-Gram with Negative Sampling

We simulate first- and second-order context overlap and show that Skip-G...
12/03/2020

Encoding Incremental NACs in Safe Graph Grammars using Complementation

In modelling complex systems with graph grammars (GGs), it is convenient...
12/03/2014

Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation

We present a novel family of language model (LM) estimation techniques n...
09/09/2019

A Complete Transient Analysis for the Incremental LMS Algorithm

The incremental least mean square (ILMS) algorithm was presented in <cit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Existing methods of neural word embeddings are typically designed to go through the entire training data multiple times. For example, negative sampling (Mikolov et al., 2013b)

needs to pre-compute the noise distribution from the entire training data before performing Stochastic Gradient Descent (SGD). It thus needs to go through the training data at least twice. Similarly, hierarchical soft-max

(Mikolov et al., 2013b) has to determine the tree structure and GloVe (Pennington et al., 2014) has to count co-occurrence frequencies before performing SGD.

The fact that those existing methods are multi-pass algorithms means that they cannot perform incremental model update when additional training data is provided. Instead, they have to re-train the model on the old and new training data from scratch.

However, the re-training is obviously inefficient since it has to process the entire training data received thus far whenever new training data is provided. This is especially problematic when the amount of the new training data is relatively smaller than the old one. One such a situation is that the embedding model is updated on a small amount of training data that includes newly emerged words for instantly adding them to the vocabulary set. Another situation is that the word embeddings are learned from ever-evolving data such as news articles and microbologs (Peng et al., 2017) and the embedding model is periodically updated on newly generated data (e.g., once in a week or month).

This paper investigates an incremental training method of word embeddings with a focus on the skip-gram model with negative sampling (SGNS) (Mikolov et al., 2013b) for its popularity. We present a simple incremental extension of SGNS, referred to as incremental SGNS, and provide a thorough theoretical analysis to demonstrate its validity. Our analysis reveals that, under a mild assumption, the optimal solution of incremental SGNS agrees with the original SGNS when the training data size is infinitely large. See Section 4 for the formal and strict statement. Additionally, we present techniques for the efficient implementation of incremental SGNS.

Three experiments were conducted to assess the correctness of the theoretical analysis as well as the practical usefulness of incremental SGNS. The first experiment empirically investigates the validity of the theoretical analysis result. The second experiment compares the word embeddings learned by incremental SGNS and the original SGNS across five benchmark datasets, and demonstrates that those word embeddings are of comparable quality. The last experiment explores the training time of incremental SGNS, demonstrating that it is able to save much training time by avoiding expensive re-training when additional training data is provided.

2 SGNS Overview

As a preliminary, this section provides a brief overview of SGNS.

Given a word sequence, , for training, the skip-gram model seeks to minimize the following objective to learn word embeddings:

where is a target word and is a context word within a window of size .

represents the probability that

appears within the neighbor of , and is defined as

(1)

where and are ’s embeddings when it behaves as a target and context, respectively. represents the vocabulary set.

Since it is too expensive to optimize the above objective, Mikolov et al. Mikolov et al. (2013b) proposed negative sampling to speed up skip-gram training. This approximates Eq. (1

) using sigmoid functions and

randomly-sampled words, called negative samples. The resulting objective is given as

where , , and is the sigmoid function. The negative sample

is drawn from a smoothed unigram probability distribution referred to as

noise distribution: , where represents the frequency of a word in the training data and is a smoothing parameter ().

The objective is optimized by SGD. Given a target-context word pair ( and ) and negative samples () drawn from the noise distribution, the gradient of is computed. Then, the gradient descent is performed to update , , and .

SGNS training needs to scan the entire training data multiple times because it has to pre-compute the noise distribution before performing SGD. This makes it difficult to perform incremental model update when additional training data is provided.

3 Incremental SGNS

This section explores incremental training of SGNS. The incremental training algorithm (Section 3.1), its efficient implementation (Section 3.2), and the computational complexity (Section 3.3) are discussed in turn.

3.1 Algorithm

Algorithm 1 presents incremental SGNS, which goes through the training data in a single-pass to update word embeddings incrementally. Unlike the original SGNS, it does not pre-compute the noise distribution. Instead, it reads the training data word by word111In practice, Algorithm 1 buffers a sequence of words (rather than a single word ) at each step, as it requires an access to the context words in line 7. This is not a practical problem because the window size is usually small and independent from the training data size . to incrementally update the word frequency distribution and the noise distribution while performing SGD. Hereafter, the original SGNS (c.f., Section 2) is referred to as batch SGNS to emphasize that the noise distribution is computed in a batch fashion.

The learning rate for SGD is adjusted by using AdaGrad (Duchi et al., 2011). Although the linear decay function has widely been used for training batch SGNS (Mikolov, 2013), adaptive methods such as AdaGrad are more suitable for the incremental training since the amount of training data is unknown in advance or can increase unboundedly.

It is straightforward to extend the incremental SGNS to the mini-batch setting by reading a subset of the training data (or mini-batch), rather than a single word, at a time to update the noise distribution and perform SGD (Algorithm 2). Although this paper primarily focuses on the incremental SGNS, the mini-batch algorithm is also important in practical terms because it is easier to be multi-threaded.

Alternatives to Algorithms 2 might be possible. Other possible approaches include computing the noise distribution separately on each subset of the training data, fixing the noise distribution after computing it from the first (possibly large) subset, and so on. We exclude such alternatives from our investigation because it is considered difficult to provide them with theoretical justification.

1:   for all
2:  for  do
3:      
4:       for all
5:      for  do
6:          draw negative samples from :
7:          use SGD to update , , and
8:      end for
9:  end for
Algorithm 1 Incremental SGNS
1:  for each subset of the training data do
2:      update the noise distribution using
3:      perform SGD over
4:  end for
Algorithm 2 Mini-batch SGNS

3.2 Efficient implementation

Although the incremental SGNS is conceptually simple, implementation issues are involved.

3.2.1 Dynamic vocabulary

One problem that arises when training incremental SGNS is how to maintain the vocabulary set. Since new words emerge endlessly in the training data, the vocabulary set can grow unboundedly and exhaust a memory.

We address this problem by dynamically changing the vocabulary set. The Misra-Gries algorithm (Misra and Gries, 1982) is used to approximately keep track of top- frequent words during training, and those words are used as the dynamic vocabulary set. This method allows the maximum vocabulary size to be explicitly limited to , while being able to dynamically change the vocabulary set.

a b c
Table 1: Example noise distribution for the vocabulary set (left) and the corresponding unigram table of size (right).
1:   for all
2:  
3:  for  do
4:      
5:      
6:      
7:      if  then
8:          add copies of to
9:      else
10:          for  do
11:               with probability
12:          end for
13:      end if
14:  end for
Algorithm 3 Adaptive unigram table.

3.2.2 Adaptive unigram table

Another problem is how to generate negative samples efficiently. Since negative samples per target-context pair have to be generated by the noise distribution, the sampling speed has a significant effect on the overall training efficiency.

Let us first examine how negative samples are generated in batch SGNS. In a popular implementation (Mikolov, 2013), a word array (referred to as a unigram table) is constructed such that the number of a word in it is proportional to . See Table 1 for an example. Using the unigram table, negative samples can be efficiently generated by sampling the table elements uniformly at random. It takes only time to generate one negative sample.

The above method assumes that the noise distribution is fixed and thus cannot be used directly for the incremental training. One simple solution is to reconstruct the unigram table whenever new training data is provided. However, such a method is not effective for the incremental SGNS, because the unigram table reconstruction requires time.222This overhead is amortized in mini-batch SGNS if the mini-batch size is sufficiently large. Our discussion here is dedicated to efficiently perform the incremental training irrespective of the mini-batch size.

We propose a reservoir-based algorithm for efficiently updating the unigram table (Vitter, 1985; Efraimidis, 2015) (Algorithm 3). The algorithm incrementally update the unigram table while limiting its maximum size to . In case , it can be easily confirmed that the number of a word in is . In case , since is equal to the normalization factor of the noise distribution, it can be proven by induction that, for all , is a word with probability . See (Vitter, 1985; Efraimidis, 2015) for reference.

Note on implementation

In line 8, copies of are added to . When is not an integer, the copies are generated so that their expected number becomes . Specifically, copies are added to with probability , and copies are added otherwise.

The loop from line 10 to 12 becomes expensive if implemented straightforwardly because the maximum table size is typically set large (e.g., in word2vec (Mikolov, 2013)). For acceleration, instead of checking all elements in the unigram table, randomly chosen elements are substituted with . Note that is the expected number of table elements to be substituted in the original algorithm. This approximation achieves great speed-up because we usually have . In fact, it can be proven that it takes time when . See Appendix333The appendices are in the supplementary material. A for more discussions.

3.3 Computational complexity

Both incremental and batch SGNS have the same space complexity, which is independent of the training data size . Both require space to store the word embeddings and the word frequency counts, and space to store the unigram table.

The two algorithms also have the same time complexity. Both require training time when the training data size is . Although incremental SGNS requires extra time for updating the dynamic vocabulary and adaptive unigram table, these costs are practically negligible, as will be demonstrated in Section 5.3.

4 Theoretical Analysis

Although the extension from batch to incremental SGNS is simple and intuitive, it is not readily clear whether incremental SGNS can learn word embeddings as well as the batch counterpart. To answer this question, in this section we examine incremental SGNS from a theoretical point of view.

The analysis begins by examining the difference between the objectives optimized by batch and incremental SGNS (Section 4.1). Then, probabilistic properties of their difference are investigated to demonstrate the relationship between batch and incremental SGNS (Sections 4.2 and 4.3). We shortly touch the mini-batch SGNS at the end of this section (Section 4.4).

4.1 Objective difference

As discussed in Section 2, batch SGNS optimizes the following objective:

where collectively represents the model parameter444We treat words as integers and thus . (i.e., word embeddings) and represents the noise distribution. Note that the noise distribution is represented in a different notation than Section 2 to make its dependence on the whole training data explicit. The function is defined as , where represents the word frequency in the first words of the training data.

In contrast, incremental SGNS computes the gradient of at each step to perform gradient descent. Note that the noise distribution does not depend on but rather on . Because it can be seen as a sample approximation of the gradient of

incremental SGNS can be interpreted as optimizing with SGD.

Since the expectation terms in the objectives can be rewritten as , the difference between the two objectives can be given as

where is the delta function.

4.2 Unsmoothed case

Let us begin by examining the objective difference in the unsmoothed case, .

The technical difficulty in analyzing is that it is dependent on the word order in the training data. To address this difficulty, we assume that the words in the training data are generated from some stationary distribution. This assumption allows us to investigate the property of from a probabilistic perspective. Regarding the validity of this assumption, we want to note that this assumption is already taken by the original SGNS: the probability that the target and context words co-occur is assumed to be independent of their position in the training data.

We below introduce some definitions and notations as the preparation of the analysis.

Definition 1.

Let

be a random variable that represents

. It takes when the -th word in the training data is and otherwise.

Remind that we assume that the words in the training data are generated from a stationary distribution. This assumption means that the expectation and (co)variance of

do not depend on the index . Hereafter, they are respectively denoted as and .

Definition 2.

Let be a random variable that represents when . It is given as .

4.2.1 Convergence of the first and second order moments of

It can be shown that the first order moment of

has an analytical form.

Theorem 1.

The first order moment of is given as

where is the -th harmonic number.

Sketch of proof.

Notice that can be written as

Because we have, for any and such that ,

plugging this into proves the theorem. See Appendix B.1 for the complete proof. ∎

Theorem 1 readily gives the convergence property of the first order moment of :

Theorem 2.

The first-order moment of decreases in the order of :

and thus converges to zero in the limit of infinity:

Proof.

We have from the upper integral bound, and thus Theorem 1 gives the proof. ∎

A similar result to Theorem 2 can be obtained for the second order moment of as well.

Theorem 3.

The second-order moment of decreases in the order of :

and thus converges to zero in the limit of infinity:

Proof.

Omitted. See Appendix B.2. ∎

4.2.2 Main result

The above theorems reveal the relationship between the optimal solutions of the two objectives, as stated in the next lemma.

Lemma 4.

Let and be the optimal solutions of and , respectively: and . Then,

(2)
(3)
Proof.

The proof is made by the squeeze theorem. Let . The optimality of gives . Also, the optimality of gives

We thus have . Since Theorem 2 implies that the right hand side converges to zero when , the squeeze theorem gives Eq. (2). Next, we have

(4)

Theorem 3 suggests that Eq. (4) converges to zero when . Also, the non-negativity of the variance gives . Therefore, the squeeze theorem gives Eq. (3). ∎

We are now ready to provide the main result of the analysis. The next theorem shows the convergence of .

Theorem 5.

converges in probability to :

Proof.

Let again . Then, Chebyshev’s inequality gives, for any ,

Remember that Eq. (2) means that for any , there exists such that if then . Therefore, we have

The arbitrary property of and allows to be rewritten as . Also, Eq. (3) implies that . This completes the proof. ∎

Informally, this theorem can be interpreted as suggesting that the optimal solutions of batch and incremental SGNS agree when is infinitely large.

4.3 Smoothed case

We next examine the smoothed case (). In this case, the noise distribution can be represented by using the ones in the unsmoothed case:

where and corresponds to the noise distribution in the unsmoothed case.

Definition 3.

Let be a random variable that represents in the smoothed case. Then, it can be written by using :

where .

Because is no longer a linear combination of , it becomes difficult to derive similar proofs to the unsmoothed case. To address this difficulty, is approximated by the first-order Taylor expansion around

The first-order Taylor approximation gives

where and . Consequently, it can be shown that the first and second order moments of have the order of in the smoothed case as well. See Appendix C for the details.

4.4 Mini-batch SGNS

The same analysis result can also be obtained for the mini-batch SGNS. We can prove Theorems 2 and 3 in the mini-batch case as well (see Appendix D for the proof). The other part of the analysis remains the same.

5 Experiments

Three experiments were conducted to investigate the correctness of the theoretical analysis (Section 5.1) and the practical usefulness of incremental SGNS (Sections 5.2 and 5.3). Details of the experimental settings that do not fit into the paper are presented in Appendix E.

5.1 Validation of theorems

An empirical experiment was conducted to validate the result of the theoretical analysis. Since it is difficult to assess the main result in Section 4.2.2 directly, the theorems in Sections 4.2.1, from which the main result is readily derived, were investigated. Specifically, the first and second order moments of were computed on datasets of increasing sizes to empirically investigate the convergence property.

Datasets of various sizes were constructed from the English Gigaword corpus (Napoles et al., 2012). The datasets made up of words were constructed by randomly sampling sentences from the Gigaword corpus. The value of was varied over . different datasets were created for each size to compute the first and second order moments.

Figure 1 (top left) shows log-log plots of the first order moments of computed on the different sized datasets when . The crosses and circles represent the empirical values and theoretical values obtained by Theorem 1, respectively. Figure 1 (top right) similarly illustrates the second order moments of . Since Theorem 3 suggests that the second order moment decreases in the order of , the graph is also shown. The graph was fitted to the empirical data by minimizing the squared error.

The top left figure demonstrates that the empirical values of the first order moments fit the theoretical result very well, providing a strong empirical evidence for the correctness of Theorem 1. In addition, the two figures show that the first and second order moments decrease almost in the order of , converging to zero as the data size increases. This result validates Theorems 2 and 3.

Figures 1 (bottom left) and (bottom right) show similar results when

. Since we do not have theoretical estimates of the first order moment when

, the graphs are shown in both figures. From these, we can again observe that the first and second order moments decrease almost in the order of . This indicates the validity of the investigation in Section 4.3. The relatively larger deviations from the graphs , compared with the top right figure, are considered to be attributed to the first-order Taylor approximation.

Figure 1: Log-log plots of the first and second order moments of on the different sized datasets when (top left and top right) and (bottom left and bottom right).

5.2 Quality of word embeddings

The next experiment investigates the quality of the word embeddings learned by incremental SGNS through comparison with the batch counterparts.

The Gigaword corpus was used for the training. For the comparison, both our own implementation of batch SGNS as well as word2vec (Mikolov et al., 2013c) were used (denoted as batch and w2v). The training configurations of the three methods were set the same as much as possible, although it is impossible to do so perfectly. For example, incremental SGNS (denoted as incremental) utilized the dynamic vocabulary (c.f., Section 3.2.1) and thus we set the maximum vocabulary size to control the vocabulary size. On the other hand, we set a frequency threshold to determine the vocabulary size of w2v. We set k for incremental, while setting the frequency threshold to for w2v. This yields vocabulary sets of comparable sizes: and .

The learned word embeddings were assessed on five benchmark datasets commonly used in the literature (Levy et al., 2015): WordSim353 (Agirre et al., 2009), MEN (Bruni et al., 2013), SimLex-999 (Hill et al., 2015), the MSR analogy dataset (Mikolov et al., 2013c), the Google analogy dataset (Mikolov et al., 2013a). The former three are for a semantic similarity task, and the remaining two are for a word analogy task. As evaluation measures, Spearman’s and prediction accuracy were used in the two tasks, respectively.

Figures 2 (a) and (b) represent the results on the similarity datasets and the analogy datasets. We see that the three methods (incremental, batch, and w2v) perform equally well on all of the datasets. This indicates that incremental SGNS can learn as good word embeddings as the batch counterparts, while being able to perform incremental model update. Although incremental performs slightly better than the batch methods in some datasets, the difference seems to be a product of chance.

The figures also show the results of incremental SGNS when the maximum vocabulary size was reduced to k and k (incremental-150k and incremental-100k). The resulting vocabulary sizes were and , respectively. We see that incremental-150k and incremental-100k perform comparatively well with incremental, although relatively large performance drops are observed in some datasets (MEN and MSR). This demonstrates that the Misra-Gries algorithm can effectively control the vocabulary size.

(a) (b) (c)
Figure 2: (a): Spearman’s on the word similarity datasets. (b): Accuracy on the analogy datasets. (c): Update time when new training data is provided.

5.3 Update time

The last experiment investigates how much time incremental SGNS can save by avoiding re-training when updating the word embeddings.

In this experiment, incremental was first trained on the initial training data of size555The number of sentences here. and then updated on the new training data of size to measure the update time. For comparison, batch and w2v were re-trained on the combination of the initial and new training data. We fixed and varied over .

Figure 2 (c) compares the update time of the three methods across various values of . We see that incremental significantly reduces the update time. It achieves and times speed-up compared with batch and w2v (when ). This represents the advantage of the incremental algorithm, as well as the time efficiency of the dynamic vocabulary and adaptive unigram table. We note that batch is slower than w2v because it uses AdaGrad, which maintains different learning rates for different dimensions of the parameter, while w2v uses the same learning rate for all dimensions.

6 Related Work

Word representations based on distributional semantics have been common (Turney and Pantel, 2010; Baroni and Lenci, 2010). The distributional methods typically begin by constructing a word-context matrix and then applying dimension reduction techniques such as SVD to obtain high-quality word meaning representations. Although some investigated incremental updating of the word-context matrix (Yin et al., 2015; Goyal and Daume III, 2011), they did not explore the reduced representations. On the other hand, neural word embeddings have recently gained much popularity as an alternative. However, most previous studies have not explored incremental strategies (Mikolov et al., 2013a, b; Pennington et al., 2014).

Very recently, Peng et al. (2017) proposed an incremental learning method of hierarchical soft-max. Because hierarchical soft-max and negative sampling have different advantages (Peng et al., 2017), the incremental SGNS and their method are complementary to each other. Also, their updating method needs to scan not only new but also old training data, and thus is not an incremental algorithm in a strict sense. As a consequence, it potentially incurs the same time complexity as the re-training. Another consequence is that their method has to retain the old training data and thus wastes space, while incremental SGNS can discard old training examples after processing them.

There are publicly available implementations for training SGNS, one of the most popular being word2vec (Mikolov, 2013). However, it does not support an incremental training method. Gensim (Řehůřek and Sojka, 2010) also offers SGNS training. Although Gensim allows the incremental updating of SGNS models, it is done in an ad-hoc manner. In Gensim, the vocabulary set as well as the unigram table are fixed once trained, meaning that new words cannot be added. Also, they do not provide any theoretical accounts for the validity of their training method.

7 Conclusion and Future Work

This paper proposed incremental SGNS and provided thorough theoretical analysis to demonstrate its validity. We also conducted experiments to empirically demonstrate its effectiveness. Although the incremental model update is often required in practical machine learning applications, only a little attention has been paid to learning word embeddings incrementally. We consider that incremental SGNS successfully addresses this situation and serves as an useful tool for practitioners.

The success of this work suggests several research directions to be explored in the future. One possibility is to explore extending other embedding methods such as GloVe (Pennington et al., 2014) to incremental algorithms. Such studies would further extend the potential of word embedding methods.

References

Appendix A Note on Adaptive Unigram Table

Algorithm 4 illustrates the efficient implementation of the adaptive unigram table (c.f., Section 3.2.2). In line 8 and 10, and are not always integers and therefore they are probabilistically converted into integers as explained in the paper.

Time complexity of Algorithm 4 is per update in case of . When , the update (line 8) takes time since we always have . When , we have and consequently . This means that the update (line 10–13) takes time.

Even if , the value of becomes sufficiently large in practice, and thus the update becomes efficient as demonstrated in the experiment.

1:   for all
2:  
3:  for  do
4:     
5:     
6:     
7:     if  then
8:        add copies of to
9:     else
10:        for  do
11:            is randomly drawn from
12:           
13:        end for
14:     end if
15:  end for
Algorithm 4 Adaptive unigram table.

Appendix B Complete Proofs

This appendix provides complete proofs of Theorems 1, 3, and 5.

b.1 Proof of Theorem 1

Proof.

The first order moment of can be rewritten as

Here, for any and such that , we have

Therefore, we have

b.2 Proof of Theorem 3

To prove Theorem 3, we begin by examining the upper- and lower-bounds of in the following Lemma, and then make use of the bounds to evaluate the order of the second order moment of .

Lemma 6.

For any and such that , we have

Proof.

We have

To prove the lemma, we rewrite the expression by splitting the set of into two subsets. Let be a set of such that , , and are independent from each other (i.e., , , and are all different), and let be its complementary set:

Then, is upper-bounded as

where the inequality holds because , , and are binary random variables and thus . Here, we have , because includes elements such that and also includes and elements such that and , respectively. And we consequently have . Therefore, the upper-bound can be rewritten as

Similarly, by making use of , the lower-bound can be derived:

Making use the above Lemma, we can prove Theorem 3.

Proof.

The upper-bound of is examined to prove the theorem. Let . Making use of Jensen’s inequality, we have