 # word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method

The word2vec software of Tomas Mikolov and colleagues (https://code.google.com/p/word2vec/ ) has gained a lot of traction lately, and provides state-of-the-art word embeddings. The learning models behind the software are described in two research papers. We found the description of the models in these papers to be somewhat cryptic and hard to follow. While the motivations and presentation may be obvious to the neural-networks language-modeling crowd, we had to struggle quite a bit to figure out the rationale behind the equations. This note is an attempt to explain equation (4) (negative sampling) in "Distributed Representations of Words and Phrases and their Compositionality" by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean.

## Code Repositories

### Deep-Learning-Resources

None

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 The skip-gram model

The departure point of the paper is the skip-gram model. In this model we are given a corpus of words and their contexts

. We consider the conditional probabilities

, and given a corpus , the goal is to set the parameters of so as to maximize the corpus probability:

 argmaxθ∏w∈Text⎡⎣∏c∈C(w)p(c|w;θ)⎤⎦ (1)

in this equation, is the set of contexts of word . Alternatively:

 argmaxθ∏(w,c)∈Dp(c|w;θ) (2)

here is the set of all word and context pairs we extract from the text.

### 1.1 Parameterization of the skip-gram model

One approach for parameterizing the skip-gram model follows the neural-network language models literature, and models the conditional probability using soft-max:

 p(c|w;θ)=evc⋅vw∑c′∈Cevc′⋅vw (3)

where and

are vector representations for

and respectively, and is the set of all available contexts.222Throughout this note, we assume that the words and the contexts come from distinct vocabularies, so that, for example, the vector associated with the word dog will be different from the vector associated with the context dog. This assumption follows the literature, where it is not motivated. One motivation for making this assumption is the following: consider the case where both the word dog and the context dog share the same vector . Words hardly appear in the contexts of themselves, and so the model should assign a low probability to , which entails assigning a low value to which is impossible. The parameters are , for , , (a total of parameters). We would like to set the parameters such that the product (2) is maximized.

Now will be a good time to take the log and switch from product to sum:

 argmaxθ∑(w,c)∈Dlogp(c|w)=∑(w,c)∈D(logevc⋅vw−log∑c′evc′⋅vw) (4)

An assumption underlying the embedding process is the following:

Assumption

maximizing objective 4 will result in good embeddings , in the sense that similar words will have similar vectors.

It is not clear to us at this point why this assumption holds.

While objective (4) can be computed, it is computationally expensive to do so, because the term is very expensive to compute due to the summation over all the contexts (there can be hundreds of thousands of them). One way of making the computation more tractable is to replace the softmax with an hierarchical softmax. We will not elaborate on this direction.

## 2 Negative Sampling

Mikolov et al.  present the negative-sampling approach as a more efficient way of deriving word embeddings. While negative-sampling is based on the skip-gram model, it is in fact optimizing a different objective. What follows is the derivation of the negative-sampling objective.

Consider a pair of word and context. Did this pair come from the training data? Let’s denote by the probability that came from the corpus data. Correspondingly, will be the probability that did not come from the corpus data. As before, assume there are parameters controlling the distribution: . Our goal is now to find parameters to maximize the probabilities that all of the observations indeed came from the data:

 argmaxθ∏(w,c)∈Dp(D=1|w,c;θ) = argmaxθlog∏(w,c)∈Dp(D=1|w,c;θ) = argmaxθ∑(w,c)∈Dlogp(D=1|w,c;θ)

The quantity can be defined using softmax:

 p(D=1|w,c;θ)=11+e−vc⋅vw

 argmaxθ∑(w,c)∈Dlog11+e−vc⋅vw

This objective has a trivial solution if we set such that for every pair . This can be easily achieved by setting such that and for all , where is large enough number (practically, we get a probability of 1 as soon as ).

We need a mechanism that prevents all the vectors from having the same value, by disallowing some combinations. One way to do so, is to present the model with some pairs for which must be low, i.e. pairs which are not in the data. This is achieved by generating the set of random pairs, assuming they are all incorrect (the name “negative-sampling” stems from the set of randomly sampled negative examples). The optimization objective now becomes:

 argmaxθ∏(w,c)∈Dp(D=1|c,w;θ)∏(w,c)∈D′p(D=0|c,w;θ) = argmaxθ∏(w,c)∈Dp(D=1|c,w;θ)∏(w,c)∈D′(1−p(D=1|c,w;θ)) = argmaxθ∑(w,c)∈Dlogp(D=1|c,w;θ)+∑(w,c)∈D′log(1−p(D=1|w,c;θ)) = argmaxθ∑(w,c)∈Dlog11+e−vc⋅vw+∑(w,c)∈D′log(1−11+e−vc⋅vw) = argmaxθ∑(w,c)∈Dlog11+e−vc⋅vw+∑(w,c)∈D′log(11+evc⋅vw)

If we let we get:

 argmaxθ∑(w,c)∈Dlog11+e−vc⋅vw+∑(w,c)∈D′log(11+evc⋅vw) = argmaxθ∑(w,c)∈Dlogσ(vc⋅vw)+∑(w,c)∈D′logσ(−vc⋅vw)

which is almost equation (4) in Mikolov et al ().

The difference from Mikolov et al. is that here we present the objective for the entire corpus , while they present it for one example and examples , following a particular way of constructing .

Specifically, with negative sampling of , Mikolov et al.’s constructed is times larger than , and for each we construct samples , where each is drawn according to its unigram distribution raised to the power. This is equivalent to drawing the samples in from the distribution , where and are the unigram distributions of words and contexts respectively, and is a normalization constant. In the work of Mikolov et al. each context is a word (and all words appear as contexts), and so

### 2.1 Remarks

• Unlike the Skip-gram model described above, the formulation in this section does not model

but instead models a quantity related to the joint distribution of

and .

• If we fix the words representation and learn only the contexts representation, or fix the contexts representation and learn only the word representations, the model reduces to logistic regression, and is convex. However, in this model the words and contexts representations are learned jointly, making the model non-convex.

## 3 Context definitions

This section lists some peculiarities of the contexts used in the word2vec software, as reflected in the code. Generally speaking, for a sentence of words , contexts of a word comes from a window of size around the word: , where is a parameter. However, there are two subtleties:

Dynamic window size

the window size that is being used is dynamic – the parameter denotes the window size. For each word in the corpus, a window size is sampled uniformly from .

Effect of subsampling and rare-word pruning

word2vec has two additional parameters for discarding some of the input words: words appearing less than min-count times are not considered as either words or contexts, an in addition frequent words (as defined by the sample parameter) are down-sampled. Importantly, these words are removed from the text before generating the contexts. This has the effect of increasing the effective window size for certain words. According to Mikolov et al. , sub-sampling of frequent words improves the quality of the resulting embedding on some benchmarks. The original motivation for sub-sampling was that frequent words are less informative. Here we see another explanation for its effectiveness: the effective window size grows, including context-words which are both content-full and linearly far away from the focus word, thus making the similarities more topical.

## 4 Why does this produce good word representations?

Good question. We don’t really know.

The distributional hypothesis states that words in similar contexts have similar meanings. The objective above clearly tries to increase the quantity for good word-context pairs, and decrease it for bad ones. Intuitively, this means that words that share many contexts will be similar to each other (note also that contexts sharing many words will also be similar to each other). This is, however, very hand-wavy.

Can we make this intuition more precise? We’d really like to see something more formal.