1 Introduction
Skipgram [1]
is a popular technique used in natural language processing, where in addition to sequences of words, we allow to substitute a word with a skip token. The model is used to overcome the data sparsity problem and provides an efficient method for learning highquality vector representations for phrases.
Guthrie et al. further investigated the use of skipgrams by introducing kskipngrams [2] and empirically shown that they can be more effective than increasing the size of the training corpus. In their paper, they also provided the following formula for calculating the number of kskiptrigrams () for a corpus of size :
The purpose of this paper is to derive the general case formula for arbitrary , , and .
2 Proof
The proof of the general formula can be derived from the algorithm of constructing the kskipngrams. There are a few recursive algorithms to construct them, but the one that makes the counting easier relies on the following intuition:
The number of kskipngrams is equal to the sum of the number of ngrams with 0 skips plus the number of ngrams with exactly 1 skip plus the number of ngrams with exactly 2 skips plus so on till the number of ngrams with exactly k skips. So if we number of ngrams with exactly skips is , then the total number of all kskipngrams is .
To derive the formula for , let’s see how we can generate an ngram with exactly skips. One can notice that generating ngrams with skips is equivalent of selecting a sequence of length and substituting any element with skips. It is important to realize is that you can’t substitute the first or the last element, as this ngram will be equivalent to

(k1)skipngram if you substitute only one (first or last) element with a skip

(k2)skipngram if you substitute both (first and last) elements with a skip
So we need to choose substitutions from positions which can be done in different ways. Because we can generate (should be ) different substrings of length from the corpus of size , the total number of ngrams with exactly skips is
Therefore the total formula for kskipngrams is
This expression can be simplified using the following identities:

can be proved by induction

can be proved by induction

can be proved from the definition of binomial
So
The formula is almost complete apart of a few corner cases. If , we do not select any ngrams and the result should be zero. Previously it was also mentioned that , which is the same as
3 Additional materials
The code and verification for the formula are available at https://github.com/salvadordali/kskipngram
References

[1]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
Efficient estimation of word representations in vector space.
ICLR Workshop, 2013.  [2] David Guthrie, Ben Allison, Wei Liu, Louise Guthrie and Yorick Wilk. A Closer Look at Skipgram Modelling. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), 2016.
Comments
There are no comments yet.