is a popular technique used in natural language processing, where in addition to sequences of words, we allow to substitute a word with a skip token. The model is used to overcome the data sparsity problem and provides an efficient method for learning high-quality vector representations for phrases.
Guthrie et al. further investigated the use of skip-grams by introducing k-skip-n-grams  and empirically shown that they can be more effective than increasing the size of the training corpus. In their paper, they also provided the following formula for calculating the number of k-skip-trigrams () for a corpus of size :
The purpose of this paper is to derive the general case formula for arbitrary , , and .
The proof of the general formula can be derived from the algorithm of constructing the k-skip-n-grams. There are a few recursive algorithms to construct them, but the one that makes the counting easier relies on the following intuition:
The number of k-skip-n-grams is equal to the sum of the number of n-grams with 0 skips plus the number of n-grams with exactly 1 skip plus the number of n-grams with exactly 2 skips plus so on till the number of n-grams with exactly k skips. So if we number of n-grams with exactly skips is , then the total number of all k-skip-n-grams is .
To derive the formula for , let’s see how we can generate an n-gram with exactly skips. One can notice that generating n-grams with skips is equivalent of selecting a sequence of length and substituting any element with skips. It is important to realize is that you can’t substitute the first or the last element, as this n-gram will be equivalent to
(k-1)-skip-n-gram if you substitute only one (first or last) element with a skip
(k-2)-skip-n-gram if you substitute both (first and last) elements with a skip
So we need to choose substitutions from positions which can be done in different ways. Because we can generate (should be ) different substrings of length from the corpus of size , the total number of n-grams with exactly skips is
Therefore the total formula for k-skip-n-grams is
This expression can be simplified using the following identities:
can be proved by induction
can be proved by induction
can be proved from the definition of binomial
The formula is almost complete apart of a few corner cases. If , we do not select any n-grams and the result should be zero. Previously it was also mentioned that , which is the same as
3 Additional materials
The code and verification for the formula are available at https://github.com/salvador-dali/k-skip-n-gram
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
Efficient estimation of word representations in vector space.ICLR Workshop, 2013.
-  David Guthrie, Ben Allison, Wei Liu, Louise Guthrie and Yorick Wilk. A Closer Look at Skip-gram Modelling. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), 2016.