On the number of k-skip-n-grams

05/14/2019 ∙ by Dmytro Krasnoshtan, et al. ∙ 0

The paper proves that the number of k-skip-n-grams for a corpus of size L is Ln + n + k' - n^2 - nk'/n·n-1+k'n-1 where k' = (L - n + 1, k).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Skip-gram [1]

is a popular technique used in natural language processing, where in addition to sequences of words, we allow to substitute a word with a skip token. The model is used to overcome the data sparsity problem and provides an efficient method for learning high-quality vector representations for phrases.

Guthrie et al. further investigated the use of skip-grams by introducing k-skip-n-grams [2] and empirically shown that they can be more effective than increasing the size of the training corpus. In their paper, they also provided the following formula for calculating the number of k-skip-trigrams () for a corpus of size :

The purpose of this paper is to derive the general case formula for arbitrary , , and .

2 Proof

The proof of the general formula can be derived from the algorithm of constructing the k-skip-n-grams. There are a few recursive algorithms to construct them, but the one that makes the counting easier relies on the following intuition:

The number of k-skip-n-grams is equal to the sum of the number of n-grams with 0 skips plus the number of n-grams with exactly 1 skip plus the number of n-grams with exactly 2 skips plus so on till the number of n-grams with exactly k skips. So if we number of n-grams with exactly skips is , then the total number of all k-skip-n-grams is .

To derive the formula for , let’s see how we can generate an n-gram with exactly skips. One can notice that generating n-grams with skips is equivalent of selecting a sequence of length and substituting any element with skips. It is important to realize is that you can’t substitute the first or the last element, as this n-gram will be equivalent to

  • (k-1)-skip-n-gram if you substitute only one (first or last) element with a skip

  • (k-2)-skip-n-gram if you substitute both (first and last) elements with a skip

So we need to choose substitutions from positions which can be done in different ways. Because we can generate (should be ) different substrings of length from the corpus of size , the total number of n-grams with exactly skips is

Therefore the total formula for k-skip-n-grams is

This expression can be simplified using the following identities:

  • can be proved by induction

  • can be proved by induction

  • can be proved from the definition of binomial

So

The formula is almost complete apart of a few corner cases. If , we do not select any n-grams and the result should be zero. Previously it was also mentioned that , which is the same as

3 Additional materials

The code and verification for the formula are available at https://github.com/salvador-dali/k-skip-n-gram

References

  • [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.

    Efficient estimation of word representations in vector space.

    ICLR Workshop, 2013.
  • [2] David Guthrie, Ben Allison, Wei Liu, Louise Guthrie and Yorick Wilk. A Closer Look at Skip-gram Modelling. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), 2016.