N-Grams

What is an N-Gram?

An N-Gram is a connected string of N

items from a sample of text or speech. The N-Gram could be comprised of large blocks of words, or smaller sets of syllables. N-Grams are used as the basis for functioning N-Gram models, which are instrumental in natural language processing as a way of predicting upcoming text or speech.

Source

N-Gram Models

As mentioned above, N-Gram models are used to aid in prediction of speech and/or text. They utilize the stochastic properties of N-Gram, and sometimes incorporate elements of the Markov model. N-Gram models, and the algorithms that use them, benefit from their relative simplicity and scalability, enabling smaller experiments to scale up from larger

N sets.

Skip-Gram

Sometimes, in language processing, a skip-gram is applied to the model. A skip-gram is a generalization of n-grams in which the items, typically words, are not required to be consecutive during text classification, but may leave gaps that are skipped over, hence skip-gram.

How does an N-Gram model work?

N-Gram models work by taking a sequence of items, and predicting upcoming items. For example, imagine a string of letters used for DNA sequencing (i.e. GATC). An N-Gram model will analyze the sequence of letters and, utilizing training data, creates a probability distribution for the likelihood of upcoming values. Each possible value will be assigned a probability (e.g. .0004) and the sum total of all probabilities will be 1.

Applications of N-Gram

The example used above is in fact a common use of N-Gram models. From predictions in DNA sequencing, to implementation for better text prediction within neural networks, N-Gram models, have a wide range of applicability.

Language Identification

N-Gram models are uses in natural language processing as a tool for modeling probable upcoming sequences of characters, also known as trigrams or 3-grams. An example is the phrase, "Good Afternoon," which breaks down to the trigrams "Goo","d A", "fte", etc. In machine translation models, however, N-Gram models are usually used in conjunction with Bayesian inference, leading to a more accurate prediction.

Information Retrieval

The predictive power of N-Grams is applicable in modern search algorithms. For example, N-Gram models are applied to databases of documents, and given a single query for a document, is able to provide sequences of "similar documents." Using reference documents as training data, N-Gram models assist in providing relevant additional resources in a search function.