What is a Lexicon?

In its simplest form, a lexicon is the vocabulary of a person, language, or branch of knowledge. It is the catalog of words used often in conjunction with grammar, the set of rules for the use of those words. Items within a lexicon are called lexemes, and groups of lexemes are called lemmas, which are often the unit used for describing the size of a a lexicon. Additionally, a lexicon is divided into two main sections, open and closed categories. An open category is defined by lexemes that are more semantic in nature, like nouns and verbs, whereas a closed category is defined by syntactic lexemes, such as pronouns and determiners.

Mechanisms of Lexicons

A lexicon defines a vocabulary of lexemes, but there are also many mechanisms by which lexemes are transformed. For example, a common mechanism is compounding where two lexemes are combined into one (e.g. "could have" becomes "could've", "can not" becomes "cannot or can't"). Other mechanisms include abbreviations ("et cetera" becomes "etc."), acronyms ("United States of America" becomes "USA"), and innovation which is the planned creation of new roots, such as slang or branding.


Neologisms are new lexemes that, if used widely, become integrated into a lexicon. Some specific types of neologisms are called phono-semantic matches. These lexemes use both the phonetic (sound) and semantic (meaning) from the initial source language. An example of a phono-semantic neologism is the English pronunciation of the French "chaise longue" which is often pronounced "chase lounge". The English pronunciation borrows from the French pronunciation (phonetic) and the meaning is the same (semantic).

Lexicons and Machine Learning

Neural networks for information retrieval or document searches use lexicons as their sample data. Each lexeme can be assigned an associated vector, where the substance of the lexeme is defined as coordinates, and its frequency within a database defines its magnitude (length). Using techniques like cosine similarity, machine learning algorithms can quickly  distinguish and compare documents to each other based upon their lexical similarities and overlap of subject matter.


Another application of lexicons and machine learning is sentiment analysis. In this application, words are assigned values for sentiment through training data. Using classifiers like the Naive-Bayes classifier, which classifies objects based on their independent features, machine learning models can process lexemes and assign them sentiment scores based on their individual characteristics. This method differs from a traditional, lexicon based, method for categorizing and assigning sentiment values to a set of lexemes because the bulk of the processing, excluding the training, is done in an unsupervised manner.