A Survey On Neural Word Embeddings

10/05/2021
by   Erhan Sezerer, et al.
0

Understanding human language has been a sub-challenge on the way of intelligent machines. The study of meaning in natural language processing (NLP) relies on the distributional hypothesis where language elements get meaning from the words that co-occur within contexts. The revolutionary idea of distributed representation for a concept is close to the working of a human mind in that the meaning of a word is spread across several neurons, and a loss of activation will only slightly affect the memory retrieval process. Neural word embeddings transformed the whole field of NLP by introducing substantial improvements in all NLP tasks. In this survey, we provide a comprehensive literature review on neural word embeddings. We give theoretical foundations and describe existing work by an interplay between word embeddings and language modelling. We provide broad coverage on neural word embeddings, including early word embeddings, embeddings targeting specific semantic relations, sense embeddings, morpheme embeddings, and finally, contextual representations. Finally, we describe benchmark datasets in word embeddings' performance evaluation and downstream tasks along with the performance results of/due to word embeddings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/30/2020

Development of Word Embeddings for Uzbek Language

In this paper, we share the process of developing word embeddings for th...
01/25/2019

Word Embeddings: A Survey

This work lists and describes the main recent strategies for building fi...
06/24/2019

Language Modelling Makes Sense: Propagating Representations through WordNet for Full-Coverage Word Sense Disambiguation

Contextual embeddings represent a new generation of semantic representat...
09/03/2018

Affordance Extraction and Inference based on Semantic Role Labeling

Common-sense reasoning is becoming increasingly important for the advanc...
05/10/2018

From Word to Sense Embeddings: A Survey on Vector Representations of Meaning

Over the past years, distributed representations have proven effective a...
10/23/2020

Domain Specific Complex Sentence (DCSC) Semantic Similarity Dataset

Semantic textual similarity is one of the open research challenges in th...
02/25/2020

Semantic Relatedness for Keyword Disambiguation: Exploiting Different Embeddings

Understanding the meaning of words is crucial for many tasks that involv...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The recent decade has witnessed a transformation in natural language processing (NLP). This transformation can be attributed to neural language models, their success in representation learning, and the transfer of this knowledge into complex NLP tasks.

Before neural representation learning, representations of words or documents have been computed using the vector space model (VSM) of semantics.

Turney and Pantel (2010) provide a comprehensive survey on the use of VSM for semantics. In VSM (Salton et al., 1975), frequencies of words in documents are considered to form a term-document matrix, and global co-occurrences of words in context lead to word-context matrices (Deerwester et al., 1990; Lund and Burgess, 1996; Lebret and Collobert, 2014). Although these count-based representations are proved helpful in addressing semantics, they are the bag of words approaches and are not able to capture both syntactical and semantic features at the same time, which is required for performing well in NLP tasks.

Neural word embeddings are due to neural language models. Neural network architecture is constructed to predict the next word given the set of neighboring words in the sequence in neural language modeling. In the iterative processing of this prediction over a large corpus, the learned weights in the hidden layers serve as neural embeddings for words.

Neural word embeddings have experienced an evolution. Early word embeddings had some problems. Although they can learn syntactic and semantic regularities, they are not so good at capturing their mixture. Moreover, they provide just one representation that is shared among the different senses of a word. State-of-the-art contextual embeddings are responsive to these problems. They lead to a significant performance improvement and find their application throughout all NLP tasks and in many other fields (Wang et al., 2018; Kutuzov et al., 2018; Khattak et al., 2019).

In this article, we describe this transition by first providing the theoretical foundations. Then, preliminary realizations of these ideas by some seminal papers are explained. In the remaining part, generally accepted and efficiently computable early word embeddings are introduced. Afterward, extensions to early word embeddings are given with respect to some criteria such as the use of knowledge base, having morphological features, and addressing specific semantic relations (synonym, antonym, hypernym, hyponym, etc.). Succeedingly, separate sections are devoted to sense, morphological, and contextual embeddings. We also include performance evaluation of word embeddings on the benchmark datasets. Finally, we conclude the article with some historical reflections and future remarks. We have also included a diagram showing the milestone papers and summarizing the flow of ideas in the field in Appendix A.

Multilingual information requirements and parallel/comparable corpora in different languages pave the way for cross-lingual representations of words in a joint embedding space. In this survey, we exclude those techniques that specialize in learning word representations in a multilingual setting. The reader can refer to Ruder et al. (2019) for a comprehensive overview of cross-lingual word embedding models.

2. Background

2.1. Distributional Hypothesis

Together with Wittgenstein (1953), Harris (1954) were one of the first authors to propose that languages have a distributional structure. He argues that language elements are dispersed to environments that are composed of an existing array of their co-occurrents. An element’s distribution is the sum of all these environments. Harris’ second contribution is that we can relate an element’s distribution with its meaning. He states that at least certain aspects of meaning are due to distributional relations. For instance, synonymy between two words can be defined as having almost identical environments except chiefly for glosses where they co-occur e.g. oculist and eye-doctor. The author also suggests that sentences starting with a pronoun should be considered as the same context as the previous sentence where the subject of the pronoun is given since their occurrence is not arbitrary and the fullest environmental unit for the distributional investigation is the connected discourse structures of such sentences.

2.2. Distributional Representations

Hinton et al. (1986) utilize the idea of distributed representations for concepts. They propose patterns of hidden layer activations (which are only allowed to be or ) as the representation of meanings. They argue that the most important evidence of distributed representations is their degree of similarity to the weaknesses and strengths of human mind. Unlike computer memory, human brain is able to retrieve memory from partial information. Distributed representations conform to this notion better than local distributions (i.e. bag of words model, where each meaning is associated with a single computational unit) since the meaning of a word is distributed across several units and a loss of an activation will only slightly effect the memory retrieval process. Rest of the activations that are still there will be able to retrieve the memory. Even if the occlusion of activations are strong enough to lead the system to an incorrect meaning, it will still result in a meaning close to that of the target word, such as instead of apricot the word peach is recalled. Authors state that, this phenomenon further reinforces the idea of being similar to human mind by showing the similarities with deep dyslexia that occurs in adults with certain brain damage.

2.3. Language Modeling

Language modeling is the task of predicting the next word given a sequence of words. Formally, it is the prediction of the next word’s probability distribution given a sequence of words (Equation

1).

(1)

In an alternative interpretation, a language model assigns a probability to a sequence of words. The probability calculation can be formulated as the product of conditional probabilities in each subsequent step having the assumption that they are independent (Equation

2).

(2)

In traditional language modeling, the next word’s probability is calculated based on the statistics of n-gram occurrences. n-grams are

consecutive words. In n-gram language models (Chen and Goodman, 1996; Kneser and Ney, 1995), an n-gram’s probability is computed depending on the preceding words instead of using the product of conditional probabilities of bi-grams, tri-grams, etc. to simplify the computation.

n-gram language models have some issues. When the length of n-grams increases, their occurrence becomes sparse. This sparsity causes zero or division by zero probability values. The former one is resolved by smoothing and back-off is used to deal with the latter. Sparsity provides coarse-grained values in the resultant probability distribution. Moreover, storing all n-gram statistics becomes a major problem when the size of

increases. This curse of dimensionality is a bottleneck for n-gram language models.

2.4. Distributional Representations through Language Modeling

Elman (1990) was the first to implement the distributional model proposed by Hinton et al. (1986)

, in a language model. He proposes a specific recurrent neural network structure with memory, called the Elman network, to predict bits in temporal sequences. Memory is provided to network through the use of context units that are fully-connected with hidden units. He makes a simulation to predict bits in XOR problem. The input sequence is in the form of an input pair followed by an output bit. In the solution scheme, two hidden units are expected to represent two main patterns in the XOR truth table. That is one hidden unit should have high activation for

or pattern and the other should recognize or pattern. As an alternative problem, letter sequences that are generated partially random and partially by a simple rule are tried to be learned by a recurrent neural network where hidden unit activations are used to represent word meanings. The idea is that using such network structures, time can be modeled in an implicit way. In other words, the use of a recurrent neural network helps in learning temporal structure in language.

Xu and Rudnicky (2000)

create the first language model based on neural networks. Their proposed model is based on a single fully connected layer and uses one-hot vectors of words as inputs and outputs. They highlight computational cost as the major problem and in tackling the issue they mention the necessity of update mechanisms which only update those weights with non-zero input value due to one-hot encoding.

Bengio et al. (2003)

popularize the distributional representation idea by realizing it through a language model and lead to numerous other studies that are built on it. In their model architecture, they use a feed forward network with a single hidden layer and optional direct connections from input layer to softmax layer (Figure

1).

Figure 1. Neural network architecture in Bengio et al. (2003). Taken from the original article.

In addition to the advantages discussed by the aforementioned earlier works, they argue that distributional representations also break the curse of dimensionality in traditional n-gram models ((Chen and Goodman, 1996), (Kneser and Ney, 1995)) where the probability of each word depends on the discrete n-grams whose numbers can exceed millions. A considerably high number of such n-grams will highly unlikely to be observed in the training set which results in sparsity problems in conditional probability calculations. A real valued feature vector representation of words will overcome this problem by working with a smooth probability function. The conditional probability of seeing a word given a context is calculated by updating the index of that word on the shared representation matrix of all the vocabulary. The probability function is smooth in that the updates that are caused by similar contexts are alike.

A second advantage of the model is the ability to capture context-based similarities. In n-gram models, the sentences ”the cat is walking in the bedroom” and ”a dog was running in a room” will be considered as dissimilar since they are unable to consider contexts further than words and have no notion of similarity among word meanings. On the other hand, in the proposed model, increasing the probability of the sentence ”the cat is walking in the bedroom” will increase the probability of all the sentences below and help us generalize better:

"a dog was running in a room"
"the cat is running in a room"
"a dog is walking in a bedroom"

3. Word Embeddings with Improved Language Models

Once it is shown that neural language models are efficiently computable by Bengio et al. (2003), newer language models along with better word embeddings are developed successively. All of these models and their properties are summarized in Table 1.

Alexandrescu and Kirchhoff (2006) (FNLM) improve the model proposed by Bengio et al. (2003) by including word-shape features such as stems, affixes, capitalization, POS class, etc. at input.

Morin and Bengio (2005) focus on improving the performance of the earlier neural language models. Instead of using softmax and predicting the output word over the entire dictionary, they propose a hierarchical organization for vocabulary terms. A binary tree of words is created based on the IS-A relation of Wordnet hierarchy. Instead of directly predicting each word’s probability, prediction is performed as a binary decision over the constructed tree’s branches and leaves. This technique is an alternative to importance sampling to increase efficiency. Although the authors report exponential speed-up, the accuracy of the resultant word embeddings is a bit worse than the original method and importance sampling.

Mnih and Hinton (2008) improve the hierarchical language model proposed by Morin and Bengio (2005) by constructing and using a word hierarchy from distributional representations of words rather than a hierarchy built out of Wordnet. Thus, their approach is entirely unsupervised. They calculate feature vectors for words by training a hierarchical log-bilinear model (HLBL) and apply EM algorithm on mixture of two Gaussians to construct a data-driven binary tree for words in the vocabulary. Authors also represent different senses of words as different leaves in the tree which is proposed in Morin and Bengio (2005) but not implemented. Their model outperforms non-hierarchical neural models, the hierarchical neural language model that is based on Wordnet hierarchy, and the best n-gram models ((Chen and Goodman, 1996), (Kneser and Ney, 1995)).

Mnih and Hinton (2007)

propose three different language models that use distributed representation of words. In Factored Restricted Boltzmann Machine (

RBM), they put an additional hidden layer over the distributed representation of the preceding words and exploit interactions between this hidden layer and the next word distributed representation. In temporal RBM, they further put temporal connections among hidden layer units to capture longer dependencies in the previous set of words, and finally in the log-bilinear model, called LBL, they use linear dependencies between the next word and the preceding set of words. They report that the log-bilinear model outscores RBM models and also n-gram models ((Chen and Goodman, 1996), (Kneser and Ney, 1995)).

Collobert and Weston (2008) and Collobert et al. (2011) (C&W

) are among the precursors in using distributed representations in various NLP problems such as part-of-speech tagging, named entity recognition, chunking, and semantic role labeling. They propose a unified architecture for all of the problems where the words in the sentences are represented by word vectors trained from the Wikipedia Corpus in an unsupervised fashion. Although they use a feed forward architecture with a sliding window approach in word-level tasks, they utilize a convolutional neural network (CNN) architecture in semantic role labeling in order to incorporate the varying lengths of sentences, since in semantic role labeling, sliding window-based approaches don’t work because target words may depend on some other far away words in a sentence. By using trained word vectors and neural network architecture, their proposed method can capture the meaning of words and succeed in various NLP tasks (almost) without using hand-crafted features. Their overall scheme is described as semi-supervised, being composed of unsupervised language modeling and other supervised tasks.

Mikolov et al. (2010) propose a recurrent neural network-based language model (RNNLM), from where word representations can be taken. The model is able to consider contexts of arbitrary length, unlike the previous feed-forward methods where a context size should be defined beforehand. The network can learn longer dependencies. It is proved useful in tasks involving inflectional languages or languages with large vocabulary when compared to n-gram language models ((Chen and Goodman, 1996), (Kneser and Ney, 1995)).

Model Year Dimension Training Corpus NN Model Aim Knowledge-Base(s) Feature(s)
Bengio et al. (Bengio et al., 2003) 2003 100 Brown FFNN Training - -
Morin and Bengio (Morin and Bengio, 2005) 2005 100 Brown FFNN Performance Wordnet(Miller, 1995) Hierarchical Binary Tree
FNLM (Alexandrescu and Kirchhoff, 2006) 2006 45-64
LDC ECA(Gadalla et al., 1997),
Turkish News(Hakkani-Tür et al., 2002)
FFNN Training
LDC ECA(Gadalla et al., 1997)
Turkish News(Hakkani-Tür et al., 2002)
Word Shape Features
LBL (Mnih and Hinton, 2007) 2007 100 APNews RBM, FFNN Training - -
HLBL (Mnih and Hinton, 2008) 2008 100 APNews LBL Performance - Hierarchical Binary Tree
C&W (Collobert and Weston, 2008) 2008 15-100 Wiki FFNN, CNN Training - -
RNNLM (Mikolov et al., 2010) 2010 60-400 Gigaword RNN Training - -
CBOW (Mikolov et al., 2013a) 2013 300-1000 Google News FFNN Training - -
Skip-Gram (Mikolov et al., 2013a) 2013 300-1000 Google News FFNN Training - -
SGNS (Mikolov et al., 2013b) 2013 300 Google News FFNN Performance - Negative Sampling
ivLBL/vLBL (Mnih and Kavukcuoglu, 2013) 2013 100-600 Wiki LBL Performance - NCE (Gutmann and Hyvärinen, 2012)
GloVe (Pennington et al., 2014) 2014 300
Wiki, Gigaword,
Commoncrawl
LBL+coocurence Matrix Training - -
DEPS (Levy and Goldberg, 2014) 2014 300 Wiki CBOW Training
Stanford tagger(Toutanova et al., 2003)
Dependency parser(Goldberg and Nivre, 2012)
POS,
Dependency relation
Ling et al. (Ling et al., 2015b) 2015 50 Wiki CBOW+Attn. Training - -
SWE (Liu et al., 2015) 2015 300 Wiki Skip-Gram Training Wordnet(Miller, 1995) Ordinal Semantic Rules
Faruqui et al. (Faruqui et al., 2015) 2015 - - - fine-tuning
PPDB(Pavlick et al., 2015)
FrameNet(Baker et al., 1998)
WordNet(Miller, 1995)
Semantic Relations
Yin and Schütze (Yin and Schütze, 2016) 2016 200 - - Ensemble - -
Ngram2vec (Zhao et al., 2017) 2017 300 Wiki SGNS+n-gram Training - -
Dict2vec (Tissier et al., 2017) 2017 300 Wiki Skip-Gram Training
Oxford, Cambridge
and Collins dict.
-
Table 1. Properties of word embedding models.

3.1. Early Word Embeddings

Word2vec (Mikolov et al., 2013a) is the first neural word embedding model that efficiently computes representations to leverage the context of target words. Thus, it can be considered as the initiator of early word embeddings.

Mikolov et al. (2013a) propose word2vec to learn high-quality word vectors. The authors removed the non-linearity in the hidden layer in the proposed model architecture of Bengio et al. (2003) to gain an advantage in computational complexity. Due to this basic change, the system can be trained using billions of words efficiently. word2vec has two variants: Continuous bag of words model (CBOW) and Skip-gram model.

In CBOW, a middle word is predicted given its context, the set of neighboring left and right words. When the input sentence ”nature is pleased with simplicity” is processed, the system predicts the middle word ”pleased” given the left and right contexts. Every input word is in one-hot encoding where there is a vocabulary size () vector of all zeros except a one in that word’s index. In the single hidden layer, the average of the neighboring left and right vectors () is computed to represent the context instead of applying a nonlinear transformation. As the order of words is not considered by averaging, it is named a bag-of-words model. Then the middle word’s () probability given the context () is calculated through softmax on the context-middle word dot product vector (Equation 3). Finally, the output loss is calculated based on the cross-entropy loss between the system predicted output and the ground-truth middle word.

(3)

In Skip-gram, system predicts the most probable context words for a given input word. In terms of a language model, while CBOW predicts an individual word’s probability, Skip-gram outputs the probabilities of a set of words, defined by a given context size. Due to high dimensionality in the output layer (all vocabulary words have to be considered), Skip-gram has higher computational complexity compared to CBOW. Rather than traversing all vocabulary in the output layer, Skip-gram with Negative Sampling (SGNS) (Mikolov et al., 2013b) formulates the problem as a binary classification where one class represents the current context’s probability. In contrast, the other class is connected to all other vocabulary terms’ occurrence probability in the present context. In the latter probability calculation, a negative sampling method is incorporated (Mnih and Teh, 2012)

, which is influenced by Noise Contrastive Estimation (NCE)

(Gutmann and Hyvärinen, 2012), to speed up the training process. As vocabulary terms are not distributed uniformly in contexts, sampling is performed from a distribution where the order of frequency of vocabulary words in corpora is taken into consideration. SGNS incorporates this sampling idea by replacing the Skip-gram’s objective function. The new objective function (Equation 4) depends on maximizing where is the word-context pair. This probability denotes the probability of coming from the corpus data. Additionally, should be maximized if pair is not included in the corpus data. In this condition, pair is sampled, as the name suggests negative sampled times.

(4)

Both word2vec variants produced word embeddings that can capture multiple degrees of similarity including both syntactic and semantic regularities.

Mnih and Kavukcuoglu (2013) introduce speedups to the CBOW and Skip-gram models (Mikolov et al., 2013a), called vLBL and ivLBL

, by using noise-contrastive estimation (NCE) for the training of the unnormalized counterparts of these models. Training of the normalized model has a high cost due to the normalization over the whole vocabulary (the denominator term in Equation

3

). NCE trains the unnormalized model by adapting a logistic regression classifier to discriminate between the samples under the model and the samples from a noise distribution. Thus, the computational cost and accuracy become dependent on the number of noise samples. With the relatively small number of noise samples, the same accuracy level with the normalized models is achieved in considerably shorter training times.

Pennington et al. (2014) combine global matrix factorization and local context window-based prediction to form a global log bilinear model called GloVe. GloVe uses ratios of co-occurrence probabilities of words as weights in its objective function to cancel out the noise from non-discriminative words. As distinct from CBOW and Skip-gram (Mikolov et al., 2013a), instead of cross-entropy, GloVe uses the weighted least squares regression in its objective function. For the same corpus, vocabulary, window size, and training time, GloVe consistently outperforms word2vec.

Zhao et al. (2017) (ngram2vec) improve word representations by adding n-gram co-occurrence statistics to the SGNS (Mikolov et al., 2013b), GloVe (Pennington et al., 2014), and PPMI models (Levy et al., 2015). In order to incorporate these statistics into the SGNS model, instead of just predicting the context words, they also predict the context n-gram of words. In order to add it to the other systems, they just add n-gram statistics to the co-occurrence matrix of words. They show improved scores over the models that they are built upon.

Levy and Goldberg (2014) argue that although the word embeddings with Skip-gram are able to capture very useful representations, they also learn from unwanted co-occurrences in the context, e.g. Australian and discovers in the sentence ”Australian scientist discovers stars with telescope”. In order to create a different context, they use dependency trees to link each word in the sentence to the other according to the relations they have. Their experimental results show that while their model (DEPS) is significantly better at representing syntactic relationships, it is worse at finding semantic relationships. In this work, they also share a non-trivial interpretation of how word embeddings learn representations, which is very rare in neural network solutions, by examining the activations of context for specific words.

Ling et al. (2015b) augment CBOW (Mikolov et al., 2013a)

with an attention model in order to solve the shortcomings of it: Inability to account for word order and lack of treating the importance of context words differently. They show that their method can obtain better word representations than CBOW while still being faster than its complementary model Skip-gram

(Mikolov et al., 2013a).

Yin and Schütze (2016) put forward the idea of ensembling the existing embeddings in order to achieve performance enhancement and improved coverage on the vocabulary. They propose four different ensemble approaches on five different word embeddings: Skip-Gram (Mikolov et al., 2013b), Glove (Pennington et al., 2014), Collobert&Weston (Collobert and Weston, 2008), Huang (Huang et al., 2012), and Turian (Turian et al., 2010). The first method CONC simply concatenates the word embeddings from five different models. SVD reduces the dimensionality of CONC. 1toN creates metaembeddings and 1to creates out of vocabulary (OOV) words for individual sets by randomly initializing the embeddings for OOVs and the metaembeddings, then uses a setup similar to 1toN to update metaembeddings as well as OOV embeddings. They also propose a MUTUALLEARNING method to solve OOV problem in CONC, SVD, and 1toN. They show that the ensemble approach outperforms individual embeddings on similarity, analogy, and POS tagging tasks.

There have been some work to improve early word embeddings through knowledge bases.

Liu et al. (2015) (SWE) try to improve word embeddings by subjecting them with ordinal knowledge inequality constraints. They form three different types of constraints:

  1. Synonym-antonym rule: A synonym of a word should be more similar than an antonym. They find these pair of words from the WordNet (Miller, 1995) synsets.

  2. Semantic category rule: Similarity of words that belong to the same category should be larger than the similarity of words that are in different categories. i.e. (hacksaw, jigsaw) similarity should be greater than (hacksaw, mallet).

  3. Semantic hierarchy rule: Shorter distances in hierarchy should infer larger similarities between words compared to longer distance cases. i.e (mallet, hammer) similarity should be larger than (mallet, tool).

The last two rules are constructed from the hypernymy-hyponymy information from Wordnet. They combine these constraints with the Skip-gram algorithm (Mikolov et al., 2013b) to train word embeddings and show that they can improve upon the baseline algorithm.

Faruqui et al. (2015)

aim to improve word embeddings with information from lexicons with a method called retrofitting. They use a word graph where each word is a vertex, and each relation in the knowledge-base is an edge between words. Their algorithm brings closer the words that are shown to be connected in the word graph and words that are found to be similar from the text. In other words, while they bring closer the words related in synsets, they also preserve the similarity in the underlying pre-trained word embeddings (Skip-gram

(Mikolov et al., 2013a), GloVe (Pennington et al., 2014), etc.). They use various knowledge-bases such as PPDB (Pavlick et al., 2015), WordNet (Miller, 1995), and FrameNet (Baker et al., 1998).

Tissier et al. (2017) (dict2vec) improve word2vec (Mikolov et al., 2013b) by incorporating dictionary information in the form of strong and weak pair of words into the training process. If a word is in the definition of the word in dictionary and is in the definition of too, then it is a strong pair. On the other hand, if is in the definition of but is not in the definition of

, then they form a weak pair. The authors add this positive sampling information into the training process proportional to hyperparameters.

Despite the success of these earlier word embeddings, there were still many limitations in terms of the accuracy of representations, each of which is targeted by many research works. In the succeeding subsections, we discuss these limitations (such as morphology, senses, antonymy/synonymy, and so on) with the proposed solutions from the literature.

3.2. Embeddings Targeting Specific Semantic Relations

Although the initial word embedding models successfully identified semantic and syntactic similarities of words, they still need to be improved to address specific semantic relations among words such as synonymy-antonymy and hyponymy-hypernymy. To illustrate, consider the sentences ”She took a sip of hot coffee” and ”He is taking a sip of cold water.” The antonyms ”cold” and ”hot” are deemed to be similar since their context is similar. Therefore, it becomes an issue to differentiate the synonyms ”warm” and ”hot” from the antonyms ”cold” and ”hot” considering they have similar contexts in most occurrences.

Work Base Model Year Knowledge-Base
Morphological
Features
Specific Semantic Relations
dLCE (Nguyen et al., 2016) SGNS (Mikolov et al., 2013b)) 2016 WordNet (Miller, 1995) and Wordnik Synonym-Antonym
Mrkšić et al. (2016)
GloVe (Pennington et al., 2014) and
paragram-SL999 (Wieting et al., 2015)
2016 WordNet (Miller, 1995) and PPDB 2.0 (Pavlick et al., 2015) Synonym-Antonym
Vulić et al. (2017) SGNS (Mikolov et al., 2013b)) 2017 Synonym-Antonym
Yu et al. (2015) 2015 Probase (Wu et al., 2012) Hyponym-Hypernym
Luu et al. (2016) 2016 WordNet (Miller, 1995) Hyponym-Hypernym
Nguyen et al. (2017) SGNS (Mikolov et al., 2013b)) 2017 WordNet (Miller, 1995) Hyponym-Hypernym
Wang et al. (2019) Skip-gram (Mikolov et al., 2013a) 2019
Synonym-Antonym,
Hyponym-Hypernym, Meronym
Table 2. Embeddings targeting specific semantic relations.

Table 2 presents the main approaches addressing synonym-antonym relations, hyponym-hypernym relations, and a study covering all types of relations.

Nguyen et al. (2016) propose a weight update for SGNS (Mikolov et al., 2013b) to identify synonyms and antonyms from word embeddings. Their system (dLCE) increases weights if there is a synonym in the context and makes a reduction in the case of an antonym. In order to come up with a list of antonyms and synonyms, they use WordNet (Miller, 1995) and Wordnik. They report state-of-the-art results in similarity tasks and synonym-antonym distinguishing datasets.

Mrkšić et al. (2016) propose a counter-fitting method to inject antonymy (REPEL) and synonymy (ATTRACT) constraints into vector space representations to improve word vectors. The idea behind the ATTRACT rule is that synonymous words should be closer to each other than any other word in the dictionary. Similarly, the REPEL constraint assumes that an antonym of a word should be farther away from the word than any other word in the dictionary. As knowledge-bases, they use WordNet (Miller, 1995) and PPDB 2.0 (Pavlick et al., 2015), and as pre-trained word vectors they use GloVe (Pennington et al., 2014) and paragram-SL999 (Wieting et al., 2015). They report state-of-the-art results on various datasets.

Vulić et al. (2017) use ATTRACT and REPEL constraints on pretrained word embeddings. Their algorithm aims to pull together ATTRACT pairs while pushing REPEL pairs apart. To form the ATTRACT and REPEL constraints, the inflectional and derivational morphological rules of four languages are used; English, Italian, Russian, and German. ATTRACT constraints consist of suffixes such as (-s, -ed, -ing) to create ATTRACT word pairs such as (look, looking), (create, created). On the other hand, REPEL constraints consist of prefixes like (il, dis, anti, mis, ir,..) to create REPEL word pairs such as (literate, illiterate), (regular, irregular). In order to balance the changes they make to the original embeddings (they use SGNS (Mikolov et al., 2013b)), there is a third constraint that tries to pull word embeddings to their original positions.

In their work, Yu et al. (2015) train term embeddings for hypernymy identification. They use Probase (Wu et al., 2012) as their training data for hypernym/hyponym pairs and impose three constraints on the training process: 1) hypernyms and hyponyms should be similar to each other (dog and animal), 2) co-hyponyms should be similar (dog and cat), 3) co-hypernyms should be similar (car and auto). They create a neural network architecture to update word embeddings without optimizing parameters. They use 1-norm distance as a similarity measure. They use an SVM on the output term embeddings to decide whether a word is a hypernym/hyponym to another word.

Luu et al. (2016) aim to identify is-a relationship through neural network architecture. First, they extract hypernyms and hyponyms using the relations in WordNet (Miller, 1995)

to form a training set. Second, they create (hypernym, hyponym, context word) triples by finding all sentences in the dataset containing two hypernym/hyponyms found in the first step and using the words between the hypernym and hyponym as context words. Then, they give hyponym and context words as input to the neural network and try to predict the hypernym by aggregating them with a feed-forward neural network. The resultant hypernym, hyponym pairs along with an offset vector are given to SVM to predict whether there is an is-a relationship or not. The authors state that since their method takes context words into account, their embeddings have good generalization capability and are able to identify unseen words.

Nguyen et al. (2017) aim to learn hierarchical embeddings for hypernymy. They leverage hypernymy-hyponymy information from WordNet (Miller, 1995) and propose objective functions over/above SGNS embeddings (Mikolov et al., 2013b)

to move hypernymy-hyponymy pairs closer. The first objective function is based on the distributional inclusion hypothesis, while the second adopts distributional informativeness. They also propose an unsupervised hypernymy measure to be used by their hierarchical embeddings. In the proposed hypernymy measure, the cosine similarity between the hypernym and hyponym vectors (to detect the hypernymy) is multiplied by the hypernym to hyponym magnitude ratio (to account for the directionality of the relation by the assumption that hypernyms are more general terms, being more frequent and thus having a large magnitude compared to hyponyms). Their evaluation also tests the generalization capability of their hypernymy solution, which proves that the model learns rather than memorizes prototypical hypernyms.

Wang et al. (2019) propose a neural representation learning model for predicting different types of lexical relations, e.g., hypernymy, synonymy, meronymy, etc. Their solution avoids the ”lexical memorization problem” because relation triples’ embeddings are learned rather than computing those relations through individual word embeddings. In order to learn a relation embedding for a pair of words, they use the Skip-gram model (Mikolov et al., 2013a) over the neighborhood pairs where the similarity between pairs is defined on hyperspheres. Their lexical relation classification results verify the effectiveness of their approach.

3.3. Sense Embeddings

Another drawback of early word embeddings is they unite all the senses of a word into one representation. In reality, however, a word gets meaning in its use and can mean different things in varying contexts. For example, even though the words ”hot” and ”warm” are very similar when they are used to refer to temperature levels, they are not similar in the sentences ”She took a sip of hot coffee” and ”He received a warm welcome”. In the transition period to contextual embeddings, different supervised and unsupervised solutions are proposed for having sense embeddings.

Schütze (1998) was the first work aimed at identifying senses in texts. He defines the problem of word sense discrimination as the decomposition of a word’s occurrences into same sense groups. This definition is unsupervised in its nature. When the issue becomes labeling those sense groups, the task becomes a supervised one and is named as word sense disambiguation. The reader can refer to Navigli (2009) for a comprehensive survey on word sense disambiguation and Camacho-Collados and Pilehvar (2018) for an in-depth examination of sense embedding methods and their development.

Table 3 provides a classification of the studies that we analyze in this section. The classification dimensions include unsupervised/supervised, topical or not, knowledge base, probabilistic approach, exploiting syntactic information or not, and neural network (NN) model.

Unsupervised
R&M(Reisinger and Mooney, 2010)
Supervised
Work Topical Knowledge Base Probabilistic Syntactic NN Model
Information
Huang et al. (2012)

Spherical k-means

Custom Language Model using
both local and global context
Pelevina et al. (2016) Graph clustering on ego network CBOW (Mikolov et al., 2013a)
TWE (Liu et al., 2015) LDA (Blei et al., 2003) Skip-gram (Mikolov et al., 2013a)
SenseEmbed (Iacobacci et al., 2015) BabelNet (Navigli and Ponzetto, 2012) CBOW (Mikolov et al., 2013a)
Chen et al. (2015) WordNet (Miller, 1995) Context clustering CNN
Jauhar et al. (2015) WordNet (Miller, 1995) Expectation-Maximization (EM) Skip-gram (Mikolov et al., 2013a)
Chen et al. (2014) WordNet (Miller, 1995) Skip-gram (Mikolov et al., 2013a)
Tian et al. (2014) Mixture of Gaussians (EM) Skip-gram (Mikolov et al., 2013a)
Nieto Piña and Johansson (2015) SALDO (Borin et al., 2013) Skip-gram (Mikolov et al., 2013a)
MSSG (Neelakantan et al., 2014) Skip-gram (Mikolov et al., 2013a)
SAMS (Cheng and Kartsaklis, 2015) Recursive Neural Network
Li and Jurafsky (2015) Chinese Restaurant Process (CRP) CBOW-Skip-gram (Mikolov et al., 2013a), SENNA (Collobert et al., 2011)
MSWE (Nguyen et al., 2017) LDA (Blei et al., 2003) Skip-gram (Mikolov et al., 2013a)
Guo et al. (2014) Affinity Propagation Algorithm RNNLM model (Mikolov et al., 2010)
LSTMEmbed (Iacobacci and Navigli, 2019) BabelNet (Navigli and Ponzetto, 2012) LSTM
Kumar et al. (2019)
Knowledge Graph
Embedding
Framework consisting of
different types of Encoders
Table 3. Sense embeddings.

At the outset, unsupervised learning is used to discriminate the different senses of a word.

Reisinger and Mooney (2010) propose a multi-prototype based word sense discovery approach. In their approach (R&M), a word’s all occurrences are collected as a set of feature vectors and are clustered by a centroid-based clustering algorithm. The resultant clusters (fixed number) for each word are expected to capture meaningful variation in word usage rather than matching to traditional word senses. They define the similarity of words and

as the ”maximum cosine similarity between one of A’s vectors and one of B’s vectors” and provide experimental evidence on similarity judgments and near-synonym prediction. Moreover, variance in the prototype similarities is found to predict variation in human ratings.

Following Reisinger and Mooney (2010), Huang et al. (2012) also aim at creating multi-prototype word embeddings. They compute vectors using a feed forward neural network architecture with one layer to produce single prototype word vectors and then perform spherical k-means to cluster them into multiple prototypes. They also introduce the idea of using global context where the vectors of words in a document are averaged to create a global semantic vector. The final score of embeddings is then calculated as the sum of scores of each word vector along with the global semantic vector.

The authors also argue that available test sets for similarity measurements are not sufficient for testing multi-prototype word embeddings because the scores of word pairs in those test sets are given in isolation, which lacks the contextual information for senses. Therefore, they introduce a new test set in which the word pairs are scored within a context by mechanical turkers, where context is usually a paragraph from Wikipedia that contains the given word. Finally, they show that their model is capable of outperforming the former models when such a test set is used, although its performance is similar to others in previous test sets.

Pelevina et al. (2016) aim at creating sense embeddings without using knowledge bases. Their model takes the existing single-prototype word embeddings and transforms them into multi-prototype sense embeddings by constructing an ego network and performing graph clustering over it. In fact, the senses of a word they learn do not have to correspond to the senses of that word in the dictionary. They evaluate their method on their crowd-sourced dataset.

Liu et al. (2015) propose three different methods to create topical embeddings (TWE). They create their topical embeddings without the use of any knowledge base, but instead rely on LDA (Blei et al., 2003) to find the topics of each document the word occurs in. Topical embeddings they create are similar to sense embeddings with the only difference being that the number of topics may not correspond to the number of senses in the dictionary.

In their first model, named TWE-1, they learn word embeddings and topic embeddings separately and simultaneously with the skip-gram method by treating topic embeddings as pseudo-words, which appear in all the positions of words under this topic. The sense embeddings of a word for topic are then constructed by concatenating the word embedding with the corresponding topic embedding . Their second model TWE-2 treats word embeddings and topic embeddings as tuples and train them together. This method may lead to sparsity issues since some words on a specific topic may not be frequent. The last method they propose, TWE-3, also train word and topic embeddings together, but this time the weights of embeddings are shared over all word-topic pairs. They show that the TWE-1 method gives the best results overall, and the independence assumption between words and topics in the first model is given as the reason behind its performance.

Exploiting vast information in knowledge bases to learn sense representations has proved useful. The approaches that rely mainly on knowledge bases to compute sense embeddings include Iacobacci et al. (2015), Chen et al. (2015), Jauhar et al. (2015), and Chen et al. (2014).

Iacobacci et al. (2015) (SenseEmbed) use BabelNet (Navigli and Ponzetto, 2012) as a knowledge-base to retrieve word senses and to tag words with the correct sense. They train the sense-tagged corpora on the CBOW architecture and achieve state-of-the art results in various word similarity and relatedness datasets.

Chen et al. (2015) also use a knowledge-base (WordNet) to solve the sense-embedding problem. They use CNN to initialize sense-embeddings from the example sentences of synsets in WordNet. Then, they apply context clustering to create distributed representations of senses. The representations they obtain achieve promising results.

Jauhar et al. (2015) propose two models for learning sense-embeddings using ontological resources like WordNet (Miller, 1995). In their first model, they retrofit pretrained embeddings by imposing two conditions on them: pulling together the words that are ontologically-related (by using the graphs constructed from the relationships in WordNet) and leveraging the tension between sense-agnostic neighbors from the same graph. They implement the first method over Skip-gram (Mikolov et al., 2013b) and Huang et al. (2012) and show that their method can improve the success of the previous methods. Their second method constructs embeddings from scratch by training them with an Expectation-Maximization (EM) objective function that pulls together ontologically-related words similar to the first model and finds the correct sense of the word from WordNet and creates a vector for each sense.

Chen et al. (2014) propose a unified model for word sense representation (WSR) and word sense disambiguation (WSD). The main idea behind this is that both models may benefit from each other. Their solution is composed of three steps: First, they initialize single-prototype word vectors using Skip-gram (Mikolov et al., 2013b) and initialize the sense embeddings using the glosses in WordNet (Miller, 1995). They take the average of words in WordNet synset glosses to initialize the sense embeddings. Second, they perform word sense disambiguation using some rules on the given word vectors and sense vectors. Finally, using the disambiguated senses, they learn sense vectors by modifying the Skip-gram objective such that both context words and context words’ senses must be optimized given the middle word in context.

Tian et al. (2014) propose a probabilistic approach to provide a solution to sense embeddings. They improve the Skip-gram algorithm by introducing the mixture of Gaussians idea to represent the given middle word in context in the objective function. Every Gaussian represents a specific sense, and the mixture is their multi-prototype vector. The number of Gaussians, in other words, the number of senses, is a hyperparameter of the model. They use Expectation-Maximization (EM) algorithm to solve the probabilistic model.

Nieto Piña and Johansson (2015) extend the Skip-gram (Mikolov et al., 2013a) method to find sense representations of words. They get the number of senses from a knowledge-base and for each word in the training corpus, they find the most probable sense by using the likelihoods of context words. They only train the sense with the highest probability. They train their system on Swedish text and measure their success by comparing the senses to the ground-truth in the knowledge-base (SALDO (Borin et al., 2013)).

Neelakantan et al. (2014) (MSSG) also aim at creating word vectors for each sense of a word. Unlike most other models, they do it by introducing the sense prediction into the neural network and jointly performing sense vector calculation and word sense discrimination. Their first model relies on Skip-gram and induces senses by clustering the context word representations around each word. Then, the word is assigned to the closest sense by calculating the distance to the sense clusters’ centers. Here the count of clusters is the same for all words and is a hyperparameter. Their second model is a non-parametric variant of the first one where a varying number of senses is learned for each word. A new cluster (sense) for a word type is created with probability proportional to the distance of its context to the nearest cluster (sense). They show that their second method can outperform the first since it can better learn the senses’ nature.

Cheng and Kartsaklis (2015) consider capturing syntactical information to better address senses. They use recursive neural networks on parsed sentences to learn sense embeddings. Each input is disambiguated to its sense by calculating the average distance of the words’ embeddings in the sentence to sense cluster means. They define two negative sampling methods to train the network. One negative example is created to swap the target word with a random word (as in (Mikolov et al., 2013b) and (Gutmann and Hyvärinen, 2012)), another negative sampling changes the order of words in a sentence, which further enforces the model (SAMS) to learn syntactic dependencies.

Li and Jurafsky (2015)

decide the number of senses in an unsupervised fashion by using the Chinese Restaurant Process (CRP). They combine CRP with neural network training methods by determining the sense of a word by looking at its context. They also compare sense-embedding methods with single-prototype models across various NLP tasks to see if they are beneficial. They state that in some tasks (POS tagging, semantic relatedness, semantic relation identification), sense-embeddings outperform single-prototype methods. Still, they fail to improve their scores on some other tasks (NER, sentiment analysis).

Instead of getting the number of senses from a knowledge-base, Nguyen et al. (2017) (MSWE) use LDA (Blei et al., 2003) to find word to topic and topic to document probability distributions. Here the number of topics is a parameter to the model. They train different weights for each sense of a word using two different optimization methods. The first model learns word vectors based on the most suitable topic. On the other hand, their second model considers all topics to learn them. They conclude that this second method can be considered as a generalization of the Skip-gram model (Mikolov et al., 2013a) given the fact that it behaves as Skip-gram if the mixture weights are set to zero.

Guo et al. (2014) exploit bilingual resources to find sense embeddings, motivated by the idea that if a word in a source language translates into multiple words in a target language, that means different words in the target language corresponds to a sense in the source language. For this purpose, they use Chinese to English translation data to induce senses in an unsupervised fashion. They represent the initial words with the word embeddings from C&W (Collobert and Weston, 2008) and use the affinity propagation algorithm to cluster the translated words into dynamic clusters, which means that their method can learn a different number of senses for each word. Then, they use the RNNLM model (Mikolov et al., 2010) to train the sense embeddings.

Iacobacci and Navigli (2019) propose an LSTM-based architecture (LSTMEmbed) to jointly learn word and sense embeddings. Input contexts are provided from semantically annotated data, and one bidirectional LSTM processes the left context while another one handles the right one. As an extra layer, the concatenation of both outputs is linearly projected into a dense representation. Then, the optimization objective tries to maximize the similarity between the produced dense output and pretrained word embeddings from SGNS. Consideration of these pretrained word embeddings in the final phase increases the vocabulary use of the proposed system. Their experiments on the word to sense similarity and word-based semantic evaluations prove the usefulness of their approach.

Kumar et al. (2019) propose a framework that combines a context encoder with a definition encoder to provide sense predictions for out of vocabulary words. In the case of rare and unseen words, most word sense disambiguation (WSD) systems rely on the most frequent sense (MFS) on the training set. In the part of the definition encoder, sentence encoders along with knowledge graph embeddings are utilized. Here instead of using discrete labels for senses, the score for each sense in the inventory is calculated by the dot product of the sense embedding with the projected context-aware embedding.

3.4. Morpheme Embeddings

The quest for morphological representations is a result of two important limitations of earlier word embedding models. The first point is, words are not the smallest units of meaning in languages, morphemes are. Even if a model does not see the word unpleasant in the training it should be able to deduce that it is the negative form of pleasant. Word embedding methods that don’t take morphological information into account can not produce any results in such a situation. The second limitation is the data scarcity problem of morphologically rich languages and agglutinative languages. Unlike English, morphologically rich languages have many more noun and/or verb forms inflected by gender, case, or number, which may not exist in the training corpora. The same thing is also valid for agglutinative languages in which words can have many forms according to the suffix(es) they take. Therefore, models that take morphemes/lexemes into account is needed.

Model Year Training Corpus Knowledge-Base NN Model Dimension
Luong et al. (Luong et al., 2013) 2013 Wiki Morfessor(Creutz and Lagus, 2007) recNN 50
CLBL (Botha and Blunsom, 2014) 2014 ACL MT Morfessor(Creutz and Lagus, 2007) LBL -
Qiu et al. (Qiu et al., 2014) 2014 Wiki Morfessor(Creutz and Lagus, 2007),Root,Syllable(Liang, 1983b) CBOW 200
Bian et al. (Bian et al., 2014) 2014 Wiki
Morfessor(Creutz and Lagus, 2007), WordNet(Miller, 1995),
Freebase(Bollacker et al., 2008), Longman Dict.
CBOW 600
CharWNN (Dos Santos and Zadrozny, 2014) 2014 Wiki - CNN 100
KNET (Cui et al., 2015) 2015 Wiki Morfessor(Creutz and Lagus, 2007), Syllable(Liang, 1983b) Skip-Gram 100
AutoExtend (Rothe and Schütze, 2015) 2015 Google News WordNet (Miller, 1995) Autoencoder 300
Morph-LBL (Cotterell and Schütze, 2015) 2015 TIGER (Brants et al., 2004) TIGER (Brants et al., 2004) LBL 200
Soricut and Och (Soricut and Och, 2015) 2015 Wiki - Skip-Gram 500
C2W (Ling et al., 2015a) 2015 Wiki - biLSTM(Graves and Schmidhuber, 2005) 50
Cotterell et al. (Cotterell et al., 2016) 2016 Wiki CELEX (H Baayen et al., 1993) GGM 100
Fasttext (Bojanowski et al., 2016) 2016 Wiki - Skip-Gram 300
char2vec (Cao and Rei, 2016) 2016 text8 (wiki) - LSTM(Hochreiter and Schmidhuber, 1997)+Attn 256
Kim et al. (2016) 2016 ACL MT - CNN+LSTM 300-650
LMM (Xu et al., 2018) 2018 Gigaword Morfessor(Creutz and Lagus, 2007) CBOW 200
Table 4. Morpheme embedding models.

Researchers propose several ways to target morphological information in order to obtain sub-word information for solving the rare/unknown word problem of earlier word embedding methods and also to have better representations of words for morphologically rich languages. While some of the works are proposed to train embeddings directly from morphemes/lexemes, others adjust the representations of other word embedding models. Summary of these models and their properties can be seen in Table 4.

3.4.1. Training Morphological Embeddings from Scratch

There are two main ways for training morpheme embeddings from scratch: While some methods ((Luong et al., 2013), (Botha and Blunsom, 2014), (Qiu et al., 2014), (Bian et al., 2014), (Cui et al., 2015), (Cotterell and Schütze, 2015), (Xu et al., 2018), (Soricut and Och, 2015)) propose to use tools or special rules for dissecting a text to its morphemes, others ((Bojanowski et al., 2016), (Cao and Rei, 2016), (Ling et al., 2015a), (Dos Santos and Zadrozny, 2014)) prefer using characters or character n-grams as input to learn morphemes along with their representations.

Luong et al. (2013)’s work is the first work that attempts to incorporate morphological information in word embeddings. They train morphological embeddings with recursive neural networks. They divide words into (prefix, stem, affix) tuples by using morfessor (Creutz and Lagus, 2007) and feed them to a recursive neural network. Word embeddings are then constructed by a word-based neural language model (NLM). Instead of initializing the vectors with random numbers, they initialize them with the pre-trained word embeddings from Collobert et al. (2011) and Huang et al. (2012) in order to focus on learning the morphemic semantics.

Similar to Luong et al. (2013), Botha and Blunsom (2014) (CLBL) also use morfessor (Creutz and Lagus, 2007) to find the morphemes of words in text and train both the target word and context words by first factoring them into their morphemes. They learn the morphology-based word representations with an additive-LBL of their factor embeddings, e.g., surface form, stem, affixes, etc.

Qiu et al. (2014) incorporate morphemes into the CBOW (Mikolov et al., 2013a) architecture: Instead of predicting a word from the context words, they propose to use both morphemes and words as input and for prediction. They control the relative contributions of words and morphemes with two parameters that weigh the information to be extracted from each input. They use three different tools for extracting morphemes from corpus: Morfessor (Creutz and Lagus, 2007), root, and syllable (Liang, 1983b).

Bian et al. (2014) investigate three different methods for finding better representations for words and morphemes: First, they transform CBOW (Mikolov et al., 2013a) into a new basis by using morphemes (segmented by using morfessor (Creutz and Lagus, 2007)) instead of words. They later represent words as the aggregate of the morphemes they are composed of. Second, they provide additional information to their first model by feeding semantic and syntactic information vectors as inputs along with the morpheme vectors. As semantic and syntactic information, they use synsets, syllables, syntactical transformation, and antonym and synonyms from Freebase (Bollacker et al., 2008), WordNet (Miller, 1995), and Longman dictionaries111www.longmandictionariesonline.com. Finally, they use syntactic knowledge (POS tagging vector) and semantic knowledge (entity vector and relation matrix) as auxiliary tasks, where they use syntactic/semantic information as outputs around the center word to be predicted. Their relation matrix consists of relations such as belong-to and is-a relation. They examine the effects of both semantic and syntactic information compared to the baseline model (CBOW) and report the relative effects in various tasks.

Soricut and Och (2015) aim at improving word vectors and solving the rare word problem by using morphology induction. In their method, they first extract candidate morphological rules. In this step, they find the word pairs such that is formed by substituting prefixes and suffixes up to characters from (i.e., is produced from the rule ). Later they form their rules from word pairs. After training their embeddings with the Skip-gram method (Mikolov et al., 2013a), they keep the rule if the word pair is similar in embedding space; otherwise, the rule is removed from the candidate rule list. Thus, they use their morphological rules to obtain representations for rare words that may or may not be in the training set.

Cui et al. (2015) (KNET) use co-occurrence statistics to construct word embeddings with sub-word information. They leverage four different morphological information inspired by the advances in cognitive psychology: i) edit distance similarity ii) longest common sub-string similarity, iii) morpheme similarity (share roots, affixes, etc. by using morfessor (Creutz and Lagus, 2007)), and iv) syllable similarity (by using hyphenation tool (Liang, 1983a)). They combine the aforementioned morphological information into a relation matrix and construct morphological embeddings from it. On the other hand, they also create word embeddings by using the Skip-gram method (Mikolov et al., 2013b). A combination of these two embeddings with weighted averaging is used in order to obtain the final word embeddings. Unlike most other word embedding methods, authors do not change the digits in the text with zeros; instead, they change the digits with their text counterparts to reflect the information better.

Different from other morphology-based models, Cotterell and Schütze (2015) implement a semi-supervised approach (MorphLBL) where a partially morphologically tagged dataset (TIGER dataset of German newspaper (Brants et al., 2004)) is used. They augment the LBL model (Mnih and Hinton, 2007) to both predict word and morpheme together. They also introduce a new metric for measuring the success of morphological models called MorphDist.

Dos Santos and Zadrozny (2014), Ling et al. (2015a), Bojanowski et al. (2016), and Cao and Rei (2016) come up with character-based solutions instead of using a tool/knowledge-base to find morphemes in sentences.

In their work (CharWNN), Dos Santos and Zadrozny (2014) use word embeddings together with character embeddings to compensate for the need for hand-crafted features in part-of-speech (POS) tagging, where the morphological structure of words plays a significant role. In their architecture, they use Skip-gram (Mikolov et al., 2013a) for word embeddings and train their character embeddings from scratch.

The compositional model of Ling et al. (2015a), called C2W, takes the characters of a word as input and uses bidirectional-LSTM to construct word vectors by concatenating the last state of LSTM in each direction.

Bojanowski et al. (2016) propose a model, called Fasttext, that takes character - to -grams of words and represents the words with a bag of n-grams. i.e., for the word “where” the -grams are: (¡wh, whe, her, ere, re¿), where and are special characters for denoting the beginning and end of the word, respectively. N-grams are then summed to produce word embeddings. Thus, as the model shares representations across words, it can have better representations for rare words. They perform extensive tests on morphologically rich languages to see how their model works and learns the subword information.

Cao and Rei (2016) aim at solving unsupervised morphology induction and learning word embeddings jointly by using bidirectional LSTMs with the Bahdanau attention (Bahdanau et al., 2014) on characters. The output of the attention layer is fed to Skip-gram (Mikolov et al., 2013a) algorithm to compute word representations. They prove that the attention layer learns to split the words into multiple morphemes by showing that their algorithm outperforms other morpheme induction methods. However, it is not only designed for solving that problem. They also show that since their method (char2vec) focused on finding morpheme representations through characters, it is better at tasks that measure syntactic similarity. On the other hand, they argue that their method is worse at tasks that measure semantic similarity since characters do not convey any semantic information of words alone.

To address both syntactic and semantic features, Kim et al. (2016) use a mixture of character and word-level features. In their model, at the lowest level of the hierarchy, character-level features are processed by a CNN; after transferring these features over a highway network, high-level features are learned using an LSTM. Thus, the resulting embeddings show good syntactic and semantic patterns. For instance, the closest words to the word richard are returned as eduard, gerard, edward, and carl, where all of them are person names and have a high syntactic similarity to the query word. Due to character-aware processing, their models are able to produce good representations for out-of-vocabulary words.

Xu et al. (2018) (LMM) also aim at enhancing word representations with morphological information. In incorporating morphological information, the authors suggest using the latent meaning of morphemes instead of morphemes themselves. They state that although the words and have similar semantics, the methods based on morphemes cannot catch it. Instead, they use the latent meaning of morphemes that they extract from knowledge-bases (i.e. in=not, un=not, ible=able, able=able, cred=believe, believ=believe). They use CBOW (Mikolov et al., 2013a) as pretrained word embeddings and show improvements using their method on them.

3.4.2. Adjusting the Existing Embeddings

Among the models that adjust the pre-trained word embeddings, Rothe and Schütze (2015) take any word embeddings and transform them into embeddings for lexemes and synsets. To do that, they use WordNet (Miller, 1995) synsets and lexemes although they note that their model (AutoExtend) can get the information from the other knowledge bases such as Freebase (Bollacker et al., 2008). They consider words and synsets as the sum of their respective lexemes and enforce three constraints on the system i) synset constraint, ii) lexeme constraint, and iii) WordNet constraint (because some synsets contain only a single word). They use an autoencoder where the result of the encoding corresponds to synset vectors, and the hidden layer in encoding and its counterpart in decoding correspond to lexeme vectors. Two lexeme vectors are then averaged to produce the final lexeme embeddings.

On the other hand, Cotterell et al. (2016) use a Gaussian graphical model where word embeddings are represented as the sum of their morphemes. Their system takes the output of the other word embedding methods as input and converts them by learning their morpheme embeddings and calculating the word embeddings by summing them. They also note that it is also possible to extrapolate the embeddings of OOV words with their method since one can compute their morpheme embeddings from the same morpheme in other words.

4. Contextual Representations

As it is shown in the last section, many methods have been proposed for solving the deficiencies of embedding methods. Each of them is specialized on a single problem such as sense representation, morpheme representation, etc., while none of them was able to combine different aspects together into a single model, a single solution. It is the idea of contextual representations to provide a solution that covers each aspect successfully. The main idea behind contextual representations is that words should not have a single representation to be used in every context. Instead, a representation should be calculated separately for different contexts. Contextual representation methods calculate the embedding of a word from the surrounding words each time the word is seen, contrary to the earlier methods where each word is represented with a fixed vector of weights. This leads to an implicit solution to many problems such as sense representations, antonymy/synonymy, and hypernymy/hyponymy, since now multi-sense words can have different representations according to their context. Furthermore, it has also been proposed to use characters as input which also incorporates the sub-word information into embeddings. Therefore, contextual representation models, described below, are able to incorporate different aspects together into a single model. Liu et al. (2020) examine contextual embeddings in detail by comparing their pre-training methods, objectives, and downstream learning approaches.

In such a first attempt to create contextual representations, Melamud et al. (2016) developed a neural network architecture based on bidirectional-LSTMs to learn context embeddings with target word embeddings jointly. They feed words to a -layer bidirectional LSTM network in order to predict a target word in a sentence. They use sentences as context and feed the left side of the target word to left to right (forward) biLSTM and feed the right side of the target word to right to left (backward) biLSTM. To jointly learn context and target word embeddings, they use the Skip-gram objective function sampled on context-word occurrences. Furthermore, they show that this is equivalent to the factorization of a context-target word co-occurrence matrix. Although the previous word embedding models create both context and target word embeddings, they only use target-target similarity as representations and ignore the context embeddings. In this work, the authors also use context-context and context-target to show that contextual embeddings can significantly improve NLP systems’ performance. They also show that since bidirectional LSTM structures can learn long-term contextual dependencies, their model, context2vec, is able to differentiate polysemous words with a high success rate.

CoVe (McCann et al., 2017) uses Glove (Pennington et al., 2014)

as the initial word embeddings and feeds them to a machine translation architecture to learn contextual representations. The authors argue that pre-training the contextual representations on machine learning tasks, where there are vast amounts of data, can lead to better contextual representations to transfer learning to other downstream tasks. They concatenate the output of the encoder of a machine translation model (as contextual embeddings) with the GloVe embeddings to construct their final word representations.

Using language modeling and learning word representations as to the pre-training objective, then fine-tuning the architecture to downstream tasks is first proposed by Dai and Le (2015) and Howard and Ruder (2018). While Dai and Le (2015) propose to use RNNs and autoencoders to tackle the issue, ULMFiT (Howard and Ruder, 2018)

introduces novel fine-tuning ideas such as discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing to their LSTM model, inspired from the advances in transfer learning in computer vision. After the success shown by these models, the aim is shifted from creating word representations to using their system as pre-trained models and then fine-tuning a classifier on top to perform downstream tasks.

ELMO (Peters et al., 2018) improves on the character-aware neural language model by Kim et al. (2016). The architecture takes characters as input to a CNN network from where it is fed to a -layer bidirectional-LSTM network to predict a target word. They show that this architecture can learn various aspects of words such as semantic, syntactic, and sub-word information. First, they show that, since the model takes characters as inputs, it is able to learn sub-word information even for the unseen words. Second, they show that while the first layer of biLSTM better captures the syntactic similarity of words, the second layer better captures the semantics. Therefore, they propose to use the different layers of the model to create word representations. They also propose to use a weighted averaging method for combining the different layers. They show that including ELMO representations can improve many state-of-the-art models in various NLP tasks.

Instead of using words as input, Flair (Akbik et al., 2018) uses a character-level language model to learn contextual word representations. Different from ELMO (Peters et al., 2018) where character level inputs are later converted into word features, in this work, the authors propose to use characters only. They feed the characters of an input string to a single layer LSTM network and predict the next character. They later form the word representation by concatenating the backward LSTM output from the beginning of the word with the forward LSTM output from the end of the word. They also try concatenating other pre-trained word vectors with their contextual representations in downstream tasks and show that this can improve the results.

BERT (Devlin et al., 2019) uses bidirectional transformer (Vaswani et al., 2017) architecture to learn contextual word representations. Different from the earlier approaches (ELMO (Peters et al., 2018), Melamud et al. (2016)) BERT is bidirectional. Although ELMO also considers both sides of a target word, it considers them separately as the left and right sides. Instead, BERT spans the entire sentence with both right to left and left to right transformers. To do so, without also spanning the target word, they mask the target word. Therefore, they call this model a masked language model (MLM).

In addition to the token (word) embeddings, they also use segment (sentence) embeddings and position embeddings (words’ position in segments) as input which enables BERT to consider multiple sentences as context and to represent inter-sentence relations. Giving multiple sentences as input helps BERT be integrated into most downstream tasks that require inter-sentence connections such as question answering (QA) and natural language inference (NLI) easily without requiring any other architecture. For further details, the reader can refer to the work of Rogers et al. (2020), which provides an in-depth survey on how exactly BERT works and what kind of information it captures during training and fine-tuning.

XLNet (Yang et al., 2019)

is an autoregressive method that combines the advantages of two language modeling methods: Autoregressive models (i.e. transformer-XL

(Dai et al., 2019)) and autoencoder models (i.e. BERT (Devlin et al., 2019)). Specifically, it considers both sides of the target word by employing a permutation language modeling object without masking any words like BERT, which allows their model to capture also the relation between the masked word and the context words, unlike BERT.

ALBERT (Lan et al., 2020) aims at lowering the memory consumption and training times of BERT (Devlin et al., 2019). To accomplish this, they perform two changes on the original BERT model: They factorize the embeddings into two matrices to use smaller dimensions, and they apply weight sharing to decrease the number of parameters. They state that weight sharing also allows the model to generalize better. They show that although they can obtain state-of-the-art results over BERT with fewer parameters, ALBERT requires a longer time to train than BERT.

RoBERTa (Liu et al., 2019) revises the pre-training design choices of BERT (Devlin et al., 2019) by trying alternatives in a controlled way. Specifically, dynamic masking for the Masked Language Model (MLM), input format of entire sentences from a single document with the Next Sentence Prediction (NSP) loss removed, and byte-level Byte Pair Encoding (BPE) vocabulary give better performance. Moreover, they extend the training set size and the size of mini-batches in training. As a result, RoBERTa (Liu et al., 2019) achieves state-of-the-art results in GLUE, RACE, and SQuAD benchmarks.

In their work, called ERNIE, Sun et al. (2019) improve on BERT by introducing two knowledge masking strategies into their masked language modeling. In addition to masking out random words in the sentence, they mask phrases and named entities to incorporate real-world knowledge into language modeling/representation. In their successive work, ERNIE 2.0 (Sun et al., 2020), they implement continual multi-task learning. Including the one in ERNIE, they define seven pre-training tasks categorized into word-aware, structure-aware, and semantic-aware pre-training tasks, aiming to capture lexical, syntactic, and semantic relations, respectively.

GPT and its variants rely on a meta-learner idea using a conditional language model in diverse NLP tasks. This conditional language model predicts the next word conditioned both on an unsupervised pre-trained language model and the previous set of words in context. In GPT-3, Brown et al. (2020) pre-train a billion parameter transformer-based language model on a sufficiently large and diverse corpus and tests its performance in zero-shot, one-shot, and few-shot settings. Their learning curves for these three settings show that a larger model better learns a task from contextual information. Authors apply task-specific input transformations, e.g., delimiting context and question from the answer in reading comprehension, to test the model’s performance in different NLP tasks. Their few-shot results prove the effectiveness of their approach by outperforming state-of-the-art on LAMBADA language modeling dataset (Paperno et al., 2016), TriviaQA closed book open domain question answering dataset (Joshi et al., 2017), and PhysicalQA (PIQA) common sense reasoning dataset (Bisk et al., 2019).

5. Performance of Word Representations

Due to the popularity of the field, many datasets have been proposed and tested upon. In this section, we report the structure of the datasets and the performance of the aforementioned word embedding models on them.

5.1. Datasets

Depending on their aim, the datasets produced to measure the success of embedding models can be divided into four categories: Similarity tasks, analogy task, synonym selection tasks, and downstream tasks.

5.1.1. Similarity Tasks

These datasets provide pairs of words whose similarity is rated by human judgments. They all use Spearman’s rank correlation () with average human judgment to measure the performance and quality of embeddings.

  • WordSim-353 (WS-353): Finkelstein et al. (2001) produced a corpus that contains human judgements, rated from to , on pairs of words.

  • SCWS: Huang et al. (2012) introduced this dataset in which the word pairs are scored by mechanical turkers within a context, which is usually a paragraph from Wikipedia that contains the given word. The reason for introducing such a dataset is that the available test sets for similarity measurements are not sufficient for testing multi-prototype word embeddings because the scores of word pairs in those test sets are given in isolation, which lacks the contextual information for senses.

  • RG-65: This dataset, developed by Rubenstein and Goodenough (1965), is composed of noun pairs whose similarity is rated by humans.

  • MC-30: The dataset (Miller and Charles, 1991) contains pairs of words.

  • MEN: It (Bruni et al., 2014) contains pairs of words together with human assigned similarity scores obtained from Amazon Mechanical Turk.

  • YP-130: Similar to the previous test sets, YP-130   citeYP130 also contains human assigned similarity scores to word pairs.

  • RW: Unlike the previous word similarity datasets, RW (Luong et al., 2013) consists of pairs of rare words which are not frequently seen in texts. The motivation behind this dataset is to provide a sufficient number of complex and rare words to test the expressiveness of morphological models since the previous datasets mostly contain frequent words that are insufficient for such tests.

  • Simlex-999: Simlex-999 dataset (Hill et al., 2015) contains pairs of words whose similarity is annotated by mechanical turkers.

5.1.2. Analogy Task

Semantic-syntactic word relationship test set (Google Analogy Task) introduced by Mikolov et al. (2013a) consists of pairs of words in the form of is to as is to (such as Paris is to France as London is to England). The aim is to find , given , , and (cosine distance is used as a distance metric to find the missing word). There are semantic and syntactic questions in the dataset, and the success is measured by accuracy.

5.1.3. Synonym Selection Tasks

Given a word as input, this task aims to select the most synonym-like word among the list of candidates. Accuracy (%) is used to measure the performance.

  • ESL-50: Contains synonym selection questions from ESL (English as a second language) tests.

  • TOEFL-80: Contains synonym selection questions from TOEFL (Test of English as Foreign Language) tests.

  • RD-300: Contains synonym selection problems from Reader’s Digest Power Game.

5.1.4. Downstream Tasks

As representations and models get better and the difference between word embedding methods and language models gets closer, experiments are shifted from similarity tasks to downstream tasks.

GLUE benchmark dataset (Wang et al., 2018) is introduced to provide a stable testing environment for researchers. It consists of several downstream tasks:

  • CoLA: The Corpus of Linguistic Acceptability (Warstadt et al., 2019) is a sentence classification task where the aim is to determine whether a sentence is linguistically acceptable or not. It contains sentences from linguistic publications and the success is measured by Matthew’ Correlation Coefficient (MCC).

  • SST-2: The Stanford Sentiment Treebank (Socher et al., 2013) consists of sentences from movie reviews. The aim is to classify the sentiment of sentences. Accuracy is used to measure the performance.

  • MRPC: Microsoft Research Paraphrase Corpus (Dolan et al., 2004) contains pairs of sentences from news sources on the web. Each pair is annotated by humans indicating whether they are semantically equivalent or not. Performance is measured by accuracy.

  • STS-B: Semantic Textual Similarity Benchmark (Cer et al., 2017) is composed of pairs of sentences from various sources, annotated between and , determining how similar they are. Success is measured by Spearman’s rank correlation ().

  • QQP: Quora Question Pairs

    (chen et al., 2018) dataset contains over question pairs where the aim is to determine whether the questions are semantically similar or not. Success is measured by accuracy.

  • MNLI: Multi-Genre Natural Language Inference (Williams et al., 2018) dataset is composed of crowd-sourced sentence pairs annotated with entailment information. The aim is to predict whether a second sentence is a contradiction, entailment, or neutral to the first one. Accuracy is used to measure the performance.

  • QNLI: Questions Natural Language Inference (Rajpurkar et al., 2018a)

    dataset is a modified version of the SQuAD dataset

    (Rajpurkar et al., 2016). It contains over sentence/context pairs where the aim is to determine if the context contains an answer to the question.

  • RTE: Recognizing Textual Entailment (Bentivogli et al., 2009) is similar to MNLI, where the aim is to predict the type of entailment between a paragraph and a sentence, entailment, contradiction, and unknown being the choices.

  • WNLI: Winograd Natural Language Inference (Levesque et al., 2012) dataset also concerns natural language inference similar to the MNLI and the RTE datasets.

Stanford Question Answering Dataset (SQuAD 1.1 (Rajpurkar et al., 2016) and SQuAD 2.0 (Rajpurkar et al., 2018b)) is a reading comprehension dataset that is composed of Wikipedia articles and questions related to them. The aim is to find the text segment that answers the related question. There are questions of which is unanswerable from the given context article. Any model built for this task should also determine whether the question is answerable or not in addition to answering the questions.

RACE dataset (Lai et al., 2017) is also a dataset for reading comprehension taken from the English exams for middle and high school Chinese students. The aim is to find the correct answer to the question about a specific text passage among the choices. There are approximately passages and questions.

One can find the links of the datasets in Appendix B. The leaderboards of current state-ot-the-art can be tracked either from the respective websites or from the ACL Wiki website (https://aclweb.org/aclwiki/State_of_the_art). The reader can refer to Bakarov (2018) for comparisons, advantages, and disadvantages of the evaluation methods of word embedding models.

5.2. Results

width= Model Dim. WS-353 SCWS () RG-65 MEN YP-130 RW MC-30 Simlex-999 () avgSim avgSimC globalSim localSim MaxSimC () () () () () () HLBL (Mnih and Hinton, 2008) 100 33.233footnotemark: 3 - - - - - - - - - - - C&W (Collobert and Weston, 2008) 50 29.533footnotemark: 3 - - 57.088footnotemark: 8 - - 48.077footnotemark: 7 57.077footnotemark: 7 - - - - C&W (Collobert and Weston, 2008) 50 49.833footnotemark: 3 - - - - - - - - - - - R&M (Reisinger and Mooney, 2010) - 73.433footnotemark: 3 60.433footnotemark: 3 60.533footnotemark: 3 62.588footnotemark: 8 - 60.488footnotemark: 8 - - - - - - RNNLM (Mikolov et al., 2010) 640 - - - - - - - - - - - - Huang et al. (Huang et al., 2012) 50 71.3 62.8 65.7 58.622footnotemark: 2 26.122footnotemark: 2 - - - - - - - CBOW (Mikolov et al., 2013a) 400 69.477footnotemark: 7 64.277footnotemark: 7 - - - - 73.277footnotemark: 7 66.577footnotemark: 7 34.377footnotemark: 7 - - - Skip-Gram (Mikolov et al., 2013a) 100 58.955footnotemark: 5 - - - - - - - - - - - Skip-Gram (Mikolov et al., 2013b) 300 70.499footnotemark: 9 66.666footnotemark: 6 66.666footnotemark: 6 65.299footnotemark: 9 - - - - - - - - Skip-Gram (Mikolov et al., 2013b) 256 66.711footnotemark: 1 - - - - - - 55.711footnotemark: 1 - 38.811footnotemark: 1 - - Luong et al.(Luong et al., 2013) 50 64.6 - - 48.5 - - 65.4 - - 34.4 71.7 - CLBL (Botha and Blunsom, 2014) - 39.0 - - - - - 41.0 - - 30.0 - - Tian et al. (Tian et al., 2014) 50 - - 65.4 - - 63.6 - - - - - - Qiu et al. (Qiu et al., 2014) 200 65.2 - - 53.4 - - 67.4 - - 32.9 81.6 - MSSG (Neelakantan et al., 2014) 300 70.9 67.3 69.1 65.5 59.8 - - - - - - - Chen et al. (Chen et al., 2014) 200 - 66.2 68.9 64.2 - - - - - - - - GloVe (Pennington et al., 2014) 300 75.9 - - 59.6 - - 82.9 - - 47.8 83.6 41.01515footnotemark: 15 Guo et al. (Guo et al., 2014) 50 - 49.3 - - - 55.4 - - - - - - KNET (Cui et al., 2015) 100 66.1 - - - - - - - - 39.3 - - CNN-VMSSG (Chen et al., 2015) 300 - 65.7 66.4 66.3 61.1 - - - - - - - AutoExtend (Rothe and Schütze, 2015) 300 - 68.9 69.8 - - - - - - - - - SenseEmbed (Iacobacci et al., 2015) 400 77.9 62.4 - - - - 89.4 80.5 73.4 - - - TWE-1 (Liu et al., 2015) 400 - - 68.1 - - 67.3 - - - - - - Jauhar et al. (Jauhar et al., 2015) 80 63.9 - - 65.7 - - 73.4 64.6 - - 75.8 - SAMS (Cheng and Kartsaklis, 2015) 300 - 62.5 - 59.9 58.5 - - - - - - - SWE (Liu et al., 2015) 300 72.8 - - - - - - - - - - - Soricut and Och (Soricut and Och, 2015) 500 71.2 - - - - - 75.1 - - 41.8 - - Cotterell et al. (Cotterell et al., 2016) 100 58.9 - - - - - - - - - - - char2vec (Cao and Rei, 2016) 256 34.5 - - - - - - 32.2 - 28.2 - - Bojanowski et al. (Bojanowski et al., 2016) 300 71.0 - - - - - - - - 47.0 - - Yin and Schütze (Yin and Schütze, 2016) 200 76.0 - - - - - - 82.5 - 61.6 85.7 48.5 dLCE (Nguyen et al., 2016) 500 - - - - - - - - - - - 59.0 Ngram2vec (Zhao et al., 2017) 300 - - - - - - - 76.0 - 44.6 - 42.1 MSWE (Nguyen et al., 2017) 300 72.4 66.7 66.7 66.8 - - - 76.4 - 35.6 - 39.2 Dict2vec (Tissier et al., 2017) 300 75.6 - - - - - 87.5 75.6 64.6 48.2 86.0 - LMM (Xu et al., 2018) 200 61.5 - - 63.0 - - 63.1 - - 43.1 - - LSTMEmbed (Iacobacci and Navigli, 2019) 400 61.2 - - - - - - - - - - -

Table 5. Word embedding models’ performances in similarity tasks (in chronological order).

In this section, we report the results obtained by the models examined in this survey on aforementioned datasets. In Tables 5,6,7, and 8, the results in similarity, analogy, synonym selection, and downstream tasks are given respectively.

While reporting the results, we follow a few criteria to make it as fair and simple as possible:

  • Unless noted otherwise, all of the results are taken from the original papers. (The results taken from other sources are marked with numbered superscripts. See Appendix B for details.)

  • If more than one paper report results on the same model, we take the one in the original paper.

  • If the author(s) provide several variations of a model, we report only the one with the best score.

Although some of the differences in performances of word representations are due to the models themselves, it should be noted that the size of the datasets that the models are trained on can be different, therefore, can affect the fairness of comparison.

Table 5 shows word embedding models’ performances in similarity tasks. SenseEmbed (Iacobacci et al., 2015) is the best performing model in WS-353, RG-65, and YP-130 datasets according to the reported results. Yin and Schütze (2016) has superior performance in the datasets of MEN and RW, while Dict2vec (Tissier et al., 2017) outperforms others on MC-30. In SCWS, AutoExtend (Rothe and Schütze, 2015) gives the highest correlation coefficient scores. In general, GloVe (Pennington et al., 2014), SenseEmbed (Iacobacci et al., 2015), Yin and Schütze (2016), and Dict2vec (Tissier et al., 2017) perform well on similarity datasets.

SenseEmbed’s (Iacobacci et al., 2015) success can be attributed to its capability to disambiguate senses by being trained on sense-tagged corpora. Glove (Pennington et al., 2014) is generally robust as it’s a mixture of global co-occurrence and local context-based methods. When it comes to Yin and Schütze (2016), it is an ensemble of existing embeddings, including Glove, which produces better representations for OOV words due to its ensemble nature. Thus, it has good coverage of words in similarity datasets. Dict2vec’s (Tissier et al., 2017) performance proves the effectiveness of positive sampling over word2vec (Mikolov et al., 2013b).

Word embedding models’ performances are tested on Google Analogy Task that includes both syntactic and semantic analogies (Table 6). The best accuracy scores are obtained by Yin and Schütze (2016) in this category. Glove(Pennington et al., 2014) follows it as the second-best performing model. Results in the Google Analogy task can be interpreted much as those in similarity tasks.

width=0.5 Model Dimension Google Analogy Task (acc. %) Syntactic Semantic Total C&W (Collobert and Weston, 2008) 50 9.344footnotemark: 4 12.344footnotemark: 4 11.044footnotemark: 4 RNNLM (Mikolov et al., 2010) 640 8.644footnotemark: 4 36.544footnotemark: 4 24.644footnotemark: 4 CBOW (Mikolov et al., 2013a) 1000 57.3 68.9 63.7 Skip-Gram (Mikolov et al., 2013a) 1000 66.1 65.1 65.6 Skip-Gram (Mikolov et al., 2013b) 100 36.41313footnotemark: 13 28.01313footnotemark: 13 32.61313footnotemark: 13 Skip-Gram (Mikolov et al., 2013b) 300 61.0 61.0 61.0 Skip-Gram (Mikolov et al., 2013b) 256 51.311footnotemark: 1 33.911footnotemark: 1 43.611footnotemark: 1 ivLBL (Mnih and Kavukcuoglu, 2013) 100 46.1 40.0 43.3 ivLBL (Mnih and Kavukcuoglu, 2013) 300 63.0 65.2 64.0 vLBL (Mnih and Kavukcuoglu, 2013) 300 64.8 54.0 60.0 vLBL (Mnih and Kavukcuoglu, 2013) 600 67.1 60.5 64.1 Qiu et al. (Qiu et al., 2014) 200 58.4 25.0 43.3 MSSG (Neelakantan et al., 2014) 300 - - 64.01010footnotemark: 10 GloVe (Pennington et al., 2014) 300 69.3 81.9 75.0 KNET (Cui et al., 2015) 100 46.9 24.9 36.3 char2vec (Cao and Rei, 2016) 256 52.5 2.5 35.5 Bojanowski et al. (Bojanowski et al., 2016) 300 74.9 77.8 - Yin and Schütze (Yin and Schütze, 2016) 200 76.3 92.5 77.0 Ngram2vec (Zhao et al., 2017) 300 71.0 74.2 72.5 MSWE (Nguyen et al., 2017) 50 - - 69.9 LMM (Xu et al., 2018) 200 20.4 - -

Table 6. Word embedding models’ performances in analogy task (in chronological order).

In synonym-selection tasks, three models’ (Skip-Gram (Mikolov et al., 2013b), Jauhar et al. (Jauhar et al., 2015), SWE (Liu et al., 2015)) results are reported (Table 7). In ESL-50 and RD-300 datasets, the only model with the reported performance is Jauhar et al. (Jauhar et al., 2015). In TOEFL-80, SWE(Liu et al., 2015) outperforms the others. Here, SWE’s success can be explained by its synonym-antonym rule in learning word embeddings.

In Table 8, word embedding models’ performances on downstream tasks are provided. In GLUE benchmark, CBOW (Mikolov et al., 2013a), BiLSTM+Cove+Attn (McCann et al., 2017), and BiLSTM+Elmo+Attn (Peters et al., 2018) are behind human baselines except for the task of QQP. In QQP, CBOW is still underperforming but BiLSTM+Cove+Attn (McCann et al., 2017) and BiLSTM+Elmo+Attn(Peters et al., 2018) are superior to human performance.

As for the original BERT(Devlin et al., 2019) and its variants, in the tasks of MRPC, QQP, QNLI, they consistently outperform human baselines. In SST-2, MNLI, RTE, and WNLI, human performance is better. In STS-B, the only model with superior performance to humans is ALBERT(Lan et al., 2020), In CoLA, and the tasks of question answering (SQuAD 2.0), and reading comprehension (RACE), starting from XLNET(Yang et al., 2019) better performances over human are observed. GPT-3 (Brown et al., 2020) is promising with its language model meta-learner idea and gives its best performance in the Few-Shot setting. Although it is behind the state-of-the-art by a large margin in GLUE benchmark, in RTE its score is beyond CBOW (Mikolov et al., 2013a), BiLSTM+Cove+Attn (McCann et al., 2017), and BiLSTM+Elmo+Attn (Peters et al., 2018).

Table 8 proves the success of contextual representations, especially the transformer-based models (BERT (Devlin et al., 2019) and its successors), by going beyond human performance in most of the downstream tasks. However, it can be said that in natural language inference tasks such as MNLI, WNLI, and RTE, these probabilistic language representations still have some limitations in meeting causal inference requirements.

width=0.5 Model Dimension ESL-50 (%) TOEFL-80 (%) RD-300 (%) Skip-Gram (Mikolov et al., 2013b) 300 - 83.71111footnotemark: 11 - Skip-Gram (Mikolov et al., 2013b) 400 62.01414footnotemark: 14 87.01414footnotemark: 14 - GloVe (Pennington et al., 2014) 300 60.01414footnotemark: 14 88.71414footnotemark: 14 - MSSG (Neelakantan et al., 2014) 300 57.11414footnotemark: 14 78.31414footnotemark: 14 - Jauhar et al. (Jauhar et al., 2015) 80 63.6 73.3 66.7 Jauhar et al. (Jauhar et al., 2015) 80 73.31414footnotemark: 14 80.01414footnotemark: 14 - Li and Jurafsky (Li and Jurafsky, 2015) 300 50.01414footnotemark: 14 82.61414footnotemark: 14 - SWE (Liu et al., 2015) 300 - 88.7 - LSTMEmbed (Iacobacci and Navigli, 2019) 400 72.0 92.5 -

Table 7. Word embedding models’ performances in synonym selection tasks (in chronological order).

width= Model CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE WNLI SQuAD 2.0 RACE (mcc) (%) (F1) () (F1) m/mm (%/%) (%) (%) (%) (F1) (%) CBOW (Mikolov et al., 2013a) 0.0 80.0 81.5 58.7 51.4 56.0/56.4 72.1 54.1 62.3 BiLSTM+Cove+Attn (McCann et al., 2017) 8.3 80.7 80.0 68.4 60.5 68.1/68.6 72.9 56.0 18.3 - - BiLSTM+Elmo+Attn (Peters et al., 2018) 33.6 90.4 84.4 72.3 63.1 74.1/74.5 79.8 58.9 65.1 - - GLUE Human Baselines 66.4 97.8 86.3 92.6 59.5 92.0/92.8 91.2 93.6 95.9 - - SQuAD Human Baselines (Rajpurkar et al., 2018b) - - - - - - - - - 89.4 - Turkers (Lai et al., 2017) - - - - - - - - - - 73.3 BERT (Devlin et al., 2019) 60.5 94.9 89.3 86.5 72.1 86.7/85.9 91.1 70.1 65.1 89.11212footnotemark: 12 72.01212footnotemark: 12 ERNIE 2.0 (Sun et al., 2020) 63.5 95.6 90.2 90.6 73.8 88.7/88.8 94.6 80.2 67.8 - - XLNet (Yang et al., 2019) (ensemble) 67.8 96.8 92.9 91.6 74.7 90.2/89.7 98.6 86.3 90.4 89.11212footnotemark: 12 81.81212footnotemark: 12 RoBERTa (Liu et al., 2019) (ensemble) 67.8 96.7 92.3 91.9 74.3 90.8/90.2 98.9 88.2 89.0 89.81212footnotemark: 12 83.21212footnotemark: 12 ALBERT (Lan et al., 2020) 71.4 96.9 90.9 93.0 - 90.8 95.3 89.2 - 90.9 86.5 ALBERT (Lan et al., 2020) (ensemble) 69.1 97.1 93.4 92.5 74.2 91.3/91.0 99.2 89.2 91.8 92.2 89.4 GPT-3 Few-Shot (Brown et al., 2020) - - - - - - - - 69.0 - 69.8 45

Table 8. Word embedding models’ performances in downstream tasks.

6. Conclusion

Human-level language understanding is one of the oldest challenges in computer science. Many scientific works have been dedicated to finding good representations for semantic units (words, morphemes, characters) in languages since it is preliminary for all downstream tasks in NLP. Most of these studies use the distributional hypothesis, where the meaning of a word is measured from its neighboring words.

Distributed representation through a neural network is intuitive in that it resembles human mind’s representation of concepts. Beyond that, pre-trained language models’ knowledge has been transferred to fine-tuned task-specific models, which introduced a boost in performance. To summarize, neural language models with their updated weights as well as learned representations in their layers have become a source of knowledge.

From the release of early word embeddings to current contextual representations, the area of semantics has experienced a transformation, which becomes evident by substantial performance improvements in all NLP tasks. The idea of pre-training a language model then fine-tuning it on a downstream task has become a de facto standard in almost all subfields of NLP.

Recently, contextual models, such as BERT and its variants, showed great success in downstream NLP tasks using masked language modeling and transformer structures. They have become state-of-the-art word embeddings and obtained human-level results on some of the downstream tasks.

Over the last few years, there has been an increase in the studies that consider experiential (visual) information by building multi-modal language models and representations (Sezerer and Tekir, 2021; Lu et al., 2019; Yang et al., 2016). The idea of multi-modal language modeling is based on human language acquisition, where learning starts with concrete concepts through images early on (As pointed out by the ”pointing phase” in children (Liszkowski et al., 2006, 2008)) and then continues with learning abstract ideas through the text (Vigliocco et al., 2009; Andrews et al., 2009; Griffiths et al., 2007). Fueled by the success of text-based language models and advancements in cognitive psychology, perhaps this type of multi-modal language modeling can be the next goal to tackle in the future.

Acknowledgements.
We want to thank Tuğkan Tuğlular for his valuable comments and feedback in the development of this survey.

References

  • A. Akbik, D. Blythe, and R. Vollgraf (2018) Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, pp. 1638–1649. Cited by: §4.
  • A. Alexandrescu and K. Kirchhoff (2006) Factored neural language models. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short ’06, Stroudsburg, PA, USA, pp. 1–4. Cited by: Table 1, §3.
  • M. Andrews, G. Vigliocco, and D. Vinson (2009) Integrating experiential and distributional data to learn semantic representations.. Psychological Review 116 (3), pp. 463–498. Cited by: §6, §6.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. Cited by: §3.4.1.
  • A. Bakarov (2018) A survey of word embeddings evaluation methods. CoRR abs/1801.09536. External Links: Link, 1801.09536 Cited by: §5.1.4.
  • C. F. Baker, C. J. Fillmore, and J. B. Lowe (1998) The berkeley framenet project. In Proceedings of the 17th International Conference on Computational Linguistics - Volume 1, COLING ’98, Stroudsburg, PA, USA, pp. 86–90. External Links: Document Cited by: §3.1, Table 1.
  • Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin (2003) A neural probabilistic language model. J. Mach. Learn. Res. 3, pp. 1137–1155. External Links: ISSN 1532-4435 Cited by: Figure 1, §2.4, §3.1, Table 1, §3, §3.
  • L. Bentivogli, I. Dagan, H. T. Dang, D. Giampiccolo, and B. Magnini (2009) The fifth pascal recognizing textual entailment challenge. In In Proc Text Analysis Conference (TAC’09, Cited by: 8th item.
  • J. Bian, B. Gao, and T. Liu (2014)

    Knowledge-powered deep learning for word embedding

    .
    In Proceedings of the 2014th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I, ECMLPKDD’14, Berlin, Heidelberg, pp. 132–148. External Links: ISBN 978-3-662-44847-2, Document Cited by: §3.4.1, §3.4.1, Table 4.
  • Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi (2019) PIQA: reasoning about physical commonsense in natural language. CoRR abs/1911.11641. External Links: Link, 1911.11641 Cited by: §4.
  • D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation. J. Mach. Learn. Res. 3, pp. 993–1022. External Links: ISSN 1532-4435 Cited by: §3.3, §3.3, Table 3.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2016) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: §3.4.1, §3.4.1, §3.4.1, Table 4, Table 5, Table 6.
  • K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, New York, NY, USA, pp. 1247–1250. External Links: ISBN 978-1-60558-102-6, Document Cited by: §3.4.1, §3.4.2, Table 4.
  • L. Borin, M. Forsberg, and L. Lönngren (2013) SALDO: a touch of yin to wordnet’s yang. Language Resources and Evaluation 47 (4), pp. 1191–1211. External Links: ISSN 1574-0218, Document, Link Cited by: §3.3, Table 3.
  • J. A. Botha and P. Blunsom (2014) Compositional morphology for word representations and language modelling. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pp. II–1899–II–1907. Cited by: §3.4.1, §3.4.1, Table 4, Table 5.
  • S. Brants, S. Dipper, P. Eisenberg, S. Hansen-Schirra, E. König, W. Lezius, C. Rohrer, G. Smith, and H. Uszkoreit (2004) TIGER: linguistic interpretation of a german corpus. Research on Language and Computation 2 (4), pp. 597–620. External Links: Document Cited by: §3.4.1, Table 4.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. CoRR abs/2005.14165. External Links: Link, 2005.14165 Cited by: §4, §5.2, Table 8.
  • E. Bruni, N. K. Tran, and M. Baroni (2014) Multimodal distributional semantics. J. Artif. Int. Res. 49 (1), pp. 1–47. External Links: ISSN 1076-9757 Cited by: 5th item.
  • J. Camacho-Collados and M. T. Pilehvar (2018) From word to sense embeddings: a survey on vector representations of meaning. J. Artif. Int. Res. 63 (1), pp. 743–788. External Links: ISSN 1076-9757, Link, Document Cited by: §3.3.
  • K. Cao and M. Rei (2016) A joint model for word embedding and word morphology. In Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, pp. 18–26. External Links: Document Cited by: §B.2, §3.4.1, §3.4.1, §3.4.1, Table 4, Table 5, Table 6.
  • D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2017) SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 1–14. External Links: Document Cited by: 4th item.
  • S. F. Chen and J. Goodman (1996) An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ACL ’96, pp. 310–318. Cited by: §2.3, §2.4, §3, §3, §3.
  • T. Chen, R. Xu, Y. He, and X. Wang (2015) Improving distributed representation of word sense via wordnet gloss composition and context clustering. In ACL, Cited by: §B.2, §3.3, §3.3, Table 3, Table 5.
  • X. Chen, Z. Liu, and M. Sun (2014) A unified model for word sense representation and disambiguation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1025–1035. External Links: Document Cited by: §3.3, §3.3, Table 3, Table 5.
  • Z. chen, H. Zhang, X. Zhang, and L. Zhao (2018) Quora question pairs. External Links: Link Cited by: 5th item.
  • J. Cheng and D. Kartsaklis (2015) Syntax-aware multi-sense word embeddings for deep compositional models of meaning. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1531–1542. External Links: Document Cited by: §3.3, Table 3, Table 5.
  • R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa (2011) Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, pp. 2493–2537. External Links: ISSN 1532-4435 Cited by: §3.4.1, Table 3, §3.
  • R. Collobert and J. Weston (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 160–167. External Links: ISBN 978-1-60558-205-4, Document Cited by: §3.1, §3.3, Table 1, §3, Table 5, Table 6.
  • R. Cotterell, H. Schütze, and J. Eisner (2016) Morphological smoothing and extrapolation of word embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1651–1660. External Links: Document Cited by: §B.2, §3.4.2, Table 4, Table 5.
  • R. Cotterell and H. Schütze (2015) Morphological word-embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 1287–1292. External Links: Document Cited by: §3.4.1, §3.4.1, Table 4.
  • M. Creutz and K. Lagus (2007) Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. 4 (1), pp. 3:1–3:34. External Links: ISSN 1550-4875, Document Cited by: §3.4.1, §3.4.1, §3.4.1, §3.4.1, §3.4.1, Table 4.
  • Q. Cui, B. Gao, J. Bian, S. Qiu, H. Dai, and T. Liu (2015) KNET: a general framework for learning word embedding using morphological knowledge. ACM Trans. Inf. Syst. 34 (1), pp. 4:1–4:25. External Links: ISSN 1046-8188, Document Cited by: §3.4.1, §3.4.1, Table 4, Table 5, Table 6.
  • A. M. Dai and Q. V. Le (2015) Semi-supervised sequence learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, Cambridge, MA, USA, pp. 3079–3087. Cited by: §4.
  • Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov (2019) Transformer-XL: attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2978–2988. External Links: Document Cited by: §4.
  • S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman (1990) Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE 41 (6), pp. 391–407. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §4, §4, §4, §4, §5.2, §5.2, Table 8.
  • B. Dolan, C. Quirk, and C. Brockett (2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, pp. 350–356. Cited by: 3rd item.
  • C. N. Dos Santos and B. Zadrozny (2014) Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pp. II–1818–II–1826. Cited by: §3.4.1, §3.4.1, §3.4.1, Table 4.
  • J. L. Elman (1990) Finding structure in time. Cognitive Science 14 (2), pp. 179–211. Cited by: §2.4.
  • M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith (2015) Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 1606–1615. External Links: Document Cited by: §3.1, Table 1.
  • L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin (2001) Placing search in context: the concept revisited. In Proceedings of the 10th International Conference on World Wide Web, WWW ’01, New York, NY, USA, pp. 406–414. External Links: ISBN 1-58113-348-0 Cited by: 1st item.
  • H. Gadalla, H. Kilany, H. Arram, A. Yacoub, A. El-Habashi, A. Shalaby, K. Karins, E. Rowson, R. MacIntyre, P. Kingsbury, et al. (1997) CALLHOME egyptian arabic transcripts. Linguistic Data Consortium, Philadelphia. Cited by: Table 1.
  • Y. Goldberg and J. Nivre (2012) A dynamic oracle for arc-eager dependency parsing. In COLING, Cited by: Table 1.
  • A. Graves and J. Schmidhuber (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks : the official journal of the International Neural Network Society 18 5-6, pp. 602–10. Cited by: Table 4.
  • T. L. Griffiths, J. B. Tenenbaum, and M. Steyvers (2007) Topics in semantic representation. Psychological Review 114, pp. 2007. Cited by: §6, §6.
  • J. Guo, W. Che, H. Wang, and T. Liu (2014) Learning sense-specific word embeddings by exploiting bilingual resources. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 497–507. Cited by: §3.3, Table 3, Table 5.
  • M. U. Gutmann and A. Hyvärinen (2012) Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13 (1), pp. 307–361. External Links: ISSN 1532-4435 Cited by: §3.1, §3.3, Table 1.
  • R. H Baayen, R. Piepenbrock, and H. Rijn (1993) The celex lexical data base on cd-rom. , pp. . Cited by: Table 4.
  • D. Z. Hakkani-Tür, K. Oflazer, and G. Tür (2002) Statistical morphological disambiguation for agglutinative languages. Computers and the Humanities 36 (4), pp. 381–410. Cited by: Table 1.
  • Z. S. Harris (1954) Distributional structure. Word 10 (2-3), pp. 146–162. External Links: Document Cited by: §2.1.
  • F. Hill, R. Reichart, and A. Korhonen (2015) SimLex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. External Links: Link, Document Cited by: 8th item.
  • G. E. Hinton, J. L. McClelland, and D. E. Rumelhart (1986) Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group (Eds.), pp. 77–109. External Links: ISBN 0-262-68053-X Cited by: §2.2, §2.4.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link Cited by: Table 4.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. External Links: Document Cited by: §4.
  • E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng (2012) Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL ’12, Stroudsburg, PA, USA, pp. 873–882. Cited by: §B.2, §3.1, §3.3, §3.3, §3.4.1, Table 3, 2nd item, Table 5.
  • I. Iacobacci and R. Navigli (2019) LSTMEmbed: learning word and sense representations from a large semantically annotated corpus with long short-term memories. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 1685–1695. Cited by: §B.2, §3.3, Table 3, Table 5, Table 7.
  • I. Iacobacci, M. T. Pilehvar, and R. Navigli (2015) SensEmbed: learning sense embeddings for word and relational similarity. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 95–105. External Links: Document Cited by: §B.2, §3.3, §3.3, Table 3, §5.2, §5.2, Table 5.
  • S. K. Jauhar, C. Dyer, and E. Hovy (2015) Ontologically grounded multi-sense representation learning for semantic vector space models. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 683–693. External Links: Document Cited by: §3.3, §3.3, Table 3, §5.2, Table 5, Table 7.
  • M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, R. Barzilay and M. Kan (Eds.), pp. 1601–1611. External Links: Link, Document Cited by: §4.
  • F. K. Khattak, S. Jeblee, C. Pou-Prom, M. Abdalla, C. Meaney, and F. Rudzicz (2019) A survey of word embeddings for clinical text. Journal of Biomedical Informatics: X 4, pp. 100057. External Links: ISSN 2590-177X, Document, Link Cited by: §1.
  • Y. Kim, Y. Jernite, D. A. Sontag, and A. M. Rush (2016) Character-aware neural language models. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA

    , D. Schuurmans and M. P. Wellman (Eds.),
    pp. 2741–2749. External Links: Link Cited by: §3.4.1, Table 4, §4.
  • R. Kneser and H. Ney (1995) Improved backing-off for m-gram language modeling. In 1995 International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 181–184 vol.1. Cited by: §2.3, §2.4, §3, §3, §3.
  • S. Kumar, S. Jat, K. Saxena, and P. Talukdar (2019) Zero-shot word sense disambiguation using sense definition embeddings. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 5670–5681. Cited by: §3.3, Table 3.
  • A. Kutuzov, L. Øvrelid, T. Szymanski, and E. Velldal (2018) Diachronic word embeddings and semantic shifts: a survey. ArXiv abs/1806.03537. Cited by: §1.
  • G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy (2017) RACE: large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 785–794. External Links: Document Cited by: §5.1.4, Table 8.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020)

    ALBERT: a lite bert for self-supervised learning of language representations

    .
    In International Conference on Learning Representations, Cited by: §B.2, §4, §5.2, Table 8.
  • R. Lebret and R. Collobert (2014) Word embeddings through hellinger PCA. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 482–490. External Links: Link, Document Cited by: §1.
  • H. Levesque, E. Davis, and L. Morgenstern (2012) The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, Cited by: 9th item.
  • O. Levy, Y. Goldberg, and I. Dagan (2015) Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3, pp. 211–225. External Links: Document Cited by: §3.1.
  • O. Levy and Y. Goldberg (2014) Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 302–308. External Links: Document Cited by: §3.1, Table 1.
  • J. Li and D. Jurafsky (2015) Do multi-sense embeddings improve natural language understanding?. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1722–1732. External Links: Document Cited by: §3.3, Table 3, Table 7.
  • F. M. Liang (1983a) Word hy-phen-a-tion by com-put-er (hyphenation, computer). Ph.D. Thesis, Stanford University, Stanford University, Stanford, CA, USA. Note: AAI8329742 Cited by: §3.4.1.
  • F. M. Liang (1983b) Word hy-phen-a-tion by com-put-er. Technical report . Cited by: §3.4.1, Table 4.
  • W. Ling, C. Dyer, A. W. Black, I. Trancoso, R. Fermandez, S. Amir, L. Marujo, and T. Luís (2015a) Finding function in form: compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1520–1530. External Links: Document Cited by: §3.4.1, §3.4.1, §3.4.1, Table 4.
  • W. Ling, Y. Tsvetkov, S. Amir, R. Fermandez, C. Dyer, A. W. Black, I. Trancoso, and C. Lin (2015b) Not all contexts are created equal: better word representations with variable attention. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1367–1372. External Links: Document Cited by: §3.1, Table 1.
  • U. Liszkowski, M. Carpenter, T. Striano, and M. Tomasello (2006) 12- and 18-month-olds point to provide information for others. Journal of Cognition and Development 7 (2), pp. 173–187. External Links: Document Cited by: §6.
  • U. Liszkowski, M. Carpenter, and M. Tomasello (2008) Twelve-month-olds communicate helpfully and appropriately for knowledgeable and ignorant partners. Cognition 108 (3), pp. 732–739. External Links: ISSN 0010-0277, Document, Link Cited by: §6.
  • Q. Liu, M. J. Kusner, and P. Blunsom (2020) A survey on contextual embeddings. ArXiv abs/2003.07278. Cited by: §4.
  • Q. Liu, H. Jiang, S. Wei, Z. Ling, and Y. Hu (2015) Learning semantic word embeddings based on ordinal knowledge constraints. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 1501–1511. External Links: Document Cited by: §B.2, §3.1, Table 1, §5.2, Table 5, Table 7.
  • Y. Liu, Z. Liu, T. Chua, and M. Sun (2015) Topical word embeddings. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 2418–2424. External Links: ISBN 0-262-51129-0 Cited by: §B.2, §3.3, Table 3, Table 5.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: §4, Table 8.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §6.
  • K. Lund and C. Burgess (1996) Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior research methods, instruments, & computers 28 (2), pp. 203–208. Cited by: §1.
  • T. Luong, R. Socher, and C. Manning (2013) Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria, pp. 104–113. Cited by: §3.4.1, §3.4.1, §3.4.1, Table 4, 7th item, Table 5.
  • A. T. Luu, Y. Tay, S. C. Hui, and S. K. Ng (2016) Learning term embeddings for taxonomic relation identification using dynamic weighting neural network. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 403–413. External Links: Document Cited by: §3.2, Table 2.
  • B. McCann, J. Bradbury, C. Xiong, and R. Socher (2017) Learned in translation: contextualized word vectors.. In NIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 6297–6308. Cited by: §4, §5.2, §5.2, Table 8.
  • O. Melamud, J. Goldberger, and I. Dagan (2016) Context2vec: learning generic context embedding with bidirectional LSTM. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016, pp. 51–61. Cited by: §4, §4.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013a) Efficient estimation of word representations in vector space. CoRR abs/1301.3781. External Links: 1301.3781 Cited by: §B.2, §3.1, §3.1, §3.1, §3.1, §3.1, §3.1, §3.2, §3.3, §3.3, §3.4.1, §3.4.1, §3.4.1, §3.4.1, §3.4.1, §3.4.1, Table 1, Table 2, Table 3, §5.1.2, §5.2, §5.2, Table 5, Table 6, Table 8.
  • T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur (2010) Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pp. 1045–1048. Cited by: §3.3, Table 1, Table 3, §3, Table 5, Table 6.
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013b) Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, USA, pp. 3111–3119. Cited by: §3.1, §3.1, §3.1, §3.1, §3.1, §3.2, §3.2, §3.2, §3.3, §3.3, §3.3, §3.4.1, Table 1, Table 2, §5.2, §5.2, Table 5, Table 6, Table 7.
  • G. A. Miller and W. G. Charles (1991) Contextual correlates of semantic similarity. Language and Cognitive Processes 6 (1), pp. 1–28. External Links: Document Cited by: 4th item.
  • G. A. Miller (1995) WordNet: a lexical database for english. Commun. ACM 38 (11), pp. 39–41. External Links: ISSN 0001-0782, Document Cited by: item 1, §3.1, §3.2, §3.2, §3.2, §3.2, §3.3, §3.3, §3.4.1, §3.4.2, Table 1, Table 2, Table 3, Table 4.
  • A. Mnih and G. Hinton (2007) Three new graphical models for statistical language modelling. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pp. 641–648. External Links: ISBN 978-1-59593-793-3 Cited by: §3.4.1, Table 1, §3.
  • A. Mnih and G. Hinton (2008) A scalable hierarchical distributed language model. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS’08, USA, pp. 1081–1088. External Links: ISBN 978-1-6056-0-949-2 Cited by: Table 1, §3, Table 5.
  • A. Mnih and K. Kavukcuoglu (2013) Learning word embeddings efficiently with noise-contrastive estimation. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, USA, pp. 2265–2273. Cited by: §B.2, §3.1, Table 1, Table 6.
  • A. Mnih and Y. W. Teh (2012) A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, USA, pp. 419–426. External Links: ISBN 978-1-4503-1285-1 Cited by: §3.1.
  • F. Morin and Y. Bengio (2005) Hierarchical probabilistic neural network language model. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, R. G. Cowell and Z. Ghahramani (Eds.), pp. 246–252. Cited by: Table 1, §3, §3.
  • N. Mrkšić, D. Ó Séaghdha, B. Thomson, M. Gašić, L. M. Rojas-Barahona, P. Su, D. Vandyke, T. Wen, and S. Young (2016) Counter-fitting word vectors to linguistic constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 142–148. External Links: Document Cited by: §B.2, §3.2, Table 2.
  • R. Navigli and S. P. Ponzetto (2012) BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, pp. 217–250. External Links: ISSN 0004-3702, Document Cited by: §3.3, Table 3.
  • R. Navigli (2009) Word sense disambiguation: a survey. ACM Comput. Surv. 41 (2). External Links: ISSN 0360-0300, Link, Document Cited by: §3.3.
  • A. Neelakantan, J. Shankar, A. Passos, and A. McCallum (2014) Efficient non-parametric estimation of multiple embeddings per word in vector space. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1059–1069. External Links: Document Cited by: §B.2, §3.3, Table 3, Table 5, Table 6, Table 7.
  • D. Q. Nguyen, D. Q. Nguyen, A. Modi, S. Thater, and M. Pinkal (2017) A mixture model for learning multi-sense word embeddings. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), Vancouver, Canada, pp. 121–127. External Links: Document Cited by: §B.2, §3.3, Table 3, Table 5, Table 6.
  • K. A. Nguyen, M. Köper, S. Schulte im Walde, and N. T. Vu (2017) Hierarchical embeddings for hypernymy detection and directionality. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 233–243. External Links: Document Cited by: §3.2, Table 2.
  • K. A. Nguyen, S. Schulte im Walde, and N. T. Vu (2016) Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, pp. 454–459. External Links: Document Cited by: §3.2, Table 2, Table 5.
  • L. Nieto Piña and R. Johansson (2015) A simple and efficient method to generate word sense representations. In Proceedings of the International Conference Recent Advances in Natural Language Processing, Hissar, Bulgaria, pp. 465–472. Cited by: §3.3, Table 3.
  • D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández (2016) The LAMBADA dataset: word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, External Links: Link, Document Cited by: §4.
  • E. Pavlick, P. Rastogi, J. Ganitkevitch, B. Van Durme, and C. Callison-Burch (2015) PPDB 2.0: better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing, China, pp. 425–430. External Links: Document Cited by: §3.1, §3.2, Table 1, Table 2.
  • M. Pelevina, N. Arefiev, C. Biemann, and A. Panchenko (2016) Making sense of word embeddings. In Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 174–183. External Links: Document Cited by: §3.3, Table 3.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: §3.1, §3.1, §3.1, §3.1, §3.2, Table 1, Table 2, §4, §5.2, §5.2, §5.2, Table 5, Table 6, Table 7.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp. 2227–2237. Cited by: §4, §4, §4, §5.2, §5.2, Table 8.
  • S. Qiu, Q. Cui, J. Bian, B. Gao, and T. Liu (2014) Co-learning of word representations and morpheme representations. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 141–150. Cited by: §3.4.1, §3.4.1, Table 4, Table 5, Table 6.
  • P. Rajpurkar, R. Jia, and P. Liang (2018a) Know what you don’t know: unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 784–789. External Links: Document Cited by: 7th item.
  • P. Rajpurkar, R. Jia, and P. Liang (2018b) Know what you don’t know: unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 784–789. External Links: Document Cited by: 14th item, §5.1.4, Table 8.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. External Links: Document Cited by: 14th item, 7th item, §5.1.4.
  • J. Reisinger and R. J. Mooney (2010) Multi-prototype vector-space models of word meaning. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, Stroudsburg, PA, USA, pp. 109–117. External Links: ISBN 1-932432-65-5 Cited by: §3.3, §3.3, Table 3, Table 5.
  • A. Rogers, O. Kovaleva, and A. Rumshisky (2020) A primer in bertology: what we know about how bert works. ArXiv abs/2002.12327. Cited by: §4.
  • S. Rothe and H. Schütze (2015) AutoExtend: extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1793–1803. External Links: Document Cited by: §B.2, §3.4.2, Table 4, §5.2, Table 5.
  • H. Rubenstein and J. B. Goodenough (1965) Contextual correlates of synonymy. Commun. ACM 8 (10), pp. 627–633. External Links: ISSN 0001-0782, Document Cited by: 3rd item.
  • S. Ruder, I. Vulić, and A. Søgaard (2019) A survey of cross-lingual word embedding models. J. Artif. Int. Res. 65 (1), pp. 569–630. External Links: ISSN 1076-9757, Document Cited by: §1.
  • G. Salton, A. Wong, and C. S. Yang (1975) A vector space model for automatic indexing. Commun. ACM 18 (11), pp. 613–620. External Links: ISSN 0001-0782, Link, Document Cited by: §1.
  • H. Schütze (1998) Automatic word sense discrimination. Computational Linguistics 24 (1), pp. 97–123. Cited by: §3.3, §6.
  • E. Sezerer and S. Tekir (2021) Incorporating concreteness in multi-modal language models with curriculum learning. Applied Sciences 11 (17). External Links: Link, ISSN 2076-3417, Document Cited by: §6.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. Cited by: 2nd item.
  • R. Soricut and F. J. Och (2015) Unsupervised morphology induction using word embeddings. In HLT-NAACL, Cited by: §3.4.1, §3.4.1, Table 4, Table 5.
  • Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, and H. Wu (2019) ERNIE: enhanced representation through knowledge integration. ArXiv abs/1904.09223. Cited by: §4.
  • Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang (2020) ERNIE 2.0: a continual pre-training framework for language understanding. ArXiv abs/1907.12412. Cited by: §4, Table 8.
  • F. Tian, H. Dai, J. Bian, B. Gao, R. Zhang, E. Chen, and T. Liu (2014) A probabilistic model for learning multi-prototype word embeddings. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 151–160. Cited by: §3.3, Table 3, Table 5.
  • J. Tissier, C. Gravier, and A. Habrard (2017) Dict2vec : learning word embeddings using lexical dictionaries. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 254–263. External Links: Document Cited by: §3.1, Table 1, §5.2, §5.2, Table 5.
  • K. Toutanova, D. Klein, C. D. Manning, and Y. Singer (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, USA, pp. 173–180. External Links: Document Cited by: Table 1.
  • J. Turian, L. Ratinov, and Y. Bengio (2010)

    Word representations: a simple and general method for semi-supervised learning

    .
    In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, Stroudsburg, PA, USA, pp. 384–394. Cited by: §3.1.
  • P. D. Turney and P. Pantel (2010) From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37 (1), pp. 141–188. External Links: ISSN 1076-9757 Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §4.
  • G. Vigliocco, L. Meteyard, M. Andrews, and S. Kousta (2009) Toward a theory of semantic representation. Language and Cognition 1 (2), pp. 219–247. Cited by: §6, §6.
  • I. Vulić, N. Mrkšić, R. Reichart, D. Ó Séaghdha, S. Young, and A. Korhonen (2017) Morph-fitting: fine-tuning word vector spaces with simple language-specific rules. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 56–68. External Links: Document Cited by: §3.2, Table 2.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. External Links: Document Cited by: §5.1.4.
  • C. Wang, X. He, and A. Zhou (2019) SphereRE: distinguishing lexical relations with hyperspherical relation embeddings. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 1727–1737. Cited by: §3.2, Table 2.
  • Y. Wang, S. Liu, N. Afzal, M. Rastegar-Mojarad, L. Wang, F. Shen, P. Kingsbury, and H. Liu (2018) A comparison of word embeddings for the biomedical natural language processing. Journal of Biomedical Informatics 87, pp. 12 – 20. External Links: ISSN 1532-0464, Document, Link Cited by: §1.
  • A. Warstadt, A. Singh, and S. R. Bowman (2019) Neural network acceptability judgments. Transactions of the Association for Computational Linguistics 7, pp. 625–641. Cited by: 1st item.
  • J. Wieting, M. Bansal, K. Gimpel, and K. Livescu (2015) From paraphrase database to compositional paraphrase model and back. Transactions of the Association for Computational Linguistics 3, pp. 345–358. External Links: Document Cited by: §3.2, Table 2.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Cited by: 6th item.
  • L. Wittgenstein (1953) Philosophical investigations. Basil Blackwell. Cited by: §2.1.
  • W. Wu, H. Li, H. Wang, and K. Q. Zhu (2012) Probase: a probabilistic taxonomy for text understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, New York, NY, USA, pp. 481–492. External Links: ISBN 978-1-4503-1247-9, Document Cited by: §3.2, Table 2.
  • W. Xu and A. Rudnicky (2000) Can artificial neural networks learn language models?. In Sixth International Conference on Spoken Language Processing, Cited by: §2.4.
  • Y. Xu, J. Liu, W. Yang, and L. Huang (2018) Incorporating latent meanings of morphological compositions to enhance word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1232–1242. External Links: Document Cited by: §3.4.1, §3.4.1, Table 4, Table 5, Table 6.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 5753–5763. Cited by: §4, §5.2, Table 8.
  • Z. Yang, X. He, J. Gao, L. Deng, and A. Smola (2016) Stacked attention networks for image question answering. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Vol. , pp. 21–29. External Links: Document Cited by: §6.
  • W. Yin and H. Schütze (2016) Learning word meta-embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1351–1360. External Links: Document Cited by: §3.1, Table 1, §5.2, §5.2, §5.2, Table 5, Table 6.
  • Z. Yu, H. Wang, X. Lin, and M. Wang (2015) Learning term embeddings for hypernymy identification. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pp. 1390–1397. External Links: ISBN 978-1-57735-738-4 Cited by: §3.2, Table 2.
  • Z. Zhao, T. Liu, S. Li, B. Li, and X. Du (2017) Ngram2vec: learning improved word representations from ngram co-occurrence statistics. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 244–253. External Links: Document Cited by: §3.1, Table 1, Table 5, Table 6.

Appendix A Reading Guide for Beginners

In Figure 2, the milestone papers of each subtopic are listed. It is not an exhaustive list created to serve as a starting point for researchers who are not familiar with the subject. For the details of each subfield, readers should refer to the corresponding chapters in the survey. Dashed sets represent non-neural models, which are not the subject of this survey.

Figure 2. Evolution of neural word embeddings.

Appendix B Details on Datasets and Results

b.1. Datasets

  • WS-353: http://gabrilovich.com/resources/data/wordsim353/wordsim353.zip

  • SCWS: http://www-nlp.stanford.edu/~ehhuang/SCWS.zip

  • RG-65: There are no formal links to this dataset

  • MC-30: There are no formal links to this dataset

  • MEN: https://staff.fnwi.uva.nl/e.bruni/MEN

  • YP-130: https://www.researchgate.net/publication/257946337_Verb_similarity_on_the_taxonomy_of_WordNet_-_dataset/link/02e7e5266fe99269cc000000/download

  • RW: http://www-nlp.stanford.edu/~lmthang/morphoNLM/rw.zip

  • Simlex-999: https://fh295.github.io/simlex.html

  • Google Analogy Task: http://download.tensorflow.org/data/questions-words.txt

  • ESL-50: https://www.apperceptual.com/home (personal communication)

  • TOEFL-80: http://lsa.colorado.edu/mail_sub.html (personal communication)

  • RD-300: https://arxiv.org/ftp/arxiv/papers/1204/1204.0140.pdf (Appendix K; also contains TOEFL-80 and ESL-50)

  • Glue Benchmark (CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI):
    https://gluebenchmark.com/tasks

  • Stanford Question Answering Dataset (SQuAD 1.1 (Rajpurkar et al., 2016) and SQuAD 2.0 (Rajpurkar et al., 2018b)):
    https://rajpurkar.github.io/SQuAD-explorer/

  • RACE dataset: http://www.cs.cmu.edu/~glai1/data/race/

b.2. Results

You can find the sources of experimental results here. Each number corresponds to the numbered superscripts used in the tables:
1: reported in (Cao and Rei, 2016)
2: reported in (Chen et al., 2015)
3: reported in (Huang et al., 2012)
4: reported in (Mikolov et al., 2013a)
5: reported in (Cotterell et al., 2016)
6: reported in (Rothe and Schütze, 2015)
7: reported in (Iacobacci et al., 2015)
8: reported in (Liu et al., 2015)
9: reported in (Neelakantan et al., 2014)
10: reported in (Nguyen et al., 2017)
11: reported in (Liu et al., 2015)
12: reported in (Lan et al., 2020)
13: reported in (Mnih and Kavukcuoglu, 2013)
14: reported in (Iacobacci and Navigli, 2019)
15: reported in (Mrkšić et al., 2016)