A Comparative Study of Lexical Substitution Approaches based on Neural Language Models

05/29/2020 ∙ by Nikolay Arefyev, et al. ∙ 0

Lexical substitution in context is an extremely powerful technology that can be used as a backbone of various NLP applications, such as word sense induction, lexical relation extraction, data augmentation, etc. In this paper, we present a large-scale comparative study of popular neural language and masked language models (LMs and MLMs), such as context2vec, ELMo, BERT, XLNet, applied to the task of lexical substitution. We show that already competitive results achieved by SOTA LMs/MLMs can be further improved if information about the target word is injected properly, and compare several target injection methods. In addition, we provide analysis of the types of semantic relations between the target and substitutes generated by different models providing insights into what kind of words are really generated or given by annotators as substitutes.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Lexical substitution is the task of generating words which can replace a given word in a given textual context. For instance, in the sentence “My daughter purchased a new car” the word car can be substituted by its synonym vehicle keeping the same meaning, but also with the co-hyponym bike, or even the hypernym means of transport while keeping the original sentence grammatical. Lexical substitution can be useful in various applications, such as word sense induction Amrami and Goldberg (2018), lexical relation extraction Schick and Schütze (2019), paraphrase generation, semantic spelling correction, text simplification, textual data augmentation, etc.

The new generation of language models (LMs) based on deep neural networks, such as ELMo 

Peters et al. (2018), BERT 1, and XLNet Yang et al. (2019)

enabled a profound breakthrough in many NLP tasks, ranging from sentiment analysis to named entity recognition. Commonly these models are used to perform pre-training of deep neural networks which are finally fine-tuned to perform some task different from language modelling 

Howard and Ruder (2018). In this paper we provide the first large-scale comparison and analysis of various neural LMs/MLMs applied to the task of lexical substitution and two tasks which exploit lexical substitution, namely word sense induction and text data augmentation. More specifically, the main contributions of the paper are as follows:

  • A comparative study of neural language models and masked language models (context2vec, Elmo, BERT, XLNet) applied for lexical substitution based on both intrinsic and extrinsic evaluations.

  • A study of types of semantic relations (synonyms, co-hyponyms, etc.) produced by substitution models and human annotators.

  • A study of methods of target word inclusion for improvement of lexical substitution quality.

2 Related Work

The paper which is arguably most similar to our study is Zhou et al. (2019), where an end-to-end lexical substitution approach based on BERT is proposed, similar to the baseline BERT-based approaches studied in our paper. However, our study goes beyond evaluation only on the SemEval-based lexical substitution task: in addition to this, we test performance on other intrinsic datasets but also in the context of two applications: word sense induction and data augmentation. Besides, our study is not limited to BERT but compares face-to-face three recently introduced neural LMs: BERT, ELMo, and XLNet and their variants.

More generally, solving the lexical substitution task requires finding words that are both appropriate in the given context and related to the target word in some sense (which may vary depending on the application of generated substitutes). To achieve this unsupervised substitution models heavily rely on distributional similarity models of words (DSMs) and language models (LMs). Probably, the most commonly used DSM is

word2vec model, which trains word embeddings and context embeddings to be similar when they tend to occur together. Contexts are either nearby words Mikolov et al. (2013), or syntactically related words Levy and Goldberg (2014), resulting in similar embeddings for distributionally similar words. In Melamud et al. (2015) several metrics for lexical substitution were proposed based on embedding similarity of substitutes both to the target word and to the words in the given context. Later Roller and Erk (2016)

improved this approach by switching to dot-product instead of cosine similarity and applying an additional trainable transformation to context word embeddings.

A more sophisticated context2vec model producing embeddings for a word in a particular context (contextualized word embeddings) was proposed in Melamud et al. (2016) and was shown to outperform previous models in a ranking scenario when candidate substitutes are given. The training objective is similar to word2vec, but context representation is produced by two LSTMs (a forward and a backward for the left and the right context), in which final outputs are combined by a feed-forward NN. For lexical substitution, candidate word embeddings are ranked by their similarity to the given context representation. A similar architecture consisting of a forward and a backward LSTM is employed in ELMo Peters et al. (2018). However, each LSTM was trained with the LM objective instead. To rank candidate substitutes using ELMo Soler et al. (2019) proposed calculating cosine similarity between contextualized ELMo embeddings of the target word and all candidate substitutes (this requires feeding the original example with the target word replaced by one of the candidate substitutes at a time). Average of all ELMo layers’ outputs at the target timestep performed best. However, they found context2vec perform even better explaining this by its training objective, which is more related to the task.

Recently deep Transformer NNs pre-trained on huge corpora with LM or similar objective consistently show SOTA results in a variety of NLP tasks. BERT 1 is trained to restore a word replaced with a special [MASK] token at its input given both left and right context (masked LM objective) and XLNet Yang et al. (2019) predicts a word at a specified position given randomly selected words from the context with their positions. In Zhou et al. (2019), BERT was reported to perform purely for lexical substitution (which is contrary to our experiments) and two improvements were proposed to achieve SOTA results using it. Firstly, dropout is applied to the target word embedding before showing it to the model. Secondly, the similarity between the original contextualized representations of context words and their representations after replacing the target by one of the possible substitutes are integrated into the ranking metric to ensure minimal changes in the sentence’s meaning. We are not aware of any work applying XLNet for lexical substitution, but our experiments show that it outperforms BERT by a large margin.

Supervised approaches to lexical substitution were also proposed, including Szarvas et al. (2013, 2013); Hintz and Biemann (2016). These approaches rely on manually curated lexical resources like WordNet, so they are not easily transferrable to different languages unlike those described above. Also, the latest unsupervised methods like Zhou et al. (2019) were shown to perform better.

3 Neural Language Models for Lexical Substitution

To generate a substitute we take a text fragment and a target word position in it as input, and produce a list of substitutes with their probabilities using a neural LM/MLM. We experiment with naive application of MLMs to predict probability distribution for words that can appear instead of the target word given its left and right context, and also with combinations of several probability distributions including distributional similarity to the target. Combinations yield better results for WSI according to prior studies

Amrami and Goldberg (2019); Arefyev et al. (2019) and further for intrinsic and extrinsic metrics in our experiments. More specifically, various methods for inclusion information about the target word are tested.

In our experiments, we the following models as substitute probability estimators: context2vec 

Melamud et al. (2016), ELMo Peters et al. (2018), BERT 1, and XLNet Yang et al. (2019). Our experiments show that neural LMs used for lexical substitutes should preserve the meaning of the target word: the information about the target should be somehow presented to the substitute generator. This is why we experiment with several ways to inject information about the target word. Suppose we have an example , where is the target word, and is its context (left and right correspondingly).

The first option is to combine a distribution provided by substitute probability estimator, , with a distribution that comes from measuring of proximity between the target and substitutes, . The latter distribution is computed as an inner product between the respective embeddings. If we simply multiply these distributions the second will almost have no effect because the first is very peaky. To align the orders of distributions we use softmax with temperature: The final distribution is obtained by the formula where is a parameter controlling how we penalize frequent words, for more details see Arefyev et al. (2019).

The second option is to use dynamic patterns. For example, pattern “T and then ”, proposed in Arefyev et al. (2019) means that we replace the target with this construction. The probability estimator should predict words at timestep ‘_’. Dynamic patterns give a vision of the target word to the model.

Finally, we can give no information about the target word to the probability estimator. By default, ELMo does not have this information. BERT has special mask tokens, so we replace the target word with this token, thus, hiding the target from the model. For XLNet we use special attention mask so words in the context don’t see the target word.

More specifically, we experiment with the following baseline models and their upgraded version which include one of these approaches:


To use ELMo as a probability estimator divide a sentence into left and right contexts with respect to a target word. We obtain two independent distributions over vocabulary: one with the forward model for the left context, , another with the backward model for the right context, . To combine these distributions by using method BComb-LMs proposed in Arefyev et al. (2019). Therefore, we get distribution: . The substitutes are the most probable words according to this distribution. Additionally, we study this model with two types of target injection: proximity according to ELMo-embeddings, denoted as ELMo+embs, and dynamic-patterns usage, denoted as ELMo+pat.


In order to generate substitutes with BERT we give full context as input to a model and gather distribution over vocabulary at target word position. Since BERT is a masked LM we can mask out target word, hence, using no target word information to a model. Such a generator we would call BERT-notgt. As for ELMo we furthermore analyze other target word injections: combination with first layer BERT embeddings (BERT+embs) and combination with dynamic-pattern (BERT+pat).


We obtain substitute distribution with XLNet in the same way as for BERT. In the case of a base model, elements at context positions could attend to an element at a target position, non-masked version. In a similar vein as for BERT and ELMo we consider three additional models: combination with embeddings (XLNet+embs), masking of a target word (XLNet-notgt), and usage of dynamic-pattern (XLNet+pat). We find that for small contexts XLNet gives erroneous distribution. To mitigate this problem we prepend initial context with some text that ends with the end of document special symbol.

4 Baseline Lexical Substitution Models

Lexical substitution models based on the three state-of-the-art neural LMs described above are compared to the three following strong models specifically developed for the lexical substitution task: OOC Roller and Erk (2016), nPIC Roller and Erk (2016), and context2vec Melamud et al. (2016).

OOC: Out of Context

This model ranks words by their cosine similarity with the target word and completely ignores context. Following Roller and Erk (2016) we use dependency-based embeddings111http://www.cs.biu.ac.il/nlp/resources/downloads/lexsub_embeddings released by Melamud et al. (2015).

nPIC: non-Parameterized probability In Context

nPIC is a measure that consists of two independent components that measure appropriateness of a substitute to the context (words that are directly connected to the target) and to the target, see Roller and Erk (2016). Each component is based on dependency based word and context embeddings and takes form of a softmax.


This model builds the vector representation of the context using LSTM-based NN and ranks possible substitutes by their dot product similarity to the context representation. We use original implementation

222https://github.com/orenmel/context2vec and the weights333http://u.cs.biu.ac.il/~nlp/resources/downloads/context2vec pre-trained on ukWac dataset.

Model SemEval 2007 CoInCo
GAP P@1 P@3 R@10 GAP P@1 P@3 R@10
OOC 42.8 15.9 12.5 18.1 44.5 10.9 8.6 14.3
nPIC 50.6 23.1 17.3 26.5 48.1 26.3 19.8 18.0

53.4 7.6 5.6 10.8 47.5 8.4 7.0 7.9

51.7 11.4 8.4 14.0 48.9 13.6 11.0 11.6
BERT 52.8 37.1 27.0 38.8 50.3 44.4 33.6 29.8
XLNet 57.0 31.3 22.4 34.2 52.8 39.2 29.4 27.2
ELMo+embs 53.6 33.8 23.3 33.8 53.3 38.4 28.7 26
BERT+embs 52.1 38.5 29.4 42.6 50.3 44.8 35.2 31.9
XLNet+embs 57.3 45.6 32.8 46.0 54.8 49.0 38.9 34.9

Table 1: Results for candidate ranking (GAP) and all words ranking based on our re-implementation of the baselines to ensure that the models use the same post-processing.

5 Intrinsic Evaluation

We perform an intrinsic evaluation of neural LMs on the lexical substitution task on two datasets.

5.1 Experimental Settings

Lexical substitution task is concerned with finding appropriate substitutes for a target word in a given context. For example, possible substitutes of a word trade in the sentence ”Angels make a trade to get outfield depth.” are a swap, exchange, deal, barter, transaction, etc. The irrelevant ones are skill or craft that encompass different meanings of trade. This task was originally introduced as SemEval 2007 evaluation competition McCarthy and Navigli (2007) and suits for an evaluation of how distributional models handle polysemous words. In a lexical substitution task, annotators are provided with the target word and the context. Their task is to propose possible substitutes. Since there are several annotators, we have a weighted list of substitutes for each target word in a given context.

We compute the probability of a substitute for a target word in a context acquiring distribution over vocabulary or a candidate list. Lexical substitution task comes with two variations: candidate ranking and all-words ranking. In candidate ranking task models are provided with the list of candidates. Following previous works, we acquire this list by merging all the substitutions of the target lemma and POS tag over the corpus. We measure performance on this task with Generalized Average Precision (GAP) that was introduced in Thater et al. (2010). GAP is similar to Mean Average Precision and the difference is in the weights that come from how many times annotators selected a particular substitute (see the original paper on GAP for more details). Following Melamud et al. (2015) we discard all multi-word expressions from the gold substitutes and omit all instances that left without gold substitutes.

In all-ranking task model is not given with the candidate substitutions, therefore, it’s a much harder task than the previous one. The model should give a higher probability to gold substitutes than to other words in its vocabulary that could have the size of thousands of words. Commonly data sets don’t have many annotators and many words have a lot of possible substitutes, e.g. you could change word violet on many other colors. Hence, it’s challenging for the model to generate substitutes that were chosen by annotators. Following Roller and Erk (2016)

we use mean precision at 1 and 3 (P@1, P@3) as an evaluation metric for this task. Additionally, we look at recall at 10 (R@10).

We used two lexical substitution datasets:

SemEval 2007 task McCarthy and Navigli (2007) consists of 300 dev and 1710 test sentences for 201 polysemous words. For each target word, 10 sentences are provided. Annotators’ task was to give up to 3 possible substitutes.

CoInCo or Concepts-In-Context dataset Kremer et al. (2014) consists of over than 15K target instances with a given 35%/65% split. There are over 2500 sentences that come from fiction, emails, and newswires. Annotators provided at least 6 substitutes for each target.

5.2 Discussion of Results

Comparison to previously published baselines

Table 2 contains metrics for candidate and all-vocab ranking tasks. We compare our best model (XLNet+embs) with a baseline models presented in Roller and Erk (2016), context2vec (c2v) model Melamud et al. (2016) and BERT for lexical substitution presented in Zhou et al. (2019). Note the improvement of the proposed model over baselines. On the SemEval07 task, our models show comparable results to c2v but outperform it on the CoInCo data set. BERT for lexical substitution outperforms XLNet+embs on all tasks. In Zhou et al. (2019) they add substitute validation metric that improves predictions. We expect that our models also could be improved with this technique. That leaves room for future research. It is worth to mention that BERT and XLNet work on a sub-token level, hence, their vocabularies are lower in size than ELMo or c2v and contain a lot of non-word tokens. We hypothesize that these models could be improved by integrating multi-token generation so they could cover more words.

Model SemEval 2007 CoInCo
GAP P@1/@3 GAP P@1/@3
Sup. learning 55.0 - - -
Trans. learning 51.9 - - -
PIC 52.4 19.7/14.8 48.3 18.2/13.8
Substitute vector 55.1 - 50.2 -
context2vec 56.0 - 47.9 -
BERT 60.5 51.1/- 57.6 56.3/-
XLNet+embs 57.3 45.6/32.8 54.8 49.0/38.9
(w/o trg excl)
57.4 13.2/18.8 51.6 14.8/24.2
(w/o lemmat)
58.6 24.4/17.8 51.5 25.5/19.3
(c2v post-proc)
58.8 25.8/21.0 52.9 17.7/16.8
Table 2:

Comparison to previous published results. Post-processing and metrics implementation details may differ. Models: Supervised Learning 

Szarvas et al. (2013)

, Transfer Learning 

Hintz and Biemann (2016), PIC Roller and Erk (2016), Substitute vector Melamud et al. (2015), context2vec Melamud et al. (2016), BERT Zhou et al. (2019).

Further, Table 2 provide results for different post-processing of substitute distribution from our XLNet+embs model. We see that post-processing has a great impact on the metrics. In PIC authors use NLTK English stemmer for exclusion stems of the target word, i.e. they assign zero probability to word with stem equal to target stem. The code of context2vec uses NLTK WordNet lemmatizer to lemmatize only candidates. We use spaCy lemmatizer in our post-processing. In order to analyze substitute distributions provided by different vectorizers, independently of post-processing steps, we fixed the following post-processing: default post-processing (i.e. with lemmatization and target exclusion), w/o lemmatization, w/o target lemmas exclusion, c2v post-processing. In Table 1 we present results for our re-implementations of baselines, context2vec and proposed generators.

Re-implementation of the baselines

Table 1

contains metrics (P@1, P@3, R@10) for all-words ranking variation of lexical substitution task. First, we note that pipelines based on a new line of NLP models (ELMo, BERT, XLNet) substantially outperform word2vec based PIC and OOC methods. We have approximately 50% relative improvement in precision@1 for SemEval07 and 60% for CoInCo. This indicates that proposed models are better at capturing the meaning of a word in a context as such providing more accurate substitutes. We also note that combination with embeddings substantially improves basic models. The greatest improvement comes for XLNet model in precision and recall, e.g. for SemEval07 precision@1 improves by approximately 14%.

Injection of information about target word

Here we compare substitute generation models described in Section 3 using the based on different types of target information injection. Figure 1 shows Recall@10 metric on SemEval 2007 McCarthy and Navigli (2007) dataset for each substitute generator. Dynamic pattern application worsens the result of XLNet-notgt and BERT-notgt generators, but ELMo with pattern ’T and _’(proposed in Amrami and Goldberg (2018)), slightly outperforms ELMo-notgt. Perhaps this is because people generate substitutes in the original sentence without a pattern, but in our case, despite we show target word to the substitute generator, the pattern can spoil the predictions. When we show the target word in the sentence to the substitute generator(BERT-base or XLNet-base) we overtake BERT-notgt by several percents, because the target word information allows the generator to generate more relevant substitutes. Also, the combination of a probability distribution with embedding similarity leads to a significant increase of Recall@10. For example, ELMo+embs outperforms ELMo-notgt more than 50 percent. And also XLNet+embs outperforms XLNet-base more than 12 percent. This result means that the correct information about the target word allows you to generate substitutes more similar to human substitutes and more appropriate for the context.

Figure 1: Comparison of various target information injection methods on the SemEval 2007 dataset. By default XLNet and BERT see target and ELMo doesn’t(ELMo-notgt same model as ELMo-default).
Model SemEv10 SemEv13
Amrami and Goldberg (2019) 53.61.2 37.00.5
Amrami and Goldberg (2018) -
ELMo 41.8 27.6
BERT 52.0 34.5
XLNet 49.6 33.7
ELMo+embs 45.3 28.2
BERT+embs 53.8 36.8
XLNET+embs 51.1 36.1
Table 3: Evaluation on the word sense induction task to the current state-of-the-art models on SemEval2010 and SemEval2013 tasks.

6 Extrinsic Evaluation

In this section, we show the usefulness of lexical substitution based on neural LMs in the context of two tasks: word sense induction and textual data augmentation.

6.1 Word Sense Induction

WSI is the task of senses identification for a target word given its usages in different contexts. In this problem, we are commonly provided with a corpus of sentences that contain target lemma and part of speech (POS) tag and it’s needed to cluster word occurrences, hence, obtaining word senses. For example, suppose that we have the following sentences:

  1. He settled down on the river bank and contemplated the beauty of nature.

  2. They unloaded the tackle from the boat to the bank.

  3. Grand River bank now offers a profitable mortgage.

Sentences 1 and 2 must belong to one cluster, but sentence 3 must be assigned to another. This task was proposed in several SemEval competitions Agirre and Soroa (2007); Manandhar et al. (2010); Jurgens and Klapaftis (2013). The current state-of-the-art approach Amrami and Goldberg (2019) relies on substitute vectors, i.e. each word usage is represented as a context-dependent distribution over probable substitutes and clustering is performed over these distributions.

We incorporated an algorithm for word sense induction task in order to compare proposed generators. The algorithm is based on techniques that were described in Amrami and Goldberg (2018, 2019). On the first step, we generate substitutes for each instance, lemmatize them and take 200 most probable. On the next step we represent these 200 substitutes as a vector by using TF-IDF. Finally, we cluster obtained vectors with agglomerative clusterization with average linkage and cosine distance.

We evaluate lexical substitutes based on neural LMs in the following datasets: SemEval-2013 and SemEval-2010. We compare our models with the current SOTA on the WSI task – Amrami and Goldberg (2019). Table 3 demonstrates that combination with embeddings helps to substantially improve generators. For example, a combination of BERT with its embeddings (BERT+embs) improves the results of a BERT model by about 3% on both data sets. Likewise, a combination of forward LM, backward LM and proximity of ELMo embeddings between substitute and target word, i.e. ELMo+embs generator, raises results on SemEval-2010 task by about 4%.

Figure 2: Accuracy on Intent Classification task with different train sizes on SNIPS dataset.

6.2 Data Augmentation

Another task that could benefit from contextual substitution is data augmentation. Data augmentation techniques are widely used in computer vision and audio, e.g. image rotation, cropping, etc. For textual data, we don’t have straightforward techniques for augmentation due to the high complexity of language. There are several papers that address this problem by using contextual substitutions,

Kobayashi (2018); Gao et al. (2019); Wu et al. (2018) to mention a few. Since we can generate substitutes for a word in a sentence, it can be used to create simple paraphrases. In this paper, we analyze data augmentation with contextual substitutions on the Intent Classification task.

Intent Classification is necessary for personal digital assistants to decide which action to take in response to some user utterance.This is essentially a multi-class classification task. When new skills are introduced in assistant, the number of classes grows rapidly. The number of examples for these new classes are usually small, which makes the application of modern deep learning models difficult and requires techniques like data augmentation.

In this paper we use the SNIPS dataset to study how augmentation affects Intent Classification quality. The SNIPS dataset Coucke et al. (2018) is a popular public dataset for the Intent Classification and Slot Tagging tasks, which contains 7 intents, 13084/700/700 samples in train/dev/test, respectively. Also, SNIPS has a nice feature: it is well balanced by intent. As a model for the Intent Classification task, we chose the SOTA model on SNIPS — Capsule NLU which is a capsule-based neural network model Zhang et al. (2018)

. We train this model using hyperparameters which were selected in the original paper.

To generate new examples, we use the following algorithm: we select one random word in the sentence corresponding to some slot, next we generate substitutes for this word, and then we sample one substitute with probabilities corresponding to the generated substitutes and replace the original word with the sampled substitute.

Figure 3: Percent of substitutes (log-scale) related to the target by various semantic relations (according to WordNet) on the CoInCo data set.

6.2.1 Results

We evaluate how data augmentation based on different substitution models affects on Intent Classification performance depending on the number of examples in the train set. For all intents we randomly sampled without replacement the same number of examples ranging from 1% to 100%.

In the Figure 2 we see that the quality of the Intent Classification task begins to sharply decrease when the size of train data reduces to 10%. Even with 30% of the train set, it’s enough data to get accuracy score close (0.5% difference) to the performance on the full data set. Our augmentation allows to improve the quality of Intent Classification. The Figure 2 shows that augmentation gives a greater increase with a small amount of initial train set (1%, 3%, 10%) than with a larger train set (30%, 50%, 100%).

Figure 4: Visualisation interface of various considered neural substitution models facilitating interpretation of the results. The input sentence is placed at the top. The target word is marked by a dashed box. Then gold substitutes follow, their weights are given in brackets. After gold substitutes models predictions come along with the number of true positives to the right of a model name. Each word is colored according to WordNet relation between it and a target word. Here words with a similar to relation highlighted by blue, words with no relation - by red. For each model web application provides ranks of gold substitutes.

7 Analysis of Semantic Relation Types

In this section, we provide an analysis of types of semantic relations produced by various neural language models.

7.1 Experimental Settings

We use two lexical substitution corpora this analysis, which were described above: the SemEval 2007 dataset McCarthy and Navigli (2007) and the CoInCo dataset Kremer et al. (2014). For each target word and for the substitute we search for two most similar synsets in WordNet Miller (1995). Then a relation between these words is identified. If the direct relation is not available we search for a transitive relation: for hypo/hypernyms with no limitation of path length and for co-hyponyms with length of maximum three hops in the graph. We give examples of considered relations in the appendix. Then we count statistics of relation types.

For better interpretability of various neural lexical substitution models, we developed a graphical user interface presented in Figure 4. It allows to select the most suitable model based on interactive processing of user input texts.

7.2 Discussion of Results

Figure 3 presents results of the experiment. We used several neural language models to show the difference between produced relation types for nouns and verbs. First, as one can observe, for both parts of speech a substantial fraction of words, even produced by the original gold standard annotations has no direct relation to target in terms of WordNet. Also, we note that even human annotators make errors in pos for substitute or a target, e.g. for bright as an adjective someone gave glitter as a substitute. For adjectives and adverbs such case takes 15% and 25%, and for verbs and nouns less than 7%. Analyzing substitutes provided by baseline models, OOC and nPIC, we see that unknown word relation prevails taking 40%. Partly this happens because their vocabularies contain words with typos but we also see that these models don’t capture pos of a target word properly for some instances. Proposed models produce much fewer substitutes that are unknown-word according to WordNet for a given pos. BERT and XLNet generate comparable to the gold proportion of such words. This suggests that these models better capture pos tag of a target word and relations between words in a sentence.

Second, for nouns the majority of substitutes fall into either synonyms or (transitive) co-hyponym relation classes. We observe that combinations with embeddings produce consistently more synonyms than corresponding single models, however, still less than humans. When combined with embeddings, BERT and XLNet are on par. Without embeddings BERT outperforms XLNet. If we look at transitive co-hyponyms (co-hyponyms-3 on the figure) we observe the opposite: models combined with embeddings produce fewer substitutes of this type, XLNet outperforms BERT. We hypothesize that the addition of information from embeddings incline models to produce words that are more closely related to a target word as they lie closer to it in a WordNet tree. Analyzing other relations we see the proof to this: the proportion of transitive hypernyms, transitive co-hyponyms and unknown-relation decreases and at the same time proportion of direct hypernyms, direct hyponyms and co-hyponyms increases.

Further, c2v and ELMo without embeddings, which don’t see the target, generate the smallest percent of synonyms for all parts of speech except verbs. Also, these models produce much more substitutes with unknown relation to a target word than other models. Combination of these models with embeddings gives rise to all meaningful relations, i.e. co-hyponyms, transitive co-hyponyms, synonyms, etc, as we inject information about a target word.

8 Conclusion

In this paper, we presented the first large-scale computational study of three state-of-the-art neural language models (BERT, ELMo, and XLNet) and their variant on the task of lexical substitution in the context. In addition to extensive experimental comparisons on several intrinsic lexical substitution benchmarks, we present a comparison of the models in the context of two applications: word sense induction and text data augmentation.

Our finding suggests that (i) the simple unsupervised approaches based on large pre-trained neural language models yield results comparable to sophisticated traditional supervised baseline approaches; (ii) integration of the information about the target substantially boosts the quality of lexical substitution and shall be used whenever possible.

In addition to comparison on the benchmarks, we also show which models tend to produce semantic relations of which types (synonyms, hypernyms, meronyms, etc.) providing valuable guidelines to practitioners aiming to use lexical substitution in applications. Indeed, depending on the type of semantic relations required in an NLP application one or another type of neural LM shall be used.


  • [1] Cited by: §1, §2, §3.
  • E. Agirre and A. Soroa (2007) SemEval-2007 task 02: evaluating word sense induction and discrimination systems. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic, pp. 7–12. External Links: Link Cited by: §6.1.
  • A. Amrami and Y. Goldberg (2018) Word sense induction with neural biLM and symmetric patterns. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 4860–4867. External Links: Link, Document Cited by: §1, §5.2, Table 3, §6.1.
  • A. Amrami and Y. Goldberg (2019) Towards better substitution-based word sense induction. CoRR abs/1905.12598. External Links: Link, 1905.12598 Cited by: §3, Table 3, §6.1, §6.1, §6.1.
  • N. Arefyev, B. Sheludko, A. Davletov, D. Kharchev, A. Nevidomsky, and A. Panchenko (2019) Neural GRANNy at SemEval-2019 task 2: a combined approach for better modeling of semantic relationships in semantic frame induction. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 31–38. External Links: Link, Document Cited by: §3.
  • N. Arefyev, B. Sheludko, and A. Panchenko (2019) Combining lexical substitutes in neural word sense induction. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’19), RANLP ’19, Varna, Bulgaria, pp. 62–70. External Links: Link Cited by: §3, §3, §3.
  • A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril, M. Primet, and J. Dureau (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. ArXiv abs/1805.10190. Cited by: §6.2.
  • F. Gao, J. Zhu, L. Wu, Y. Xia, T. Qin, X. Cheng, W. Zhou, and T. Liu (2019)

    Soft contextual data augmentation for neural machine translation

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5539–5544. External Links: Link, Document Cited by: §6.2.
  • G. Hintz and C. Biemann (2016) Language transfer learning for supervised lexical substitution. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 118–129. External Links: Link, Document Cited by: §2, Table 2.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. External Links: Link, Document Cited by: §1.
  • D. Jurgens and I. Klapaftis (2013) SemEval-2013 task 13: word sense induction for graded and non-graded senses. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA, pp. 290–299. External Links: Link Cited by: §6.1.
  • S. Kobayashi (2018) Contextual augmentation: data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 452–457. External Links: Link, Document Cited by: §6.2.
  • G. Kremer, K. Erk, S. Padó, and S. Thater (2014)

    What substitutes tell us - analysis of an “all-words” lexical substitution corpus. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 540–549. External Links: Link, Document Cited by: §5.1, §7.1.
  • O. Levy and Y. Goldberg (2014) Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, Maryland, pp. 302–308. External Links: Link, Document Cited by: §2.
  • S. Manandhar, I. Klapaftis, D. Dligach, and S. Pradhan (2010) SemEval-2010 task 14: word sense induction &disambiguation. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 63–68. External Links: Link Cited by: §6.1.
  • D. McCarthy and R. Navigli (2007) SemEval-2007 task 10: English lexical substitution task. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic, pp. 48–53. External Links: Link Cited by: §5.1, §5.1, §5.2, §7.1.
  • O. Melamud, J. Goldberg, and I. Dagan (2016) Context2vec: learning generic context embedding with bidirectional LSTM. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 51–61. External Links: Link, Document Cited by: §2, §3, §4, §5.2, Table 2.
  • O. Melamud, I. Dagan, and J. Goldberger (2015) Modeling word meaning in context with substitute vectors. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 472–482. External Links: Link, Document Cited by: §5.1, Table 2.
  • O. Melamud, O. Levy, and I. Dagan (2015) A simple word embedding model for lexical substitution. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, Colorado, pp. 1–7. External Links: Link, Document Cited by: §2, §4.
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546. External Links: Link, 1310.4546 Cited by: §2.
  • G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §7.1.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §1, §2, §3.
  • S. Roller and K. Erk (2016) PIC a different word: a simple model for lexical substitution in context. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 1121–1126. External Links: Link, Document Cited by: §2, §4, §4, §4, §5.1, §5.2, Table 2.
  • T. Schick and H. Schütze (2019) Rare words: a major problem for contextualized embeddings and how to fix it by attentive mimicking. arXiv preprint arXiv:1904.06707. Cited by: §1.
  • A. G. Soler, A. Cocos, M. Apidianaki, and C. Callison-Burch (2019) A comparison of context-sensitive models for lexical substitution. In Proceedings of the 13th International Conference on Computational Semantics - Long Papers, Gothenburg, Sweden, pp. 271–282. External Links: Link Cited by: §2.
  • G. Szarvas, C. Biemann, and I. Gurevych (2013) Supervised all-words lexical substitution using delexicalized features. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 1131–1141. External Links: Link Cited by: §2.
  • G. Szarvas, R. Busa-Fekete, and E. Hüllermeier (2013) Learning to rank lexical substitutions. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1926–1932. External Links: Link Cited by: §2, Table 2.
  • S. Thater, H. Fürstenau, and M. Pinkal (2010) Contextualizing semantic representations using syntactically enriched vector models. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 948–957. External Links: Link Cited by: §5.1.
  • X. Wu, S. Lv, L. Zang, J. Han, and S. Hu (2018) Conditional BERT contextual augmentation. CoRR abs/1812.06705. External Links: Link, 1812.06705 Cited by: §6.2.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. Note: cite arxiv:1906.08237Comment: Pretrained models and code are available at https://github.com/zihangdai/xlnet External Links: Link Cited by: §1, §2, §3.
  • C. Zhang, Y. Li, N. Du, W. Fan, and P. S. Yu (2018) Joint slot filling and intent detection via capsule neural networks. ArXiv abs/1812.09471. Cited by: §6.2.
  • W. Zhou, T. Ge, K. Xu, F. Wei, and M. Zhou (2019) BERT-based lexical substitution. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3368–3373. External Links: Link, Document Cited by: §2, §2, §2, §5.2, Table 2.