To lemmatize or not to lemmatize: how word normalisation affects ELMo performance in word sense disambiguation

09/06/2019
by   Andrey Kutuzov, et al.
0

We critically evaluate the widespread assumption that deep learning NLP models do not require lemmatized input. To test this, we trained versions of contextualised word embedding ELMo models on raw tokenized corpora and on the corpora with word tokens replaced by their lemmas. Then, these models were evaluated on the word sense disambiguation task. This was done for the English and Russian languages. The experiments showed that while lemmatization is indeed not necessary for English, the situation is different for Russian. It seems that for rich-morphology languages, using lemmatized training and testing data yields small but consistent improvements: at least for word sense disambiguation. This means that the decisions about text pre-processing before training ELMo should consider the linguistic nature of the language in question.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/11/2021

Evaluation of Morphological Embeddings for English and Russian Languages

This paper evaluates morphology-based embeddings for English and Russian...
research
03/14/2020

Word Sense Disambiguation for 158 Languages using Word Embeddings Only

Disambiguation of word senses in context is easy for humans, but is a ma...
research
05/12/2018

Huge Automatically Extracted Training Sets for Multilingual Word Sense Disambiguation

We release to the community six large-scale sense-annotated datasets in ...
research
09/20/2021

BERT Has Uncommon Sense: Similarity Ranking for Word Sense BERTology

An important question concerning contextualized word embedding (CWE) mod...
research
03/07/2019

Creation and Evaluation of Datasets for Distributional Semantics Tasks in the Digital Humanities Domain

Word embeddings are already well studied in the general domain, usually ...
research
11/02/2018

Improving the Coverage and the Generalization Ability of Neural Word Sense Disambiguation through Hypernymy and Hyponymy Relationships

In Word Sense Disambiguation (WSD), the predominant approach generally i...
research
04/29/2017

Extending and Improving Wordnet via Unsupervised Word Embeddings

This work presents an unsupervised approach for improving WordNet that b...

Please sign up or login with your details

Forgot password? Click here to reset