On the Role of Morphological Information for Contextual Lemmatization

02/01/2023
by   Olia Toporkov, et al.
0

Lemmatization is a Natural Language Processing (NLP) task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without analyzing whether that is the optimum in terms of downstream performance. Thus, in this paper we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish and English. Furthermore, and unlike the vast majority of previous work, we also evaluate lemmatizers in out-of-domain settings, which constitutes, after all, their most common application use. The results of our study are rather surprising: (i) providing lemmatizers with fine-grained morphological features during training is not that beneficial, not even for agglutinative languages; (ii) in fact, modern contextual word representations seem to implicitly encode enough morphological information to obtain good contextual lemmatizers without seeing any explicit morphological signal; (iii) the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology; (iv) current evaluation practices for lemmatization are not adequate to clearly discriminate between models.

READ FULL TEXT

page 15

page 19

page 21

research
02/13/2017

A Morphology-aware Network for Morphological Disambiguation

Agglutinative languages such as Turkish, Finnish and Hungarian require m...
research
03/11/2021

Evaluation of Morphological Embeddings for English and Russian Languages

This paper evaluates morphology-based embeddings for English and Russian...
research
05/25/2023

Morphological Inflection: A Reality Check

Morphological inflection is a popular task in sub-word NLP with both pra...
research
04/25/2020

Hierarchical Multi Task Learning with Subword Contextual Embeddings for Languages with Rich Morphology

Morphological information is important for many sequence labeling tasks ...
research
11/24/2020

Enhancing deep neural networks with morphological information

Currently, deep learning approaches are superior in natural language pro...
research
06/16/2023

How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

This paper investigates the effect of tokenizers on the downstream perfo...
research
08/09/2021

A Neural Approach for Detecting Morphological Analogies

Analogical proportions are statements of the form "A is to B as C is to ...

Please sign up or login with your details

Forgot password? Click here to reset