Unsupervised Lemmatization as Embeddings-Based Word Clustering

08/22/2019
by   Rudolf Rosa, et al.
0

We focus on the task of unsupervised lemmatization, i.e. grouping together inflected forms of one word under one label (a lemma) without the use of annotated training data. We propose to perform agglomerative clustering of word forms with a novel distance measure. Our distance measure is based on the observation that inflections of the same word tend to be similar both string-wise and in meaning. We therefore combine word embedding cosine similarity, serving as a proxy to the meaning similarity, with Jaro-Winkler edit distance. Our experiments on 23 languages show our approach to be promising, surpassing the baseline on 23 of the 28 evaluation datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/25/2018

Word Embedding based Edit Distance

Text similarity calculation is a fundamental problem in natural language...
research
11/16/2022

Neural Unsupervised Reconstruction of Protolanguage Word Forms

We present a state-of-the-art neural approach to the unsupervised recons...
research
05/04/2018

A Rank-Based Similarity Metric for Word Embeddings

Word Embeddings have recently imposed themselves as a standard for repre...
research
09/16/2018

Semi-Supervised Multi-Task Word Embeddings

Word embeddings have been shown to benefit from ensembling several word ...
research
12/22/2017

Novel Ranking-Based Lexical Similarity Measure for Word Embedding

Distributional semantics models derive word space from linguistic items ...
research
04/17/2021

DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages

Word meaning is notoriously difficult to capture, both synchronically an...
research
07/23/2022

Context based lemmatizer for Polish language

Lemmatization is the process of grouping together the inflected forms of...

Please sign up or login with your details

Forgot password? Click here to reset