Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa – A Large Romanian Sentiment Data Set

01/11/2021
by   Anca Maria Tache, et al.
0

Romanian is one of the understudied languages in computational linguistics, with few resources available for the development of natural language processing tools. In this paper, we introduce LaRoSeDa, a Large Romanian Sentiment Data Set, which is composed of 15,000 positive and negative reviews collected from one of the largest Romanian e-commerce platforms. We employ two sentiment classification methods as baselines for our new data set, one based on low-level features (character n-grams) and one based on high-level features (bag-of-word-embeddings generated by clustering word embeddings with k-means). As an additional contribution, we replace the k-means clustering algorithm with self-organizing maps (SOMs), obtaining better results because the generated clusters of word embeddings are closer to the Zipf's law distribution, which is known to govern natural language. We also demonstrate the generalization capacity of using SOMs for the clustering of word embeddings on another recently-introduced Romanian data set, for text categorization by topic.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/04/2022

Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem

Word embeddings are one of the most fundamental technologies used in nat...
research
04/10/2021

FreSaDa: A French Satire Data Set for Cross-Domain Satire Detection

In this paper, we introduce FreSaDa, a French Satire Data Set, which is ...
research
01/17/2023

Word Embeddings as Statistical Estimators

Word embeddings are a fundamental tool in natural language processing. C...
research
03/24/2021

When Word Embeddings Become Endangered

Big languages such as English and Finnish have many natural language pro...
research
03/01/2022

Topological Data Analysis for Word Sense Disambiguation

We develop and test a novel unsupervised algorithm for word sense induct...
research
04/21/2018

Automated essay scoring with string kernels and word embeddings

In this work, we present an approach based on combining string kernels a...
research
07/10/2020

Topic Modeling on User Stories using Word Mover's Distance

Requirements elicitation has recently been complemented with crowd-based...

Please sign up or login with your details

Forgot password? Click here to reset