Tracing cultural diachronic semantic shifts in Russian using word embeddings: test sets and baselines

05/16/2019
by   Vadim Fomin, et al.
0

The paper introduces manually annotated test sets for the task of tracing diachronic (temporal) semantic shifts in Russian. The two test sets are complementary in that the first one covers comparatively strong semantic changes occurring to nouns and adjectives from pre-Soviet to Soviet times, while the second one covers comparatively subtle socially and culturally determined shifts occurring in years from 2000 to 2014. Additionally, the second test set offers more granular classification of shifts degree, but is limited to only adjectives. The introduction of the test sets allowed us to evaluate several well-established algorithms of semantic shifts detection (posing this as a classification problem), most of which have never been tested on Russian material. All of these algorithms use distributional word embedding models trained on the corresponding in-domain corpora. The resulting scores provide solid comparison baselines for future studies tackling similar tasks. We publish the datasets, code and the trained models in order to facilitate further research in automatically detecting temporal semantic shifts for Russian words, with time periods of different granularities.

READ FULL TEXT
research
06/09/2018

Diachronic word embeddings and semantic shifts: a survey

Recent years have witnessed a surge of publications aimed at tracing tem...
research
11/22/2017

Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes

Word embeddings use vectors to represent words such that the geometry be...
research
02/15/2021

How COVID-19 Is Changing Our Language : Detecting Semantic Shift in Twitter Word Embeddings

Words are malleable objects, influenced by events that are reflected in ...
research
01/19/2018

Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

In this paper, we present a distributional word embedding model trained ...
research
06/15/2021

Three-part diachronic semantic change dataset for Russian

We present a manually annotated lexical semantic change dataset for Russ...
research
12/12/2018

The Global Anchor Method for Quantifying Linguistic Shifts and Domain Adaptation

Language is dynamic, constantly evolving and adapting with respect to ti...
research
08/04/2023

Tweet Insights: A Visualization Platform to Extract Temporal Insights from Twitter

This paper introduces a large collection of time series data derived fro...

Please sign up or login with your details

Forgot password? Click here to reset