Gibberish Semantics: How Good is Russian Twitter in Word Semantic Similarity Task?

02/28/2016
by   Nikolay N. Vasiliev, et al.
0

The most studied and most successful language models were developed and evaluated mainly for English and other close European languages, such as French, German, etc. It is important to study applicability of these models to other languages. The use of vector space models for Russian was recently studied for multiple corpora, such as Wikipedia, RuWac, lib.ru. These models were evaluated against word semantic similarity task. For our knowledge Twitter was not considered as a corpus for this task, with this work we fill the gap. Results for vectors trained on Twitter corpus are comparable in accuracy with other single-corpus trained models, although the best performance is currently achieved by combination of multiple corpora.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/22/2021

Co-occurrences using Fasttext embeddings for word similarity tasks in Urdu

Urdu is a widely spoken language in South Asia. Though immoderate litera...
research
06/10/2022

Borrowing or Codeswitching? Annotating for Finer-Grained Distinctions in Language Mixing

We present a new corpus of Twitter data annotated for codeswitching and ...
research
05/12/2018

Weight Initialization in Neural Language Models

Semantic Similarity is an important application which finds its use in m...
research
03/15/2018

RUSSE: The First Workshop on Russian Semantic Similarity

The paper gives an overview of the Russian Semantic Similarity Evaluatio...
research
01/18/2018

Unsupervised Hashtag Retrieval and Visualization for Crisis Informatics

In social media like Twitter, hashtags carry a lot of semantic informati...
research
04/30/2015

Texts in, meaning out: neural language models in semantic similarity task for Russian

Distributed vector representations for natural language vocabulary get a...
research
01/19/2018

Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

In this paper, we present a distributional word embedding model trained ...

Please sign up or login with your details

Forgot password? Click here to reset