Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

01/19/2018
by   Andrey Kutuzov, et al.
0

In this paper, we present a distributional word embedding model trained on one of the largest available Russian corpora: Araneum Russicum Maximum (over 10 billion words crawled from the web). We compare this model to the model trained on the Russian National Corpus (RNC). The two corpora are much different in their size and compilation procedures. We test these differences by evaluating the trained models against the Russian part of the Multilingual SimLex999 semantic similarity dataset. We detect and describe numerous issues in this dataset and publish a new corrected version. Aside from the already known fact that the RNC is generally a better training corpus than web corpora, we enumerate and explain fine differences in how the models process semantic similarity task, what parts of the evaluation set are difficult for particular models and why. Additionally, the learning curves for both models are described, showing that the RNC is generally more robust as training material for this task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/04/2017

Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl

We present DepCC, the largest to date linguistically analyzed corpus in ...
research
09/15/2021

SWEAT: Scoring Polarization of Topics across Different Corpora

Understanding differences of viewpoints across corpora is a fundamental ...
research
02/28/2016

Gibberish Semantics: How Good is Russian Twitter in Word Semantic Similarity Task?

The most studied and most successful language models were developed and ...
research
05/19/2023

Contextualized Word Vector-based Methods for Discovering Semantic Differences with No Training nor Word Alignment

In this paper, we propose methods for discovering semantic differences i...
research
05/16/2019

Tracing cultural diachronic semantic shifts in Russian using word embeddings: test sets and baselines

The paper introduces manually annotated test sets for the task of tracin...
research
01/10/2017

Implicitly Incorporating Morphological Information into Word Embedding

In this paper, we propose three novel models to enhance word embedding b...
research
04/30/2015

Texts in, meaning out: neural language models in semantic similarity task for Russian

Distributed vector representations for natural language vocabulary get a...

Please sign up or login with your details

Forgot password? Click here to reset