Evaluation of Croatian Word Embeddings

11/06/2017
by   Lukas Svoboda, et al.
0

Croatian is poorly resourced and highly inflected language from Slavic language family. Nowadays, research is focusing mostly on English. We created a new word analogy corpus based on the original English Word2vec word analogy corpus and added some of the specific linguistic aspects from Croatian language. Next, we created Croatian WordSim353 and RG65 corpora for a basic evaluation of word similarities. We compared created corpora on two popular word representation models, based on Word2Vec tool and fastText tool. Models has been trained on 1.37B tokens training data corpus and tested on a new robust Croatian word analogy corpus. Results show that models are able to create meaningful word representation. This research has shown that free word order and the higher morphological complexity of Croatian language influences the quality of resulting word embeddings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/08/2019

Evaluation of Greek Word Embeddings

Since word embeddings have been the most popular input for many NLP task...
research
04/23/2018

Can Eye Movement Data Be Used As Ground Truth For Word Embeddings Evaluation?

In recent years a certain success in the task of modeling lexical semant...
research
11/13/2020

Learning language variations in news corpora through differential embeddings

There is an increasing interest in the NLP community in capturing variat...
research
06/30/2021

A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers

We present ASDiv (Academia Sinica Diverse MWP Dataset), a diverse (in te...
research
10/28/2016

Word Embeddings for the Construction Domain

We introduce word vectors for the construction domain. Our vectors were ...
research
01/30/2022

Recognition of Implicit Geographic Movement in Text

Analyzing the geographic movement of humans, animals, and other phenomen...
research
03/04/2019

Russian Language Datasets in the Digitial Humanities Domain and Their Evaluation with Word Embeddings

In this paper, we present Russian language datasets in the digital human...

Please sign up or login with your details

Forgot password? Click here to reset