Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints

04/18/2016
by   Andrey Kutuzov, et al.
0

We present our experience in applying distributional semantics (neural word embeddings) to the problem of representing and clustering documents in a bilingual comparable corpus. Our data is a collection of Russian and Ukrainian academic texts, for which topics are their academic fields. In order to build language-independent semantic representations of these documents, we train neural distributional models on monolingual corpora and learn the optimal linear transformation of vectors from one language to another. The resulting vectors are then used to produce `semantic fingerprints' of documents, serving as input to a clustering algorithm. The presented method is compared to several baselines including `orthographic translation' with Levenshtein edit distance and outperforms them by a large margin. We also show that language-independent `semantic fingerprints' are superior to multi-lingual clustering algorithms proposed in the previous work, at the same time requiring less linguistic resources.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/05/2017

Language Modeling by Clustering with Word Embeddings for Text Readability Assessment

We present a clustering-based language model using word embeddings for t...
research
01/19/2018

A Resource-Light Method for Cross-Lingual Semantic Textual Similarity

Recognizing semantically similar sentences or paragraphs across language...
research
04/23/2018

Can Eye Movement Data Be Used As Ground Truth For Word Embeddings Evaluation?

In recent years a certain success in the task of modeling lexical semant...
research
03/09/2022

Unsupervised Alignment of Distributional Word Embeddings

Cross-domain alignment play a key roles in tasks ranging from machine tr...
research
12/04/2019

PDC – a probabilistic distributional clustering algorithm: a case study on suicide articles in PubMed

The need to organize a large collection in a manner that facilitates hum...
research
04/13/2020

Compass-aligned Distributional Embeddings for Studying Semantic Differences across Corpora

Word2vec is one of the most used algorithms to generate word embeddings ...
research
12/19/2022

Independent Components of Word Embeddings Represent Semantic Features

Independent Component Analysis (ICA) is an algorithm originally develope...

Please sign up or login with your details

Forgot password? Click here to reset