Training Cross-Lingual embeddings for Setswana and Sepedi

11/11/2021
by   Mack Makgatho, et al.
0

African languages still lag in the advances of Natural Language Processing techniques, one reason being the lack of representative data, having a technique that can transfer information between languages can help mitigate against the lack of data problem. This paper trains Setswana and Sepedi monolingual word vectors and uses VecMap to create cross-lingual embeddings for Setswana-Sepedi in order to do a cross-lingual transfer. Word embeddings are word vectors that represent words as continuous floating numbers where semantically similar words are mapped to nearby points in n-dimensional space. The idea of word embeddings is based on the distribution hypothesis that states, semantically similar words are distributed in similar contexts (Harris, 1954). Cross-lingual embeddings leverages monolingual embeddings by learning a shared vector space for two separately trained monolingual vectors such that words with similar meaning are represented by similar vectors. In this paper, we investigate cross-lingual embeddings for Setswana-Sepedi monolingual word vector. We use the unsupervised cross lingual embeddings in VecMap to train the Setswana-Sepedi cross-language word embeddings. We evaluate the quality of the Setswana-Sepedi cross-lingual word representation using a semantic evaluation task. For the semantic similarity task, we translated the WordSim and SimLex tasks into Setswana and Sepedi. We release this dataset as part of this work for other researchers. We evaluate the intrinsic quality of the embeddings to determine if there is improvement in the semantic representation of the word embeddings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/09/2018

Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings

The notions of concreteness and imageability, traditionally important in...
research
08/21/2019

On the Robustness of Unsupervised and Semi-supervised Cross-lingual Word Embedding Learning

Cross-lingual word embeddings are vector representations of words in dif...
research
06/05/2019

A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity

Cross-lingual word embeddings encode the meaning of words from different...
research
01/17/2020

A Common Semantic Space for Monolingual and Cross-Lingual Meta-Embeddings

This paper presents a new technique for creating monolingual and cross-l...
research
12/10/2019

Machine Translation with Cross-lingual Word Embeddings

Learning word embeddings using distributional information is a task that...
research
06/01/2017

Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints

We present Attract-Repel, an algorithm for improving the semantic qualit...
research
08/09/2016

Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders

Current approaches to learning vector representations of text that are c...

Please sign up or login with your details

Forgot password? Click here to reset