AfriVEC: Word Embedding Models for African Languages. Case Study of Fon and Nobiin

03/08/2021
by   Bonaventure F. P. Dossou, et al.
0

From Word2Vec to GloVe, word embedding models have played key roles in the current state-of-the-art results achieved in Natural Language Processing. Designed to give significant and unique vectorized representations of words and entities, those models have proven to efficiently extract similarities and establish relationships reflecting semantic and contextual meaning among words and entities. African Languages, representing more than 31 spoken languages, have recently been subject to lots of research. However, to the best of our knowledge, there are currently very few to none word embedding models for those languages words and entities, and none for the languages under study in this paper. After describing Glove, Word2Vec, and Poincaré embeddings functionalities, we build Word2Vec and Poincaré word embedding models for Fon and Nobiin, which show promising results. We test the applicability of transfer learning between these models as a landmark for African Languages to jointly involve in mitigating the scarcity of their resources, and attempt to provide linguistic and social interpretations of our results. Our main contribution is to arouse more interest in creating word embedding models proper to African Languages, ready for use, and that can significantly improve the performances of Natural Language Processing downstream tasks on them. The official repository and implementation is at https://github.com/bonaventuredossou/afrivec

READ FULL TEXT

page 5

page 8

research
08/02/2016

New word analogy corpus for exploring embeddings of Czech words

The word embedding methods have been proven to be very useful in many ta...
research
05/19/2017

A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese

Psycholinguistic properties of words have been used in various approache...
research
10/19/2022

DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models

We study the way DALLE-2 maps symbols (words) in the prompt to their ref...
research
07/29/2019

A Mathematical Model for Linguistic Universals

Inspired by chemical kinetics and neurobiology, we propose a mathematica...
research
02/27/2020

The Spectral Underpinning of word2vec

word2vec due to Mikolov et al. (2013) is a word embedding method that is...
research
06/10/2018

LexNLP: Natural language processing and information extraction for legal and regulatory texts

LexNLP is an open source Python package focused on natural language proc...

Please sign up or login with your details

Forgot password? Click here to reset