LearningWord Embeddings for Low-resource Languages by PU Learning

05/09/2018
by   Chao Jiang, et al.
0

Word embedding is a key component in many downstream applications in processing natural languages. Existing approaches often assume the existence of a large collection of text for learning effective word embedding. However, such a corpus may not be available for some low-resource languages. In this paper, we study how to effectively learn a word embedding model on a corpus with only a few million tokens. In such a situation, the co-occurrence matrix is sparse as the co-occurrences of many word pairs are unobserved. In contrast to existing approaches often only sample a few unobserved word pairs as negative samples, we argue that the zero entries in the co-occurrence matrix also provide valuable information. We then design a Positive-Unlabeled Learning (PU-Learning) approach to factorize the co-occurrence matrix and validate the proposed approaches in four different languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/09/2020

Combining Pretrained High-Resource Embeddings and Subword Representations for Low-Resource Languages

The contrast between the need for large amounts of data for current Natu...
research
06/22/2020

Dirichlet-Smoothed Word Embeddings for Low-Resource Settings

Nowadays, classical count-based word embeddings using positive pointwise...
research
01/11/2023

Word-Graph2vec: An efficient word embedding approach on word co-occurrence graph using random walk sampling

Word embedding has become ubiquitous and is widely used in various text ...
research
07/06/2019

TEAGS: Time-aware Text Embedding Approach to Generate Subgraphs

Contagions (e.g. virus, gossip) spread over the nodes in propagation gra...
research
07/06/2019

TEALS: Time-aware Text Embedding Approach to Leverage Subgraphs

Given a graph over which the contagions (e.g. virus, gossip) propagate, ...
research
06/09/2022

Predicting Embedding Reliability in Low-Resource Settings Using Corpus Similarity Measures

This paper simulates a low-resource setting across 17 languages in order...
research
10/26/2022

Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages

In the process of numerically modeling natural languages, developing lan...

Please sign up or login with your details

Forgot password? Click here to reset