Predicting Embedding Reliability in Low-Resource Settings Using Corpus Similarity Measures

06/09/2022
by   Jonathan Dunn, et al.
0

This paper simulates a low-resource setting across 17 languages in order to evaluate embedding similarity, stability, and reliability under different conditions. The goal is to use corpus similarity measures before training to predict properties of embeddings after training. The main contribution of the paper is to show that it is possible to predict downstream embedding similarity using upstream corpus similarity measures. This finding is then applied to low-resource settings by modelling the reliability of embeddings created from very limited training data. Results show that it is possible to estimate the reliability of low-resource embeddings using corpus similarity measures that remain robust on small amounts of data. These findings have significant implications for the evaluation of truly low-resource languages in which such systematic downstream validation methods are not possible because of data limitations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2022

Corpus Similarity Measures Remain Robust Across Diverse Languages

This paper experiments with frequency-based corpus similarity measures a...
research
06/22/2020

Dirichlet-Smoothed Word Embeddings for Low-Resource Settings

Nowadays, classical count-based word embeddings using positive pointwise...
research
03/09/2020

Combining Pretrained High-Resource Embeddings and Subword Representations for Low-Resource Languages

The contrast between the need for large amounts of data for current Natu...
research
10/06/2017

Low-resource bilingual lexicon extraction using graph based word embeddings

In this work we focus on the task of automatically extracting bilingual ...
research
05/09/2018

LearningWord Embeddings for Low-resource Languages by PU Learning

Word embedding is a key component in many downstream applications in pro...
research
09/17/2020

More Embeddings, Better Sequence Labelers?

Recent work proposes a family of contextual embeddings that significantl...
research
10/01/2020

An Ultra Lightweight CNN for Low Resource Circuit Component Recognition

In this paper, we present an ultra lightweight system that can effective...

Please sign up or login with your details

Forgot password? Click here to reset