Word Embeddings as Statistical Estimators

01/17/2023
by   Neil Dey, et al.
0

Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from a statistical theoretical perspective, which is essential for formal inference and uncertainty quantification. We propose a copula-based statistical model for text data and show that under this model, the now-classical Word2Vec method can be interpreted as a statistical estimation method for estimating the theoretical pointwise mutual information (PMI). Next, by building on the work of Levy and Goldberg (2014), we develop a missing value-based estimator as a statistically tractable and interpretable alternative to the Word2Vec approach. The estimation error of this estimator is comparable to Word2Vec and improves upon the truncation-based method proposed by Levy and Goldberg (2014). The proposed estimator also performs comparably to Word2Vec in a benchmark sentiment analysis task on the IMDb Movie Reviews data set.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/30/2020

Blind signal decomposition of various word embeddings based on join and individual variance explained

In recent years, natural language processing (NLP) has become one of the...
research
02/02/2019

Word Embeddings for Sentiment Analysis: A Comprehensive Empirical Survey

This work investigates the role of factors like training method, trainin...
research
01/11/2021

Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa – A Large Romanian Sentiment Data Set

Romanian is one of the understudied languages in computational linguisti...
research
04/16/2021

Word2rate: training and evaluating multiple word embeddings as statistical transitions

Using pretrained word embeddings has been shown to be a very effective w...
research
08/18/2021

FeelsGoodMan: Inferring Semantics of Twitch Neologisms

Twitch chats pose a unique problem in natural language understanding due...
research
09/26/2020

Metaphor Detection using Deep Contextualized Word Embeddings

Metaphors are ubiquitous in natural language, and their detection plays ...
research
09/28/2017

Structured Embedding Models for Grouped Data

Word embeddings are a powerful approach for analyzing language, and expo...

Please sign up or login with your details

Forgot password? Click here to reset