PIP Distance: A Unitary-invariant Metric for Understanding Functionality and Dimensionality of Vector Embeddings

03/01/2018
by   Zi Yin, et al.
0

In this paper, we present a theoretical framework for understanding vector embedding, a fundamental building block of many deep learning models, especially in NLP. We discover a natural unitary-invariance in vector embeddings, which is required by the distributional hypothesis. This unitary-invariance states the fact that two embeddings are essentially equivalent if one can be obtained from the other by performing a relative-geometry preserving transformation, for example a rotation. This idea leads to the Pairwise Inner Product (PIP) loss, a natural unitary-invariant metric for the distance between two embeddings. We demonstrate that the PIP loss captures the difference in functionality between embeddings. By formulating the embedding training process as matrix factorization under noise, we reveal a fundamental bias-variance tradeoff in dimensionality selection. With tools from perturbation and stability theory, we provide an upper bound on the PIP loss using the signal spectrum and noise variance, both of which can be readily inferred from data. Our framework sheds light on many empirical phenomena, including the existence of an optimal dimension, and the robustness of embeddings against over-parametrization. The bias-variance tradeoff of PIP loss explicitly answers the fundamental open problem of dimensionality selection for vector embeddings.

READ FULL TEXT
research
03/01/2018

Pairwise Inner Product Distance: Metric for Functionality, Stability, Dimensionality of Vector Embedding

In this paper, we present a theoretical framework for understanding vect...
research
12/11/2018

On the Dimensionality of Word Embedding

In this paper, we provide a theoretical understanding of word embedding ...
research
12/04/2019

Deep Distributional Sequence Embeddings Based on a Wasserstein Loss

Deep metric learning employs deep neural networks to embed instances int...
research
02/29/2020

Understanding the Downstream Instability of Word Embeddings

Many industrial machine learning (ML) systems require frequent retrainin...
research
06/04/2021

Language Model Metrics and Procrustes Analysis for Improved Vector Transformation of NLP Embeddings

Artificial Neural networks are mathematical models at their core. This t...
research
12/19/2020

Fundamental Limits and Tradeoffs in Invariant Representation Learning

Many machine learning applications involve learning representations that...

Please sign up or login with your details

Forgot password? Click here to reset