Understanding the Downstream Instability of Word Embeddings

02/29/2020
by   Megan Leszczynski, et al.
0

Many industrial machine learning (ML) systems require frequent retraining to keep up-to-date with constantly changing data. This retraining exacerbates a large challenge facing ML systems today: model training is unstable, i.e., small changes in training data can cause significant changes in the model's predictions. In this paper, we work on developing a deeper understanding of this instability, with a focus on how a core building block of modern natural language processing (NLP) pipelines—pre-trained word embeddings—affects the instability of downstream NLP models. We first empirically reveal a tradeoff between stability and memory: increasing the embedding memory 2x can reduce the disagreement in predictions due to small changes in training data by 5 (relative). To theoretically explain this tradeoff, we introduce a new measure of embedding instability—the eigenspace instability measure—which we prove bounds the disagreement in downstream predictions introduced by the change in word embeddings. Practically, we show that the eigenspace instability measure can be a cost-effective way to choose embedding parameters to minimize instability without training downstream models, outperforming other embedding distance measures and performing competitively with a nearest neighbor-based measure. Finally, we demonstrate that the observed stability-memory tradeoffs extend to other types of embeddings as well, including knowledge graph and contextual word embeddings.

READ FULL TEXT
research
05/23/2017

Second-Order Word Embeddings from Nearest Neighbor Topological Features

We introduce second-order vector representations of words, induced from ...
research
09/03/2019

On the Downstream Performance of Compressed Word Embeddings

Compressing word embeddings is important for deploying NLP models in mem...
research
07/23/2020

Word Embeddings: Stability and Semantic Change

Word embeddings are computed by a class of techniques within natural lan...
research
08/11/2021

Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding Ecosystems

The industrial machine learning pipeline requires iterating on model fea...
research
03/01/2018

Pairwise Inner Product Distance: Metric for Functionality, Stability, Dimensionality of Vector Embedding

In this paper, we present a theoretical framework for understanding vect...
research
03/01/2018

PIP Distance: A Unitary-invariant Metric for Understanding Functionality and Dimensionality of Vector Embeddings

In this paper, we present a theoretical framework for understanding vect...
research
09/16/2021

MOFSimplify: Machine Learning Models with Extracted Stability Data of Three Thousand Metal-Organic Frameworks

We report a workflow and the output of a natural language processing (NL...

Please sign up or login with your details

Forgot password? Click here to reset