1 Introduction
Data is more dynamic than ever before: every input, interaction, and response is captured and archived in hopes of extracting insights with machine learning (ML) models. To stay uptodate, models must be frequently retrained, with the freshness of models becoming a requirement for user satisfaction in numerous products, from ads He et al. (2014) to recommendation systems Covington et al. (2016). However, frequent retraining can lead to large and unwanted fluctuations in model predictions due to the instability of many machine learning training algorithms: minimal changes in training data can produce significantly different predictions Fard et al. (2016). From discussions with engineers in an ecommerce firm, an online social media company, and a Fortune 500 software company, we found that instability from retraining is one of their largest, and also most underaddressed, pain points. As a result of instability, ML engineers struggle to identify genuine concept shifts, spend more time tracking down regressions, and require more resources retraining downstream model dependencies. Diagnosing and reducing instability in a costeffective way is a major challenge for today’s machine learning pipelines.
In this work, we take a first step toward addressing the problem of ML model instability by examining in detail a core building block of most modern natural language processing (NLP) applications: word embeddings Mikolov et al. (2013a, b); Pennington et al. (2014); Bojanowski et al. (2017). Several recent works have shown that word embeddings are unstable, with the nearest neighbors to words varying significantly across embeddings trained under different settings Hellrich & Hahn (2016); Antoniak & Mimno (2018); Wendlandt et al. (2018); Pierrejean & Tanguy (2018); Chugh et al. (2018); Hellrich et al. (2019). These results may cause researchers using embeddings for analysis to reassess the reliability of their conclusions. Moreover, these results raise questions about how the embedding instability impacts downstream NLP tasks—an area which remains largely unexplored and which we focus on in this work. We define the downstream instability between a pair of word embeddings as the percentage of predictions which change between the models trained on the two embeddings for a given task. By this notion of instability, we find that 15%
of predictions on a sentiment analysis task can disagree due to training the embeddings on an accumulated dataset with just 1% more data. In embedding servers, where an embedding is reused among multiple downstream tasks
Hermann & Balso (2017); Gordon (2018); Shiebler et al. (2018); Sell & Pienaar (2019), the impact of this instability can be quickly amplified. Understanding this downstream instability is challenging, however, because it requires both theoretical and empirical insights on how the embedding instability propagates to the downstream tasks.The goal of this paper is to develop a deeper understanding of the downstream instability of word embeddings. This understanding could both drive the design choices for embedding systems (i.e. choosing hyperparameters) and lead to efficient techniques to distinguish among unstable and stable embeddings without training downstream models. To achieve this, we perform a study on the downstream instability of word embeddings across multiple embedding algorithms and downstream tasks. Our study exposes a novel tradeoff between stability and another critical property of embeddings—
memory. We find that increasing the memory can lead to more stable embeddings, with a 2 increase in memory reducing the percentage prediction disagreement on downstream tasks by 5% to 37% (relative). Determining how the memory affects the instability is not straightforward: factors like the dimension, a hyperparameter controlling the expressiveness of the embedding, and the precision, the number of bits used per entry in the embedding after compression, can independently affect the instability and interact in unexpected ways. To better understand the stabilitymemory tradeoff empirically, we study the effects of dimension and precision both in isolation and together. This important stabilitymemory tradeoff leads us to ask two key questions: (1) theoretically, how can we explain this tradeoff, and (2) practically, how can we select the dimensionprecision^{1}^{1}1For brevity, we refer to a pair of dimension and precision parameters as the “dimensionprecision” parameters. parameters to minimize the downstream instability?To theoretically explain the stabilitymemory tradeoff, we introduce a new measure for embedding instability—the eigenspace instability measure—which we theoretically relate to downstream instability in the case of linear regression models. The eigenspace instability measure builds on the eigenspace overlap score
May et al. (2019), and measures the degree of similarity between the eigenvectors of the Gram matrices of a pair of embeddings, weighted by their eigenvalues. We show that the expected downstream disagreement between the linear regression models trained on two embedding matrices can be expressed in terms of the eigenspace instability measure. Furthermore, these theoretical insights have a practical application: we propose using the eigenspace instability measure to efficiently select dimensionprecision parameters with low downstream instability, without having to train downstream models.
We empirically validate that the eigenspace instability measure correlates strongly with the downstream instability and that the measure is effective as a selection criterion for the dimensionprecision parameters. First, we show that the theoretically grounded eigenspace instability measure more strongly correlates with downstream instability than the majority of the other embedding distance measures (i.e. semantic displacement Hamilton et al. (2016), the PIP loss Yin & Shen (2018), and the eigenspace overlap score May et al. (2019)) and attains Spearman correlations from 0.04 better to 0.09 worse than the other topperforming measure, the kNN measure (e.g., Hellrich & Hahn (2016); Antoniak & Mimno (2018); Wendlandt et al. (2018)), which lacks theoretical guarantees. Next, we show that when using an embedding distance measure to choose the more stable dimensionprecision parameters out of a pair of choices, the eigenspace instability measure achieves up to 3.33 lower error rates than the weaker measures and from to the error rate of the kNN measure. On the more challenging task of selecting the combination of dimension and precision under a memory budget, we show that eigenspace instability measure attains a difference in prediction disagreement to the oracle up to 2.98% (absolute) better than the weaker baselines and within 0.35% (absolute) of the kNN measure.
To summarize, we make the following contributions:

[itemsep=0mm]

We study the downstream instability of word embeddings, revealing a novel stabilitymemory tradeoff. In particular, we study the impact of two key parameters, dimension and precision, and propose a simple rule of thumb relating the embedding memory and downstream instability (Section 3).

To theoretically explain this tradeoff, we introduce a new measure for embedding instability, the eigenspace instability measure, that we prove theoretically determines the expected downstream disagreement on a linear regression task (Section 4).

To empirically validate our theory, we perform an evaluation of methods for selecting embedding hyperparameters to minimize downstream instability. Practically, we show that the eigenspace instability measure can outperform the majority of other embedding distance measures and perform similarly to the kNN measure, for which we have no theoretical guarantees.

Finally, we show that the stabilitymemory tradeoffs extend to knowledge graph embeddings Bordes et al. (2013) and contextual word embeddings, such as BERT embeddings Devlin et al. (2019). For instance, we find that increasing the memory of knowledge graph embeddings 2 decreases the instability on a link prediction task by 7% to 19% (relative) (Section 6).
2 Preliminaries
We begin by formally defining the notion of instability we use in this work. We then review the word embedding algorithms and compression technique used in our study, and discuss existing measures to compare two embeddings.
2.1 Instability Definition
We define the downstream instability as follows:
Definition 1.
Let and be two embedding matrices, and let and represent models trained using X and , respectively, for a downstream task T. Then the instability between X and with respect to task T is defined as
where is a heldout set for task T, and
is a fixed loss function.
When the zeroone loss is used for , this measure captures the percentage of predictions which disagree on downstream models trained on each embedding.
2.2 Word Embedding Algorithms
Word embedding algorithms learn distributed representations of words by taking as input a textual corpus
and returning the word embedding , where is the dimension of the embeddings and is the vocabulary size. We evaluate matrix completion (MC) Jin et al. (2016), GloVe Pennington et al. (2014), and continuous bagofwords (CBOW) Mikolov et al. (2013a, b) embedding algorithms. MC and GloVe factor the cooccurrence matrix generated from , whereas CBOW operates on the sequential corpus directly. We elaborate below.Matrix completion (MC)
Matrix completion uses the word embeddings to approximate the observed word cooccurrence and can be formally written as:
where are the observed (nonzero) entries in . Following standard technique, is the positive pointwise mutual information (PPMI) matrix, rather than the true cooccurrence matrix Bullinaria & Levy (2007).
We solve the matrix completion problem using an online algorithm similar to that proposed in Jin et al. (2016). We iteratively train
via stochastic gradient descent (SGD) after computing the loss on sampled entries of the observed cooccurrence matrix
.GloVe
Similar to MC, GloVe solves a matrix factorization problem, but approximates the cooccurrence information in a weighted form to reduce noise from rare cooccurrences. GloVe models the word and context embeddings separately.
Continuous bagofwords (CBOW)
The CBOW algorithm predicts a word given its local context words. The embedding matrix
is trained via SGD, where the loss maximizes the probability that an observed word and context pair cooccurs in the corpus and minimizes the probability that a negative sample cooccurs. We use the
word2vec implementation of CBOW.^{2}^{2}2https://github.com/tmikolov/word2vec2.3 Compression Technique
We use a standard technique—uniform quantization—to compress word embeddings. Recent work May et al. (2019)
demonstrates that uniform quantization performs on par in terms of downstream quality with more complex compression techniques, such as kmeans compression
Andrews (2016) and deep compositional code learning Shu & Nakayama (2018). We leverage their implementation^{3}^{3}3https://github.com/HazyResearch/smallfry to apply uniform quantization to word embeddings to study the impact of the precision on instability. Under uniform quantization, each entry in the word embedding matrix is rounded to a discrete value in a set of equally spaced values within an interval, such that each entry can be represented with just bits. For more details on the way we use uniform quantization for our experiments, see Appendix C.2.2.4 Embedding Distance Measures
We consider four embedding distance measures from the literature to quantify the differences between embeddings. For each measure, we assume we have a pair of embeddings and trained on corpora and , respectively, where is the size of the vocabulary and is the dimension of the embedding. Due to computational efficiency and our observation that downstream tasks use a majority of high frequency words, we only consider the top 10k most frequent words to compute each measure (including the eigenspace instability measure).
kNearest Neighbors (kNN) Measure
Variants of the kNN measure were used in recent works on word embedding stability to characterize the intrinsic stability of embeddings (e.g., Hellrich & Hahn (2016); Antoniak & Mimno (2018); Wendlandt et al. (2018)). The kNN measure is defined as , where is the number of randomly sampled query words (we use =1000), and the function takes an embedding and the index of a query word, and returns the indices of the most similar words to the query word by the cosine distance.
Semantic Displacement
Pairwise Inner Product Loss
The Pairwise Inner Product (PIP) loss was proposed for dimensionality selection to optimize for the intrinsic quality of an embedding Yin & Shen (2018). The PIP loss is defined as .
Eigenspace Overlap Score
The eigenspace overlap score was recently proposed as a measure of compression quality May et al. (2019). The eigenspace overlap is defined as , where and
are the singular value decompositions (SVDs) of
and .3 A StabilityMemory Tradeoff
We now present the empirical study that exposes the tradeoff we observe between downstream stability and embedding memory, and demonstrate that as the memory increases, the instability decreases. We consider the dimension and precision of the embedding as two important axes controlling the memory of the embedding. We first study the impact of the embedding’s dimension and precision on downstream instability in isolation in Sections 3.1 and 3.2, respectively, followed by a discussion of their joint effect in Section 3.3.
Corpora
We use two full Wikipedia dumps^{4}^{4}4https://dumps.wikimedia.org: Wiki’17 and Wiki’18, which we collected approximately a year apart, to train embeddings. The corpora are preprocessed by a Facebook script^{5}^{5}5https://github.com/facebookresearch/fastText/blob/master/getwikimedia.sh, which we modify to keep the letter cases. We use these two corpora as examples of the temporal changes which can occur to the text corpora used to train word embeddings. Each corpora has about 4.5 billion tokens, and when training the embeddings, we only learn the embeddings for the top 400k most frequent words.
Downstream NLP Tasks
After training the word embeddings, we compress the embeddings with uniform quantization and train models for downstream NLP tasks on top of the embeddings, fixing the embeddings during training. We train word embeddings with three seeds, and use the same corresponding seeds for the downstream models. Results are reported as averages over the three seeds, with error bars indicating the standard deviation. We also align all pairs of Wiki’17 and Wiki’18 embeddings (same dimension and seed) with orthogonal Procrustes
Schönemann (1966) prior to compressing and training downstream models, as preliminary experiments found this helped to decrease instability. For each downstream task, we perform a hyperparameter search for the learning rate using 400dimensional Wiki’17 embeddings, and use the same learning rate across all dimensions to minimize the impact of learning rate on our analysis. Here, we discuss the two standard downstream NLP tasks we consider throughout our paper. Please see Appendix C.3 for more experimental setup details.Sentiment Analysis. We evaluate on a binary sentiment analysis task where given a sentence, the model determines if the sentence is positive or negative Kim (2014). We train a linear bagofwords model for this task and evaluate on four benchmark datasets: MR Pang & Lee (2005), MPQA Wiebe et al. (2005), Subj Pang & Lee (2004), and SST2 Socher et al. (2013b). We will be showing results on SST2; for more results, see Appendix D.1.
Named Entity Recognition (NER).
The named entity recognition task is a multiclass classification task to predict whether each token in the dataset is an entity, and if so, what type. We use a BiLSTM model
Akbik et al. (2018) for this task and evaluate on the benchmark CoNLL2003 dataset Tjong Kim Sang & De Meulder (2003). Each token is assigned an entity label of PER, ORG, LOC, and MISC, or an O label, indicating outside of any entities (i.e., no entity). We measure instability only over the tokens for which the true value is an entity. We use the BiLSTM without the conditional random field (CRF) decoding layer for computational efficiency; in Appendix E.2 we show that the trends also hold on a subset of the results with a BiLSTMCRF.3.1 Effect of Dimension
We evaluate the impact of the dimension of the embedding on its downstream stability, and show that generally as the dimension increases, the instability decreases.
Tradeoffs
To perform our tradeoff study, we train Wiki’17 and Wiki’18 embeddings with dimensions in {25, 50, 100, 200, 400, 800}, and train downstream models on top of the embeddings. We compute the prediction disagreement between models trained on Wiki’17 and Wiki’18 embeddings of the same dimension. In Figure 1 (top), we see that as the dimension increases, the downstream instability often decreases across embedding algorithms and downstream tasks, plateauing at larger dimensions. In Section 3.3, we see that these trends are even more pronounced in lower memory regimes when we also consider different precisions.
3.2 Effect of Precision
We evaluate the effect of the precision, the number of bits used to store each entry of the embedding matrix, on the downstream stability, and show that as the precision increases, the instability decreases.
Tradeoffs
We compress 100dimensional Wiki’17 and Wiki’18 embeddings with uniform quantization to precisions ,^{6}^{6}6 signifies fullprecision embeddings. and train downstream models on top of the compressed embeddings. We compute the prediction disagreement between models trained on Wiki’17 and Wiki’18 embeddings of the same precision. In Figure 1 (bottom), we show that as the precision increases, the instability generally decreases on sentiment analysis and NER tasks for CBOW, GloVe, and MC embedding algorithms. Moreover, we see that for precisions greater than 4 bits, the impact of compression on instability is minimal.
3.3 Joint Effect of Dimension and Precision
We study the effect of dimension and precision together, and show that overall, as the memory increases, the downstream instability decreases. We also propose a simple rule of thumb relating the memory and instability, and evaluate the relative impact of dimension and precision on the instability. Finally, we discuss two key questions based on our empirical observations, which motivate the rest of the work.
Tradeoffs
We uniformly quantize the Wiki’17 and Wiki’18 embeddings of dimensions {25, 50, 100, 200, 400, 800} to precisions {1, 2, 4, 8, 16, 32} to generate many dimensionprecision pairs spanning over a wide range of memory budgets. Across the memory budgets, embedding algorithms, and tasks, we see that as we increase the memory, the downstream instability decreases (Figure 2). To propose a simple rule of thumb for the stabilitymemory tradeoff, we fit a single linearlog model to the dimensionprecision pairs for all memory budgets less than bits/word (after which the instability plateaus) across five downstream tasks (i.e., the four sentiment analysis tasks and one NER task) and two embedding algorithms. We find the following average stabilitymemory relationship for the downstream instability for a task with respect to the memory, or bits/word, : , where is a taskspecific constant. For instance, if we increase the memory , then the instability decreases on average by 1.3% (absolute). Across the tasks, embedding algorithms, and memory budgets we consider, this 1.3% (absolute) difference corresponds to an approximately 5% to 37% relative reduction in downstream instability, depending on the original instability value (3.6% to 25.9%).
To understand the relative impact on instability of increasing the dimension vs. the precision, we fit independent linearlog models to each parameter. We find that precision has a larger impact on instability than dimension, with a 2 increase in precision decreasing instability by 1.4% (absolute) vs. a 2 increase in dimension decreasing instability by 1.2% (absolute). Please see Appendix C.4 for more details on how we fit these trends. In Appendix E, we further demonstrate the robustness of the stabilitymemory tradeoff (e.g., to more complex downstream models, other sources of downstream randomness).
This stabilitymemory tradeoff raises two key questions: (1) how can we theoretically explain this tradeoff between the embedding memory and the downstream stability, and (2) how can we jointly select the embedding’s dimensionprecision parameters to minimize the downstream instability? Practically, choosing these parameters is important, because downstream instability can vary over 3% across the different combinations of dimension and precision for a given memory budget (Figure 2). The goal of the remainder of the paper will be to shed light on these questions.
4 Analyzing Embedding Instability
To address both questions raised above, we present a new measure of embedding instability, the eigenspace instability measure, which we show is both theoretically and empirically related to the downstream instability of the embeddings. The goal of this measure is to efficiently estimate, given two embeddings, how different the predictions of models trained with these embeddings will be. We first define the eigenspace instability measure and present its theoretical connection with downstream instability in Section
4.1; we then propose using this measure to efficiently select parameters to minimize downstream instability in Section 4.2.4.1 Eigenspace Instability Measure
We now define the eigenspace instability measure between two embeddings, and show that this measure is directly related to the expected disagreement between linear regression models trained using these embeddings.
Definition 2.
Let and be the singular value decompositions (SVDs) of two embedding matrices and , and let be a positive semidefinite matrix. Then the eigenspace instability measure between and , with respect to , is defined as
Intuitively, this measure captures how different the subspaces spanned by the left singular vectors of
and are to one another; the measure will be equal to zero when the left singular vectors of and span identical subspaces of , and will be equal to one when these singular vectors span orthogonal subspaces of whose union covers the whole space. We note that the left singular vectors are particularly important in the case of linear regression models, because the predictions of the learned model on the training examples depend only on the label vector and the left singular vectors of the data matrix.^{7}^{7}7The linear model trained on data matrix with label vector makes predictions on the training points.We now present our result showing that the expected mean squared difference between the linear regression models trained on vs. is equal to the the eigenspace instability measure, where corresponds to the covariance matrix of the regression label vector. For the proof, see Appendix B.
Proposition 1.
Let , be two fullrank embedding matrices, where and correspond to the rows of and respectively. Let be a random regression label vector with zero mean and covariance . Then the (normalized) expected mean squared difference between the linear models and ^{8}^{8}8, for , and , for . trained on label vector using embeddings and satisfies
(1) 
The above result exactly characterizes the expected downstream instability of linear regression models trained on and , in terms of the eigenspace instability measure, given the covariance matrix of the label vector; but how should we select ? One desirable property for
could be that it produce label vectors with higher variance in directions believed to be important, for example because they correspond to eigenvectors with large eigenvalues of an embedding’s Gram matrix. In Section
5, where we evaluate the instability of pairs of embeddings of various dimensions and precisions, we consider ; in those experiments, and are the highestdimensional (), fullprecision embeddings for Wiki’17 and Wiki’18, respectively, and is a scalar controlling the relative importance of the directions of high eigenvalue. This choice of results in label vectors with large variance in the directions of high eigenvalues of these embedding matrices. In Section 5.1 we show that when is chosen appropriately, there is strong empirical correlation between the eigenspace instability measure (with this ) and downstream instability.4.2 Jointly Selecting Dimension and Precision
We now demonstrate a practical utility of the eigenspace instability measure: we propose using the measure to efficiently select embedding dimensionprecision parameters to minimize downstream instability without training the downstream models. In particular, we propose an algorithm that takes two or more pairs of embeddings with different dimensionprecision parameters as input, and outputs the pair with the lowest eigenspace instability measure between embeddings. In Section 5.2, we evaluate the performance of this proposed selection algorithm in two settings: first, a simple setting where the goal is to select the pair with the lowest downstream instability out of two randomly selected pairs, and second, a more challenging setting where the goal is to select the pair with the lowest downstream instability out of two or more pairs with the same memory budget. In both settings, we demonstrate that the eigenspace instability measure outperforms the majority of embedding distance measures and is competitive with the other topperforming embedding distance measure, the kNN measure.
5 Experiments
We now empirically validate the eigenspace instability measure’s relation with downstream instability and demonstrate that the eigenspace instability measure is an effective selection criterion for dimensionprecision parameters. In Section 5.1, we show that the theoretically grounded eigenspace instability measure strongly correlates with downstream instability, attaining Spearman correlations greater than the weaker baselines (semantic displacement, PIP loss, and eigenspace overlap score) and between 0.04 better and 0.09 worse than the strongest baseline (the kNN measure). In Section 5.2, when selecting dimensionprecision parameters without training the downstream models, we show that the eigenspace instability measure attains up to lower error rates than weaker baselines and from to the error rate of the kNN measure.^{9}^{9}9Our code is available at https://github.com/HazyResearch/anchorstability.
Downstream Task  SST2  Subj  CoNLL2003  
Embedding Algorithm  CBOW  GloVe  MC  CBOW  GloVe  MC  CBOW  GloVe  MC 
Eigenspace Instability  0.68  0.84  0.84  0.72  0.77  0.78  0.80  0.78  0.83 
0.74  0.86  0.89  0.74  0.76  0.76  0.76  0.86  0.92  
Semantic Displacement  0.70  0.34  0.28  0.45  0.43  0.46  0.53  0.16  0.32 
PIP Loss  0.40  0.06  0.39  0.14  0.14  0.56  0.01  0.11  0.44 
0.63  0.18  0.26  0.50  0.29  0.45  0.58  0.01  0.31 
Experimental Setup
To evaluate how predictive the various embedding distance measures are of downstream instability, we take the embedding pairs and corresponding downstream models we trained in Section 3 and measure the embedding distance measures between these pairs of embeddings. Specifically, we compute the kNN measure, semantic displacement, PIP loss, eigenspace overlap score, and eigenspace instability measure between the embedding pairs (Section 2.4). Recall that the kNN measure and the eigenspace instability measure each have an important hyperparameter: the in the kNN measure, which determines how many neighbors we compare, and the in the eigenspace instability measure, which controls how important the eigenvectors of high eigenvalue are. For both hyperparameters, we choose the values with the highest average correlation across four sentiment analysis tasks (SST2, MR, Subj, and MPQA) and one NER task (CoNLL2003) and two embedding algorithms (CBOW and MC)^{10}^{10}10These values also worked well for GloVe (added later). when using validation datasets for the downstream tasks ( = 5 and = 3). See Appendix D.3 for more details. The eigenspace instability measure also requires additional embeddings and : we use 800dimensional, fullprecision Wiki’17 and Wiki’18 embeddings as these are the highest dimensional, fullprecision embeddings in our study.
5.1 Predictive Performance of the Eigenspace Instability Measure
We evaluate how predictive the eigenspace instability measure is of downstream instability, showing that the theoretically grounded eigenspace instability measure correlates strongly with downstream instability and is competitive with other embedding distance measures. To do this, we measure the Spearman correlations between the downstream prediction disagreement and the embedding distance measure for each of the five tasks and three embedding algorithms. The Spearman correlation quantifies how similar the ranking of the pairs of embeddings based on the embedding distance measure is to the ranking of the pairs of embeddings based on their downstream prediction disagreement, with a maximum value of 1.0. In Table 1, we see that the eigenspace instability measure and the kNN measure are the topperforming embedding distance measures by Spearman correlation, with the eigenspace instability measure attaining Spearman correlations between 0.04 better and 0.09 worse than the kNN measure on all tasks. Moreover, the strong correlation of at least 0.68 for the eigenspace instability measure across embedding algorithms and downstream tasks validates our theoretical claim that this measure relates to downstream disagreement. In Appendix D.4, we include additional plots showing the downstream prediction disagreement versus the embedding distance measures.
Downstream Task  SST2  Subj  CoNLL2003  
Embedding Algorithm  CBOW  GloVe  MC  CBOW  GloVe  MC  CBOW  GloVe  MC 
Eigenspace Instability  0.23  0.15  0.17  0.24  0.21  0.20  0.20  0.20  0.17 
0.21  0.14  0.13  0.23  0.21  0.21  0.21  0.16  0.11  
Semantic Displacement  0.24  0.40  0.42  0.34  0.36  0.34  0.29  0.47  0.41 
PIP Loss  0.64  0.50  0.35  0.57  0.54  0.28  0.50  0.44  0.32 
0.28  0.46  0.43  0.32  0.41  0.34  0.29  0.52  0.41 
Downstream Task  SST2  Subj  CoNLL2003  
Embedding Algorithm  CBOW  GloVe  MC  CBOW  GloVe  MC  CBOW  GloVe  MC 
Eigenspace Instability  0.65  0.55  1.42  0.39  0.41  0.63  0.28  0.45  0.43 
0.57  0.43  1.07  0.38  0.44  0.57  0.32  0.48  0.23  
Semantic Displacement  0.37  1.58  3.73  0.48  0.64  0.94  0.27  0.89  1.17 
PIP Loss  3.63  2.54  3.32  1.16  1.71  0.74  0.83  0.83  0.99 
0.88  1.58  3.60  0.34  0.64  0.93  0.20  0.89  1.15  
High Precision  0.85  1.58  3.94  0.61  0.64  1.01  0.60  0.89  1.28 
Low Precision  3.63  2.54  1.23  1.16  1.71  1.39  0.83  0.83  0.74 
5.2 Embedding Distance Measures for DimensionPrecision Selection
We demonstrate that the eigenspace instability measure is an effective selection criterion for dimensionprecision parameters, outperforming the majority of existing embedding distance measures and competitive with the kNN measure, for which there are no theoretical guarantees. Specifically, we evaluate the embedding distance measures as selection criteria in two settings of increasing difficulty: in the first setting the goal is, given two pairs of embeddings (each corresponding to an arbitrary dimensionprecision combination), to select the pair with the lowest downstream instability. In the second, more challenging setting, the goal is to select, among all dimensionprecision combinations corresponding to the same total memory, the one with the lowest downstream instability. This setting is challenging, as for many memory budgets, there are more than two choices of embedding pairs, and some choices may have very similar expected downstream instability. We now discuss each of these settings, and the corresponding results, in more detail.
For the first, simpler setting, we first form all groupings of two embedding pairs with different dimensionprecision combinations. For instance, a grouping may have one embedding pair with dimension 800, precision 32, and another embedding pair with dimension 200, precision 2, where a pair consists of a Wiki’17 and a Wiki’18 embedding from the same algorithm. For each embedding distance measure, we report the fraction of groupings where the embedding distance measure correctly chooses the embedding pair with lower downstream instability on a given task. We repeat over three seeds, comparing embedding pairs of the same seed, and report the average. In Table 2, we show that the eigenspace instability measure and kNN measure are the most accurate embedding distance measures, with up to 3.33 and 3.73 lower selection error rates than the other embedding distance measures, respectively. Moreover, across downstream tasks, the eigenspace instability measure attains to the error rate of the kNN measure.
For the second, more challenging setting, we enumerate all embedding pairs with different dimensionprecision combinations which correspond to the same total memory. For each embedding measure, we report the average absolute percentage difference between the downstream instability of the pair selected by the measure to the most stable “oracle” pair, across different memory budgets. We also introduce two naive baselines that do not require an embedding distance measure: high precision, which selects the pair with the highest precision possible at each memory budget, and low precision, which selects the pair with the lowest precision possible at each memory budget. As before, we repeat over three seeds, comparing embedding pairs of the same seed, and report the average. We see that the eigenspace instability measure and kNN measure again outperform the other baselines on the majority of downstream tasks, with the eigenspace instability measure attaining a distance up to 2.98% (absolute) closer to the oracle than the other baselines, and average distance to the oracle 0.03% (absolute) better to 0.35% (absolute) worse than the kNN measure across downstream tasks (Table 3). For both settings, we include additional results measuring the worstcase performance of the embedding distance measure in Appendix D.5, where we find that the eigenspace instability measure and kNN measure continue to be the topperforming measures.
6 Extensions
We demonstrate that the stabilitymemory tradeoffs we observe with pretrained word embeddings can extend to knowledge graph embeddings and contextual word embeddings: as the memory of the embedding increases, the instability decreases. We first show how these trends hold on knowledge graph embeddings in Section 6.1 and then on contextual word embeddings in Section 6.2.
6.1 Knowledge Graph Embeddings
Knowledge graph embeddings (KGEs) are a popular type of embedding that is used for multirelational data, such as social networks, knowledge bases, and recommender systems. Here, we show that as the dimension and precision of the KGE increases, the stability on two standard KGE tasks improves, aligning with the trends we observed on pretrained word embedding algorithms. Unlike word embedding algorithms, the input to KGE algorithms is a directed graph, where the relations are the edges in the graph and the entities are the nodes in the graph. The graph can be represented as a set of triplets , where the entity head is related by the relation to the entity tail . The output is two sets of embeddings: (1) entity embeddings and (2) relation embeddings . We study the stability of these embeddings on two standard benchmark tasks: link prediction and triplet classification. We summarize the datasets and protocols, and then discuss the results.
Datasets
We use two datasets to train KGE embeddings: FB15K95 and FB15K. FB15K was introduced in Bordes et al. (2013) and is composed of a subset of triplets from the Freebase knowledge base. We construct FB15K95 by randomly sampling 95% of the the triplets from the training dataset of FB15K. The validation and test datasets remain the same for both datasets. We use these datasets to study the stability of KGEs under small changes in training data.
Training Protocol
We consider a standard KGE algorithm—TransE Bordes et al. (2013). The TransE objective function minimizes the distances for observed triplets and maximizes the distances for negatively sampled triplets, where either or has been corrupted. We use the distance for the distance function , and learn the embeddings iteratively via stochastic gradient descent.
To measure the impact of the dimension and precision on the stability of TransE embeddings, we train TransE embeddings of dimensions {10, 20, 50, 100, 200, 400} and then uniformly quantize the entity and relation embeddings for each TransE embedding to bits {1, 2, 4, 8, 16, 32} per entry in embedding.^{11}^{11}11The same dimension is used for both the entity and the relation embeddings. We perform a hyperparameter sweep on the learning using dimension 50, and select the best learning rate on the validation set for link prediction. We use this learning rate for all dimensions to minimize the impact of learning rate on our analysis. We take other training hyperparameters from the TransE paper Bordes et al. (2013) for the FB15K dataset, and use three seeds to train each dimension using the OpenKE repository Han et al. (2018).
Evaluation Protocol
For each dimensionprecision, we evaluate all pairs of embeddings trained on FB15K95 and FB15K on the link prediction and triplet classification tasks. For each test triplet, the link prediction task evaluates the mean predicted rank of an observed triplet among all corrupted triplets. We measure instability on this task with unstablerank@10: the fraction of changes in rank greater than 10 between two embeddings across all test triplets.
The triplet classification task was introduced in Socher et al. (2013a) and is a binary classification task to determine whether or not a triplet occurs in the knowledge graph. For each relation, a threshold is determined based on the validation set, such that if then the triplet is predicted as positive. For each dimensionprecision pair, we set the thresholds on FB15K95 embedding and use the same thresholds for the FB15K embedding. We include results with threshold set independently for each embedding in Appendix D.6. As for classification with downstream NLP tasks, we define stability on the triplet classification task as the percentage prediction disagreement.
Results
We find that the stabilitymemory tradeoffs continue to hold for TransE embeddings on the link prediction and triplet classification tasks: overall as the memory increases, the instability decreases, and specifically, as the dimension and precision increases, the instability decreases. In Figure 3 (left), we show for link prediction that as the memory per vector increases, the unstablerank@10 measure decreases. Each line represents a different precision, where each point on the line represents a different dimension. Thus, we can also see that as the dimension increases, the unstablerank@10 decreases, and as precision increases, this measure also decreases. When fitting a linearlog model to the dimensionprecision combinations for all memory budgets, we find that increasing the memory 2 decreases the instability by 7% to 19% (relative). In Figure 3 (right), we similarly show for triplet classification that as the memory per vector increases, the prediction disagreement between the embeddings trained on the two datasets decreases. Finally, as we saw with word embeddings, we observe that the effect of the dimension or precision on stability is more significant at low memory regimes.
6.2 Contextual Word Embeddings
Unlike pretrained word embeddings, contextual word embeddings Peters et al. (2018); Vaswani et al. (2017) extract word representations dynamically with awareness of the input context. We find that the stabilitymemory tradeoff observed on pretrained embeddings can still hold for contextual word embeddings, though with noisier trends: higher dimensionality and higher precision can demonstrate better downstream stability. We pretrain shallow, 3layer versions of BERT Devlin et al. (2019) on subsampled Wiki’17 and Wiki’18 dumps (200 million tokens) as feature extractors with different transformer layer output dimensionalities, ranging from a quarter as large to 4 as large as the hidden size in BERT_{BASE} (i.e., 768).^{12}^{12}12The recent 12layer BERT_{BASE} model is pretrained with 3 billion tokens from BooksCorpus Zhu et al. (2015) and Wikipedia, and requires 16 TPU chips to train for 4 days.
To evaluate the effect of precision, we use uniform quantization to compress the output of the last transformer layer in the BERT models. Finally, we measure the prediction disagreement between linear classifiers trained on top of the Wiki’17 and Wiki’18 BERT models, with the BERT model parameters fixed.
Across four sentiment analysis tasks, we can observe reduced instability with higher dimensional BERT embeddings (Figure 10(a) in Appendix D.7); however, the reduction in instability from increasing the dimension is noisier than with pretrained word embeddings. We hypothesize this is due to the instability of the training of the BERT embedding itself, which is a much more complex model than pretrained word embeddings. We also observe that increasing the precision can decrease the downstream instability, such that using 1 or 2 bits for precision often demonstrates observable degradation in stability, but precisions higher than 4bit have negligible influence on stability (Figure 10(b) in Appendix D.7). For more details on the training and evaluation, see Appendix D.7.
7 Related work
There have been many recent works studying word embedding instability Hellrich & Hahn (2016); Antoniak & Mimno (2018); Wendlandt et al. (2018); Pierrejean & Tanguy (2018); Chugh et al. (2018); Hellrich et al. (2019); these works have focused on the intrinsic instability of word embeddings, meaning the stability measured between the embedding matrices without training a downstream model. In the work of Wendlandt et al. (2018) they do consider a downstream task (partofspeech tagging), but focus on how the intrinsic instability impacts the error of words on this task. In contrast, we focus on the downstream instability (i.e., prediction disagreement), evaluating how different parameters of embeddings impact downstream instability with largescale Wikipedia embeddings over multiple downstream NLP tasks. Furthermore, we provide theoretical analysis which is specific to the downstream instability setting to help explain our empirical observations.
More broadly, researchers have also studied the general problem of ML model instability in the context of online training and incremental learning. Fard et al. (2016) study the problem of reducing the prediction churn between consecutively trained classifiers by introducing a Monte Carlo stabilization operator as a form of regularization. Cotter et al. (2016) further define stability as a design goal for classifiers in realworld applications, along with goals such as precision, recall, and fairness, and propose an algorithm to optimize for these multiple design goals. Other researchers have also studied the problem of catastrophic forgetting when models are incrementally trained Yang et al. (2019), which shares a similar goal of wanting to learn new information, while minimizing changes with respect to previous models. As these works focus on changes to the downstream model training to reduce instability, we believe these works are complementary to our work, which focuses on better understanding the instability introduced by word embeddings.
Lastly, although the biasvariance tradeoff is a commonly used tool in ML to analyze model stability, there is an important difference in our setting. While the variance of a model quantifies the expected deviation of the model from its mean (typically over randomness in the training sample), in our work we analyze the disagreement between two separate models trained with different fixed data matrices on the same random label vector.
8 Conclusion
We performed the first indepth study of the downstream instability of word embeddings. In our study, we exposed a novel stabilitymemory tradeoff, showing that increasing the embedding dimension or precision decreases downstream instability. To better understand these empirical results, we introduced a new measure for embedding instability—the eigenspace instability measure—which we theoretically relate to downstream prediction disagreement. We showed that this theoretically grounded embedding measure correlates strongly with downstream instability, and can be used to select dimensionprecision parameters, performing better than or competitively with other embedding measures on minimizing downstream instability without training the downstream tasks. Finally, we demonstrated that the stabilitymemory tradeoff extends to other types of embeddings, including contextual word embeddings and knowledge graph embeddings. We hope our study motivates future work on ML model instability in more complex pipelines.
Acknowledgements
We thank Charles Kuang, Shoumik Palkar, Fred Sala, Paroma Varma, and the anonymous reviewers for their valuable feedback. We gratefully acknowledge the support of DARPA under Nos. FA87501720095 (D3M), FA86501827865 (SDH), and FA86501827882 (ASED); NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); ONR under No. N000141712266 (Unifying Weak Supervision); the Moore Foundation, NXP, Xilinx, LETICEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, the Okawa Foundation, American Family Insurance, Google Cloud, Swiss Re, NSF Graduate Research Fellowship under No. DGE1656518, and members of the Stanford DAWN project: Teradata, Facebook, Google, Ant Financial, NEC, VMWare, and Infosys. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of DARPA, NIH, ONR, or the U.S. Government.
References
 Akbik et al. (2018) Akbik, A., Blythe, D., and Vollgraf, R. Contextual string embeddings for sequence labeling. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 1638–1649, 2018.
 Andrews (2016) Andrews, M. Compressing word embeddings. In International Conference on Neural Information Processing (ICONIP), pp. 413–422, 2016.
 Antoniak & Mimno (2018) Antoniak, M. and Mimno, D. Evaluating the stability of embeddingbased word similarities. Transactions of the Association for Computational Linguistics (TACL), 6:107–119, 2018.
 Bojanowski et al. (2017) Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics (TACL), 5:135–146, 2017.
 Bordes et al. (2013) Bordes, A., Usunier, N., GarciaDuran, A., Weston, J., and Yakhnenko, O. Translating embeddings for modeling multirelational data. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2787–2795, 2013.
 Bullinaria & Levy (2007) Bullinaria, J. A. and Levy, J. P. Extracting semantic representations from word cooccurrence statistics: A computational study. Behavior Research Methods, 39:510–526, 2007.

Chugh et al. (2018)
Chugh, M., Whigham, P. A., and Dick, G.
Stability of word embeddings using word2vec.
In
AI 2018: Advances in Artificial Intelligence
, pp. 812–818, 2018.  Cotter et al. (2016) Cotter, A., Friedlander, M. P., Goh, G., and Gupta, M. R. Satisfying realworld goals with dataset constraints. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2415–2423, 2016.

Covington et al. (2016)
Covington, P., Adams, J., and Sargin, E.
Deep neural networks for youtube recommendations.
In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys), pp. 191–198, 2016.  Devlin et al. (2019) Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT), pp. 4171–4186, 2019.
 Fard et al. (2016) Fard, M. M., Cormier, Q., Canini, K. R., and Gupta, M. R. Launch and iterate: Reducing prediction churn. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3179–3187, 2016.

Gardner et al. (2018)
Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N. F.,
Peters, M., Schmitz, M., and Zettlemoyer, L.
AllenNLP: A deep semantic natural language processing platform.
In
Proceedings of Workshop for NLP Open Source Software (NLPOSS)
, pp. 1–6, 2018. 
Gordon (2018)
Gordon, J.
Introducing tensorflow hub: A library for reusable machine learning modules in tensorflow, 2018.
URL https://medium.com/tensorflow/introducingtensorflowhubalibraryforreusablemachinelearningmodulesintensorflowcdee41fa18f9.  Hamilton et al. (2016) Hamilton, W. L., Leskovec, J., and Jurafsky, D. Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1489–1501, 2016.
 Han et al. (2018) Han, X., Cao, S., Xin, L., Lin, Y., Liu, Z., Sun, M., and Li, J. Openke: An open toolkit for knowledge embedding. In Proceedings of EMNLP, 2018.
 He et al. (2014) He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., and Candela, J. Q. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising (ADKDD), pp. 5:1–5:9, 2014.
 Hellrich & Hahn (2016) Hellrich, J. and Hahn, U. Bad company–neighborhoods in neural embedding spaces considered harmful. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 2785–2796, 2016.
 Hellrich et al. (2019) Hellrich, J., Kampe, B., and Hahn, U. The influence of downsampling strategies on SVD word embedding stability. In Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP, pp. 18–26, 2019.
 Hermann & Balso (2017) Hermann, J. and Balso, M. D. Meet michelangelo: Uber’s machine learning platform, 2017. URL https://eng.uber.com/michelangelo/.
 Jin et al. (2016) Jin, C., Kakade, S. M., and Netrapalli, P. Provable efficient online matrix completion via nonconvex stochastic gradient descent. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4527–4535, 2016.
 Kim (2014) Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751, 2014.
 May et al. (2019) May, A., Zhang, J., Dao, T., and Ré, C. On the downstream performance of compressed word embeddings. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
 Mikolov et al. (2013a) Mikolov, T., Chen, K., Corrado, G. S., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
 Mikolov et al. (2013b) Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3111–3119, 2013b.
 Pang & Lee (2004) Pang, B. and Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In ”Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL)”, pp. 271–278, 2004.
 Pang & Lee (2005) Pang, B. and Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 115–124, 2005.
 Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014.
 Peters et al. (2018) Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (EMNLPHLT), pp. 2227–2237, 2018.
 Pierrejean & Tanguy (2018) Pierrejean, B. and Tanguy, L. Predicting word embeddings variability. In The Seventh Joint Conference on Lexical and Computational Semantics (*SEM), pp. 154–159, 2018.
 Schönemann (1966) Schönemann, P. H. A generalized solution of the orthogonal Procrustes problem. Psychometrika, 31(1):1–10, 1966.
 Sell & Pienaar (2019) Sell, T. and Pienaar, W. Introducing Feast: an open source feature store for machine learning, 2019. URL https://cloud.google.com/blog/products/aimachinelearning/introducingfeastanopensourcefeaturestoreformachinelearning.
 Shiebler et al. (2018) Shiebler, D., Green, C., Belli, L., and Tayal, A. Embeddings@Twitter, 2018. URL https://blog.twitter.com/engineering/en_us/topics/insights/2018/embeddingsattwitter.html.
 Shu & Nakayama (2018) Shu, R. and Nakayama, H. Compressing word embeddings via deep compositional code learning. In International Conference on Learning Representation (ICLR), 2018.

Socher et al. (2013a)
Socher, R., Chen, D., Manning, C. D., and Ng, A. Y.
Reasoning with neural tensor networks for knowledge base completion.
In Advances in Neural Information Processing Systems (NeurIPS), pp. 926–934, 2013a.  Socher et al. (2013b) Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1631–1642, 2013b.
 Tjong Kim Sang & De Meulder (2003) Tjong Kim Sang, E. F. and De Meulder, F. Introduction to the CoNLL2003 shared task: Languageindependent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLTNAACL 2003, pp. 142–147, 2003.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008, 2017.
 Wendlandt et al. (2018) Wendlandt, L., Kummerfeld, J., and Mihalcea, R. Factors influencing the surprising instability of word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT), pp. 2092–2102, 2018.
 Wiebe et al. (2005) Wiebe, J., Wilson, T. S., and Cardie, C. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39:165–210, 2005.
 Yang et al. (2019) Yang, Y., Zhou, D.W., Zhan, D.C., Xiong, H., and Jiang, Y. Adaptive deep models for incremental learning: Considering capacity scalability and sustainability. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 74–82, 2019.
 Yin & Shen (2018) Yin, Z. and Shen, Y. On the dimensionality of word embedding. In Advances in Neural Information Processing Systems (NeurIPS), pp. 887–898, 2018.

Zhu et al. (2015)
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A.,
and Fidler, S.
Aligning books and movies: Towards storylike visual explanations by
watching movies and reading books.
In
Proceedings of the IEEE international conference on computer vision
, pp. 19–27, 2015.
Appendix A Artifact Appendix
a.1 Abstract
This artifact reproduces the memorystability tradeoff and the embedding distance measure results for the sentiment analysis experiments on the word2vec CBOW and matrix completion embedding algorithms. It contains the pretrained CBOW and MC embeddings of six different dimensions, trained on the Wiki’17 and Wiki’18 datasets, and the scripts and data for training the sentiment analysis tasks on these embeddings. It can validate the results in Figures 1 and 2, and Tables 1, 2, and 3 for the sentiment analysis tasks for the MC and CBOW embedding algorithms. We describe the specific steps to reproduce the SST2 sentiment analysis results (which received the ACM badges), however, the steps can be easily modified to validate the MR, Subj, and MPQA sentiment tasks.
Our experimental pipeline consists of 3 main steps: (1) train and compress embeddings, (2) train downstream models and compute metrics, and (3) run analyses. Because step (1) is very computationally expensive (takes approximately 600 CPU hours to train all CBOW and MC embeddings using 56 threads), we provide the pretrained embeddings (they must still be compressed). This artifact supports reproducing steps (1), (2), and (3), starting from the compression of the embeddings. The full artifact requires 1.1 TB of disk space for storing all embeddings and model output, and requires at least 1 GPU (tested on NVIDIA K80s) for training downstream models. We also provide a lightweight option to start from (3), which does not require training downstream models or space to store embeddings and can be run on a local machine. To do this, we provide CSVs of the precomputed embedding distance measures and downstream instabilities.
a.2 Artifact checklist (metainformation)

Model: Linear bagofwords model for sentiment analysis (included).

Data set: SST2^{13}^{13}13We describe how to modify the scripts for the other provided datasets of MR, Subj, MPQA in Section A.7.(included).

Runtime environment: Debian GNU/Linux, or Ubuntu 16.04 with CUDA ( 9.0).

Hardware: Compute node (Amazon EC2 p2.16xlarge or equivalent) with at least 1 NVIDIA K80 for model training.

Metrics: Embedding distance measures and downstream instability (defined in Section 2).

Output: Reproduces SST2 results in Figures 1 and 2, and in Tables 1, 2, and 3.

Experiments: Included shell scripts, Jupyter notebook for plotting.

How much disk space required (approximately)?: 900 GB for storing all embeddings, 200 GB for storing SST2 model and analysis results.

How much time is needed to prepare workflow (approximately)?: 30 minutes for installing dependencies.

How much time is needed to complete experiments (approximately)?: 17 CPU hours for embedding compression, 43 GPU hours for model training, 15 CPU hours for metric computation, and 13 minutes for analysis. Note embedding compression, model training, and metric computation are easily parallelizable.

Publicly available?: Yes.

Code licenses (if publicly available)?: MIT License.
a.3 Description
a.3.1 How to access
Our source code is publicly available on GitHub: https://github.com/HazyResearch/anchorstability. Pretrained embeddings are currently stored in a publicly accessible Google Cloud storage bucket (script to download from the bucket is provided in the GitHub repository in run_get_embs.sh).
We also have the source code and pretrained emebddings permanently available at https://doi.org/10.5281/zenodo.3687120, which obtained the ACM badges.
a.3.2 Hardware dependencies
We recommend an Amazon EC2 p2.16xlarge or equivalent for the embedding compression, model training, and metric computation steps. For a Base AMI, we suggest the Deep Learning AMI (Ubuntu 16.04) Version 26.0 (ami025ed45832b817a35).
a.3.3 Software dependencies
We tested our implementation on Ubuntu 16.04 with CUDA 9.0. We recommend using a conda environment or Python virtualenv, and we provide a requirements.txt
file with the Python dependencies. We tested our implementation with Python 3.6 and PyTorch 1.0.
a.4 Installation
Please see the https://github.com/HazyResearch/anchorstability/blob/master/README.md file for detailed installation instructions and scripts.
a.5 Experiment workflow
We provide shell scripts to run to reproduce each of the steps.^{14}^{14}14If resource limited, to only reproduce the analysis results, we provide the CSVs of the embedding distance measures between pairs of embeddings and downstream instabilities between pairs of corresponding models. Here we summarize the workflow; please see the README.md for more detailed instructions and specific commands to run.

Obtain the pretrained MC and CBOW embeddings trained on Wiki’17 and Wiki’18 and compress all embeddings to precisions .

Train the downstream models on top of all of the compressed embeddings for the SST2 task. After the models are done training, compute the embedding distance measures and downstream instability between pairs of embeddings trained on Wiki’17 and Wiki’18, and their corresponding pairs of models.

Run the analysis script to evaluate the Spearman correlations of the embedding distance measures with the downstream instabilities, and the selection criterion results for the tasks described in Section 5.2. Finally, graph the memorystability tradeoff results with the Jupyter notebooks provided.
a.6 Evaluation and expected result
Step 3 in Section A.5 should reproduce the results for the CBOW and MC embeddings for the SST2 sentiment analysis task in Figures 1 and 2, as well as the results in Tables 1, 2, and 3, using the compressed embeddings, trained models, and measured instabilities generated in Steps 1 and 2. Note there might be slight variance in the kNN results (+/ 0.03 for Spearman correlation and selection error).
Using our provided CSVs file (see the results directory), Step 3 should also reproduce the remaining analysis results for the MR, Subj, and MPQA sentiment analysis tasks, as well as the CoNLL2003 NER task found in Table 1, 2, and 3, as well as the linearlog trends described in Section 3.
a.7 Experiment customization
To reproduce the complete pipeline of results on the MR, Subj, and MPQA tasks, modify the run_models.sh and run_collect_results.sh script to use the MC and CBOW learning rates (MC_LR and CBOW_LR) that we found from our grid search (Appendix C.3.1) for the new task and update the DATASET variable to the new task. Then pass the new task name to run_analysis.sh (e.g., with bash run_analysis.sh mr).
In terms of extending the results, the pretrained embeddings we provide could be used to train more models to further measure the impact of the embedding instability on downstream instability. New embedding distance measures could also be added to anchor/embedding.py and easily evaluated against the measures presented in this paper in terms of their correlation with downstream instability.
a.8 Methodology
Submission, reviewing and badging methodology:
Appendix B Eigenspace Instability: Theory
We present the proof of Proposition 1, which shows that the expected prediction disagreement between the linear regression models trained on embedding matrices and is equal to the eigenspace instability measure between and .
Proposition 1.
Let , be two fullrank embedding matrices, where and correspond to the rows of and respectively. Let be a random regression label vector with zero mean and covariance . Then the (normalized) expected disagreement between the linear models and ^{15}^{15}15, for , and , for . trained on label vector using embedding matrices and respectively satisfies
(2) 
Proof.
Let and be the SVDs of and respectively, and let and in be the rows of and . Recall that parameter vector which minimizes is given by (where here we use the assumption that is fullrank to know that is invertible). Thus, the linear regression model trained on data matrix with label vector makes predictions on the training points. So if we train linear model with data matrices and , using the same label vector , these model will make predictions and on the training points, respectively. Thus, the expected disagreement between the predictions made using vs. , over the randomness in , can be expressed as follows:
Furthermore, we can easily compute the expected norm of the label vector .
Thus, we have successfully shown that
as desired. ∎
b.1 Efficiently Computing the Eigenspace Instability Measure
We now discuss an efficient way of computing the eigenspace instability measure, assuming as discussed in Section 4.1. Here, and correspond to fixed embedding matrices,^{16}^{16}16In our experiments, and are the highestdimensional (), fullprecision embeddings for Wiki’17 and Wiki’18, respectively. where and are the SVDs of and respectively.
Recall the definition of the eigenspace instability measure:
We now show that both traces in this expression can be computed efficiently.
(3)  
(4)  
We now note that the traces in Equation (4), and all the matrix multiplications in Equation (3) can be computed efficiently and with lowmemory (no need to ever store an by Gram matrix, for example), assuming the embedding matrices are “tall and thin” (large vocabulary, relatively lowdimensional). More specifically, the eigenspace instability measure can be computed in time and memory , where we take to all be in (or in for ). Thus, even for large vocabulary , the eigenspace instability measure can be computed relatively efficiently (assuming the dimension isn’t too large).
Appendix C Experimental Setup Details
We discuss the experimental protocols used for each of our experiments. In Appendix C.1, we discuss the training procedures for the word embeddings, and in Appendix C.2, we discuss how we compress and postprocess the embeddings. In Appendix C.3, we describe the models, datasets, and training procedures used for the downstream tasks in our study, and in Appendix C.4, we discuss how we analyze the instability trends we observe on these tasks. Finally, in Appendix C.5 and Appendix C.6 we describe setup details for the extension experiments on knowledge graph and contextual word embeddings, respectively.
c.1 Word Embedding Training
We use Google’s C implementation of word2vec CBOW^{17}^{17}17https://github.com/tmikolov/word2vec, the original GloVe implementation^{18}^{18}18https://github.com/stanfordnlp/GloVe, and our own C++ implementation of MC to train word embeddings. For CBOW, we use the default learning rate. For GloVe, we use a learning rate of 0.01 (as the default of 0.05 resulted in NaNs on 800dimensional Wiki embeddings). For MC, since we are using our own implementation, we use a learning rate which we found to achieve low loss on Wiki’17. We include the full details on the hyperparameters used for both embedding algorithms in Table 4.
Algorithm  Hyperparameter  Value 
Shared  Training epochs 
50 
Window size  15  
Minimum count  5  
Threads  56  
CBOW  Learning rate  0.05 
Negative samples  5  
GloVe  Learning rate  0.01 
100  
0.75  
MC  Learning rate  0.2 
LR decay epochs  20  
Batch size  128  
Stopping tolerance  0.0001 
c.2 Word Embedding Compression and PostProcessing
We now discuss some important implementation details for uniform quantization related to stability. We use the techniques and implementation from May et al. (2019). To minimize confounding factors with stability, we use deterministic rounding for each word. The bounds of the interval for uniform quantization are determined by computing an optimal clipping threshold which is based on the distribution of the real numbers to be quantized. As we assume that embeddings and have similar distributions in terms of their vector values, we use the same clipping threshold across embeddings and to avoid unnecessary sources of instability, and we compute the clipping threshold using embedding . Finally, we apply orthogonal Procrustes to align embedding to embedding before compressing the embeddings and training downstream models. Preliminary results indicated that this alignment decreased instability, particularly at high compression rates, and we use this technique throughout our experiments.
c.3 Downstream Tasks
We discuss the models, datasets, and training procedure we use for the sentiment analysis and NER tasks.
c.3.1 Sentiment Analysis
We use a simple, bagofwords model for sentiment analysis. The goal of the task is to classify a sentence as positive or negative. For each sentence, the bagofwords model averages the word embeddings of the words in the sentence and then passes the sentence embedding through a linear classifier. This simple model allows us to study the impact of the embedding on the downstream task in a controlled setting, where the downstream model itself is expected to be fairly stable.
We use four datasets for the sentiment analysis task: SST2, MR, Subj, and MPQA. These are the four largest binary classification datasets used in Kim (2014).^{19}^{19}19https://github.com/harvardnlp/sentconvtorch/tree/master/data We use their given train/validation/test splits for SST2. For MR, Subj, and MPQA, which do not have these splits, we take 10% of the data for the validation set, 10% for the test set, and use the remaining 80% for the training set.
We tune the learning rate for each dataset and embedding algorithm. We use the 400dimensional Wiki’17 embeddings to tune the learning rate in the grid of {1e6, 1e5, 0.0001, 0.001, 0.01, 0.1, 1}. We choose the learning rate which achieves the highest validation accuracy on average across three seeds for each dataset and report the selected values in Table 5(a). To avoid choosing unstable learning rates, we also throw out learning rate values where the validation errors increase by 15% or greater between any consecutive epochs. We include the hyperparameters shared among all datasets in Table 5(b).


c.3.2 Named Entity Recognition
We use the singlelayer, BiLSTM model from Akbik et al. (2018) for named entity recognition.^{20}^{20}20https://github.com/zalandoresearch/flair We turn off the conditional random field (CRF) for computational efficiency and include a smaller subset of results with the CRF turned on in Appendix E.2.
We use the standard English CoNLL2003 dataset with the default setup for dataset splits Tjong Kim Sang & De Meulder (2003). Following Gardner et al. (2018), we ignore article divisions (denoted with “DOCSTART”) and do not consider them as sentences.^{21}^{21}21https://github.com/allenai/allennlp
We tune the learning rate per embedding algorithm, and otherwise follow the training hyperparameter settings of Akbik et al. (2018). Using the 400dimensional Wiki’17 embeddings, we sweep the learning rate in the grid of {0.001, 0.01, 0.1, 1, 10}, and choose the one which achieves the highest validation micro F1score on average across three seeds for each embedding algorithm. We train with vanilla SGD without momentum and use learning decay with early stopping if the learning rate becomes too small. We provide the selected learning rates in Table 6(a) and the hyperparameters shared across embeddings in Table 6(b).


c.4 Fitting LinearLog Models to Trends
We describe in detail how we fit linearlog model to the memory, dimension, and precision trends in Section 3.3. To propose the simple rule of thumb relating stability and memory, we consider 10 tasks to form a data matrix for the linearlog model: 5 downstream tasks (the four sentiment tasks in our study and the NER task) for two embedding algorithms (CBOW and MC embeddings). Let denote the number of Wiki’17/Wiki’18 pairs of embedding matrices from our experiments which correspond to a combination of dimension , precision , and random seed (we consider 3 random seeds) such that the number of bits per row is less than our cutoff of ().^{22}^{22}22In our case , because we have 3 random seeds, and 21 pairs of dimension and precision such that . For each task (out of total tasks), we construct a data matrix , and a label vector , as follows: Each row in corresponds to one of the above pairs of Wiki’17/Wiki’18 embedding matrices. For each of these embedding matrix pairs, we compute the memory in bits occupied per row of the embedding matrices, as well as the downstream prediction disagreement percentage between the models trained on those embeddings. We then set the corresponding row in to be , where is a binary vector with a one at index and zeros everywhere else, and the corresponding entry of to the prediction disagreement ; note that appending to allows us to learn a different bias term (i.e., intercept) per task. We then vertically concatenate all the matrices and label vectors , to form a single data matrix and label vector . To fit our loglinear model, we use and to solve the least squares problem using the closed form solution, . Given , for each task we can extract the fitted loglinear trend: , where is the first element of , and is the element of . This implies that doubling the memory of the embeddings on average leads to a 1.3% reduction in downstream prediction disagreement.
To fit the individual dimension and precision loglinear trends, we follow a protocol very similar to the above. For the dimension (respectively, precision) trend, the primary difference with the above protocol is that instead of having an independent intercept term per task, we have an independent intercept term for each combination of task and precision (resp., dimension). Furthermore, in the rows of the data matrices, instead of of the memory , we consider of the dimension (resp., precision ).
We also use the linearlog model for stabilitymemory to compute the minimum and maximum relative percentage decreases in downstream instability when increasing the memory of word embeddings. In particular, our goal is to understand how much the 1.3% decrease in prediction disagreement is in relative terms. To do this, we consider the combination of downstream task and embedding algorithm which is most stable at high memory (task: Subj; embedding algorithm: CBOW), and the combination which is least stable at low memory (task: MR; embedding algorithm: MC). At these extreme points, the instability is approximately 2.2% and 25.9%, respectively. A 1.3% absolute decrease in instability from 3.5% to 2.2% corresponds to a relative decrease of approximately 37% . Similarly, a 1.3% absolute decrease in instability from 25.9% to 24.6% corresponds to a relative decrease of approximately 5% . Thus, we conclude that this 1.3% absolute decrease in instability corresponds to a relative decrease in instability between 5% and 37%, across the tasks and embedding algorithms we consider.
We repeat the procedures above to fit a linearlog model to the stabilitymemory trend for knowledge graph embeddings in Section 6.1.
c.5 Knowledge Graph Embeddings
We use the OpenKE repository to generate knowledge graph embeddings Han et al. (2018).^{23}^{23}23https://github.com/thunlp/OpenKE/tree/OpenKEPyTorch We follow the training hyperparameters described in Bordes et al. (2013) for TransE embeddings for the FB15K dataset where available, and use default parameters from the OpenKE repository, otherwise. We modify the repository to follow the early stopping procedure and normalization of entity embeddings to follow the protocol of Bordes et al. (2013). We additionally sweep the learning rate in {1e5, 0.0001, 0.001, 0.01, 0.1} using dimension 50 on the FB15K95 dataset, and choose the learning rate which attains the lowest mean rank (i.e., highest quality) on the validation set for the link prediction task. We include the full hyperparameters in Table 7. We also note that unlike with word embeddings, we do not align embeddings with orthogonal Procrustes before compressing the embeddings with uniform quantization. We found alignment to result in a quality drop on knowledge graph embeddings, likely due to the fact that there are two sets of embeddings jointly learned (relation and entity embeddings) which require more advanced alignment techniques.
Hyperparameter  Value 
Optimizer  SGD 
Max. training epochs  1000 
Num. batches  100 
Threads  8 
Early stopping patience  10 
Head/tail replacement strategy  Uniform 
Entity negative rate  1 
Relation negative rate  0 
Margin  1 
Distance  
Learning rate  0.001 
c.6 Contextual Word Embeddings
To study the downstream instability of contextual word embeddings, we pretrain BERT Devlin et al. (2019) models and then use them as fixed feature extractors to train downstream task models. We use BERT without finetuning parameters for downstream tasks because our goal is to isolate and study the instability resulting from the difference in pretraining corpora; this is in analogy to our study in Section 3 on the instability of conventional fixed pretrained embeddings.
Pretraining
In the pretraining phase, we use Wikipedia dumps (the major component of the corpus used by Devlin et al. (2019)) to train the BERT models. We use Wiki’2017 and Wiki’2018 dumps respectively for pretraining to study the instability introduced by the change in corpora. We pretrain BERT models with 3 transformer layers on 10% subsampled articles from the Wikipedia dumps, which consists of approximately 200 million tokens. We use these shallower BERT model on the subsampled pretraining corpus to allow for computationally feasible training of BERT models with different transformer output dimensionality. As our corpus size are different from the one used by the original BERT model Devlin et al. (2019), we first grid search the pretraining learning rate with the subsampled Wiki’17 corpus using the same transformer output dimensionality as the BERT_{BASE}. We then use the gridsearched optimal learning rate to pretrain the BERT model with different transformer dimensionality for both Wiki’17 and Wiki’18 corpus.^{24}^{24}24We follow the experiment design from pretrained word embeddings to use the same learning rate for pretraining BERT models with transformer configurations.
Downstream Evaluation
To evaluate the downstream instability of pretrained BERT models, we take BERT model pairs with the same model configuration but trained on Wiki’17 and Wiki’18 respectively. We measure the percentage of disagreement in downstream task prediction of the BERT pairs as proxy for downstream instability. Specifically, we evaluate the instability on the sentiment analysis task using the SST, Subj, MR and MPQA datasets. In these tasks, we use linear bagofwords models on top of the last transformer layer output; this output acts as the contextual word vector representation. To train the sentiment analysis task models, we first gridsearch the learning rate using BERT with 768dimensional transformer output for each dataset and choose the value with the highest validation accuracy.^{25}^{25}25We use the dimensionality used for original BERT_{BASE} Devlin et al. (2019). We then use the gridsearched learning rate to train the sentiment analysis models using different pretrained BERT models. To ensure statistically meaningful results, we use three random seeds to pretrain BERT models and train the downstream sentiment analysis models. We otherwise use the same hyperparameters reported in Table 5(b).
Appendix D Extended Empirical Results
We now present additional experimental results to further validate the claims in this paper and provide deeper analysis of our results. We organize this section as follows:

In Appendix D.1, we present additional results showing that the stabilitymemory trends (and individual dimension and precision trends) hold on sentiment analysis tasks.

In Appendix D.2, we evaluate another important property–quality—exploring the tradeoffs of quality with memory and stability for the tasks in our study.

In Appendix D.3, we discuss how we choose the additional hyperparameters required for both the kNN measure and the eigenspace instability measure.

In Appendix D.4, we use visualizations to further analyze the relationship between the downstream instability and the embedding distance measures.

In Appendix D.5, we include additional results on sentiment analysis tasks for the evaluation of the embedding distance measures. We also evaluate the worstcase performance of the embedding distance measures as selection criteria, showing that the eigenspace instability measure and kNN measure remain the topperforming measures overall.

In Appendix D.6, we experiment with a modified setup for the triplet classification task, showing that the trends continue to hold, but the instability plateaus faster under this modification.
d.1 StabilityMemory Tradeoff
We validate that the stabilitymemory tradeoff holds on three more sentiment tasks (Subj, MR, and MPQA) for dimension and precision, first in isolation and then together. As always, we train embeddings and downstream models over three seeds, and the error bars indicate the standard deviation over these seeds. In Figure 4, we can see more evidence that as the dimension increases, the downstream instability often decreases, with the trends more consistent for lower precision embeddings. In Figure 5, we further validate that as the precision increases, the downstream instability decreases. Finally, in Figure 6, we show on all four sentiment tasks (SST2, Subj, MR, and MPQA) that when jointly varying dimension and precision, the instability decreases as the memory increases.
d.2 Quality Tradeoffs
We also evaluate the qualitymemory tradeoffs and qualitystability tradeoffs for CBOW and MC embedding algorithms, finding that like stability, the quality also increases with the embedding memory. In Figures 7 (a) and 8 (a), we show the qualitymemory tradeoff across sentiment analysis and NER tasks and CBOW and MC embedding algorithms for different dimensionprecision combinations. We see that the dimension tends to impact the quality significantly more than the precision (i.e., the change in dimension for a fixed precision affects the quality more than the change in precision for a fixed dimension affects the quality). Recall that in contrast, for instability, we saw that the precision actually had a slightly greater effect than the dimension in Section 3.3. In Figures 7 (b) and 8 (b) we also show the qualitystability tradeoffs. For many of the sentiment analysis tasks, there is not significant evidence of a strong relationship between the two; however, for the NER task, we can clearly see that as the instability increases, the quality decreases. For several of the tasks (e.g., CBOW, MR; CBOW, MPQA), we can see that for different precisions (i.e., lines), the instability changes significantly, but the quality is relatively constant. This aligns with the previous observation that the precision tends to impact the instability more than it does the quality. In a similar way, for different dimensions (i.e., points), we see that the quality can change significantly while the instability may stay relatively constant, especially for higher precisions (e.g., CBOW, SST2, CBOW, MPQA).
d.3 Selecting Hyperparameters for Embedding Distance Measures
The eigenspace instability measure and the measure each have a single hyperparameter to tune. For the eigenspace instability measure, determines how important the directions of the eigenvalues of high variance are. For the measure, determines how many neighbors are compared for each query word. To tune these hyperparameters, we compute the Spearman correlation between the embedding distance measure and the downstream prediction disagreement on the validation datasets for the five tasks in our study and MC and CBOW embedding algorithms. In Table 8(a) we report the average Spearman correlation for different values of for the eigenspace instability measure where we see is the topperforming value. In Table 8(b) we report the average Spearman correlation for different values of for the kNN measure, where we see is the topperforming value. Based on these results, we use and for our experiments throughout the paper.


d.4 Predictive Performance of the Eigenspace Instability Measure
We now provide additional results validating the strong relationship between the eigenspace instability measure and downstream instability. In addition to the Spearman correlation results we provide in Table 1, we visualize the downstream instability v. embedding distance measure results for the CoNLL2003 NER task in Figure 9 with CBOW and MC embeddings, taking the average over three seeds. We see that kNN measure and the eigenspace instability measure achieve strong correlations since the lines are generally monotonically increasing for both CBOW and MC embedding algorithms.
d.5 Embedding Distance Measures for DimensionPrecision Selection
We first include Spearman correlation results and selection task results for CBOW, GloVe, and MC on the two additional downstream tasks–MR and MPQA–in Tables 9(a), 9(b), and 9(c), where we see that the eigenspace instability measure and the kNN measure continue to outperform the other measures.



We also evaluate the the worstcase performance of the embedding distance measures when used as a selection criterion for dimensionprecision parameters. First, on the easier task of choosing the more stable dimensionprecision pair out of two choices, we define the worstcase performance as the maximum increase in instability that may occur by using the embedding distance measure to choose the dimensionprecision parameters (rather than the ground truth choice). On the more challenging task of choosing the most stable dimensionprecision pair under a memory budget, we define the worstcase performance as the worstcase absolute percentage error to the oracle parameters under a given memory budget. We see in Tables 10 and 11 that the eigenspace instability measure and kNN measure are the topperforming measures overall across both tasks.
Downstream Task  SST2  Subj  CoNLL2003  
Embedding Algorithm  CBOW  GloVe  MC  CBOW  GloVe  MC  CBOW  GloVe  MC 
Eigenspace Instability  10.43  6.48  13.18  3.50  3.00  3.40  3.30  4.04  4.11 
10.43  4.78  11.75  2.80  3.00  3.40  2.17  2.39  3.16  
Semantic Displacement  11.70  9.23  16.80  5.40  6.00  7.10  5.13  8.05  7.03 
PIP Loss  16.14  14.61  15.76  6.40  9.10  4.40  6.78  9.69  5.77 
12.69  11.92  16.80  5.50  8.50  7.10  5.86  9.23  7.03 
Downstream Task  SST2  Subj  CoNLL2003  
Embedding Algorithm  CBOW  GloVe  MC  CBOW  GloVe  MC  CBOW  GloVe  MC 
Eigenspace Instability  3.08  4.78  11.37  1.80  2.50  2.40  0.84  1.96  1.73 
3.02  4.78  11.37  1.80  2.50  2.60  1.29  1.96  1.01  
Semantic Displacement  2.47  5.82  13.95  1.80  2.70  3.10  1.73  3.07  3.92 
PIP Loss  7.96  6.37  13.95  3.30  3.10  3.10  2.02  2.00  3.92 
10.43  5.82  13.95  1.80  2.70  3.10  1.29  3.07  3.92  
High Precision  10.43  5.82  13.95  1.80  2.70  3.10  2.03  3.07  3.92 
Low Precision  7.96  6.37  5.33  3.30  3.10  4.60  2.02  2.00  2.33 
d.6 Knowledge Graph Embeddings
In Section 6.1, we showed that as the memory of the TransE embedding increases, the instability on link prediction and triplet classification task decreases; we now experiment with a modified setup for the triplet classification experiments. In Figure 10, we use thresholds tuned per dataset (in Figure 3 (right) we use the same threshold on both the FB15K95 and FB15K dataset) and see that the stabilitymemory tradeoffs are less pronounced for higher precisions.
d.7 Contextual Word Embeddings
We include the plots for the contextual word embedding experiments with BERT embeddings in Figures 10(a) and 10(b) for dimension and precision, respectively. As discussed in Section 6.2, although noisier than the trends with pretrained word embeddings, we see that generally as the dimension and precision increase, the downstream instability decreases.
Appendix E Robustness of Trends
We explore the robustness of our study by providing preliminary investigation on the effect of subword embedding algorithms, more complex downstream models, other sources of randomness introduced by the downstream model (e.g., model initialization and sampling order), finetuning embeddings on downstream instability, and the downstream model learning rate.
e.1 Subword Embeddings
We experiment with fastText Bojanowski et al. (2017) embeddings to evaluate if the stabilitymemory tradeoff holds on subword embedding methods. In Figure 12, we show that we can see that overall as the memory increases, the downstream instability decreases on the SST2 and CoNLL2003 tasks. However, the trend with respect to dimension is weaker on the SST2 task at larger precisions.
e.2 Complex Downstream Models
In the main text, our primary downstream models are a simple linear bagofwords model for sentiment analysis and a single layer BiLSTM for NER. We now demonstrate that complex downstream models such as CNNs or BiLSTMCRFs can still demonstrate the stabilitymemory tradeoffs, such that as the memory increases, the instability decreases. In Figure 12(a), we show that when using a CNN for the SST2 sentiment analysis task, embeddings with very low memory budgets result in high instability. The instability quickly plateaus for memory budgets greater than for the CBOW embeddings, but continues to decrease until a memory budget of
for the MC embeddings. The CNN architecture has one convolutional layer, with kernels of widths 3, 4, and 5, and 100 output channels. The convolutional layer is followed by a ReLU layer, a maxpooling layer, and finally a linear classification layer. We sweep the learning rate in a grid of {1e5, 0.0001, 0.001, 0.01, 0.1} and choose the best learning rate by validation accuracy. Selected learning rates are shown in Table
12(a) and shared hyperparameters are shown in Table 12(b).


We now demonstrate that the BiLSTMCRF also is subject to the stabilitymemory tradeoff, where as the dimension and precision increase, the instability decreases (Figure 12(b)). We use the same hyperparameters as in Table 6(b) and repeat our setup for the BiLSTM with the CRF turned on for the CoNLL2003 NER task. Due to the CRF being computationally expensive, we train a representative subset of points (dimensions in {25, 100, 800} and precisions in {1, 4, 32}). For each embedding algorithm, we grid search the learning rate for the BiLSTMCRF with 400dimensional embeddings in {0.001, 0.01, 0.1, 1.0, 10.0} and find that a learning rate of 0.1 is best for both embedding algorithms.
e.3 Sources of Randomness Downstream
We study the impact of two sources of randomness in the downstream model training—the model initialization seed and the sampling seed—on the downstream instability. First, we fix the embeddings and vary the model initialization seed and sampling seed independently. We vary the sampling order by shuffling the order of the batches in the training dataset. We compare the instability from these sources of randomness in the downstream model training to the instability from the embeddings. For each source of randomness downstream, we fix the embedding (using a single seed of the Wiki’17, fullprecision, 400dimensional embedding), and measure the instability between models trained with different random seeds. We repeat over three pairs of models and report the average. We see in Table 13 that across four sentiment analysis tasks using the linear bagofwords models, the sampling order seed introduces comparable instability to the change in embedding training data with fullprecision, 400dimensional embeddings, while the model initialization seed often contributes less instability. We note that using smaller memory budgets for the embeddings introduces much greater instability from the change in embedding training data, however, as shown in Figure 6.
In our experiments, we had also fixed the model initialization seeds and sampling order seeds to match that of the embedding, such that the seeds were the same between any two models we compared. We now remove this constraint, and vary the model initialization and sampling order seed of the model corresponding to the Wiki’18 embedding, such that no two models compared have the same seeds and otherwise repeat the experimental described in Section 3 and Appendix C.3. In Figure 13(a), we see that the stabilitymemory tradeoffs continue to hold and the trends are very similar to when we fixed the seeds (Figure 2). We note that many of the instability values themselves, particularly for CBOW, are slightly higher in Figure 13(a) than they are in Figure 2, likely due to the additional instability from the change in downstream model initialization and sampling seeds.
Downstream Task  SST2  MR  Subj  MPQA  
Embedding Algorithm  CBOW  MC  CBOW  MC  CBOW  MC  CBOW  MC 
Model Initialization Seed  3.48  7.08  2.44  9.28  1.10  4.53  1.45  4.30 
Sampling Order Seed  8.99  5.96  5.87  10.09  0.57  6.13  5.59  1.92 
Embedding Training Data  6.59  8.66  4.00  11.22  1.50  3.40  3.30  4.78 
e.4 Effect of Finetuning Embeddings Downstream
We study the impact of finetuning the embeddings downstream and find that the stabilitymemory tradeoff becomes noisier, but continues to hold under finetuning, and finetuning can dramatically help to decrease the downstream instability. In Figure 13(b), we show that as the memory increases, the instability generally decreases for both CBOW and MC embeddings, even when we allow the embeddings to be updated (i.e., finetuned) when training the downstream models. We note that we do not compress the embeddings during training in these experiments, therefore the memory denotes the memory required to store the embedding prior to training. To perform the finetuning experiments, we follow the procedure described in Appendix C.3, and perform an additional learning rate sweep per embedding algorithm with finetuning in the grid {1e5, 0.0001, 0.001, 0.01, 0.1, 1, 10}. We found the optimal learning rate for both algorithms on the SST2 sentiment analysis task with finetuning to be 0.0001. We also see that overall the instability decreases with finetuning compared to fixing the embeddings (as we did in Figure 2). We note that the learning rate for the downstream model with MC and finetuning is smaller than with fixed embeddings, which may also contribute to the reduced instability; however, from Figure 15, the reduction in instability with finetuning still appears greater than that which can be achieved from a small change in learning rate alone.
e.5 Effect of Downstream Learning Rate
We now study the impact of the downstream model learning rate on the instability, showing that the learning rate of the downstream model is another factor that impacts the downstream instability. In Figure 15, we show the instability of CBOW and MC embeddings on the SST2 and MR sentiment analysis tasks when different learning rates are used for the downstream linear model. We mark the optimal learning rate by validation accuracy with a red star. We see that very small learning rates and very large learning rates tend to be the most unstable for both 100 and 400dimensional embeddings. Moreover, the optimal learning rates do not significantly increase the instability compared to the other learning rates in our sweep. Since we see that the learning rate further contributes to the instability, we fix the learning rate over different precisions and dimensions in our main study to have a controlled setting to study the impact of dimension and precision on instability.
Comments
There are no comments yet.