Mittens: An Extension of GloVe for Learning Domain-Specialized Representations

by   Nicholas Dingwall, et al.
Stanford University

We present a simple extension of the GloVe representation learning model that begins with general-purpose representations and updates them based on data from a specialized domain. We show that the resulting representations can lead to faster learning and better results on a variety of tasks.


page 1

page 2

page 3

page 4


Representation Learning for Medical Data

We propose a representation learning framework for medical diagnosis dom...

Systèmes du LIA à DEFT'13

The 2013 Défi de Fouille de Textes (DEFT) campaign is interested in two ...

Image-embodied Knowledge Representation Learning

Entity images could provide significant visual information for knowledge...

Learning State Representations in Complex Systems with Multimodal Data

Representation learning becomes especially important for complex systems...

One4all User Representation for Recommender Systems in E-commerce

General-purpose representation learning through large-scale pre-training...

Scaling Law for Recommendation Models: Towards General-purpose User Representations

A recent trend shows that a general class of models, e.g., BERT, GPT-3, ...

Nested Subspace Arrangement for Representation of Relational Data

Studies on acquiring appropriate continuous representations of discrete ...

Code Repositories


A fast implementation of GloVe, with optional retrofitting

view repo

1 Introduction

Many NLP tasks have benefitted from the public availability of general-purpose vector representations of words trained on enormous datasets, such as those released by the GloVe

(Pennington et al., 2014) and fastText (Bojanowski et al., 2016) teams. These representations, when used as model inputs, have been shown to lead to faster learning and better results in a wide variety of settings (Erhan et al., 2009, 2010; Cases et al., 2017).

However, many domains require more specialized representations but lack sufficient data to train them from scratch. We address this problem with a simple extension of the GloVe model (Pennington et al., 2014) that synthesizes general-purpose representations with specialized data sets. The guiding idea comes from the retrofitting work of Faruqui et al. (2015)

, which updates a space of existing representations with new information from a knowledge graph while also staying faithful to the original space (see also

Yu and Dredze 2014; Mrkšić et al. 2016; Pilehvar and Collier 2016). We show that the GloVe objective is amenable to a similar retrofitting extension. We call the resulting model ‘Mittens’, evoking the idea that it is ‘GloVe with a warm start’ or a ‘warmer GloVe’.

Our hypothesis is that Mittens representations synthesize the specialized data and the general-purpose pretrained representations in a way that gives us the best of both. To test this, we conducted a diverse set of experiments. In the first, we learn GloVe and Mittens representations on IMDB movie reviews and test them on separate IMDB reviews using simple classifiers. In the second, we learn our representations from clinical text and apply them to a sequence labeling task using recurrent neural networks, and to edge detection using simple classifiers. These experiments support our hypothesis about Mittens representations and help identify where they are most useful.

2 Mittens

This section defines the Mittens objective. We first vectorize GloVe to help reveal why it can be extended into a retrofitting model.

2.1 Vectorizing GloVe

Vocabulary size
Implementation 5K 10K 20K 5K 10K 20K

Non-vectorized TensorFlow

Vectorized Numpy
Vectorized TensorFlow
Official GloVe
Table 1: Speed comparisons. The values are seconds per iteration, averaged over 10 iterations each on 5 simulated corpora that produced count matrices with about non-zero cells. Only the training step for each model is timed. The CPU experiments were done on a machine with a 3.1 GHz Intel Core i7 chip and 16 GB of memory, and the GPU experiments were done on machine with a 16 GB NVIDIA Tesla V100 GPU and 61 GB of memory. Dashes mark tests that aren’t applicable because the implementation doesn’t perform GPU computations.

For a word from vocabulary occurring in the context of word , GloVe learns representations and

whose inner product approximates the logarithm of the probability of the words’ co-occurrence. Bias terms

and absorb the overall occurrences of and . A weighting function is applied to emphasize word pairs that occur frequently and reduce the impact of noisy, low frequency pairs. This results in the objective

where is the co-occurrence of and . Since is only defined for , the sum excludes zero-count word pairs. As a result, existing implementations of GloVe use an inner loop to compute this cost and associated derivatives.

However, since , the second bracket is irrelevant whenever , and so replacing with

(for any ) does not affect the objective and reveals that the cost function can be readily vectorized as

where . and are matrices whose columns comprise the word and context embedding vectors, and is applied elementwise. Because is a factor of all terms of the derivatives, the gradients are identical to the original GloVe implementation too.

To assess the practical value of vectorizing GloVe, we implemented the model111 in pure Python/Numpy (van der Walt et al., 2011) and in TensorFlow (Abadi et al., 2015), and we compared these implementations to a non-vectorized TensorFlow implementation and to the official GloVe C implementation (Pennington et al., 2014).222We also considered a non-vectorized Numpy implementation, but it was too slow to be included in our tests (a single iteration with a 5K vocabulary took 2 hrs 38 mins). The results of these tests are in tab. 1. Though the C implementation is the fastest (and scales to massive vocabularies), our vectorized TensorFlow implementation is a strong second-place finisher, especially where GPU computations are possible.

2.2 The Mittens Objective Function

This vectorized implementation makes it apparent that we can extend GloVe into a retrofitting model by adding a term to the objective that penalizes the squared euclidean distance from the learned embedding to an existing one, :

Here, contains the subset of words in the new vocabulary for which prior embeddings are available (i.e., where is the vocabulary used to generate the prior embeddings), and is a non-negative real-valued weight. When or is empty (i.e., there is no original embedding), the objective reduces to GloVe’s.

As in retrofitting, this objective encodes two opposing pressures: the GloVe objective (left term), which favors changing representations, and the distance measure (right term), which favors remaining true to the original inputs. We can control this trade off by decreasing or increasing .

In our experiments, we always begin with 50-dimensional ‘Wikipedia 2014 + Gigaword 5’ GloVe representations333 – henceforth ‘External GloVe’ – but the model is compatible with any kind of “warm start”.

2.3 Notes on Mittens Representations

GloVe’s objective is that the log probability of words and co-occurring be proportional to the dot product of their learned vectors. One might worry that Mittens distorts this, thereby diminishing the effectiveness of GloVe. To assess this, we simulated 500-dimensional square count matrices and original embeddings for 50% of the words. Then we ran Mittens with a range of values of . The results for five trials are summarized in fig. 1: for reasonable values of , the desired correlation remains high (fig. 0(a)), even as vectors with initial embeddings stay close to those inputs, as desired (fig. 0(b)).

(a) Correlations between the dot product of pairs of learned vectors and their log probabilities.
(b) Distances between initial and learned embeddings, for words with and without pretrained initializations. As gets larger, the pressure to stay close to the original increases.
Figure 1: Simulations assessing Mittens’ faithfulness to the original GloVe objective and to its input embeddings. is regular GloVe.

3 Sentiment Experiments

For our sentiment experiments, we train our representations on the unlabeled part of the IMDB review dataset released by Maas et al. (2011). This simulates a common use-case: Mittens should enable us to achieve specialized representations for these reviews while benefiting from the large datasets used to train External GloVe.

Representations Accuracy 95% CI
External GloVe
Table 2: IMDB test-set classification results. A difference of corresponds to

examples. For all but ‘External GloVE’, we report means (with bootstrapped confidence intervals) over five runs of creating the embeddings and cross-validating the classifier’s hyperparameters, mainly to help verify that the differences do not derive from variation in the representation learning phase.

3.1 Word Representations

All our representations begin from a common count matrix obtained by tokenizing the unlabeled movie reviews in a way that splits out punctuation, downcases words unless they are written in all uppercase, and preserves emoticons and other common social media mark-up. We say word co-occurs with word if is within 10 words to the left or right of , with the counts weighted by where is the distance in words from . Only words with at least 300 tokens are included in the matrix, yielding a vocabulary of 3,133 words.

For regular GloVe representations derived from the IMDB data – ‘IMDB GloVE’ – we train 50-dimensional representations and use the default parameters from Pennington et al. 2014: , , and a learning rate of . We optimize with AdaGrad (Duchi et al., 2011)

, also as in the original paper, training for 50K epochs.

For Mittens, we begin with External GloVe. The few words in the IMDB vocabulary that are not in this GloVe vocabulary receive random initializations with a standard deviation that matches that of the GloVe representations. Informed by our simulations, we train representations with the Mittens weight

. The GloVe hyperparameters and optimization settings are as above. Extending the correlation analysis of fig. 0(a) to these real examples, we find that the GloVe representations generally have Pearson’s , Mittens

. We speculate that the improved correlation is due to the low-variance external GloVe embedding smoothing out noise from our co-occurrence matrix.

1. No/O eye/R pain/R or/O eye/R discharge/R ./O
2. Asymptomatic/D bacteriuria/D ,/O could/O be/O neurogenic/C bladder/C disorder/C ./O
3. Small/C embolism/C in/C either/C lung/C cannot/O be/O excluded/O ./O
(a) Short disease diagnosis labeled examples. ‘O’: ‘Other’; ‘D’: ‘Positive Diagnosis’; ‘C’: ‘Concern’; ‘R’: ‘Ruled Out’.
Table 3: Disease diagnosis examples.
Figure 2: Disease diagnosis test-set accuracy as a function of training epoch, with bootstrapped confidence intervals. Mitten learns fastest for all categories.

3.2 IMDB Sentiment Classification

The labeled part of the IMDB sentiment dataset defines a positive/negative classification problem with 25K labeled reviews for training and 25K for testing. We represent each review by the element-wise sum of the representation of each word in the review, and train a random forest model

(Ho, 1995; Breiman, 2001) on these representations. The rationale behind this experimental set-up is that it fairly directly evaluates the vectors themselves; whereas the neural networks we evaluate next can update the representations, this model relies heavily on their initial values.

Via cross-validation on the training data, we optimize the number of trees, the number of features at each split, and the maximum depth of each tree. To help factor out variation in the representation learning step (Reimers and Gurevych, 2017), we report the average accuracies over five separate complete experimental runs.

Our results are given in tab. 2. Mittens outperforms External GloVe and IMDB GloVe, indicating that it effectively combines complementary information from both.

4 Clinical Text Experiments

Our clinical text experiments begin with 100K clinical notes (transcriptions of the reports healthcare providers create summarizing their interactions with patients during appointments) from Real Health Data.444 These notes are divided into informal segments that loosely follow the ‘SOAP’ convention for such reporting (Subjective, Objective, Assessment, Plan). The sample has 1.3 million such segments, and these segments provide our notion of ‘document’.

4.1 Word Representations

The count matrix is created from the clinical text using the specifications described in sec. 3.1, but with the count threshold set to 500 to speed up optimization. The final matrix has a 6,519-word vocabulary. We train Mittens and GloVe as in sec. 3.1. The correlations in the sense of fig. 0(a) are for both GloVe and Mittens.

Subgraph Nodes Edges
(a) Subgraph sizes.
Representations disorder procedure finding organism substance
External GloVe
Clinical text GloVe
(b) Mean macro-F1 by subgraph (averages from 10 random train/test splits). Italics mark systems for which in a comparison with the top system numerically, according to a Wilcoxon signed-rank test.
Table 4: SNOMED subgraphs and results. For the ‘disorder’ graph (the largest), a difference of corresponds to examples. For the ‘substance’ graph (the smallest), it corresponds to examples.

4.2 Disease Diagnosis Sequence Modeling

Here we use a recurrent neural network (RNN) to evaluate our representations. We sampled 3,206 sentences from clinical texts (disjoint from the data used to learn word representations) containing disease mentions, and labeled these mentions as ‘Positive diagnosis’, ‘Concern’, ‘Ruled Out’, or ‘Other’. Tab. 2(a) provides some examples. We treat this as a sequence labeling problem, using ‘Other’ for all unlabeled tokens. Our RNN has a single 50-dimensional hidden layer with LSTM cells (Hochreiter and Schmidhuber, 1997), and the inputs are updated during training.

Fig. 2 summarizes the results of these experiments based on 10 random train/test with 30% of the sentences allocated for testing. Since the inputs can be updated, we expect all the initialization schemes to converge to approximately the same performance eventually (though this seems not to be the case in practical terms for Random or External GloVE). However, Mittens learns fastest for all categories, reinforcing the notion that Mittens is a sensible default choice to leverage both domain-specific and large-scale data.

4.3 SNOMED CT edge prediction

Finally, we wished to see if Mittens representations would generalize beyond the specific dataset they were trained on. SNOMED CT is a public, widely-used graph of healthcare concepts and their relationships (Spackman et al., 1997). It contains 327K nodes, classified into 169 semantic types, and 3.8M edges. Our clinical notes are more colloquial than SNOMED’s node names and cover only some of its semantic spaces, but the Mittens representations should still be useful here.

For our experiments, we chose the five largest semantic types; tab. 3(a) lists these subgraphs along with their sizes. Our task is edge prediction: given a pair of nodes in a subgraph, the models predict whether there should be an edge between them. We sample 50% of the non-existent edges to create a balanced problem. Each node is represented by the sum of the vectors for the words in its primary name, and the classifier is trained on the concatenation of these two node representations. To help assess whether the input representations truly generalize to new cases, we ensure that the sets of nodes seen in training and testing are disjoint (which entails that the edge sets are disjoint as well), and we train on just 50% of the nodes. We report the results of ten random train/test splits.

The large scale of these problems prohibits the large hyperparameter search described in sec. 3.2, so we used the best settings from those experiments (500 trees per forest, square root of the total features at each split, no depth restrictions).

Our results are summarized in tab. 3(b). Though the differences are small numerically, they are meaningful because of the large size of the graphs (tab. 3(a)). Overall, these results suggest that Mittens is at its best where there is a highly-specialized dataset for learning representations, but that it is a safe choice even when seeking to transfer the representations to a new domain.

5 Conclusion

We introduced a simple retrofitting-like extension to the original GloVe model and showed that the resulting representations were effective in a number of tasks and models, provided a substantial (unsupervised) dataset in the same domain is available to tune the representations. The most natural next step would be to study similar extensions of other representation-learning models.

6 Acknowledgements

We thank Real Health Data for providing our clinical texts, Ben Bernstein, Andrew Maas, Devini Senaratna, and Kevin Reschke for valuable comments and discussion, and Grady Simon for making his Tensorflow implementation of GloVe available (Simon, 2017).