mittens
A fast implementation of GloVe, with optional retrofitting
view repo
We present a simple extension of the GloVe representation learning model that begins with generalpurpose representations and updates them based on data from a specialized domain. We show that the resulting representations can lead to faster learning and better results on a variety of tasks.
READ FULL TEXT VIEW PDFA fast implementation of GloVe, with optional retrofitting
Many NLP tasks have benefitted from the public availability of generalpurpose vector representations of words trained on enormous datasets, such as those released by the GloVe
(Pennington et al., 2014) and fastText (Bojanowski et al., 2016) teams. These representations, when used as model inputs, have been shown to lead to faster learning and better results in a wide variety of settings (Erhan et al., 2009, 2010; Cases et al., 2017).However, many domains require more specialized representations but lack sufficient data to train them from scratch. We address this problem with a simple extension of the GloVe model (Pennington et al., 2014) that synthesizes generalpurpose representations with specialized data sets. The guiding idea comes from the retrofitting work of Faruqui et al. (2015)
, which updates a space of existing representations with new information from a knowledge graph while also staying faithful to the original space (see also
Yu and Dredze 2014; Mrkšić et al. 2016; Pilehvar and Collier 2016). We show that the GloVe objective is amenable to a similar retrofitting extension. We call the resulting model ‘Mittens’, evoking the idea that it is ‘GloVe with a warm start’ or a ‘warmer GloVe’.Our hypothesis is that Mittens representations synthesize the specialized data and the generalpurpose pretrained representations in a way that gives us the best of both. To test this, we conducted a diverse set of experiments. In the first, we learn GloVe and Mittens representations on IMDB movie reviews and test them on separate IMDB reviews using simple classifiers. In the second, we learn our representations from clinical text and apply them to a sequence labeling task using recurrent neural networks, and to edge detection using simple classifiers. These experiments support our hypothesis about Mittens representations and help identify where they are most useful.
This section defines the Mittens objective. We first vectorize GloVe to help reveal why it can be extended into a retrofitting model.
Vocabulary size  

CPU  GPU  
Implementation  5K  10K  20K  5K  10K  20K 
Nonvectorized TensorFlow 

Vectorized Numpy  
Vectorized TensorFlow  
Official GloVe 
For a word from vocabulary occurring in the context of word , GloVe learns representations and
whose inner product approximates the logarithm of the probability of the words’ cooccurrence. Bias terms
and absorb the overall occurrences of and . A weighting function is applied to emphasize word pairs that occur frequently and reduce the impact of noisy, low frequency pairs. This results in the objectivewhere is the cooccurrence of and . Since is only defined for , the sum excludes zerocount word pairs. As a result, existing implementations of GloVe use an inner loop to compute this cost and associated derivatives.
However, since , the second bracket is irrelevant whenever , and so replacing with
(for any ) does not affect the objective and reveals that the cost function can be readily vectorized as
where . and are matrices whose columns comprise the word and context embedding vectors, and is applied elementwise. Because is a factor of all terms of the derivatives, the gradients are identical to the original GloVe implementation too.
To assess the practical value of vectorizing GloVe, we implemented the model^{1}^{1}1https://github.com/roamanalytics/mittens in pure Python/Numpy (van der Walt et al., 2011) and in TensorFlow (Abadi et al., 2015), and we compared these implementations to a nonvectorized TensorFlow implementation and to the official GloVe C implementation (Pennington et al., 2014).^{2}^{2}2We also considered a nonvectorized Numpy implementation, but it was too slow to be included in our tests (a single iteration with a 5K vocabulary took 2 hrs 38 mins). The results of these tests are in tab. 1. Though the C implementation is the fastest (and scales to massive vocabularies), our vectorized TensorFlow implementation is a strong secondplace finisher, especially where GPU computations are possible.
This vectorized implementation makes it apparent that we can extend GloVe into a retrofitting model by adding a term to the objective that penalizes the squared euclidean distance from the learned embedding to an existing one, :
Here, contains the subset of words in the new vocabulary for which prior embeddings are available (i.e., where is the vocabulary used to generate the prior embeddings), and is a nonnegative realvalued weight. When or is empty (i.e., there is no original embedding), the objective reduces to GloVe’s.
As in retrofitting, this objective encodes two opposing pressures: the GloVe objective (left term), which favors changing representations, and the distance measure (right term), which favors remaining true to the original inputs. We can control this trade off by decreasing or increasing .
In our experiments, we always begin with 50dimensional ‘Wikipedia 2014 + Gigaword 5’ GloVe representations^{3}^{3}3http://nlp.stanford.edu/data/glove.6B.zip – henceforth ‘External GloVe’ – but the model is compatible with any kind of “warm start”.
GloVe’s objective is that the log probability of words and cooccurring be proportional to the dot product of their learned vectors. One might worry that Mittens distorts this, thereby diminishing the effectiveness of GloVe. To assess this, we simulated 500dimensional square count matrices and original embeddings for 50% of the words. Then we ran Mittens with a range of values of . The results for five trials are summarized in fig. 1: for reasonable values of , the desired correlation remains high (fig. 0(a)), even as vectors with initial embeddings stay close to those inputs, as desired (fig. 0(b)).
For our sentiment experiments, we train our representations on the unlabeled part of the IMDB review dataset released by Maas et al. (2011). This simulates a common usecase: Mittens should enable us to achieve specialized representations for these reviews while benefiting from the large datasets used to train External GloVe.
Representations  Accuracy  95% CI 

Random  
External GloVe  
IMDB GloVE  
Mittens 
examples. For all but ‘External GloVE’, we report means (with bootstrapped confidence intervals) over five runs of creating the embeddings and crossvalidating the classifier’s hyperparameters, mainly to help verify that the differences do not derive from variation in the representation learning phase.
All our representations begin from a common count matrix obtained by tokenizing the unlabeled movie reviews in a way that splits out punctuation, downcases words unless they are written in all uppercase, and preserves emoticons and other common social media markup. We say word cooccurs with word if is within 10 words to the left or right of , with the counts weighted by where is the distance in words from . Only words with at least 300 tokens are included in the matrix, yielding a vocabulary of 3,133 words.
For regular GloVe representations derived from the IMDB data – ‘IMDB GloVE’ – we train 50dimensional representations and use the default parameters from Pennington et al. 2014: , , and a learning rate of . We optimize with AdaGrad (Duchi et al., 2011)
, also as in the original paper, training for 50K epochs.
For Mittens, we begin with External GloVe. The few words in the IMDB vocabulary that are not in this GloVe vocabulary receive random initializations with a standard deviation that matches that of the GloVe representations. Informed by our simulations, we train representations with the Mittens weight
. The GloVe hyperparameters and optimization settings are as above. Extending the correlation analysis of fig. 0(a) to these real examples, we find that the GloVe representations generally have Pearson’s , Mittens. We speculate that the improved correlation is due to the lowvariance external GloVe embedding smoothing out noise from our cooccurrence matrix.
1.  No/O eye/R pain/R or/O eye/R discharge/R ./O 

2.  Asymptomatic/D bacteriuria/D ,/O could/O be/O neurogenic/C bladder/C disorder/C ./O 
3.  Small/C embolism/C in/C either/C lung/C cannot/O be/O excluded/O ./O 
The labeled part of the IMDB sentiment dataset defines a positive/negative classification problem with 25K labeled reviews for training and 25K for testing. We represent each review by the elementwise sum of the representation of each word in the review, and train a random forest model
(Ho, 1995; Breiman, 2001) on these representations. The rationale behind this experimental setup is that it fairly directly evaluates the vectors themselves; whereas the neural networks we evaluate next can update the representations, this model relies heavily on their initial values.Via crossvalidation on the training data, we optimize the number of trees, the number of features at each split, and the maximum depth of each tree. To help factor out variation in the representation learning step (Reimers and Gurevych, 2017), we report the average accuracies over five separate complete experimental runs.
Our results are given in tab. 2. Mittens outperforms External GloVe and IMDB GloVe, indicating that it effectively combines complementary information from both.
Our clinical text experiments begin with 100K clinical notes (transcriptions of the reports healthcare providers create summarizing their interactions with patients during appointments) from Real Health Data.^{4}^{4}4http://www.realhealthdata.com These notes are divided into informal segments that loosely follow the ‘SOAP’ convention for such reporting (Subjective, Objective, Assessment, Plan). The sample has 1.3 million such segments, and these segments provide our notion of ‘document’.
The count matrix is created from the clinical text using the specifications described in sec. 3.1, but with the count threshold set to 500 to speed up optimization. The final matrix has a 6,519word vocabulary. We train Mittens and GloVe as in sec. 3.1. The correlations in the sense of fig. 0(a) are for both GloVe and Mittens.


Here we use a recurrent neural network (RNN) to evaluate our representations. We sampled 3,206 sentences from clinical texts (disjoint from the data used to learn word representations) containing disease mentions, and labeled these mentions as ‘Positive diagnosis’, ‘Concern’, ‘Ruled Out’, or ‘Other’. Tab. 2(a) provides some examples. We treat this as a sequence labeling problem, using ‘Other’ for all unlabeled tokens. Our RNN has a single 50dimensional hidden layer with LSTM cells (Hochreiter and Schmidhuber, 1997), and the inputs are updated during training.
Fig. 2 summarizes the results of these experiments based on 10 random train/test with 30% of the sentences allocated for testing. Since the inputs can be updated, we expect all the initialization schemes to converge to approximately the same performance eventually (though this seems not to be the case in practical terms for Random or External GloVE). However, Mittens learns fastest for all categories, reinforcing the notion that Mittens is a sensible default choice to leverage both domainspecific and largescale data.
Finally, we wished to see if Mittens representations would generalize beyond the specific dataset they were trained on. SNOMED CT is a public, widelyused graph of healthcare concepts and their relationships (Spackman et al., 1997). It contains 327K nodes, classified into 169 semantic types, and 3.8M edges. Our clinical notes are more colloquial than SNOMED’s node names and cover only some of its semantic spaces, but the Mittens representations should still be useful here.
For our experiments, we chose the five largest semantic types; tab. 3(a) lists these subgraphs along with their sizes. Our task is edge prediction: given a pair of nodes in a subgraph, the models predict whether there should be an edge between them. We sample 50% of the nonexistent edges to create a balanced problem. Each node is represented by the sum of the vectors for the words in its primary name, and the classifier is trained on the concatenation of these two node representations. To help assess whether the input representations truly generalize to new cases, we ensure that the sets of nodes seen in training and testing are disjoint (which entails that the edge sets are disjoint as well), and we train on just 50% of the nodes. We report the results of ten random train/test splits.
The large scale of these problems prohibits the large hyperparameter search described in sec. 3.2, so we used the best settings from those experiments (500 trees per forest, square root of the total features at each split, no depth restrictions).
Our results are summarized in tab. 3(b). Though the differences are small numerically, they are meaningful because of the large size of the graphs (tab. 3(a)). Overall, these results suggest that Mittens is at its best where there is a highlyspecialized dataset for learning representations, but that it is a safe choice even when seeking to transfer the representations to a new domain.
We introduced a simple retrofittinglike extension to the original GloVe model and showed that the resulting representations were effective in a number of tasks and models, provided a substantial (unsupervised) dataset in the same domain is available to tune the representations. The most natural next step would be to study similar extensions of other representationlearning models.
We thank Real Health Data for providing our clinical texts, Ben Bernstein, Andrew Maas, Devini Senaratna, and Kevin Reschke for valuable comments and discussion, and Grady Simon for making his Tensorflow implementation of GloVe available (Simon, 2017).
International Conference on Artificial Intelligence and Statistics
, pages 153–160.Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.