The performance of models for core NLP problems heavily rely on the availability of large amounts of high-quality training data. Word embedding, language modeling or text classification are tasks that all benefit from training at scale. Pre-trained models optimized on massive corpora are readily available for most tasks, and are often used in downstream applications. For example, popular pre-trained word vectors like word2vec or fastText are trained for weeks on a large mix of web data and serve as building blocks many NLP applications. In a similarly way text classifiers are trained offline on a fixed and large labeled training set before being shipped to an application. A good example is the fastText language identifier111https://fasttext.cc/docs/en/language-identification.html that was trained on data from Tatoeba222https://tatoeba.org and can serve as a component of a larger system.
However, such pre-trained models are not without their flaws. First, these general models intended for a broad range of applications suffer from the lack of specialization. Indeed, despite their size, large web data such as Common Crawl lack coverage for highly technical expert fields such as medicine or law. Second, many applications rely on the temporal aspect of training and test data as the language distribution can drastically change over time. New words may appear in the vocabulary, new named entities gain sudden importance and new trends are rapidly emerging. Because of that, in many situations general-purpose pre-trained models require adaptation to fit the distribution of the task at hand.
The simplest solution to adapt a model to specialized data is to retrain the model from scratch on the relevant data. However, that is not always possible as it would require: (i) having access to the large dataset that was used for pre-training, (ii) retaining the data history and processing an ever-growing dataset. Another approach to adapt the model is to fine-tune it on new data to fit the new distribution. This solution is technically challenging, as one has to carefully select relevant hyper parameters. Moreoever, even when carried out carefully, it leads to a loss of important statistics gathered on the original large dataset.
In this work, we propose a simple method allowing to combine a pre-trained model with a model trained on the new data. We frame this problem as a word vector alignment problem and take inspiration from the recent progress made in bilingual lexicon induction. Our approach requires little retraining and only needs storing the previous model, not the data. When working with large datasets, this represents a considerable computational advantage. We experimentally show that our approach allows to successfully adapt word vector models as well as text classifiers to new data.
2 Problem formulation
In this work, we deal with models based on word vectors, in particular, word embedding and text classification models. We suppose that we are given a model with word vectors that is pre-trained on a large corpus . We also suppose that we have access to a novel corpus of limited size. can differ from in a variety of ways, from a subtle shift in the word distribution to the appearance of new words (for instance, neologisms). We are interested in updating the model’s word vectors to the specificities of the small corpus while retaining most of the information from the original vectors. In what follows, let us denote by (respectively ) the lexicon found in (respectively ).
The classical solution to this problem is to train a new model on while initializing the parameters with vectors from . We refer to this solution as fine-tuning. Two main issues arise with this approach: first, it only updates words that are in , leaving most of the vectors in untouched. Therefore this procedure can create a large discrepancy between words in and . Second, aggressive fine-tuning may lead to loss of information from the original dataset . Indeed, words in (including those common to the two sets) will specialize in reflecting the distribution of , discarding the useful statistics learned on .
In this work, to adapt our model to , we propose to train a model on , then align and average it with . The advantage of this alignment-based approach is that all the word vectors are being updated. We denote the word vectors trained on as . Using words in , we find a linear mapping that aligns the word vectors in and . Given the mapped vectors and , we construct the final word vectors by simply taking an average:
Please note that the same formulation allows to model the confidence in new data by replacing the average by a weighted sum:
where is a parameter in governing the confidence in the new data.
3 Word vector alignment
As described in the previous section, we are given two sets of word vectors in dimension stacked in two matrices and . In the case of bilingual word vector alignment, the lines of and have to be put in correspondance using a bilingual lexicon. When working with monolingual data, we assume without loss of generality that word vectors have the same index in and . Mikolov et al. (2013b) propose to frame the word vector alignment problem as linear least squares which results in a quadratic optimization problem. The linear mapping matrix can be found by solving:
which admits a closed-form solution. Restraining to the set of orthogonal matrices , has been shown to improve the alignments (Xing et al., 2015)
. In that case, the resulting problem is known as Orthogonal Procrustes, and still admits a closed-form solution obtained using a singular value decomposition(schönemann66).
Using an orthogonal mapping is also critical when working with classification models. Having in
ensures that we can preserve the scoring function. In a linear classification model the probability for sampleto be of class can be written:
We see that this probability will be unchanged if we map both features and classifiers and that the mapping is orthogonal:
Alternative loss function.
The norm described in the previous paragraphs is intrinsically associated with the euclidean nearest neighbor (NN) retrieval criterion. This criterion suffers from the existence of “hubs”, which are data points that are nearest neighbors to many other data points (Dinu et al., 2014). To alleviate that problem, alternative criterions have been suggested in the litterature, such as the inverted softmax (Smith et al., 2017) and CSLS (Conneau et al., 2017). Joulin et al. (2018)
show that minimizing a loss inspired by CSLS can significantly improve the quality of the retrieved word alignments. Their loss function, calledRCSLS is defined as:
This loss is a tight convex relaxation of the CSLS criterion for normalized word vectors. The problem of learning an orthogonal alignment using the RCSLS loss can be solved using a projected sub-gradient descent method.
We empirically show that the alignment procedure that we propose allows to succesfully update pre-trained models on new training data. We evaluate that in three experiments: one concerning word embeddings, and two related to text classification. In all experiments, for our approach we align and using the RCSLS loss and use for vectors obtained by fine-tuning on .
4.1 Updating word vectors
In this first experiment, we want to check how well does our method allow to update word vector models, especially when the new corpus contains a lot of new words, never seen in . As this kind of data is hard to find, we simulate this setup by discarding from lines containing selected words, that are present in . In order to be able to measure measure how well the update procedure works, we will create two test sets, with or without new words.
We evaluate our word vectors on word analogies (Mikolov et al., 2013a). This dataset is composed of 19544 questions grouped in 14 categories, with a vocabulary of 904 words. In each category we select 10% of words that we will remove from . This results in a set composed of words. We split the analogy dataset in two: a first set (Out of vocab) that only contains questions that have at least one word in , and a second set (In vocab) in which we put the leftover questions.
We take two subsets of the May 2017 dump of the Common Crawl that we preprocess following Grave et al. (2018). In order to avoid case-related problems, we lowercase the training data. We take two subsets of imbalanced size: () contains 8.8 billion words, while () has 440 million words. From we discard lines that contain at least one word in , yielding a dataset of 974 million words. We train our word vectors using the fastText library, with no character
-grams, sampling 10 negatives and training the model for 10 epochs.
|Out of vocab||In vocab|
We consider two baselines:
Fine-tune: train word vectors on and initialize the input matrix using word vectors trained on .
Subwords: train on with subwords, with character -grams of length to . We build word vectors for all words by summing all the character -gram vectors.
As a topline, we also provide the performance of a model trained on . This variant should be considered as the oracle solution, since it requires access to the old data.
The quantitative results are presented in Table 1. First, we see that fine-tuning on leads to decent performance on the Out of vocab questions, but saps the accuracy on In vocab questions. As mentionned before, learning vectors initialized with on may lead to a loss of important statistics learnt on : the total accuracy on In vocab questions drops by .
Second, we observe that training subword-enriched vectors on (Subwords baseline), when compared with simple training on , improves performance on In vocab questions ( improvement). However, we notice that this baseline fails to provide good features for the Out of vocab questions, leading to a poor accuracy of .
Third, RCSLS+Fine-tune leads to best average performance for both tasks, outperforming Fine-tune and Subwords. On In vocab questions, our approach leads to even better accuracy than when training on ( improvement). All in all the proposed approach allows to adapt the pre-trained model to words that were only present in without losing precious information from , making it an effective method for updating word vector models.
4.2 Updating classification models in time
In this second experiment, we want to check if our approach allows to adapt text classification models to new data. As opposed to the previous experiment, we perform this experiment on real data: user reviews taken at different moments in time. In that setup, we observe a significant change in the language distribution between data splits due to language changes over time. A lot of named entities has changed, and the set of most discriminative words and-grams for predicting sentiment may have changed too.
We focus on a linear classifier on top of unigram and bigram embeddings, and use the fastText library333https://github.com/facebookresearch/fastText(Joulin et al., 2016). We train models with ten hidden dimensions, and tune the number of epochs for each subset of each dataset. Given two models trained on and
, we learn an orthogonal mapping between the word vectors using RCSLS. We take as training pairs only the 1000 words with highest word-vector norms. Using the learnt alignment, we combine word and n-gram vectors, as well as classifiers and evaluate the model on the test set. In this experiment, we only report the Fine-tune baseline, where we train a classifier onand initialize the parameters with those obtained on .
We consider the Yelp dataset provided in their Yelp 2019 Challenge444https://www.yelp.com/dataset/challenge. It is a dataset composed of business reviews written by Yelp users from 2013 to 2018. For our experiment we split the data into a large training set of M reviews taken from 2013-2014, and a smaller training set of reviews taken from 2018. As we want to check in this experiment what is the effect of the size of on the peformance of the baseline and our method, we consider four variants by growing the size of , taking subsets of k, k, k and k samples. We evaluate our models on a test sets composed of reviews from 2018.
We present the result of this experiment in Table 2. First of all, we observe that the performance of models trained on and strongly depend on the size of . When the two datasets are of the same size (500k), the best performing model is the one trained on , as in that case there is no train/test distribution discrepancy. However, when is small (10k or 30k), it is better to use a larger yet ill-distributed dataset ( versus ). We also observe that when training on a concatenation of the two, the model performs at least as well as the best one.
Second, for all sizes of , our method outperforms the fine-tuning baseline. When is large, it benefits from the fine-tunning, while when is small it takes advantage of the initial model trained on . A surprising observation is that in that experiment, aligning the fine-tuned vectors works as well as training on . This shows that in some applications one can simply retain the model while discarding the old data.
4.3 Merging text classification models
In this final experiment, we want to evaluate how our approach compares to a standard technique of model ensembeling. To this end, we perform a control experiment in which we evaluate how well we can combine models trained on two shards of data of similar size. A standard technique for making an ensemble of classification models is voting, or averaging the output of the scoring function.
In the case of linear models, averaging the output of the scoring function is exactly equivalent to averaging the parameters. However, classification models such as the one used by Joulin et al. (2016) are based on a low rank parametrization of the classifier. Because of that, directly averaging the parameters is not possible:
By finding an orthogonal matrixthat maps to , we should be able to do so:
For all classifiers trained in this experiment, we follow the same procedure as in the previous experiment. We train a fastText model with bi-grams and tune the number of epochs on the validation set.
For this experiment, we use a subset of the datasets proposed by Zhang et al. (2015), i.e., Sogou News, Amazon and Yelp full reviews. We randomly split each dataset into two subsets and of increasing size: k samples, k samples and up to the full dataset. For any split and any method, we test our classifier on the full corresponding test set.
As in the previous experiments, we report a Fine-tune baseline. To this end, when training a model on , we initialize the input matrix with word and -gram embeddings obtained when learning a classifier on . In that case, the classifiers are initialized randomly.
We also report the performance obtained by training a model on each of the two splits and then aggregating the predictions by voting. Since we only use two models, in case of disagreement we take the most confident prediction. For the reasons exposed at the beginning of this section, this baseline should perform the same as our method.
We report the performance of a model trained on each of the two splits alone, the two baselines, and a topline obtained by training on . The quantitative results are presented in Table 3. First of all, we observe that our approach reduces the gap between training on a single set with the topline. This effect is especially true on small versions of the datasets (k or k). The main reason for this improvement is that each split has an incomplete coverage of the discriminative words and -grams.
Second, and most importantly, we notice that our approach performs comparably to the Vote baseline, which validates our claims. By aligning the word vectors and averaging the models, we manage to get the performance of a model ensemble, while only storing a single model.
We presented a simple method to update word vectors to the distribution of a new corpus. Our method is not a definitive solution to this challenging task, rather constitutes a proof of concept. Experiments seem to indicate that the proposed approach could be used for extending the lexicon, allowing to aggregate low frequency words from several corpora. While this task is of premier importance, we lack proper evaluation datasets for rare words. We leave the construction of adapted evaluation datasets for future work, and posit that such resources would greatly fuel research in that direction.
- Word translation without parallel data. arXiv preprint arXiv:1710.04087. Cited by: §3.
- Improving zero-shot learning by mitigating the hubness problem. arXiv preprint arXiv:1412.6568. Cited by: §3.
- Learning word vectors for 157 languages. In LREC, Cited by: §4.1.
- Loss in translation: learning bilingual word mapping with a retrieval criterion. In EMNLP, Cited by: §3.
- Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: §4.2, §4.3.
Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §4.1, Table 1.
- Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168. Cited by: §3.
- Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859. Cited by: §3.
- Normalized word embedding and orthogonal trans- form for bilingual word translation. In NAACL, Cited by: §3.
- Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §4.3.