Advances in Pre-Training Distributed Word Representations

12/26/2017 ∙ by Tomas Mikolov, et al. ∙ Facebook 0

Many Natural Language Processing applications nowadays rely on pre-trained word representations estimated from large text corpora such as news collections, Wikipedia and Web Crawl. In this paper, we show how to train high-quality word vector representations by using a combination of known tricks that are however rarely used together. The main result of our work is the new set of publicly available pre-trained models that outperform the current state of the art by a large margin on a number of tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pre-trained continuous word representations have become basic building blocks of many Natural Language Processing (NLP) and Machine Learning applications. These pre-trained representations provide distributional information about words, that typically improve the generalization of models learned on limited amount of data 

[Collobert et al.2011]. This information is typically derived from statistics gathered from large unlabeled corpus of text data [Deerwester et al.1990]. A critical aspect of their training is thus to capture efficiently as much statistical information as possible from rich and vast sources of data.

A standard approach for learning word representations is to train log-bilinear models based on either the skip-gram or the continuous bag-of-words (cbow) architectures, as implemented in word2vec [Mikolov et al.2013a] and fastText [Bojanowski et al.2017]111 In the skip-gram model, nearby words are predicted given a source word, while in the cbow model, the source word is predicted according to its context. These architectures and their implementation have been optimized to produce high quality word representations able to transfer to many tasks, while maintaining a sufficiently high training speed to scale to massive amount of data.

Recently, word2vec representations have been widely used in NLP pipelines to improve their performance. Their impressive capability at transfering to new problems suggests that they are capturing important statistics about the training corpora [Baroni and Lenci2010]. As can be expected, the more data a model is trained on, the better the representations are at transferring to other NLP problems. Training such models on massive data sources, like Common Crawl, can be cumbersome and many NLP practitioners prefer to use publicly available pre-trained word vectors over training the models by themselves. In this work, we provide new pre-trained word vectors that show consistent improvement over the currently available ones, making them potentially very useful to a wide community of researchers.

We show that several modifications of the standard word2vec training pipeline significantly improves the quality of the resulting word vectors. We focus mainly on known modifications and data pre-processing strategies that are rarely used together: the position dependent features introduced by mnih2013learning, the phrase representations used in mikolov2013distributed and the use of subword information [Bojanowski et al.2017].

We measure their quality on standard benchmarks: syntactic, semantic and phrase-based analogies [Mikolov et al.2013b], rare words dataset [Luong et al.2013], and as features in a question answering pipeline on Squad question answering dataset [Rajpurkar et al.2016, Chen et al.2017].

2 Model Description

In this section, we briefly describe the cbow model as it was used in word2vec, and then explain several known improvements to learn richer word representations.

2.1 Standard cbow model

The cbow model as used in mikolov2013efficient learns word representations by predicting a word according to its context. The context is defined as a symmetric window containing all the surrounding words. More precisely, given a sequence of words

, the objective of the cbow model is to maximize the log-likelihood of the probability of the words given their surrounding, i.e.:


where is the context of the -th word, e.g., the words for a context window of size . For now on, we assume that we have access to a scoring function between a word and its context , denoted by . This scoring function will be later parametrized by the word vectors, or representations. A natural candidate for the conditional probability in Eq. 1

is a softmax function over the scores of a context and words in the vocabulary. This choice is however impractical for large vocabulary. An alternative is to replace this probability by independent binary classifiers over words. The correct word is learned in contrast with a set of sampled negative candidates. More precisely, the conditional probability of a word

given its context in Eq. (1) is replaced by the following quantity:


where is a set of negative examples sampled from the vocabulary. The objective function maximized by the cbow model is obtained by replacing the log probability in Eq. (1) by the quantity defined in Eq. (2), i.e.:

A natural parametrization for this model is to represent each word by a vector . Similarly, a context is represented by the average of word vectors of each word in its window. The scoring function is simply the dot product between these two quantities, i.e.,


Note that different parametrizations are used for the words in a context and the predicted word.

Word subsampling.

The word frequency distribution in a standard text corpus follows a Zipf distribution, which implies that most of the words belongs to small subset of the entire vocabulary [Li1992]. Considering all the occurences of words equally would lead to overfit the parameters of the model on the representation of the most frequent words, while underfitting on the rest. A common strategy introduced in mikolov2013efficient is to subsample frequent words, with the following probability of discarding a word:


where is the frequency of the word , and is a parameter.

2.2 Position-dependent Weighting

The context vector described above is simply the average of the word vectors contained in it. This representation is oblivious to the position of each word. Explicitly encoding a representation for both a word and its position would be impractical and prone to overfitting. A simple yet effective solution introduced in the context of word representation by mnih2013learning is to learn position representations and use them to reweight the word vectors. This position dependent weighting offers a richer context representation at a minimal computational cost.

Each position in a context window is associated with a vector . The context vector is then the average of the context words reweighted by their position vectors. More precisely, denoting by the set of relative positions in the context window, the context vector of the word is:


where is the pointwise multiplication of vectors.

2.3 Phrase representations

The original cbow model is only based on unigrams, which is insensitive to the word order. We enrich this model with word n-grams to capture richer information. Directly incorporating the n-grams in the models is quite challenging as it clutters the models with uninformative content due to huge increase of the number of the parameters. Instead, we follow the approach of mikolov2013distributed where n-grams are selected by iteratively applying a mutual information criterion to bigrams. Then, in a data pre-processing step we merge the words in a selected n-gram into a single token.

For example, words with high mutual information like ”New York” are merged in a bigram token, ”New_York”. This pre-processing step is repeated several times to form longer n-gram tokens, like ”New_York_City” or ”New_York_University”. In practice, we repeat this process times to build tokens representing longer ngrams. We used the word2phrase tool from the word2vec project222 Note that unigrams with high mutual information are merged only with a probability of , thus we still keep significant number of unigram occurrences. Interstingly, even if the phrase representations are not further used in an application, they effectively improve the quality of the word vectors, as is shown in the experimental section.

2.4 Subword information

Standard word vectors ignore word internal structure that contains rich information. This information could be useful for computing representations of rare or mispelled words, as well as for mophologically rich languages like Finnish or Turkish. A simple yet effective approach is to enrich the word vectors with a bag of character n-gram vectors that is either derived from the singular value decomposition of the co-occurence matrix 

[Schütze1993] or directly learned from a large corpus of data [Bojanowski et al.2017]. In the latter, each word is decomposed into its character n-grams and each n-gram is represented by a vector . The word vector is then simply the sum of both representations, i.e.:


In practice, the set of n-grams is restricted to the n-grams with to characters. Storing all of these additional vectors is memory demanding. We use the hashing trick to circumvent this issue [Weinberger et al.2009].

3 Training Data

We used several sources of text data that are publicly available and the Gigaword dataset, as described in Table 1. In particular, we used English Wikipedia from June 2017, from which we used the meta pages archive which resulted in a text corpus with more than 9 billion words 333 Further, we used all news datasets from from years 2007 - 2016, the UMBC corpus [Han et al.2013], the English Gigaword, and Common Crawl from May 2017444

In case of the Common Crawl, we wrote a simple data extractor based on a unigram language model that retrieves the documents written in English and discards low quality data. The same approach can be in fact used to extract text data for many other languages from Common Crawl.

We decided to perform no complex data normalization or pre-processing, as we want the resulting word vectors to be very easily used by a wide community (the text normalization can be done on top of the published word vectors as a post-processing step). We only used a publicly available script from the Moses MT project555 We observed that de-duplicating large text training corpora, especially Common Crawl, significantly improves the quality of the resulting word vectors.

Corpus Size [billion]
Wikipedia meta-pages 9.2 News 4.2
UMBC News 3.2
Gigaword 3.3
Common Crawl 630
Table 1: Training corpora and their size in billions of words after tokenization and sentence de-duplication.

4 Results

Further we report results for models trained on either the Common Crawl, or on a combination of the Wikipedia, Statmt News, UMBC and Gigaword. This is comparable to corpora that other models that attempted to improve upon word2vec were trained on, notably the GloVe model from the Stanford NLP group [Pennington et al.2014]. Although a careful analysis performed in levy2015improving shows that the original word2vec is faster to train, produces more accurate models and takes significantly less memory than the GloVe algorithm, the availability of large pre-trained GloVe models proved to be a useful resource for many researchers who do not have time to train their own model on very large dataset like the Common Crawl.

We used the cbow architecture described in Section 2.1 with window size 5 for the baseline models and window size 15 for the models that learn position-dependent weights (described in Section 2.2). We used 10 negative examples for training with the negative sampling and threshold for subsampling frequent words set to .

In Table 2 we report results on the word analogies from mikolov2013efficient using baseline cbow model trained on Common Crawl with de-duplicated sentences, with phrases (we used 6 iterations of building the phrases by merging bigrams with high mutual information), and with the position-dependent weighting as used in mnih2013learning. The training itself took three days on a single multi-core machine.

Model Sem Syn Tot
cbow + uniq 79 73 76
cbow + uniq + phrases 82 78 80
cbow + uniq + phrases + weighting 87 82 85
Table 2: Accuracies on semantic and syntactic analogy datasets for models trained on Common Crawl (630B words). By performing sentence-level de-duplication, adding position-dependent weighting and phrases, the model quality improves significantly.

In Table 3, we can see comparison between cbow as implemented in the fastText library [Bojanowski et al.2017] and the GloVe models trained on comparable corpora. The 87% accuracy on the word analogy tasks is to our knowledge the best published result so far by a large margin, and much better than existing GloVe models which were trained on comparable corpora. We improved this result further to 88.5% accuracy by adding the sub-word features.

We also report very strong performance on the Rare Words dataset [Luong et al.2013], again outperforming GloVe models by a large margin. Finally, we replaced the GloVe pre-trained vectors with the new fastText vectors in a question answering system trained on the Squad dataset [Rajpurkar et al.2016]. In a setup that is further described in chen2017reading, we did observe significant improvement of the accuracy.

Model Analogy RW Squad
GloVe Wiki + news 72 0.38 77.7%
fastText Wiki + news 87 0.50 78.8%
GloVe Crawl 75 0.52 78.9%
fastText Crawl 85 0.58 79.8%
Table 3: Results on Word Analogy, Rare Words and Squad datasets with fastText models trained on various corpora (see Table 1) or Common Crawl (see Table 2), and comparison to GloVe models trained on comparable datasets.

The models trained on the Wikipedia and News corpora, and on the Common Crawl, were published at the website and are available to the NLP researchers. Further, we did experiment with the phrase-based analogy dataset introduced in mikolov2013distributed, and achieved 88% accuracy using the model trained on Crawl, which again is to our knowledge the new state of the art result. We plan to release the model containing all the phrases in the near future.

GloVe Wiki+news
GloVe Crawl
fastText Wiki+news 88.3 85.0
fastText Crawl 78.2 81.1 92.5 82.0 82.7
Table 4: Comparison of different pre-trained models on supervised text classification tasks.

Finally in Table 4, we use a script provided by conneau2017supervised to measure the influence of different pre-trained word vector models on several text classification tasks (MRPC, MR CR, SUBJ, MPQA, SST and TREC). We performed the classification using the standard fastText toolkit running in a supervised mode [Joulin et al.2016], using the pre-trained models to initialize the classifier. Overall, the new fastText word vectors result in superior text classification performance.

5 Discussion

In this work, we have focused on providing very high quality set of pre-trained word and phrase vector representations. Our findings indicate that improvements can be achieved by training well-known algorithms on very large text datasets, and that using certain tricks can provide further gains in quality. Notably, we have found it very important to de-duplicate sentences in large corpora such as the Common Crawl before training the models. Next, we have used an algorithm for building the phrases in a pre-processing step. Finally, adding the position-dependent weights and subword features to the cbow model architecture gave us the final boost of accuracy. The models described in this paper are freely available to researchers and engineers at the fastText webpage, and we hope that these will be useful in various projects that use textual data.

6 Acknowledgements

We thank Marco Baroni and German Kruszewski for useful discussions and suggestions, Adam Fisch for help with the experiments on Squad dataset, and Qun Luo for suggesting to use the position-dependent weighting at the word2vec discussion forum.

7 Bibliographical References