Pre-trained continuous word representations have become basic building blocks of many Natural Language Processing (NLP) and Machine Learning applications. These pre-trained representations provide distributional information about words, that typically improve the generalization of models learned on limited amount of data[Collobert et al.2011]. This information is typically derived from statistics gathered from large unlabeled corpus of text data [Deerwester et al.1990]. A critical aspect of their training is thus to capture efficiently as much statistical information as possible from rich and vast sources of data.
A standard approach for learning word representations is to train log-bilinear models based on either the skip-gram or the continuous bag-of-words (cbow) architectures, as implemented in word2vec [Mikolov et al.2013a] and fastText [Bojanowski et al.2017]111https://fasttext.cc/. In the skip-gram model, nearby words are predicted given a source word, while in the cbow model, the source word is predicted according to its context. These architectures and their implementation have been optimized to produce high quality word representations able to transfer to many tasks, while maintaining a sufficiently high training speed to scale to massive amount of data.
Recently, word2vec representations have been widely used in NLP pipelines to improve their performance. Their impressive capability at transfering to new problems suggests that they are capturing important statistics about the training corpora [Baroni and Lenci2010]. As can be expected, the more data a model is trained on, the better the representations are at transferring to other NLP problems. Training such models on massive data sources, like Common Crawl, can be cumbersome and many NLP practitioners prefer to use publicly available pre-trained word vectors over training the models by themselves. In this work, we provide new pre-trained word vectors that show consistent improvement over the currently available ones, making them potentially very useful to a wide community of researchers.
We show that several modifications of the standard word2vec training pipeline significantly improves the quality of the resulting word vectors. We focus mainly on known modifications and data pre-processing strategies that are rarely used together: the position dependent features introduced by mnih2013learning, the phrase representations used in mikolov2013distributed and the use of subword information [Bojanowski et al.2017].
2 Model Description
In this section, we briefly describe the cbow model as it was used in word2vec, and then explain several known improvements to learn richer word representations.
2.1 Standard cbow model
The cbow model as used in mikolov2013efficient learns word representations by predicting a word according to its context. The context is defined as a symmetric window containing all the surrounding words. More precisely, given a sequence of words
, the objective of the cbow model is to maximize the log-likelihood of the probability of the words given their surrounding, i.e.:
where is the context of the -th word, e.g., the words for a context window of size . For now on, we assume that we have access to a scoring function between a word and its context , denoted by . This scoring function will be later parametrized by the word vectors, or representations. A natural candidate for the conditional probability in Eq. 1
is a softmax function over the scores of a context and words in the vocabulary. This choice is however impractical for large vocabulary. An alternative is to replace this probability by independent binary classifiers over words. The correct word is learned in contrast with a set of sampled negative candidates. More precisely, the conditional probability of a wordgiven its context in Eq. (1) is replaced by the following quantity:
where is a set of negative examples sampled from the vocabulary. The objective function maximized by the cbow model is obtained by replacing the log probability in Eq. (1) by the quantity defined in Eq. (2), i.e.:
A natural parametrization for this model is to represent each word by a vector . Similarly, a context is represented by the average of word vectors of each word in its window. The scoring function is simply the dot product between these two quantities, i.e.,
Note that different parametrizations are used for the words in a context and the predicted word.
The word frequency distribution in a standard text corpus follows a Zipf distribution, which implies that most of the words belongs to small subset of the entire vocabulary [Li1992]. Considering all the occurences of words equally would lead to overfit the parameters of the model on the representation of the most frequent words, while underfitting on the rest. A common strategy introduced in mikolov2013efficient is to subsample frequent words, with the following probability of discarding a word:
where is the frequency of the word , and is a parameter.
2.2 Position-dependent Weighting
The context vector described above is simply the average of the word vectors contained in it. This representation is oblivious to the position of each word. Explicitly encoding a representation for both a word and its position would be impractical and prone to overfitting. A simple yet effective solution introduced in the context of word representation by mnih2013learning is to learn position representations and use them to reweight the word vectors. This position dependent weighting offers a richer context representation at a minimal computational cost.
Each position in a context window is associated with a vector . The context vector is then the average of the context words reweighted by their position vectors. More precisely, denoting by the set of relative positions in the context window, the context vector of the word is:
where is the pointwise multiplication of vectors.
2.3 Phrase representations
The original cbow model is only based on unigrams, which is insensitive to the word order. We enrich this model with word n-grams to capture richer information. Directly incorporating the n-grams in the models is quite challenging as it clutters the models with uninformative content due to huge increase of the number of the parameters. Instead, we follow the approach of mikolov2013distributed where n-grams are selected by iteratively applying a mutual information criterion to bigrams. Then, in a data pre-processing step we merge the words in a selected n-gram into a single token.
For example, words with high mutual information like ”New York” are merged in a bigram token, ”New_York”. This pre-processing step is repeated several times to form longer n-gram tokens, like ”New_York_City” or ”New_York_University”. In practice, we repeat this process times to build tokens representing longer ngrams. We used the word2phrase tool from the word2vec project222https://github.com/tmikolov/word2vec. Note that unigrams with high mutual information are merged only with a probability of , thus we still keep significant number of unigram occurrences. Interstingly, even if the phrase representations are not further used in an application, they effectively improve the quality of the word vectors, as is shown in the experimental section.
2.4 Subword information
Standard word vectors ignore word internal structure that contains rich information. This information could be useful for computing representations of rare or mispelled words, as well as for mophologically rich languages like Finnish or Turkish. A simple yet effective approach is to enrich the word vectors with a bag of character n-gram vectors that is either derived from the singular value decomposition of the co-occurence matrix[Schütze1993] or directly learned from a large corpus of data [Bojanowski et al.2017]. In the latter, each word is decomposed into its character n-grams and each n-gram is represented by a vector . The word vector is then simply the sum of both representations, i.e.:
In practice, the set of n-grams is restricted to the n-grams with to characters. Storing all of these additional vectors is memory demanding. We use the hashing trick to circumvent this issue [Weinberger et al.2009].
3 Training Data
We used several sources of text data that are publicly available and the Gigaword dataset, as described in Table 1. In particular, we used English Wikipedia from June 2017, from which we used the meta pages archive which resulted in a text corpus with more than 9 billion words 333https://dumps.wikimedia.org/enwiki/latest/. Further, we used all news datasets from statmt.org from years 2007 - 2016, the UMBC corpus [Han et al.2013], the English Gigaword, and Common Crawl from May 2017444https://commoncrawl.org/2017/06.
In case of the Common Crawl, we wrote a simple data extractor based on a unigram language model that retrieves the documents written in English and discards low quality data. The same approach can be in fact used to extract text data for many other languages from Common Crawl.
We decided to perform no complex data normalization or pre-processing, as we want the resulting word vectors to be very easily used by a wide community (the text normalization can be done on top of the published word vectors as a post-processing step). We only used a publicly available script from the Moses MT project555https://github.com/moses-smt. We observed that de-duplicating large text training corpora, especially Common Crawl, significantly improves the quality of the resulting word vectors.
Further we report results for models trained on either the Common Crawl, or on a combination of the Wikipedia, Statmt News, UMBC and Gigaword. This is comparable to corpora that other models that attempted to improve upon word2vec were trained on, notably the GloVe model from the Stanford NLP group [Pennington et al.2014]. Although a careful analysis performed in levy2015improving shows that the original word2vec is faster to train, produces more accurate models and takes significantly less memory than the GloVe algorithm, the availability of large pre-trained GloVe models proved to be a useful resource for many researchers who do not have time to train their own model on very large dataset like the Common Crawl.
We used the cbow architecture described in Section 2.1 with window size 5 for the baseline models and window size 15 for the models that learn position-dependent weights (described in Section 2.2). We used 10 negative examples for training with the negative sampling and threshold for subsampling frequent words set to .
In Table 2 we report results on the word analogies from mikolov2013efficient using baseline cbow model trained on Common Crawl with de-duplicated sentences, with phrases (we used 6 iterations of building the phrases by merging bigrams with high mutual information), and with the position-dependent weighting as used in mnih2013learning. The training itself took three days on a single multi-core machine.
|cbow + uniq||79||73||76|
|cbow + uniq + phrases||82||78||80|
|cbow + uniq + phrases + weighting||87||82||85|
In Table 3, we can see comparison between cbow as implemented in the fastText library [Bojanowski et al.2017] and the GloVe models trained on comparable corpora. The 87% accuracy on the word analogy tasks is to our knowledge the best published result so far by a large margin, and much better than existing GloVe models which were trained on comparable corpora. We improved this result further to 88.5% accuracy by adding the sub-word features.
We also report very strong performance on the Rare Words dataset [Luong et al.2013], again outperforming GloVe models by a large margin. Finally, we replaced the GloVe pre-trained vectors with the new fastText vectors in a question answering system trained on the Squad dataset [Rajpurkar et al.2016]. In a setup that is further described in chen2017reading, we did observe significant improvement of the accuracy.
|GloVe Wiki + news||72||0.38||77.7%|
|fastText Wiki + news||87||0.50||78.8%|
The models trained on the Wikipedia and News corpora, and on the Common Crawl, were published at the fasttext.cc website and are available to the NLP researchers. Further, we did experiment with the phrase-based analogy dataset introduced in mikolov2013distributed, and achieved 88% accuracy using the model trained on Crawl, which again is to our knowledge the new state of the art result. We plan to release the model containing all the phrases in the near future.
Finally in Table 4, we use a script provided by conneau2017supervised to measure the influence of different pre-trained word vector models on several text classification tasks (MRPC, MR CR, SUBJ, MPQA, SST and TREC). We performed the classification using the standard fastText toolkit running in a supervised mode [Joulin et al.2016], using the pre-trained models to initialize the classifier. Overall, the new fastText word vectors result in superior text classification performance.
In this work, we have focused on providing very high quality set of pre-trained word and phrase vector representations. Our findings indicate that improvements can be achieved by training well-known algorithms on very large text datasets, and that using certain tricks can provide further gains in quality. Notably, we have found it very important to de-duplicate sentences in large corpora such as the Common Crawl before training the models. Next, we have used an algorithm for building the phrases in a pre-processing step. Finally, adding the position-dependent weights and subword features to the cbow model architecture gave us the final boost of accuracy. The models described in this paper are freely available to researchers and engineers at the fastText webpage, and we hope that these will be useful in various projects that use textual data.
We thank Marco Baroni and German Kruszewski for useful discussions and suggestions, Adam Fisch for help with the experiments on Squad dataset, and Qun Luo for suggesting to use the position-dependent weighting at the word2vec discussion forum.
7 Bibliographical References
- [Baroni and Lenci2010] Baroni, M. and Lenci, A. (2010). Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4):673–721.
- [Bojanowski et al.2017] Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
- [Chen et al.2017] Chen, D., Fisch, A., Weston, J., and Bordes, A. (2017). Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051.
- [Collobert et al.2011] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537.
- [Conneau et al.2017] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.
- [Deerwester et al.1990] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391.
- [Han et al.2013] Han, L., Kashyap, A. L., Finin, T., Mayfield, J., and Weese, J. (2013). UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, June.
- [Joulin et al.2016] Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
- [Levy et al.2015] Levy, O., Goldberg, Y., and Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225.
- [Li1992] Li, W. (1992). Random texts exhibit zipf’s-law-like word frequency distribution. IEEE Transactions on information theory, 38(6):1842–1845.
[Luong et al.2013]
Luong, T., Socher, R., and Manning, C. D.
Better word representations with recursive neural networks for morphology.In CoNLL, pages 104–113.
- [Mikolov et al.2013a] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- [Mikolov et al.2013b] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
[Mnih and Kavukcuoglu2013]
Mnih, A. and Kavukcuoglu, K.
Learning word embeddings efficiently with noise-contrastive estimation.In Advances in neural information processing systems, pages 2265–2273.
- [Pennington et al.2014] Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- [Rajpurkar et al.2016] Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
- [Schütze1993] Schütze, H. (1993). Word space. In Advances in neural information processing systems, pages 895–902.
- [Weinberger et al.2009] Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. (2009). Feature hashing for large scale multitask learning. In ICML.