Word2Vec is a prominent tool for Natural Language Processing (NLP) tasks. Similar inspiration is found in distributed embeddings for state-of-the-art (sota) deep neural networks. However, wrong combination of hyper-parameters can produce poor quality vectors. The objective of this work is to show optimal combination of hyper-parameters exists and evaluate various combinations. We compare them with the original model released by Mikolov. Both intrinsic and extrinsic (downstream) evaluations, including Named Entity Recognition (NER) and Sentiment Analysis (SA) were carried out. The downstream tasks reveal that the best model is task-specific, high analogy scores don't necessarily correlate positively with F1 scores and the same applies for more data. Increasing vector dimension size after a point leads to poor quality or performance. If ethical considerations to save time, energy and the environment are made, then reasonably smaller corpora may do just as well or even better in some cases. Besides, using a small corpus, we obtain better human-assigned WordSim scores, corresponding Spearman correlation and better downstream (NER SA) performance compared to Mikolov's model, trained on 100 billion word corpus.READ FULL TEXT VIEW PDF
There have been many implementations of the word2vec model in either of the two architectures it provides: continuous skipgram and continuous bag of words (CBoW) (Mikolov et al. (2013a)).
Similar distributed models of word or subword embeddings (or vector representations) find usage in sota, deep neural networks like Bidirectional Encoder Representations from Transformers (BERT) and its successors (Devlin et al. (2018); Liu et al. (2019); Raffel et al. (2019)).
These deep networks generate contextual representations of words after been trained for extended periods on large corpora, unsupervised, using the attention mechanisms (Vaswani et al. (2017)).
It has been observed that various hyper-parameter combinations have been used in different research involving word2vec with the possibility of many of them being sub-optimal (Naili et al. (2017); Wang et al. (2018); Dhingra et al. (2017)). Therefore, the authors seek to address the research question: what is the optimal combination of word2vec hyper-parameters for intrinsic and extrinsic NLP
purposes? There are astronomically high numbers of combinations of hyper-parameters possible for neural networks, even with just a few layers. Hence, the scope of our extensive work over three corpora is on dimension size, training epochs, window size and vocabulary size for the training algorithms (hierarchical softmax and negative sampling) of both skipgram andCBoW. The corpora used for word embeddings are English Wiki News Abstract by Wikipedia (2019a) of about 15MB, English Wiki Simple (SW) Articles by Wikipedia (2019b) of about 711MB and the Billion Word (BW) of 3.9GB by Chelba et al. (2013). The corpus used for sentiment analysis is the Internet Movie Databse (IMDb) dataset of movie reviews by Maas et al. (2011) while that for NER is Groningen Meaning Bank (GMB) by Bos et al. (2017), containing 47,959 sentence samples. The IMDb dataset used has a total of 25,000 sentences with half being positive sentiments and the other half being negative sentiments. The Groningen Meaning Bank (GMB
) dataset has 17 labels, with 9 main labels and 2 context tags. It is however unbalanced due to the high percentage of tokens with the label ’O’. This skew in theGMB dataset is typical with NER datasets.
The objective of this work is to determine the optimal combinations of word2vec hyper-parameters for intrinsic evaluation (semantic and syntactic analogies) and extrinsic evaluation tasks (Zhang et al. (2019); Wang et al. (2019)), like SA and NER. It is not our objective in this work to record sota results. Some of the main contributions of this research are the empirical establishment of optimal combinations of word2vec hyper-parameters for NLP
tasks, discovering the behaviour of quality of vectors viz-a-viz increasing dimensions and the confirmation of embeddings being task-specific for the downstream. The rest of this paper is organised as follows: the literature review that briefly surveys distributed representation of words, particularly word2vec; the methodology employed in this research work; the results obtained and the conclusion.
Breaking away from the non-distributed (high-dimensional, sparse) representations of words, typical of traditional bag-of-words or one-hot-encoding (Turian et al. (2010)), Mikolov et al. (2013a) created word2vec. Word2Vec consists of two shallow neural network architectures: continuous skipgram and CBoW. It uses distributed (low-dimensional, dense) representations of words that group similar words. This new model traded the complexity of deep neural network architectures, by other researchers, for more efficient training over large corpora. Its architectures have two training algorithms: negative sampling and hierarchical softmax (Mikolov et al. (2013b)). The released model was trained on Google news dataset of 100 billion words. Implementations of the model have been undertaken by researchers in the programming languages Python and C++, though the original was done in C (Řehůřek and Sojka (2010)).
Continuous skipgram predicts (by maximizing classification of) words before and after the center word, for a given range. Since distant words are less connected to a center word in a sentence, less weight is assigned to such distant words in training. CBoW
, on the other hand, uses words from the history and future in a sequence, with the objective of correctly classifying the target word in the middle. It works by projecting all history or future words within a chosen window into the same position, averaging their vectors. Hence, the order of words in the history or future does not influence the averaged vector. This is similar to the traditional bag-of-words, which is oblivious of the order of words in its sequence. A log-linear classifier is used in both architectures (Mikolov et al. (2013a)). In further work, they extended the model to be able to do phrase representations and subsample frequent words (Mikolov et al. (2013b)). Being a Neural Network Language Model (NNLM
), word2vec assigns probabilities to words in a sequence, like otherNNLM
s such as feedforward networks or recurrent neural networks (Turian et al. (2010)). Earlier models like latent dirichlet allocation (LDA) and latent semantic analysis (LSA) exist and effectively achieve low dimensional vectors by matrix factorization (Deerwester et al. (1990); Levy et al. (2015)).
It’s been shown that word vectors are beneficial for NLP tasks (Turian et al. (2010)), such as sentiment analysis and named entity recognition.
Besides, Mikolov et al. (2013a) showed with vector space algebra that relationships among words can be evaluated, expressing the quality of vectors produced from the model.
The famous, semantic example: vector(”King”) - vector(”Man”) + vector(”Woman”) vector(”Queen”) can be verified using cosine distance.
Another type of semantic meaning is the relationship between a capital city and its corresponding country.
Syntactic relationship examples include plural verbs and past tense, among others.
Combination of both syntactic and semantic analyses is possible and provided (totaling over 19,000 questions) as Google analogy test set by Mikolov et al. (2013a).
WordSimilarity-353 test set is another analysis tool for word vectors (Finkelstein et al. (2002)).
Unlike Google analogy score, which is based on vector space algebra, WordSimilarity is based on human expert-assigned semantic similarity on two sets of English word pairs.
Both tools rank from 0 (totally dissimilar) to 1 (very much similar or exact, in Google analogy case).
A typical artificial neural network (ANN) has very many hyper-parameters which may be tuned. Hyper-parameters are values which may be manually adjusted and include vector dimension size, type of algorithm and learning rate (Levy et al. (2015)). Mikolov et al. (2013a) tried various hyper-parameters with both architectures of their model, ranging from 50 to 1,000 dimensions, 30,000 to 3,000,000 vocabulary sizes, 1 to 3 epochs, among others. In our work, we extended research to 3,000 dimensions. Different observations were noted from the many trials. They observed diminishing returns after a certain point, despite additional dimensions or larger, unstructured training data. However, quality increased when both dimensions and data size were increased together. Although Mikolov et al. (2013b) pointed out that choice of optimal hyper-parameter configurations depends on the NLP problem at hand, they identified the most important factors are architecture, dimension size, subsampling rate, and the window size. In addition, it has been observed that variables like size of datasets improve the quality of word vectors and, potentially, performance on downstream tasks (Adewumi et al. (2019); Mikolov et al. (2013a)).
The models were generated in a shared cluster running Ubuntu 16 with 32 CPUs of 32x Intel Xeon 4110 at 2.1GHz. Gensim (Řehůřek and Sojka (2010)
) python library implementation of word2vec was used with parallelization to utilize all 32 CPUs. The downstream experiments were run on a Tesla GPU on a shared DGX cluster running Ubuntu 18. Pytorch deep learning framework was used. Gensim was chosen because of its relative stability, popular support and to minimize the time required in writing and testing a new implementation in python from scratch.111It should be noted, however, that gensim multithreading for 30 and 40 epochs seemed unstable and crashed, preventing any related experiments.
To form the vocabulary, words occurring less than 5 times in the corpora were dropped, stop words removed using the natural language toolkit (NLTK) (Loper and Bird (2002)) and data pre-processing carried out. Table 1 describes most hyper-parameters explored for each dataset. In all, 80 runs (of about 160 minutes) were conducted for the 15MB Wiki Abstract dataset with 80 serialized models totaling 15.136GB while 80 runs (for over 320 hours) were conducted for the 711MB SW dataset, with 80 serialized models totaling over 145GB. Experiments for all combinations for 300 dimensions were conducted on the 3.9GB training set of the BW corpus and additional runs for other dimensions for the window 8 + skipgram + heirarchical softmax combination to verify the trend of quality of word vectors as dimensions are increased.
|Dimension size||300, 1200, 1800, 2400, 3000|
|Window size (w)||4, 8|
|Architecture||Skipgram (s1), CBoW (s0)|
|Algorithm||H. Softmax (h1), N. Sampling (h0)|
Google (semantic and syntactic) analogy tests and WordSimilarity-353 (with Spearman correlation) by Finkelstein et al. (2002) were chosen for intrinsic evaluations. They measure the quality of word vectors. The analogy scores are averages of both semantic and syntactic tests. NER and SA were chosen for extrinsic evaluations. The GMB dataset for NER was trained in an LSTM network, which had an embedding layer for input. The network diagram is shown in fig. 2. The IMDb dataset for SA was trained in a BiLSTM network, which also used an embedding layer for input. Its network diagram is given in fig. 2. It includes an additional hidden linear layer. Hyper-parameter details of the two networks for the downstream tasks are given in table 2. The metrics for extrinsic evaluation include F1, precision, recall and accuracy scores. In both tasks, the default pytorch embedding was tested before being replaced by pre-trained embeddings released by Mikolov et al. (2013a) and ours. In each case, the dataset was shuffled before training and split in the ratio 70:15:15 for training, validation (dev) and test sets. Batch size of 64 was used. For each task, experiments for each embedding was conducted four times and an average value calculated and reported in the next section
|Task Archi||Epochs||Hidden Dim||LR||Loss||Optimizer|
|NER LSTM||40||128||0.01||Cross Entropy||Adam|
|SA BiLSTM||20||128 * 2||0.0001||BCELoss||Adam|
Table 3 summarizes key results from the intrinsic evaluations for 300 dimensions.
Table 4 reveals the training time (in hours) and average embedding loading time (in seconds) representative of the various models used.
Tables 5 and 6 summarize key results for the extrinsic evaluations.
Figures 3, 4, 5, 6 and 7 present line graph of the eight combinations for different dimension sizes for Simple Wiki, trend of Simple Wiki and Billion Word corpora over several dimension sizes, analogy score comparison for models across datasets, NER mean F1 scores on the GMB dataset and SA mean F1 scores on the IMDb dataset, respectively.
Combination of the skipgram using hierarchical softmax and window size of 8 for 300 dimensions outperformed others in analogy scores for the Wiki Abstract.
However, its results are so poor, because of the tiny file size, they’re not worth reporting here.
Hence, we’ll focus on results from the Simple Wiki and Billion Word corpora.
Best combination changes when corpus size increases, as will be noticed from table 3.
In terms of analogy score, for 10 epochs, w8s0h0 performs best while w8s1h0 performs best in terms of WordSim and corresponding Spearman correlation.
Meanwhile, increasing the corpus size to BW, w4s1h0 performs best in terms of analogy score while w8s1h0 maintains its position as the best in terms of WordSim and Spearman correlation.
Besides considering quality metrics, it can be observed from table 4 that comparative ratio of values between the models is not commensurate with the results in intrinsic or extrinsic values, especially when we consider the amount of time and energy spent, since more training time results in more energy consumption (Adewumi and Liwicki (2019)).
|Google News - Mikolov (s1h0)|
|Analogy: 0.740||WordSim: 0.624||Spearman: 0.659|
|Model||Training (hours)||Loading Time (s)|
Information on the length of training time for the released Mikolov model is not readily available.
However, it’s interesting to note that their presumed best model, which was released is also s1h0.
Its analogy score, which we tested and report, is confirmed in their paper.
It beats our best models in only analogy score (even for Simple Wiki), performing worse in others.
This is inspite of using a much bigger corpus of 3,000,000 vocabulary size and 100 billion words while Simple Wiki had vocabulary size of 367,811 and is 711MB.
It is very likely our analogy scores will improve when we use a much larger corpus, as can be observed from table 3, which involves just one billion words.
Although the two best combinations in analogy (w8s0h0 & w4s0h0) for SW, as shown in fig. 3, decreased only slightly compared to others with increasing dimensions, the increased training time and much larger serialized model size render any possible minimal score advantage over higher dimensions undesirable. As can be observed in fig. 4
, from 100 dimensions, scores improve but start to drop after over 300 dimensions for SW and after over 400 dimensions for BW. More becomes worse! This trend is true for all combinations for all tests. Polynomial interpolation may be used to determine the optimal dimension in both corpora. Our models are available for confirmation and source codes are available on github.222https://github.com/tosingithub/sdesk
With regards to NER, most pretrained embeddings outperformed the default pytorch embedding, with our BW w4s1h0 model (which is best in BW analogy score) performing best in F1 score and closely followed by Mikolov et al. (2013a) model. On the other hand, with regards to SA, pytorch embedding outperformed the pretrained embeddings but was closely followed by our SW w8s0h0 model (which also had the best SW analogy score). Mikolov et al. (2013a) performed second worst of all, despite originating from a very huge corpus. The combinations w8s0h0 & w4s0h0 of SW performed reasonably well in both extrinsic tasks, just as the default pytorch embedding did.
|Metric||Default||Mikolov||w8 s0 h0||w8 s1 h0||BW w4 s1 h0|
|Dev, Test||Dev, Test||Dev, Test||Dev, Test||Dev, Test|
|F1||0.661, 0.661||0.679, 0.676||0.668, 0.669||0.583, 0.676||0.679, 0.677|
|Precision||0.609, 0.608||0.646, 0.642||0.636, 0.637||0.553, 0.642||0.644, 0.642|
|Recall||0.723, 0.724||0.716, 0.714||0.704, 0.706||0.618, 0.715||0.717, 0.717|
|Accuracy||0.939, 0.939||0.944, 0.944||0.942, 0.942||0.913, 0.943||0.944, 0.944|
|Metric||Default||Mikolov||w8 s0 h0||w8 s1 h0||BW w4 s1 h0|
|Dev, Test||Dev, Test||Dev, Test||Dev, Test||Dev, Test|
|F1||0.810, 0.805||0.384, 0.386||0.798, 0.799||0.548, 0.553||0.498, 0.390|
|Precision||0.805, 0.795||0.6, 0.603||0.814, 0.811||0.510, 0.524||0.535, 0.533|
|Recall||0.818, 0.816||0.303, 0.303||0.788, 0.792||0.717, 0.723||0.592, 0.386|
|Accuracy||0.807, 0.804||0.549, 0.55||0.801, 0.802||0.519, 0.522||0.519, 0.517|
This work analyses, empirically, optimal combinations of hyper-parameters for embeddings, specifically for word2vec.
It further shows that for downstream tasks, like NER and SA, there’s no silver bullet!
However, some combinations show strong performance across tasks.
Performance of embeddings is task-specific and high analogy scores do not necessarily correlate positively with performance on downstream tasks.
This point on correlation is somewhat similar to results by Chiu et al. (2016) and Wang et al. (2019).
It was discovered that increasing dimension size depreciates performance after a point.
If strong considerations of saving time, energy and the environment are made, then reasonably smaller corpora may suffice or even be better in some cases.
The on-going drive by many researchers to use ever-growing data to train deep neural networks can benefit from the findings of this work.
Indeed, hyper-parameter choices are very important in neural network systems (Levy et al. (2015)).
Future work that may be investigated are performance of other architectures of word or sub-word embeddings, the performance and comparison of embeddings applied to languages other than English and how embeddings perform in other downstream tasks. In addition, since the actual reason for the changes in best model as corpus size increases is not clear, this will also be suitable for further research.
The work on this project is partially funded by Vinnova under the project number 2019-02996 ”Språkmodeller för svenska myndigheter”
Conversational systems in machine learning from the point of view of the philosophy of science—using alime chat and related studies.Philosophies, 4(3):41, 2019.
Word representations: a simple and general method for semi-supervised learning.In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384–394. Association for Computational Linguistics, 2010.