1 Weight Initialization
Pre-training has been recognized for long to be useful to train Neural Networks. The large number of local optimas, and the combinatorial number of equally optimal solutions mean that the initial weight has a large effect on the final answer. Just to represent the combinatorics involved in the problem, consider the following simple Neural Network with just one optimal solution. Fully Connected layers mean that there can bedifferent combinations for the same values of weights, which mean that there could be solutions which yield the same globally optimal value. The situation can easily get more complicated for local optimas, something Neural Networks end up being stuck at quite often.
Work on Unsupervised Pre-training  showed the immense importance of Weight Initialization. An unsupervised objective is used in their work and they claim that the weight initialization determines which optima the model reaches, as illustrated by the following figure. Staring at and will yield different minimas.
Transfer Learning  is yet another way to utilize weight initialization. The concept of Transfer Learning has been examined by . The two key constraints are pointed out by them are, what all weights to transfer (relating to co-adaptibility of the layers) and the negative effect on the optimization of the higher level layers (how to build higher level features using lower level features of non-target task). It should be clear that, the fact that we are using a single layer Neural Network (Word2vec) alleviates both these issues.
Our work is similar to Transfer Learning in many aspects, but the fact that the neural networks being used are shallow gives a feel of Weight Initialization rather than reusability of weights. Nevertheless, we show in the experiments section that our method can give better performance than without initialization.
The final motivation to use weight initialization is to reduce the training time. We hypothesize that if the initialized weights are already good enough for most aspects, the number of epochs taken to fine tune them will be much lesser than what would be required to train them from scratch.
1.2 Using WordNet for initialization
The importance of weight initialization also makes it crucial to ensure the weights initialized are useful for a large range of tasks. The Word2vec model (detailed in the next section) works with context words. The following characteristics were identified as important to ensure good initialization.
Small Corpus so that training time is reduced
Large vocabulary size to ensure sufficient coverage
Representative Context words even with small corpus
A dictionary seemed like a good option which has all the above characteristics. WordNet  in particular is arranged in a hierarchy and we thought that the context words and examples used to create them will also encode this hierarchy in some way. Other successes  of algorithms which use WordNet glosses motivated us to use this corpus.
The work in  introduces models to learn word embeddings in a corpus. Specifically the work introduces the CBOW (Continuous bag of words) model in which the model predicts the word given the context of words. It also introduces the Skipgram architecture which weighs nearby context words more than the distant ones.
As stated in the earlier section, a dictionary type of corpus was deemed fit for a good initialization. We were motivated to use the WordNet glosses for the same. The model was trained on the word definitions from wordnet to learn the initial word weights. We tried the learning algorithm on different variants of the wordnet gloss corpus. The details of which are as follows :
The corpus was created by appending to each word, it’s definition. For example ”enamel any smooth glossy coating that resembles ceramic glaze” is a part of the corpus and this sentence was formed by concatenating ”enamel” and it’s gloss ”any smooth glossy coating that resembles ceramic glaze”. Let us call this wordnetOnce.
Another corpus was created by inserting the word into the gloss definition. For example, ”enamel any enamel smooth enamel glossy enamel coating enamel that enamel resembles enamel ceramic enamel glaze” is a part of the corpus and this sentence was formed by inserting ”enamel” in between it’s gloss definition which is ”any smooth glossy coating that resembles ceramic glaze”. Let us call this wordnetMultiple
The model was trained as CBOW and it turned out that the corpus wordnetOnce performed better. The similarity between ”banana” and ”fruit” was reported as 0.442 by wordnetOnce and 0.253 by wornetMultiple. Thus we have used the wordnetOnce corpus for rest of the experiments.
|window size -2||window size -8|
We also experimented if CBOW or Skipgram performs better. We found (subsequent sections) that both the models’ performance is very similar and hence we chose to run the experiments using the CBOW model.
The WordSim-353 dataset contains English word pairs along with human-assigned similarity judgements.During evaluation, we calculate the similarity between the 353 word pairs mentioned in the WordSim-353 dataset and try to find the correlation between the similarity values as depicted in the dataset compared to the values given by the models.
We use the Spearman correlation metric as opposed to Pearson. Spearman is a correlation test which assesses how well the relationship between two ranked variables is.
Spearman: Spearman is a rank correlation measure which is used to measure the degree of association between the two variables. The following formula is used to calculate Spearman rank correlation:
where is the difference between the ranks of corresponding variables and is the number of observations.
3.2 Word Analogy
One of the goals was to test the obtained vectors’ performance on the analogy task. The work in  states that the vector representations learnt by the word2vec models capture complex relations like word analogies as well. An analogy task would be of the type - “What is the word that is similar to small in the same sense as biggest is similar to big?”
The test was run on the family test set present in . Suppose the test was on boy:girl :: brother:sister, we then find the vector , and then we search for the
closest (via cosine similarity) words to. We say that the model performed correct for this testcase, if girl was among the most similar words, and we say that the model performed incorrect for the testcase.
Eventually we report an accuracy measure over test cases where the accuracy is
The table 2 are few sets of words of the form word1:word2 :: word3:word4 from the test set.
|word1||word2||word3||10 closest words predicted||word4||Remark|
4.1 Corpus Details
B - This is subset ‘B’ of the British National Corpus
AB - This is the concatenation of ‘A’ and ‘B’ subsets.
ABC - This is the concatenation of ‘A’, ‘B’ and ‘C’ subsets
Partial Wikipedia / enWiki - This dataset comprises of the first billion characters from wikipedia. This amounts to less than 10% of the information available on the wikipedia and can be found in 
4.2 Skipgram vs CBOW
We ran the experiments on the AB corpus once using the skipgram model and then using the CBOW model, both with a window size of 8. The observations are plotted below:
We conclude that the observations are very close, and there is no significant difference in performance. Hence we chose to run the further experiments on the CBOW model.
4.3 Performance of pretrained vectors compared to non-pretrained vectors given equivalent training
Four corpora, referred as B, AB, ABC and Partial Wikipedia were used to run these experiments. Vector embeddings were learnt by the CBOW word2vec model with a window size of 8.
Let us call, the word vectors learnt on the wordnetOnce corpus as wordnetVectors. The experiments were run once by initializing the word vectors to wordnetVectors. We call this the Pretrained setup, and the they were then run without any extra initialization, which we refer to as the Without Pretraining Setup. Following are the observations of how the vectors performed on correlation scores with the wordsim.
From the figure 11 we can observe that after sufficient amount of training, the experiments in which the vectors were pretrained give better correlation scores with wordsim when compared to the ones without any particular word vector initialization. The effect is more clearly visible when the training for word vectors is done for more number of epochs. The partial wikipedia corpus when trained for 40 epochs, gave a correlation score of 0.6598 when the word vectors were initialized with pretrained word vectors on wordnetOnce corpus as compared to the correlation score of 0.5759 when there was no vector initialization done.
Given above evidence we conclude that vectors pretrained with wordnetOnce corpus give better similarity scores than the vectors which are learnt without any pretraining.
4.4 Given the correlation score to achieve compare training time across corpuses
The aim of this experiment was to find out the effect of pretraining in the training time. The experiment goal was to find out the number of epochs of pretrained vectors after which the correlation score with wordsim was greater than the correlation score obtained by training the vectors without initialization for 20 epochs. The observations are listed in table 4.
|Corpus||Size||Correlation Score at 20 epochs||# epochs for|
|Vectors without pretraining||Pretrained vectors|
From Table 4 it is quite evident that pretraining helps reduce the training time too, while trying to achieve a particular correlation score.
4.5 Variation of correlation score for a given training time, varying size of corpus
Trying to understand the effect of pretraining across corpora of different sizes, we split the partial wikipedia (enwiki) corpus into parts of 239MB, 477MB, 716MB and 954MB (full corpus) and learnt the vector representations in the case of pretraining and also without pretraining. The training algorithm was run for 20 epochs and the observations are reported in table 5.
|Corpus||Size||Correlation Score Without Pretraining||Correlation Score With Pretraining|
One of the observations we can draw is that after 20 epochs all the above corpora perform better when they are pretrained than when no pretraining is done. Another interesting observation is that there is no significant difference in correlation score across different sizes, owing to the similar nature of the corpora. Thus for better vector representations we do not necessarily have to train with full data.
4.6 Performance on analogy task
For the experiments conducted, evaluation was also done on the accuracy score on the analogy task as described in the earlier sections. The results are plotted in the figure 16.
On this made up Word Analogy task as well, we can see that Pre-training seems to giving a better analogy score than without pre-training. On the partial Wikipedia corpus however, we don’t see a better score. These experiments are not as decisive as the previous experiments, but we could still claim that it is giving an improvement on smaller corpus sizes. The Partial Wikipedia corpus is slightly bigger than the other corpuses considered.
We thus conclude that in the best case pre-training helps in Analogy tasks for small corpuses and in the worst case it does not degrade the performance. Note the Word Analogy is almost never the main property we look for in Word Vectors. Correlation score or some extrinsic measure is more reliable because of the nuances present in Word Analogy.
5 Domain Transfer
Since WordNet is a general corpus, it may not be able to model domain specific vectors well. It is well known that “meaning” changes with the domain. Domain adaptation aims to minimize the computation so as to have vectors which are suited to multiple domains. Experiments were conducted to check the transferability of the WordNet vectors to a relatively niche domain. More specifically, the sci.med category from 20 Newsgroups  was used to see how well WordNet vectors can adapt to the Medical Domain.
Experiments are conducted with and without pre-trained vectors and evaluation is four fold.
Correlation scores on WordSim
Word Analogy score
Similarity of words from medical domain
Similarity of general words
The motivation for the first three parts is clear and the fourth evaluation is to check if training on another corpus degrades the previously encoded useful information. We definitely expect some decrease in the same, but the disparity between the values is what we want to check.
The following are the set of words that are used for the medical domain. The words were chosen such that each word has at least occurrences in sci.med corpus, and a few of the pairs had sufficiently different meaning in the medical domain as compared to colloquial language.
Similarly, the following were the general words used.
The words from Table 6 are expected to have a high similarity when trained only on the sci.med corpus because of high occurrence and less polysemy in domain specific corpus. The following are the similarity scores and their comparisons. The training was done for epochs.
|Word 1||Word 2||Sci.med||Pretrained||WordNet|
The conclusions that can be drawn from the above table are, for words on which WordNet already has a high similarity, the pre-trained words maintain that similarity, and for word pairs like “family, planning” and “children, vaccine” which have a low similarity in WordNet and higher similarity in sci.med, the similarity of pre-trained vectors increases.
The results in table 9 are for the general words. As expected, the pretrained vectors perform way better.
|Word 1||Word 2||Sci.med||Pretrained||WordNet|
From the results in table 9, it can be seen that the pre-trained vectors are able to capture the nuances of the medical domain, while largely maintaining the information it obtained when trained only on WordNet. Though there is some loss in similarity with respect to the vectors trained on just WordNet, the loss is very small in all the cases. We can conclude that WordNet can be adapted well in more niche domains because it naturally augments the knowledge it has acquired.
The words we have chosen to measure similarity are such that we want their correlation to be as high as possible. To pictorially represent the results, the following procedure was followed. For both set of words above, we measure the similarity of all the words after each epoch and then take an average. Ideally, the average should be taken, but since we created our own set weighing the word pairs based only on our prior experience may bias the results. We stuck to uniform weight for each word. Figure 17 plots the results for words in table 7.
As expected, the average score stays almost the same for pre-trained vectors because it has already been trained to do well on this data. Without pre-training, sci.med corpus is not able to find the correlation. We perform the same experiment with the domain specific words and the plot it in Figure 18.
Figure 18 shows that the score for Pretraining remains almost constant, not changing much. The score for without Pre-training is that high initially because of the fact that all the similarities start from
. Though the similarity score remained constant we could see that the scores are changing for many values. So we decided to measure the variance in scores to check if there are any changes.
Results in Figure 19 are pretty revealing. Though the sum of the scores are remaining constant, the similarity is being distributed among words. This is because some words in the WordNet corpus have very high similarity as compared to the domain corpus and some words had lesser correlation. After seeing the new corpus, the model is trying to even out the scores because it has “realised” (through data) that the words are all somewhat equally similar, which is pretty amusing!
We observed the effect of weight initialization over the word vectors and correlation with the wordsim353 similarity scores. We conclude that when the weight initialization with word embeddings learnt over a dictionary like corpus, the pretrained vectors perform better at word similarity task than the non-pretrained vectors. To reach an equivalent correlation score, the pretrained vectors need lesser training time. In one of the experiments we also showed that we do not need the full corpus. A good representation subset of a corpus also gives similar performance, although a less sized corpus implies lesser training time.
-  Torrey, Lisa, and Jude Shavlik. ”Transfer learning.” Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques 1 (2009): 242.
-  Yosinski, Jason, et al. ”How transferable are features in deep neural networks?.” Advances in neural information processing systems. 2014.
-  Miller, George A., et al. ”Introduction to WordNet: An on-line lexical database.” International journal of lexicography 3.4 (1990): 235-244.
-  Banerjee, Satanjeev, and Ted Pedersen. ”An adapted Lesk algorithm for word sense disambiguation using WordNet.” International Conference on Intelligent Text Processing and Computational Linguistics. Springer, Berlin, Heidelberg, 2002.
-  https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, ”Distributed Representations of Words and Phrases and their Compositionality”
-  http://www.fit.vutbr.cz/ imikolov/rnnlm/word-test.v1.txt
-  Agirre, Eneko, et al. ”A study on similarity and relatedness using distributional and wordnet-based approaches.” Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2009.
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, ”Efficient Estimation of Word Representations in Vector Space”
-  http://www.natcorp.ox.ac.uk/
-  http://mattmahoney.net/dc/enwik9.zip