Weight Initialization in Neural Language Models

05/12/2018 ∙ by Ameet Deshpande, et al. ∙ 0

Semantic Similarity is an important application which finds its use in many downstream NLP applications. Though the task is mathematically defined, semantic similarity's essence is to capture the notions of similarity impregnated in humans. Machines use some heuristics to calculate the similarity between words, but these are typically corpus dependent or are useful for specific domains. The difference between Semantic Similarity and Semantic Relatedness motivates the development of new algorithms. For a human, the word car and road are probably as related as car and bus. But this may not be the case for computational methods. Ontological methods are good at encoding Semantic Similarity and Vector Space models are better at encoding Semantic Relatedness. There is a dearth of methods which leverage ontologies to create better vector representations. The aim of this proposal is to explore in the direction of a hybrid method which combines statistical/vector space methods like Word2Vec and Ontological methods like WordNet to leverage the advantages provided by both.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Weight Initialization

1.1 Motivation

Pre-training has been recognized for long to be useful to train Neural Networks. The large number of local optimas, and the combinatorial number of equally optimal solutions mean that the initial weight has a large effect on the final answer. Just to represent the combinatorics involved in the problem, consider the following simple Neural Network with just one optimal solution. Fully Connected layers mean that there can be

different combinations for the same values of weights, which mean that there could be solutions which yield the same globally optimal value. The situation can easily get more complicated for local optimas, something Neural Networks end up being stuck at quite often.

Inputlayer

Hiddenlayer

Ouputlayer

Work on Unsupervised Pre-training [1] showed the immense importance of Weight Initialization. An unsupervised objective is used in their work and they claim that the weight initialization determines which optima the model reaches, as illustrated by the following figure. Staring at and will yield different minimas.

Figure 1: Multiple Minimas

Transfer Learning [2] is yet another way to utilize weight initialization. The concept of Transfer Learning has been examined by [3]. The two key constraints are pointed out by them are, what all weights to transfer (relating to co-adaptibility of the layers) and the negative effect on the optimization of the higher level layers (how to build higher level features using lower level features of non-target task). It should be clear that, the fact that we are using a single layer Neural Network (Word2vec) alleviates both these issues.

Our work is similar to Transfer Learning in many aspects, but the fact that the neural networks being used are shallow gives a feel of Weight Initialization rather than reusability of weights. Nevertheless, we show in the experiments section that our method can give better performance than without initialization.

The final motivation to use weight initialization is to reduce the training time. We hypothesize that if the initialized weights are already good enough for most aspects, the number of epochs taken to fine tune them will be much lesser than what would be required to train them from scratch.

1.2 Using WordNet for initialization

The importance of weight initialization also makes it crucial to ensure the weights initialized are useful for a large range of tasks. The Word2vec model (detailed in the next section) works with context words. The following characteristics were identified as important to ensure good initialization.

  • Small Corpus so that training time is reduced

  • Large vocabulary size to ensure sufficient coverage

  • Representative Context words even with small corpus

A dictionary seemed like a good option which has all the above characteristics. WordNet [4] in particular is arranged in a hierarchy and we thought that the context words and examples used to create them will also encode this hierarchy in some way. Other successes [5] of algorithms which use WordNet glosses motivated us to use this corpus.

2 Method

The work in [7] introduces models to learn word embeddings in a corpus. Specifically the work introduces the CBOW (Continuous bag of words) model in which the model predicts the word given the context of words. It also introduces the Skipgram architecture which weighs nearby context words more than the distant ones.

As stated in the earlier section, a dictionary type of corpus was deemed fit for a good initialization. We were motivated to use the WordNet glosses for the same. The model was trained on the word definitions from wordnet to learn the initial word weights. We tried the learning algorithm on different variants of the wordnet gloss corpus. The details of which are as follows :

  • The corpus was created by appending to each word, it’s definition. For example ”enamel any smooth glossy coating that resembles ceramic glaze” is a part of the corpus and this sentence was formed by concatenating ”enamel” and it’s gloss ”any smooth glossy coating that resembles ceramic glaze”. Let us call this wordnetOnce.

  • Another corpus was created by inserting the word into the gloss definition. For example, ”enamel any enamel smooth enamel glossy enamel coating enamel that enamel resembles enamel ceramic enamel glaze” is a part of the corpus and this sentence was formed by inserting ”enamel” in between it’s gloss definition which is ”any smooth glossy coating that resembles ceramic glaze”. Let us call this wordnetMultiple

The model was trained as CBOW and it turned out that the corpus wordnetOnce performed better. The similarity between ”banana” and ”fruit” was reported as 0.442 by wordnetOnce and 0.253 by wornetMultiple. Thus we have used the wordnetOnce corpus for rest of the experiments.

wordnetOnce wordnetMultiple wordnetMultiple
window size -2 window size -8
Musa Monstera banana
bananas bananas Citron
fruit Treelike Monstera
Phillippine perfumed Glycerine
Hazel banana Tent
Shrubby Crescent-shaped Apricot
Citrus Marang Write
Liquidambar Anaras Corozo
buckthorn Kernels One-seeded
Table 1: Similar words to banana

We also experimented if CBOW or Skipgram performs better. We found (subsequent sections) that both the models’ performance is very similar and hence we chose to run the experiments using the CBOW model.

3 Evaluation

3.1 WordSim

The WordSim-353[9] dataset contains English word pairs along with human-assigned similarity judgements.During evaluation, we calculate the similarity between the 353 word pairs mentioned in the WordSim-353 dataset and try to find the correlation between the similarity values as depicted in the dataset compared to the values given by the models.

We use the Spearman correlation metric as opposed to Pearson. Spearman is a correlation test which assesses how well the relationship between two ranked variables is.

Spearman: Spearman is a rank correlation measure which is used to measure the degree of association between the two variables. The following formula is used to calculate Spearman rank correlation:

Figure 2: Spearman Rank Correlation

where is the difference between the ranks of corresponding variables and is the number of observations.

3.2 Word Analogy

One of the goals was to test the obtained vectors’ performance on the analogy task. The work in [10] states that the vector representations learnt by the word2vec models capture complex relations like word analogies as well. An analogy task would be of the type - “What is the word that is similar to small in the same sense as biggest is similar to big?”

The test was run on the family test set present in [8]. Suppose the test was on boy:girl :: brother:sister, we then find the vector , and then we search for the

closest (via cosine similarity) words to

. We say that the model performed correct for this testcase, if girl was among the most similar words, and we say that the model performed incorrect for the testcase.

Eventually we report an accuracy measure over test cases where the accuracy is


The table 2 are few sets of words of the form word1:word2 :: word3:word4 from the test set.

word1 word2 word3 word4
boy girl brother sister
brother sister dad mom
father mother king queen
grandfather grandmother grandpa grandma
groom bride prince princess
uncle aunt man woman
son daughter nephew niece
Table 2: Analogy task example set
word1 word2 word3 10 closest words predicted word4 Remark
boy girl brother  
brother
daughter
wife
wife,
sister
son
father
mother
lover
nephew
sister Correct
boy girl dad  
girl
chipotle
mammal
blue-violet
thyrsus
rosette
hedgerow
volva
lubrica
scherzerianum
mom Incorrect
Table 3: Classifying the performance of model over test-case

4 Experiments

4.1 Corpus Details

For the experiments conducted we primarily used the British National Corpus[11] and the partial wikipedia corpus[12]. In particular, we will refer to the following corpora as mentioned below.

  • B - This is subset ‘B’ of the British National Corpus

  • AB - This is the concatenation of ‘A’ and ‘B’ subsets.

  • ABC - This is the concatenation of ‘A’, ‘B’ and ‘C’ subsets

  • Partial Wikipedia / enWiki - This dataset comprises of the first billion characters from wikipedia. This amounts to less than 10% of the information available on the wikipedia and can be found in [12]

4.2 Skipgram vs CBOW

We ran the experiments on the AB corpus once using the skipgram model and then using the CBOW model, both with a window size of 8. The observations are plotted below:

Figure 3: CBOW
Figure 4: Skipram
Figure 5: Pretrained
Figure 6: CBOW vs Skipgram

We conclude that the observations are very close, and there is no significant difference in performance. Hence we chose to run the further experiments on the CBOW model.

4.3 Performance of pretrained vectors compared to non-pretrained vectors given equivalent training

Four corpora, referred as B, AB, ABC and Partial Wikipedia were used to run these experiments. Vector embeddings were learnt by the CBOW word2vec model with a window size of 8.

Let us call, the word vectors learnt on the wordnetOnce corpus as wordnetVectors. The experiments were run once by initializing the word vectors to wordnetVectors. We call this the Pretrained setup, and the they were then run without any extra initialization, which we refer to as the Without Pretraining Setup. Following are the observations of how the vectors performed on correlation scores with the wordsim.

Figure 7: Corpus B
Figure 8: Corpus AB
Figure 9: Corpus ABC
Figure 10: Corpus Partial Wikipedia
Figure 11: Correlation Score - Pretrained vs Without Pretraining

From the figure 11 we can observe that after sufficient amount of training, the experiments in which the vectors were pretrained give better correlation scores with wordsim when compared to the ones without any particular word vector initialization. The effect is more clearly visible when the training for word vectors is done for more number of epochs. The partial wikipedia corpus when trained for 40 epochs, gave a correlation score of 0.6598 when the word vectors were initialized with pretrained word vectors on wordnetOnce corpus as compared to the correlation score of 0.5759 when there was no vector initialization done.

Given above evidence we conclude that vectors pretrained with wordnetOnce corpus give better similarity scores than the vectors which are learnt without any pretraining.

4.4 Given the correlation score to achieve compare training time across corpuses

The aim of this experiment was to find out the effect of pretraining in the training time. The experiment goal was to find out the number of epochs of pretrained vectors after which the correlation score with wordsim was greater than the correlation score obtained by training the vectors without initialization for 20 epochs. The observations are listed in table 4.

Corpus Size Correlation Score at 20 epochs # epochs for
Vectors without pretraining Pretrained vectors
B 39MB 0.5159 9
AB 118MB 0.5435 13
ABC 217MB 0.5636 15
Partial Wikipedia 954MB 0.6026 13
Table 4: Performance for desired correlation score

From Table 4 it is quite evident that pretraining helps reduce the training time too, while trying to achieve a particular correlation score.

4.5 Variation of correlation score for a given training time, varying size of corpus

Trying to understand the effect of pretraining across corpora of different sizes, we split the partial wikipedia (enwiki) corpus into parts of 239MB, 477MB, 716MB and 954MB (full corpus) and learnt the vector representations in the case of pretraining and also without pretraining. The training algorithm was run for 20 epochs and the observations are reported in table 5.

Corpus Size Correlation Score Without Pretraining Correlation Score With Pretraining
239MB 0.6113 0.6479
477MB 0.6215 0.6565
716MB 0.6186 0.6479
954MB 0.6026 0.6472
Table 5: Effect on varying corpus size

One of the observations we can draw is that after 20 epochs all the above corpora perform better when they are pretrained than when no pretraining is done. Another interesting observation is that there is no significant difference in correlation score across different sizes, owing to the similar nature of the corpora. Thus for better vector representations we do not necessarily have to train with full data.

4.6 Performance on analogy task

For the experiments conducted, evaluation was also done on the accuracy score on the analogy task as described in the earlier sections. The results are plotted in the figure 16.

Figure 12: Corpus B
Figure 13: Corpus AB
Figure 14: Corpus ABC
Figure 15: Corpus Partial Wikipedia
Figure 16: Accuracy Score on Analogy Task - Pretrained vs Without Pretraining

On this made up Word Analogy task as well, we can see that Pre-training seems to giving a better analogy score than without pre-training. On the partial Wikipedia corpus however, we don’t see a better score. These experiments are not as decisive as the previous experiments, but we could still claim that it is giving an improvement on smaller corpus sizes. The Partial Wikipedia corpus is slightly bigger than the other corpuses considered.

We thus conclude that in the best case pre-training helps in Analogy tasks for small corpuses and in the worst case it does not degrade the performance. Note the Word Analogy is almost never the main property we look for in Word Vectors. Correlation score or some extrinsic measure is more reliable because of the nuances present in Word Analogy.

5 Domain Transfer

Since WordNet is a general corpus, it may not be able to model domain specific vectors well. It is well known that “meaning” changes with the domain. Domain adaptation aims to minimize the computation so as to have vectors which are suited to multiple domains. Experiments were conducted to check the transferability of the WordNet vectors to a relatively niche domain. More specifically, the sci.med category from 20 Newsgroups [6] was used to see how well WordNet vectors can adapt to the Medical Domain.

Experiments are conducted with and without pre-trained vectors and evaluation is four fold.

  1. Correlation scores on WordSim

  2. Word Analogy score

  3. Similarity of words from medical domain

  4. Similarity of general words

The motivation for the first three parts is clear and the fourth evaluation is to check if training on another corpus degrades the previously encoded useful information. We definitely expect some decrease in the same, but the disparity between the values is what we want to check.

The following are the set of words that are used for the medical domain. The words were chosen such that each word has at least occurrences in sci.med corpus, and a few of the pairs had sufficiently different meaning in the medical domain as compared to colloquial language.

doctor nurse
doctor syringe
doctor medicine
syringe medicine
hospital nurse
disease medicine
hospital nurse
hospital health
health medicine
hospital problem
treatment cancer
breast cancer
database medical
depression medicine
depression chemical
family planning
children vaccine
Table 6: Medical Words

Similarly, the following were the general words used.

school children
university students
company industry
boy girl
mother father
national international
library books
Table 7: General Words

The words from Table 6 are expected to have a high similarity when trained only on the sci.med corpus because of high occurrence and less polysemy in domain specific corpus. The following are the similarity scores and their comparisons. The training was done for epochs.

Word 1 Word 2 Sci.med Pretrained WordNet
doctor nurse 0.479 0.614 0.672
doctor syringe 0.180 0.120 0.156
doctor medicine 0.321 0.366 0.432
syringe medicine 0.237 0.208 0.329
hospital nurse 0.281 0.449 0.537
disease medicine 0.255 0.357 0.438
hospital nurse 0.281 0.449 0.537
hospital health 0.390 0.459 0.518
health medicine 0.353 0.415 0.492
hospital problem 0.258 0.274 0.268
treatment cancer 0.564 0.555 0.652
breast cancer 0.647 0.508 0.588
database medical 0.344 0.413 0.485
depression medicine 0.200 0.162 0.227
depression chemical 0.352 0.229 0.256
family planning 0.326 0.256 0.231
children vaccine 0.579 0.233 0.151
Table 8: Comparisons on words from Medical Domain

The conclusions that can be drawn from the above table are, for words on which WordNet already has a high similarity, the pre-trained words maintain that similarity, and for word pairs like “family, planning” and “children, vaccine” which have a low similarity in WordNet and higher similarity in sci.med, the similarity of pre-trained vectors increases.

The results in table 9 are for the general words. As expected, the pretrained vectors perform way better.

Word 1 Word 2 Sci.med Pretrained WordNet
school children 0.034 0.462 0.463
university students 0.275 0.425 0.435
company industry 0.0434 0.524 0.520
boy girl 0.089 0.804 0.863
mother father 0.613 0.829 0.849
national international 0.687 0.650 0.687
library books 0.435 0.566 0.560
Table 9: Comparisons on general words

From the results in table 9, it can be seen that the pre-trained vectors are able to capture the nuances of the medical domain, while largely maintaining the information it obtained when trained only on WordNet. Though there is some loss in similarity with respect to the vectors trained on just WordNet, the loss is very small in all the cases. We can conclude that WordNet can be adapted well in more niche domains because it naturally augments the knowledge it has acquired.
The words we have chosen to measure similarity are such that we want their correlation to be as high as possible. To pictorially represent the results, the following procedure was followed. For both set of words above, we measure the similarity of all the words after each epoch and then take an average. Ideally, the average should be taken, but since we created our own set weighing the word pairs based only on our prior experience may bias the results. We stuck to uniform weight for each word. Figure 17 plots the results for words in table 7.

Figure 17: General Words

As expected, the average score stays almost the same for pre-trained vectors because it has already been trained to do well on this data. Without pre-training, sci.med corpus is not able to find the correlation. We perform the same experiment with the domain specific words and the plot it in Figure 18.

Figure 18: Domain Specific Words

Figure 18 shows that the score for Pretraining remains almost constant, not changing much. The score for without Pre-training is that high initially because of the fact that all the similarities start from

. Though the similarity score remained constant we could see that the scores are changing for many values. So we decided to measure the variance in scores to check if there are any changes.

Figure 19: Variance of scores

Results in Figure 19 are pretty revealing. Though the sum of the scores are remaining constant, the similarity is being distributed among words. This is because some words in the WordNet corpus have very high similarity as compared to the domain corpus and some words had lesser correlation. After seeing the new corpus, the model is trying to even out the scores because it has “realised” (through data) that the words are all somewhat equally similar, which is pretty amusing!

6 Conclusion

We observed the effect of weight initialization over the word vectors and correlation with the wordsim353 similarity scores. We conclude that when the weight initialization with word embeddings learnt over a dictionary like corpus, the pretrained vectors perform better at word similarity task than the non-pretrained vectors. To reach an equivalent correlation score, the pretrained vectors need lesser training time. In one of the experiments we also showed that we do not need the full corpus. A good representation subset of a corpus also gives similar performance, although a less sized corpus implies lesser training time.

References

  • [1]

    Erhan, Dumitru, et al. ”Why does unsupervised pre-training help deep learning?.” Journal of Machine Learning Research 11.Feb (2010): 625-660.

  • [2] Torrey, Lisa, and Jude Shavlik. ”Transfer learning.” Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques 1 (2009): 242.
  • [3] Yosinski, Jason, et al. ”How transferable are features in deep neural networks?.” Advances in neural information processing systems. 2014.
  • [4] Miller, George A., et al. ”Introduction to WordNet: An on-line lexical database.” International journal of lexicography 3.4 (1990): 235-244.
  • [5] Banerjee, Satanjeev, and Ted Pedersen. ”An adapted Lesk algorithm for word sense disambiguation using WordNet.” International Conference on Intelligent Text Processing and Computational Linguistics. Springer, Berlin, Heidelberg, 2002.
  • [6] https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups
  • [7]

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, ”Distributed Representations of Words and Phrases and their Compositionality”

  • [8] http://www.fit.vutbr.cz/ imikolov/rnnlm/word-test.v1.txt
  • [9] Agirre, Eneko, et al. ”A study on similarity and relatedness using distributional and wordnet-based approaches.” Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2009.
  • [10]

    Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, ”Efficient Estimation of Word Representations in Vector Space”

  • [11] http://www.natcorp.ox.ac.uk/
  • [12] http://mattmahoney.net/dc/enwik9.zip