The distributed word representation is efficient to compute the similarity of words and word relation (e.g., mean square distance, cosine similarity), having used for the input of neural networks. Various algorithms to generate the distributed word representation have been proposed, but most of the algorithms are based on the basic idea of CBOW (Continuous Bag-of-Words) and skip-gram[Mikolov et al.2013]
. Both CBOW and skip-gram are unsupervised algorithms to learn word vectors from the patterns of word order. Both algorithms learn word vectors by maximizing the probability of occurrence of a center word given neighbor words or neighbor words given a center word. Due to the aforementioned nature, the distributed word representation has a weakness to represent semantic and relational meanings of words that cannot be captured from the word orders[Lenci2018]. To compensate the weakness, the research of retrofitting have started to use external resources [Faruqui et al.2014, Mrkšić et al.2016, Speer, Chin, and Havasi2017, Vulić, Mrkšić, and Korhonen2017, Camacho-Collados, Pilehvar, and Navigli2015]
. Although there are lots of word embedding algorithms and pretrained word vectors, the benefits of retrofitting are that it can reflect additional resources into the word vectors without re-training all the data. Another strong point is that retrofitting can be applied to any kinds of pretrained word vectors because it is a post-processing method. Retrofitting can be applied to every word vectors, injecting the information of external resources by modifying word vector values. Finally, retrofitting can modify word vectors to be specialized on a specific task. For example, when retrofitting is applied to sentiment analysis on movie domain, it aggregates least relevant word vectors of movie titles, characters, and other entities that the sentiment analysis model can be more dependent on sentiment words.
The first successful approach is faruqui2014retrofitting’s retrofitting [Faruqui et al.2014], which modifies word vectors by weighted averaging the word vectors with semantic lexicons. In this work, they extracted synonym pairs from PPDB [Ganitkevitch, Van Durme, and Callison-Burch2013], WordNet [Miller1995], and FrameNet [Baker, Fillmore, and Lowe1998], and applied them to retrofitting. The retrofitting dramatically improves word similarity between synonyms, and the result not only corresponds to human intuition on words but also performs better on document classification tasks with a comparison to the original word embeddings [Kiela, Hill, and Clark2015]. After that, mrkvsic2016counter proposed counter-fitting [Mrkšić et al.2016], which use synonym pairs to collect word vectors and antonym pairs to make word vectors distant from one another. The counter-fitting showed good performance at specialization. Next, ATTRACT-REPEL [Mrkšić et al.2017] suggested a method for injecting linguistic constraints into word vectors by learning from defined cost function with mono- and cross-lingual synonym and antonym constraints. Explicit Retrofitting [Glavaš and Vulić2018] directly learns mapping functions of linguistic constraints with deep neural network architecture and retrofits the word vectors. The previous researches focused on explicit retrofitting, using manually defined or learned function to make synonyms close or antonyms distant. As a result, their approaches were strongly dependent on external resources and pretrained word vectors, that is, good at specialization whereas bad at generalization. Furthermore, we believe that making synonyms close together is reasonable even though it has different nuance in some context, but antonyms have to be further investigated to make them afar. For example, love and hate are grouped as antonyms, but they must share the meaning of ‘emotion’ in their representation. Lastly, the usefulness of word vector specialization should be further investigated. Previous works showed that specialized word vector improves the performance of domain-specific downstream tasks but did not report the effect of specialized word vectors on conventional NLP tasks such as text classification.
jo2018extrofitting presented extrofitting that a method to enrich not only word representation but also its vector space using semantic lexicons. The method implicitly retrofits word vectors by expanding and reducing its dimensions, without explicit retrofitting function. While adjusting the dimension of vector space, the algorithm could strengthen the meaning of each word, making synonyms close together and non-synonyms far from each other, finally projecting new vector space in accordance to the distribution of word vectors. Extrofitting generates generalized word vector without using antonyms. In this paper, we propose deep extrofitting: in-depth stacking of extrofitting, a method to be used for both word vector specialization and word vector specialization. We first describe the backgrounds of retrofitting and extrofitting, two different methods to retrofit word vectors. In Section 3, we will define our method in detail. Next, we will introduce experiment data including pretrained word vectors, semantic lexicons, and word similarity dataset. After that, we will show results on word similarity tasks as well as further analysis on the effect of deep extrofitting in respect to word vector specialization and word vector generalization. Finally, we will show the performance of text classification tasks with post-processed word representation as a downstream task.
Retrofitting [Faruqui et al.2014] is a post-processing method to enrich word vectors using synonyms in semantic lexicons. The algorithm learns the word embedding matrix with the objective function :
where an original word vector is , its synonym vector is , inferred word vector is , and
denotes synonym pairs in semantic lexicon. The hyperparameterand control the relative strengths of associations.
Extrofitting [Jo and Choi2018] follows 3 steps: (i) Expanding word vector with enrichment, (ii) Transferring semantic knowledge, (iii) Enriching its vector space. Step 1 is simply adding a dimension to original vectors, filling with a mean value of all the elements in word vector. Step 2 sets the value of the word pair within the added dimension to mean values if the words are synonym pairs. With these two steps, we can keep both the word dimension and the characteristic (e.g., meaning, semantics) of word vectors. The 3rd step uses Linear Discriminant Analysis (LDA) [Welling2005] to find new vector space clustering the synonyms and differentiating each word vectors using between-class scatter matrix and with-in class scatter matrix, reducing the added dimension. We will describe extrofitting briefly in Section 3.
3 Deep Extrofitting
Extrofitting first expands word embedding matrix :
where is word embedding table, and is the mean value of elements in word vector . L denotes semantic lexicons, and denotes synonym pairs. Next, we define Trans as calculating transform matrix, given word embedding matrix :
where is a word vector, is a class, which is the index of synonym pairs in lexicons. The overall average of is , and the class average in class is denoted by . This formula finds transform matrix
which minimize the variance within the same class and maximize the variance between different classes. Each class is defined as the index of synonym pairs. Then simple extrofitting is formulated as follows:
With extrofitting, we define variations of deep extrofitting: Stacked Extrofitting (Extro), RExtrofitting (RExtro), ERetrofitting (ERetro).
We first stack extrofitting, keeping original dimension of word vectors. We want to see that our method can be specialized/overfitted to semantic lexicons like retrofitting. The stacked extrofitting (Extro) is formulated as follows:
Next, we do not keep the original dimension. That is, we skip Step 1 (expanding word vector with enrichment) and Step 2 (transferring semantic knowledge), which originally keep the original dimension of word vectors. Thereby we can focus on the effect of Step 3, enriching vector space by reducing its vector dimension. The stacked extrofitting without keeping dimension is formulated as follows:
Extrofitting with Retrofitting
Retrofitting could be overfitted to semantic lexicons whereas extrofitting results in generalized word vectors [Jo and Choi2018]. If then, we expect results that retrofitting and extrofitting complement each other. So, we first apply retrofitting to word vectors and then extrofit the retrofitted word vectors. We denote retrofitting as Retro.
We can utilize them in reversed way:
We also can use them one by one:
4 Experiment Data
Pretrained Word Vectors
Pretrained word vectors include words composed of n-dimensional float vector. Each vector is obtained from train data through unsupervised algorithms. One of major pretrained word vector we used is GloVe [Pennington, Socher, and Manning2014]. The algorithm learns word vectors by making the dot products of word vectors equal to the logarithm of the words’ probability of co-occurrence. We use glove.42B.300d trained on Common Crawl data, which contains 1,917,493 unique words as 300-dimensional vectors. Even though many word embedding algorithms and pretrained word vectors have been suggested after GloVe, GloVe still have been used as a strong baseline on word similarity tasks [Cer et al.2017, Camacho-Collados et al.2017].
We use WordNet [Miller1995], which consists of approximately 150,000 words and 115,000 synsets pairs. We borrow faruqui2014retrofitting’s WordNet lexicon, comprised of synonyms, hypernyms, and hyponyms. WordNet overlaps 70,411 words with GloVe, which is 3.67% of words in GloVe. faruqui2014retrofitting reported that their method performed the best when paired with WordNet. Extrofitting [Jo and Choi2018] also worked well with WordNet.
Word similarity datasets consist of two word pairs with human-rated similarity score between words. We use 4 different kinds of datasets: MEN-3k (MEN) [Bruni et al.2014], WordSim-353 (WS) [Finkelstein et al.2001], SimLex-999 (SL) [Hill, Reichart, and Korhonen2015], and SimVerb-3500 (SV) [Gerz et al.2016]. We experiment our methods on as many datasets as possible to see the effect of generalization and specialization while avoiding to become overfitted to specific dataset as well. When we use MEN-3k, WordSim-353, and SimVerb-3500, we combine train (or dev) set and test set together solely for evaluation purpose. The other datasets are left for future work since the datasets either are too small or contain numerous out-of-vocabulary.
5 Experiments on Word Similarity Task
The word similarity task is to calculate Spearman’s correlation [Daniel1990] between two words in word vector format. We first apply stacked extrofitting to GloVe, keeping its original dimension (see Section 3) and present the result in Table 1. We observe that stacked extrofitting improves the performance of word similarity tasks for a few iterations, but the performance gap becomes smaller as we stack more extrofitting.
Next, we perform stacked extrofitting without keeping its dimension. The result is presented in Table 2, and the performance is not much different with Table 1. For the convenience to compare the performance with other methods in 300-dimensional vectors, we use stacked extrofitting with keeping dimension as a default method.
|3 Cue Word||Post-Processed||Top-10 Nearest Words(Cosine Similarity Score)|
We plot top-100 nearest words using t-SNE [Maaten and
Hinton2008], as shown in Figure 1. We can see that stacking more extrofitting makes the word vectors utilize broader vector space in general while relatively collecting synonyms together. As a result, we lose word similarity score (see Table 3) but gain overall performance improvement as shown in Table 1. We interpret the results as generalization in that the word vectors get generalized representation by being far away from each other.
To improve the performances of word similarity tasks, we combine retrofitting with extrofitting. We first apply retrofitting to extrofitted word vectors, denoted as RExtro: retrofitting to extrofitted word vectors. The result is presented in Table 4. The result shows that adding retrofitting more than once does not significantly improves the performance. Second, we apply extrofitting to retrofitted word vectors, denoted as ERetro: extrofitting to retrofitted word vectors. The result is presented in Table 5. The result shows that stacking more extrofitting improves the performance, but the performance gap becomes smaller as we add more extrofitting. We also observe that using 1 retrofitted word vector shows the best performance.
Next, we stack retrofitting and extrofitting, one by one. When we stack retrofitting first, we denote it as Stepwise RExtro. Otherwise, stacking extrofitting first, we denote it as Stepwise ERetro. We report the results in Table 6 and Table 7, respectively. Stepwise RExtro and Stepwise ERetro perform well at specialization on SimLex-999 and SimVerb-3500 datasets. Since word pairs in the datasets 100% overlaps with synonym pairs in WordNet, applying retrofitting improves the similarity on those datasets while concurrently degrading the performance on the other datasets. Note that the performance of retrofitting converges in a few iterations (see Table 3) but we can specialize over retrofitting with the help of extrofitting by finding new enriched vector space at every iteration. On the other hand, the weakness of extrofitting–not being able to strongly collect word vectors–is compensated by retrofitting. We compare our best results with previous retrofitting models in Table 8. We define the average similarity score of MEN-3k and WordSim-353 as generalization score (GenScore) because MEN-3k and WordSim-353 include words that is not a part of WordNet lexicon. The average score of the other datasets, SimLex-999 and SimVerb-3500, are defined as specialization score (SpecScore) because the words in the datsets fully overlaps with WordNet lexicon. Our methods, Stacked Extro, Stepwise RExtro and Stepwise ERetro, significantly outperform state-of-the-art retrofitting models despite using only synonyms. Furthermore, if we combine extrofitting with retrofitting in greedy way, the result could be further improved. Although ATTRACT-REPEL [Mrkšić et al.2017]
is better than our methods on SimLex-999, we specialize the word vector with only synonyms, thus using less external resources than ATTRACT-REPEL. Second, ATTRACT-REPEL cannot use GloVe without preprocessing, because of the limitation of memory allocation to a variable in tensorflow. This constraint is critical that well-known pretrained word vectors have large amount of vocabularies and data size, surpassing the memory limitation of tensorflow. Lastly, glavavs2018explicit showed that ATTRACT-REPEL specializes only words seen in semantic lexicons, whereas our methods include the strong point of ER-CNT[Glavaš and Vulić2018] that enrich word vectors not included in semantic lexicons.
We plot 100 nearest word vectors in Figure 2 and list top-10 nearest words with cosine similarity in Table 9. Both Stepwise RExtro and Stepwise ERetro collect words included in semantic lexicon stronger than retrofitted GloVe while dispersing word vectors not included in the semantic lexicon. The difference in concentration is due to the substantial effect of the first retrofitting or extrofitting layer. For the words not included in the semantic lexicon, then only extrofitting can be effective, resulting in a cosine similarity loss.
|3 Retrofitting (Syn)||0.7305||0.5332||0.6319||0.4644||0.3017||0.3831|
|3 Extro (Syn)||0.8259||0.6784||0.7522||0.4996||0.3730||0.4363|
|Stepwise RExtro (Syn)||0.6555||0.5275||0.5915||0.6089||0.6020||0.6055|
|Stepwise ERetro (Syn)||0.6965||0.5426||0.6200||0.6165||0.5978||0.6072|
|Stepwise ERetro (Syn)||0.6697||0.5275||0.5986||0.6055||0.6028||0.6042|
|3 Cue Word||Post-Processed||Top-10 Nearest Words(Cosine Similarity Score)|
6 Experiments on Text Classification Task
|3 NOT Trainable Word Vectors|
|3 (1) Without Pretrained||0.9740||0.6282||0.4095||0.6719|
|(7) Stepwise RExtro(GloVe)||0.9835||0.7038||0.4866||0.6778|
|(8) Stepwise ERetro(GloVe)||0.9853||0.7083||0.4917||0.6774|
|3 Trainable Word Vectors|
|3 (1) Without Pretrained||0.9822||0.6687||0.4392||0.6796|
|(7) Stepwise RExtro(GloVe)||0.9853||0.7180||0.4990||0.6826|
|(8) Stepwise ERetro(GloVe)||0.9861||0.7232||0.5070||0.6839|
We experiment the effect of word vectors specialization and generalization on text classification task.
We use 2 topic classification datasets; DBpedia ontology [Lehmann et al.2015], Yahoo!Answers, and 1 sentiment classification dataset; Yelp reviews
. We utilize Yahoo!Answer dataset for 2 different tasks, classifying super (upper-level) categories and classifying sub (lower-level) categories, respectively.
Since we believe that keeping the sequence of words is important, so we build simple TextCNN [Kim2014]
rather than building classifier based on Bag-of-Words (BoW) as faruqui2014retrofitting used, because BoW neglects the word sequences by averaging all the word vectors. Our TextCNN uses first 100 words as input, and the classifier consists of 2 convolutional layers with the channel size of 32 and 16, respectively. We adapt the multi-channel approach, implementing multiple sizes of kernels. We use 4 different sizes of kernels–2, 3, 4, and 5. We concatenate them after every max-pooling layers. The learned kernels go through an activation function, ReLU, and are max-pooled. We set the size of word embedding to 300, learning rate to 0.001, using early-stopping to prevent overfitting.
To observe the effect of word vector specialization and word vector generalization, we experiment with the performance in 2 different settings: fixed word vectors, or trainable word vectors. With the fixed word vectors, we can evaluate the usefulness of the word vectors themselves. With the trainable word vectors, we can see the improvement of the classification performance when initialized with the enriched word vectors. In each settings, we experiment the performance of classifier: (1) without any pretrained word vector, (2) with GloVe, (3) GloVe with retrofitting, (4) GloVe with counter-fitting, (5) GloVe with extrofitting, (6) GloVe with stacked extrofitting, (7) GloVe with Stepwise RExtro, and (8) GloVe with Stepwise ERetro. The results are presented in Table 10. We can see that generalized word vectors, (5) and (6), perform better than the specialized word vectors, (7) and (8), in topic classification tasks when both the word vectors are trainable and freezed. However, the performance gap is small in sentiment classification (Yelp review). This might be from that WordNet contains numerous emotional words. The result implies that although generalized word vectors perform better in general, specialized word vector can be useful for domain-specific tasks if we have enough specialized semantic lexicons.
We develop retrofitting models that generate specialized and generalized word vector using in-depth expansional retrofitting, called deep extrofitting. We show that stacked extrofitting improves the performance on overall word similarity tasks, and the combination of extrofitting with retrofitting performs good at word vector specialization. The aforementioned models outperform previous state-of-the-art models, specializing on SimLex-999 and SimVerb-3500 with only synonyms and generalizing on MEN-3k and WordSim-353. Also, we can see not only that extrofitting helps retrofitting find new vector space on specialization that prevents retrofitting from converging in a few iterations but also retrofitting helps extrofitting to strongly collect word vectors. Our method is dependent on the distribution of pretrained word vectors and synonym pairs, and does not need antonym pairs, hyperparameters, and explicit mapping functions. As a future work, we will further research our method to utilize antonym pairs as well.
- [Baker, Fillmore, and Lowe1998] Baker, C. F.; Fillmore, C. J.; and Lowe, J. B. 1998. The berkeley framenet project. In Proceedings of the 17th international conference on Computational linguistics-Volume 1, 86–90. Association for Computational Linguistics.
[Bruni et al.2014]
Bruni, E.; Tram, N.; Baroni, M.; et al.
Multimodal distributional semantics.
The Journal of Artificial Intelligence Research49:1–47.
- [Camacho-Collados et al.2017] Camacho-Collados, J.; Pilehvar, M. T.; Collier, N.; and Navigli, R. 2017. Semeval-2017 task 2: Multilingual and cross-lingual semantic word similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 15–26.
- [Camacho-Collados, Pilehvar, and Navigli2015] Camacho-Collados, J.; Pilehvar, M. T.; and Navigli, R. 2015. Nasari: a novel approach to a semantically-aware representation of items. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 567–577.
- [Cer et al.2017] Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; and Specia, L. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 1–14.
- [Daniel1990] Daniel, W. W. 1990. Spearman rank correlation coefficient. Applied nonparametric statistics 358–365.
- [Faruqui et al.2014] Faruqui, M.; Dodge, J.; Jauhar, S. K.; Dyer, C.; Hovy, E.; and Smith, N. A. 2014. Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166.
- [Finkelstein et al.2001] Finkelstein, L.; Gabrilovich, E.; Matias, Y.; Rivlin, E.; Solan, Z.; Wolfman, G.; and Ruppin, E. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, 406–414. ACM.
- [Ganitkevitch, Van Durme, and Callison-Burch2013] Ganitkevitch, J.; Van Durme, B.; and Callison-Burch, C. 2013. Ppdb: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 758–764.
[Gerz et al.2016]
Gerz, D.; Vulić, I.; Hill, F.; Reichart, R.; and Korhonen, A.
Simverb-3500: A large-scale evaluation set of verb similarity.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2173–2182.
- [Glavaš and Vulić2018] Glavaš, G., and Vulić, I. 2018. Explicit retrofitting of distributional word vectors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 34–45.
[Hill, Reichart, and
Hill, F.; Reichart, R.; and Korhonen, A.
Simlex-999: Evaluating semantic models with (genuine) similarity estimation.Computational Linguistics 41(4):665–695.
- [Jo and Choi2018] Jo, H., and Choi, S. J. 2018. Extrofitting: Enriching word representation and its vector space with semantic lexicons. arXiv preprint arXiv:1804.07946.
- [Kiela, Hill, and Clark2015] Kiela, D.; Hill, F.; and Clark, S. 2015. Specializing word embeddings for similarity or relatedness. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2044–2048.
- [Kim2014] Kim, Y. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–1751.
- [Lehmann et al.2015] Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P. N.; Hellmann, S.; Morsey, M.; Van Kleef, P.; Auer, S.; et al. 2015. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6(2):167–195.
- [Lenci2018] Lenci, A. 2018. Distributional models of word meaning. Annual review of Linguistics 4:151–171.
Maaten, L. v. d., and Hinton, G.
Visualizing data using t-sne.
Journal of machine learning research9(Nov):2579–2605.
- [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119.
- [Miller1995] Miller, G. A. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39–41.
- [Mrkšić et al.2016] Mrkšić, N.; Séaghdha, D. O.; Thomson, B.; Gašić, M.; Rojas-Barahona, L.; Su, P.-H.; Vandyke, D.; Wen, T.-H.; and Young, S. 2016. Counter-fitting word vectors to linguistic constraints. arXiv preprint arXiv:1603.00892.
- [Mrkšić et al.2017] Mrkšić, N.; Vulić, I.; Séaghdha, D. Ó.; Leviant, I.; Reichart, R.; Gašić, M.; Korhonen, A.; and Young, S. 2017. Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints. Transactions of the Association of Computational Linguistics 5(1):309–324.
- [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543.
- [Speer, Chin, and Havasi2017] Speer, R.; Chin, J.; and Havasi, C. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, 4444–4451.
- [Vulić, Mrkšić, and Korhonen2017] Vulić, I.; Mrkšić, N.; and Korhonen, A. 2017. Cross-lingual induction and transfer of verb classes based on word vector space specialisation. arXiv preprint arXiv:1707.06945.
- [Welling2005] Welling, M. 2005. Fisher linear discriminant analysis. Department of Computer Science, University of Toronto 3(1).