Deep Extrofitting: Specialization and Generalization of Expansional Retrofitting Word Vectors using Semantic Lexicons

08/22/2018 ∙ by Hwiyeol Jo, et al. ∙ 0

The retrofitting techniques, which inject external resources into word representations, have compensated the weakness of distributed representations in semantic and relational knowledge between words. Implicitly retrofitting word vectors by expansional technique (extrofitting), showed that our method outperforms retrofitting in word similarity task as well as is good at generalization. In this paper, we propose deep extrofitting, which is to stack extrofitting in depth. Furthermore, inspired by learning theory, we combine retrofitting with extrofitting, optimizing the result between specialization and generalization. When experimenting with GloVe, we show that our deep extrofitting not only outperforms the previous methods on most of word similarity task but also requires only synonyms. We also report further analysis on the effect of deep extrofitted word vectors on text classification task, resulting in the improvement on the performances.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The distributed word representation is efficient to compute the similarity of words and word relation (e.g., mean square distance, cosine similarity), having used for the input of neural networks. Various algorithms to generate the distributed word representation have been proposed, but most of the algorithms are based on the basic idea of CBOW (Continuous Bag-of-Words) and skip-gram 

[Mikolov et al.2013]

. Both CBOW and skip-gram are unsupervised algorithms to learn word vectors from the patterns of word order. Both algorithms learn word vectors by maximizing the probability of occurrence of a center word given neighbor words or neighbor words given a center word. Due to the aforementioned nature, the distributed word representation has a weakness to represent semantic and relational meanings of words that cannot be captured from the word orders 

[Lenci2018]. To compensate the weakness, the research of retrofitting have started to use external resources [Faruqui et al.2014, Mrkšić et al.2016, Speer, Chin, and Havasi2017, Vulić, Mrkšić, and Korhonen2017, Camacho-Collados, Pilehvar, and Navigli2015]

. Although there are lots of word embedding algorithms and pretrained word vectors, the benefits of retrofitting are that it can reflect additional resources into the word vectors without re-training all the data. Another strong point is that retrofitting can be applied to any kinds of pretrained word vectors because it is a post-processing method. Retrofitting can be applied to every word vectors, injecting the information of external resources by modifying word vector values. Finally, retrofitting can modify word vectors to be specialized on a specific task. For example, when retrofitting is applied to sentiment analysis on movie domain, it aggregates least relevant word vectors of movie titles, characters, and other entities that the sentiment analysis model can be more dependent on sentiment words.

The first successful approach is faruqui2014retrofitting’s retrofitting [Faruqui et al.2014], which modifies word vectors by weighted averaging the word vectors with semantic lexicons. In this work, they extracted synonym pairs from PPDB [Ganitkevitch, Van Durme, and Callison-Burch2013], WordNet [Miller1995], and FrameNet [Baker, Fillmore, and Lowe1998], and applied them to retrofitting. The retrofitting dramatically improves word similarity between synonyms, and the result not only corresponds to human intuition on words but also performs better on document classification tasks with a comparison to the original word embeddings [Kiela, Hill, and Clark2015]. After that, mrkvsic2016counter proposed counter-fitting [Mrkšić et al.2016], which use synonym pairs to collect word vectors and antonym pairs to make word vectors distant from one another. The counter-fitting showed good performance at specialization. Next, ATTRACT-REPEL [Mrkšić et al.2017] suggested a method for injecting linguistic constraints into word vectors by learning from defined cost function with mono- and cross-lingual synonym and antonym constraints. Explicit Retrofitting [Glavaš and Vulić2018] directly learns mapping functions of linguistic constraints with deep neural network architecture and retrofits the word vectors. The previous researches focused on explicit retrofitting, using manually defined or learned function to make synonyms close or antonyms distant. As a result, their approaches were strongly dependent on external resources and pretrained word vectors, that is, good at specialization whereas bad at generalization. Furthermore, we believe that making synonyms close together is reasonable even though it has different nuance in some context, but antonyms have to be further investigated to make them afar. For example, love and hate are grouped as antonyms, but they must share the meaning of ‘emotion’ in their representation. Lastly, the usefulness of word vector specialization should be further investigated. Previous works showed that specialized word vector improves the performance of domain-specific downstream tasks but did not report the effect of specialized word vectors on conventional NLP tasks such as text classification.
jo2018extrofitting presented extrofitting that a method to enrich not only word representation but also its vector space using semantic lexicons. The method implicitly retrofits word vectors by expanding and reducing its dimensions, without explicit retrofitting function. While adjusting the dimension of vector space, the algorithm could strengthen the meaning of each word, making synonyms close together and non-synonyms far from each other, finally projecting new vector space in accordance to the distribution of word vectors. Extrofitting generates generalized word vector without using antonyms. In this paper, we propose deep extrofitting: in-depth stacking of extrofitting, a method to be used for both word vector specialization and word vector specialization. We first describe the backgrounds of retrofitting and extrofitting, two different methods to retrofit word vectors. In Section 3, we will define our method in detail. Next, we will introduce experiment data including pretrained word vectors, semantic lexicons, and word similarity dataset. After that, we will show results on word similarity tasks as well as further analysis on the effect of deep extrofitting in respect to word vector specialization and word vector generalization. Finally, we will show the performance of text classification tasks with post-processed word representation as a downstream task.

2 Backgrounds


Retrofitting [Faruqui et al.2014] is a post-processing method to enrich word vectors using synonyms in semantic lexicons. The algorithm learns the word embedding matrix with the objective function :

where an original word vector is , its synonym vector is , inferred word vector is , and

denotes synonym pairs in semantic lexicon. The hyperparameter

and control the relative strengths of associations.


Extrofitting [Jo and Choi2018] follows 3 steps: (i) Expanding word vector with enrichment, (ii) Transferring semantic knowledge, (iii) Enriching its vector space. Step 1 is simply adding a dimension to original vectors, filling with a mean value of all the elements in word vector. Step 2 sets the value of the word pair within the added dimension to mean values if the words are synonym pairs. With these two steps, we can keep both the word dimension and the characteristic (e.g., meaning, semantics) of word vectors. The 3rd step uses Linear Discriminant Analysis (LDA) [Welling2005] to find new vector space clustering the synonyms and differentiating each word vectors using between-class scatter matrix and with-in class scatter matrix, reducing the added dimension. We will describe extrofitting briefly in Section 3.

3 Deep Extrofitting

Extrofitting first expands word embedding matrix :

where is word embedding table, and is the mean value of elements in word vector . L denotes semantic lexicons, and denotes synonym pairs. Next, we define Trans as calculating transform matrix, given word embedding matrix :

where is a word vector, is a class, which is the index of synonym pairs in lexicons. The overall average of is , and the class average in class is denoted by . This formula finds transform matrix

which minimize the variance within the same class and maximize the variance between different classes. Each class is defined as the index of synonym pairs. Then simple extrofitting is formulated as follows:

With extrofitting, we define variations of deep extrofitting: Stacked Extrofitting (Extro), RExtrofitting (RExtro), ERetrofitting (ERetro).

Stacked Extrofitting

We first stack extrofitting, keeping original dimension of word vectors. We want to see that our method can be specialized/overfitted to semantic lexicons like retrofitting. The stacked extrofitting (Extro) is formulated as follows:

Next, we do not keep the original dimension. That is, we skip Step 1 (expanding word vector with enrichment) and Step 2 (transferring semantic knowledge), which originally keep the original dimension of word vectors. Thereby we can focus on the effect of Step 3, enriching vector space by reducing its vector dimension. The stacked extrofitting without keeping dimension is formulated as follows:

Extrofitting with Retrofitting

Retrofitting could be overfitted to semantic lexicons whereas extrofitting results in generalized word vectors [Jo and Choi2018]. If then, we expect results that retrofitting and extrofitting complement each other. So, we first apply retrofitting to word vectors and then extrofit the retrofitted word vectors. We denote retrofitting as Retro.

We can utilize them in reversed way:

We also can use them one by one:

4 Experiment Data

Pretrained Word Vectors

Pretrained word vectors include words composed of n-dimensional float vector. Each vector is obtained from train data through unsupervised algorithms. One of major pretrained word vector we used is GloVe [Pennington, Socher, and Manning2014]. The algorithm learns word vectors by making the dot products of word vectors equal to the logarithm of the words’ probability of co-occurrence. We use glove.42B.300d trained on Common Crawl data, which contains 1,917,493 unique words as 300-dimensional vectors. Even though many word embedding algorithms and pretrained word vectors have been suggested after GloVe, GloVe still have been used as a strong baseline on word similarity tasks [Cer et al.2017, Camacho-Collados et al.2017].

Semantic Lexicons

We use WordNet [Miller1995], which consists of approximately 150,000 words and 115,000 synsets pairs. We borrow faruqui2014retrofitting’s WordNet lexicon, comprised of synonyms, hypernyms, and hyponyms. WordNet overlaps 70,411 words with GloVe, which is 3.67% of words in GloVe. faruqui2014retrofitting reported that their method performed the best when paired with WordNet. Extrofitting [Jo and Choi2018] also worked well with WordNet.

Evaluation Data

Word similarity datasets consist of two word pairs with human-rated similarity score between words. We use 4 different kinds of datasets: MEN-3k (MEN) [Bruni et al.2014], WordSim-353 (WS) [Finkelstein et al.2001], SimLex-999 (SL) [Hill, Reichart, and Korhonen2015], and SimVerb-3500 (SV) [Gerz et al.2016]. We experiment our methods on as many datasets as possible to see the effect of generalization and specialization while avoiding to become overfitted to specific dataset as well. When we use MEN-3k, WordSim-353, and SimVerb-3500, we combine train (or dev) set and test set together solely for evaluation purpose. The other datasets are left for future work since the datasets either are too small or contain numerous out-of-vocabulary.

5 Experiments on Word Similarity Task

The word similarity task is to calculate Spearman’s correlation [Daniel1990] between two words in word vector format. We first apply stacked extrofitting to GloVe, keeping its original dimension (see Section 3) and present the result in Table 1. We observe that stacked extrofitting improves the performance of word similarity tasks for a few iterations, but the performance gap becomes smaller as we stack more extrofitting.
Next, we perform stacked extrofitting without keeping its dimension. The result is presented in Table 2, and the performance is not much different with Table 1. For the convenience to compare the performance with other methods in 300-dimensional vectors, we use stacked extrofitting with keeping dimension as a default method.

3 Iter. MEN WS SL SV
3 0 0.7435 0.5516 0.3738 0.2264
1 0.8223 0.6638 0.4858 0.3573
2 0.8253 0.6644 0.4985 0.3684
3 0.8260 0.6704 0.4996 0.3723
4 0.8260 0.6747 0.4994 0.3719
5 0.8260 0.6675 0.4989 0.3730
6 0.8262 0.6618 0.4993 0.3730
7 0.8259 0.6784 0.4996 0.3730
8 0.8261 0.6735 0.4997 0.3729
9 0.8261 0.6708 0.4992 0.3726
10 0.8259 0.6618 0.4995 0.3722
Table 1: Spearman’s correlation of stacked extrofitted word vectors with keeping its original dimension for word similarity tasks. Stacking more extrofitting improves the performance. Iteration 0 is GloVe without any processing. Underlines indicate the best performance in average.
3 300 0.7435 0.5516 0.3738 0.2264
299 0.8250 0.6688 0.4940 0.3676
298 0.8306 0.6762 0.4969 0.3734
297 0.8302 0.6772 0.4986 0.3781
296 0.8304 0.6769 0.5007 0.3811
295 0.8278 0.6806 0.4981 0.3809
294 0.8267 0.6840 0.4942 0.3757
293 0.8238 0.6859 0.4974 0.3772
292 0.8210 0.6724 0.5003 0.3763
291 0.8164 0.6831 0.5039 0.3774
290 0.8164 0.6666 0.5039 0.3774
Table 2: Spearman’s correlation of stacked extrofitted word vectors without keeping its dimension for word similarity tasks. Stacking more extrofitting improves the performance. Dim 300 is GloVe without any processing and underlines indicate the best performance in average.
Figure 1: Plots of nearest top-100 words of cue words in different extrofitting layers. We choose two cue words; one is included in semantic lexicons (love; left), and another is not (soo; right)
3 Cue Word Post-Processed Top-10 Nearest Words(Cosine Similarity Score)
3 love Raw
loved(.7745), i(.7338), loves(.7311), know(.7286), loving(.7263),
really(.7196), always(.7193), want(.7192), hope(.7127), think(.7110)
+ Extro
adore(.5958), hate(.5917), loved(.5801), luv(.5394), loooove(.5285),
looooove(.5230), loving(.5202), want(.5176), loveeee(.5175), looove(.5065)
+ Extro
adore(.5817), hate(.5770), loved(.5592), luv(.5297), loooove(.5256),
looooove(.5207), loveeee(.5199), looove(.5047), loooooove(.4983), loving(.4953)
+ Extro
adore(.5794), hate(.5729), loved(.5496), luv(.5267), loooove(.5236),
looooove(.5216), loveeee(.5205), looove(.5040), loooooove(.4982), loadsss(.4920)
soo Raw
sooo(.8394), soooo(.7938), sooooo(.7715), soooooo(.7359), sooooooo(.6844),
haha(.6574), hahah(.6320), damn(.6247), omg(.6244), hahaha(.6219)
+ Extro
sooo(.8312), soooo(.7879), sooooo(.7760), soooooo(.7562), sooooooo(.7263), soooooooo(.6896),
sooooooooo(.6830), soooooooooo(.6559), tooo(.6501), sooooooooooo(.6465)
+ Extro
sooo(.8284), soooo(.7841), sooooo(.7720), soooooo(.7520), sooooooo(.7221), soooooooo(.6841),
sooooooooo(.6780), soooooooooo(.6507), tooo(.6435), sooooooooooo(.6408)
+ Extro
sooo(.8276), soooo(.7832), sooooo(.7709), soooooo(.7510), sooooooo(.7205), soooooooo(.6833),
sooooooooo(.6775), soooooooooo(.6497), tooo(.6423), sooooooooooo(.6403)
Table 3: List of top-10 nearest words of cue words in different extrofitting layers. We show cosine similarity scores of two words included in semantic lexicon (love) or not (soo).

We plot top-100 nearest words using t-SNE [Maaten and Hinton2008], as shown in Figure 1. We can see that stacking more extrofitting makes the word vectors utilize broader vector space in general while relatively collecting synonyms together. As a result, we lose word similarity score (see Table 3) but gain overall performance improvement as shown in Table 1. We interpret the results as generalization in that the word vectors get generalized representation by being far away from each other.
To improve the performances of word similarity tasks, we combine retrofitting with extrofitting. We first apply retrofitting to extrofitted word vectors, denoted as RExtro: retrofitting to extrofitted word vectors. The result is presented in Table 4. The result shows that adding retrofitting more than once does not significantly improves the performance. Second, we apply extrofitting to retrofitted word vectors, denoted as ERetro: extrofitting to retrofitted word vectors. The result is presented in Table 5. The result shows that stacking more extrofitting improves the performance, but the performance gap becomes smaller as we add more extrofitting. We also observe that using 1 retrofitted word vector shows the best performance.

3 (,) MEN WS SL SV
3 (0,1) 0.8223 0.6638 0.4858 0.3573
(1,1) 0.8066 0.6414 0.5717 0.4403
(2,1) 0.8059 0.6421 0.5676 0.4292
(3,1) 0.8052 0.6405 0.5664 0.4274
(4,1) 0.8051 0.6413 0.5661 0.4271
(5,1) 0.8051 0.6406 0.5660 0.4271
3 (10,1) 0.8051 0.6407 0.5661 0.4271
Table 4: Spearman’s correlation of RExtrofitted word vectors for word similarity tasks using semantic lexicon. We apply retrofitting to extrofitted word vector.
3 (,) MEN WS SL SV
3 (0,1) 0.7305 0.4986 0.4700 0.3134
(0,2) 0.7316 0.5024 0.4663 0.3041
(0,3) 0.7307 0.5018 0.4647 0.3021
3 (1,1) 0.7980 0.6225 0.5688 0.4434
(1,2) 0.7976 0.6269 0.5679 0.4355
(1,3) 0.7971 0.6260 0.5670 0.4338
3 (2,1) 0.8119 0.6415 0.5797 0.4553
(2,2) 0.8119 0.6488 0.5778 0.4476
(2,3) 0.8115 0.6438 0.5770 0.4460
3 (3,1) 0.8137 0.6416 0.5810 0.4614
(3,2) 0.8136 0.6468 0.5791 0.4535
(3,3) 0.8133 0.6442 0.5784 0.4520
Table 5: Spearman’s correlation of ERetrofitted word vectors for word similarity tasks using semantic lexicon. We apply extrofitting to retrofitted word vector.

Next, we stack retrofitting and extrofitting, one by one. When we stack retrofitting first, we denote it as Stepwise RExtro. Otherwise, stacking extrofitting first, we denote it as Stepwise ERetro. We report the results in Table 6 and Table 7, respectively. Stepwise RExtro and Stepwise ERetro perform well at specialization on SimLex-999 and SimVerb-3500 datasets. Since word pairs in the datasets 100% overlaps with synonym pairs in WordNet, applying retrofitting improves the similarity on those datasets while concurrently degrading the performance on the other datasets. Note that the performance of retrofitting converges in a few iterations (see Table 3) but we can specialize over retrofitting with the help of extrofitting by finding new enriched vector space at every iteration. On the other hand, the weakness of extrofitting–not being able to strongly collect word vectors–is compensated by retrofitting. We compare our best results with previous retrofitting models in Table 8. We define the average similarity score of MEN-3k and WordSim-353 as generalization score (GenScore) because MEN-3k and WordSim-353 include words that is not a part of WordNet lexicon. The average score of the other datasets, SimLex-999 and SimVerb-3500, are defined as specialization score (SpecScore) because the words in the datsets fully overlaps with WordNet lexicon. Our methods, Stacked Extro, Stepwise RExtro and Stepwise ERetro, significantly outperform state-of-the-art retrofitting models despite using only synonyms. Furthermore, if we combine extrofitting with retrofitting in greedy way, the result could be further improved. Although ATTRACT-REPEL [Mrkšić et al.2017]

is better than our methods on SimLex-999, we specialize the word vector with only synonyms, thus using less external resources than ATTRACT-REPEL. Second, ATTRACT-REPEL cannot use GloVe without preprocessing, because of the limitation of memory allocation to a variable in tensorflow. This constraint is critical that well-known pretrained word vectors have large amount of vocabularies and data size, surpassing the memory limitation of tensorflow. Lastly, glavavs2018explicit showed that ATTRACT-REPEL specializes only words seen in semantic lexicons, whereas our methods include the strong point of ER-CNT 

[Glavaš and Vulić2018] that enrich word vectors not included in semantic lexicons.

3 1 0.8066 0.6414 0.5717 0.4403
2 0.7444 0.5881 0.5940 0.5050
3 0.7022 0.5472 0.5932 0.5389
4 0.6941 0.5397 0.5975 0.5628
5 0.6861 0.5334 0.6019 0.5793
6 0.6722 0.5313 0.6076 0.5925
7 0.6555 0.5275 0.6089 0.6006
8 0.6476 0.5175 0.6026 0.6020
9 0.6369 0.5194 0.5988 0.6015
10 0.6260 0.5238 0.5923 0.5990
Table 6: Spearman’s correlation of Stepwise RExtrofitted word vectors for word similarity tasks.
3 1 0.7980 0.6225 0.5688 0.4434
2 0.7115 0.5359 0.5783 0.4962
3 0.6844 0.5023 0.5796 0.5314
4 0.7016 0.5182 0.5969 0.5605
5 0.7074 0.5360 0.6093 0.5818
6 0.6965 0.5426 0.6165 0.5978
7 0.6830 0.5316 0.6124 0.6004
8 0.6697 0.5275 0.6055 0.6028
9 0.6525 0.5281 0.6003 0.6028
10 0.6411 0.5255 0.5945 0.5983
Table 7: Spearman’s correlation of Stepwise ERetrofitted word vectors for word similarity tasks.

We plot 100 nearest word vectors in Figure 2 and list top-10 nearest words with cosine similarity in Table 9. Both Stepwise RExtro and Stepwise ERetro collect words included in semantic lexicon stronger than retrofitted GloVe while dispersing word vectors not included in the semantic lexicon. The difference in concentration is due to the substantial effect of the first retrofitting or extrofitting layer. For the words not included in the semantic lexicon, then only extrofitting can be effective, resulting in a cosine similarity loss.

3 Model MEN WS GenScore SL SV SpecScore
3 Retrofitting (Syn) 0.7305 0.5332 0.6319 0.4644 0.3017 0.3831
Counter-fitting (Syn) 0.7149 0.5075 0.6112 0.4143 0.2845 0.3494
Counter-fitting (Syn+Ant) 0.6898 0.4633 0.5766 0.5415 0.4167 0.4791
ATTRACT-REPEL (Syn) 0.7156 0.5921 0.6539 0.5672 0.4416 0.5044
ATTRACT-REPEL (Syn+Ant) 0.7013 0.5523 0.6268 0.6397 0.5463 0.5930
ER-CNT (Syn) - - - 0.465 0.339 0.402
ER-CNT (Syn+Ant) - - - 0.582 0.439 0.5105
3 Extro (Syn) 0.8259 0.6784 0.7522 0.4996 0.3730 0.4363
Stepwise RExtro (Syn) 0.6555 0.5275 0.5915 0.6089 0.6020 0.6055
Stepwise ERetro (Syn) 0.6965 0.5426 0.6200 0.6165 0.5978 0.6072
Stepwise ERetro (Syn) 0.6697 0.5275 0.5986 0.6055 0.6028 0.6042
Table 8: Comparison of our methods with other retrofitting models. We use GloVe with synonym pairs (Syn) in WordNet lexicon and their antonym pair (Ant) if the model uses antonyms as well. But the github codes of ER-CNT are under developed. So we report the results from their papers.
Figure 2: Plots of nearest top-100 words of cue words in different post-processing methods. One is included in semantic lexicons (love; left), and another is not (soo; right). SRExtro and SERetro denote Stepwise RExtro and Stepwise ERetro, respectively.
3 Cue Word Post-Processed Top-10 Nearest Words(Cosine Similarity Score)
3 love Raw
loved(.7745), i(.7338), loves(.7311), know(.7286), loving(.7263),
really(.7196), always(.7193), want(.7192), hope(.7127), think(.7110)
+ Retro
loved(.7857), know(.7826), like(.7781), want(.7736), i(.7707),
feel(.7550), wish(.7549), think(.7491), enjoy(.7453), loving(.7451)
+Stepwise RExtro
devotedness(.8236), lovemaking(.8163), agape(.7995), heartstrings(.7787), eff(.7407),
infatuation(.7397), do_it(.7368), gaping(.7231), fornicate(.7229), dearest(.7111)
+Stepwise ERetro
devotedness(.8259), lovemaking(.8158), heartstrings(.7788), agape(.7692), infatuation(.7548),
cherish(.7222), eff(.7125), dearest(.7027), do_it(.6975), fornicate(.6876)
soo Raw
sooo(.8394), soooo(.7938), sooooo(.7715), soooooo(.7359), sooooooo(.6844),
haha(.6574), hahah(.6320), damn(.6247), omg(.6244), hahaha(.6219)
+ Retro
sooo(.8394), soooo(.7938), sooooo(.7715), soooooo(.7359), sooooooo(.6844), soooooooo(.6896),
haha(.6574), hahah(.6320), omg(.6244), hahaha(.6219), sooooooooo(.6189)
+Stepwise RExtro
sooo(.8189), soooo(.7666), sooooo(.7619), soooooo(.7455), sooooooo(.7187), sooooooooo(.6888),
soooooooo(.6622), sooooooooooo(.6511), soooooooooo(.6464), soooooooooooo(.6170)
+Stepwise ERetro
sooo(.8055), soooo(.7585), sooooo(.7411), soooooo(.7237), sooooooo(.6985), sooooooooo(.6539),
soooooooo(.6473), tooo(.6349), soooooooooo(.6073), sooooooooooo(.6057)
Table 9: List of top-10 nearest words of cue words in different post-processing methods. We show cosine similarity scores of two words included in semantic lexicon (love) or not (soo).

6 Experiments on Text Classification Task

3 NOT Trainable Word Vectors
3 DBpedia
Yelp review
3 (1) Without Pretrained 0.9740 0.6282 0.4095 0.6719
(2) GloVe 0.8544 0.4692 0.3142 0.6528
(3) Retrofit(GloVe) 0.8660 0.4750 0.2861 0.6501
(4) Counterfit(GloVe) 0.7713 0.3794 0.2079 0.6156
(5) Extro(GloVe) 0.9864 0.7349 0.5190 0.6804
(6) Extro(GloVe) 0.9863 0.7374 0.5194 0.6812
(7) Stepwise RExtro(GloVe) 0.9835 0.7038 0.4866 0.6778
(8) Stepwise ERetro(GloVe) 0.9853 0.7083 0.4917 0.6774
3 Trainable Word Vectors
3 DBpedia
Yelp review
3 (1) Without Pretrained 0.9822 0.6687 0.4392 0.6796
(2) GloVe 0.9861 0.6960 0.4592 0.6805
(3) Retrofit(GloVe) 0.9785 0.6689 0.4186 0.6780
(4) Counterfit(GloVe) 0.9798 0.6555 0.4151 0.6799
(5) Extro(GloVe) 0.9874 0.7419 0.5280 0.6845
(6) Extro(GloVe) 0.9875 0.7434 0.5226 0.6856
(7) Stepwise RExtro(GloVe) 0.9853 0.7180 0.4990 0.6826
(8) Stepwise ERetro(GloVe) 0.9861 0.7232 0.5070 0.6839
Table 10: The accuracy of text classification datasets using TextCNN initialized by differently post-processed word vector.

We experiment the effect of word vectors specialization and generalization on text classification task.


We use 2 topic classification datasets; DBpedia ontology [Lehmann et al.2015], Yahoo!Answers, and 1 sentiment classification dataset; Yelp reviews

. We utilize Yahoo!Answer dataset for 2 different tasks, classifying super (upper-level) categories and classifying sub (lower-level) categories, respectively.


Since we believe that keeping the sequence of words is important, so we build simple TextCNN [Kim2014]

rather than building classifier based on Bag-of-Words (BoW) as  faruqui2014retrofitting used, because BoW neglects the word sequences by averaging all the word vectors. Our TextCNN uses first 100 words as input, and the classifier consists of 2 convolutional layers with the channel size of 32 and 16, respectively. We adapt the multi-channel approach, implementing multiple sizes of kernels. We use 4 different sizes of kernels–2, 3, 4, and 5. We concatenate them after every max-pooling layers. The learned kernels go through an activation function, ReLU, and are max-pooled. We set the size of word embedding to 300, learning rate to 0.001, using early-stopping to prevent overfitting.


To observe the effect of word vector specialization and word vector generalization, we experiment with the performance in 2 different settings: fixed word vectors, or trainable word vectors. With the fixed word vectors, we can evaluate the usefulness of the word vectors themselves. With the trainable word vectors, we can see the improvement of the classification performance when initialized with the enriched word vectors. In each settings, we experiment the performance of classifier: (1) without any pretrained word vector, (2) with GloVe, (3) GloVe with retrofitting, (4) GloVe with counter-fitting, (5) GloVe with extrofitting, (6) GloVe with stacked extrofitting, (7) GloVe with Stepwise RExtro, and (8) GloVe with Stepwise ERetro. The results are presented in Table 10. We can see that generalized word vectors, (5) and (6), perform better than the specialized word vectors, (7) and (8), in topic classification tasks when both the word vectors are trainable and freezed. However, the performance gap is small in sentiment classification (Yelp review). This might be from that WordNet contains numerous emotional words. The result implies that although generalized word vectors perform better in general, specialized word vector can be useful for domain-specific tasks if we have enough specialized semantic lexicons.

7 Conclusion

We develop retrofitting models that generate specialized and generalized word vector using in-depth expansional retrofitting, called deep extrofitting. We show that stacked extrofitting improves the performance on overall word similarity tasks, and the combination of extrofitting with retrofitting performs good at word vector specialization. The aforementioned models outperform previous state-of-the-art models, specializing on SimLex-999 and SimVerb-3500 with only synonyms and generalizing on MEN-3k and WordSim-353. Also, we can see not only that extrofitting helps retrofitting find new vector space on specialization that prevents retrofitting from converging in a few iterations but also retrofitting helps extrofitting to strongly collect word vectors. Our method is dependent on the distribution of pretrained word vectors and synonym pairs, and does not need antonym pairs, hyperparameters, and explicit mapping functions. As a future work, we will further research our method to utilize antonym pairs as well.


  • [Baker, Fillmore, and Lowe1998] Baker, C. F.; Fillmore, C. J.; and Lowe, J. B. 1998. The berkeley framenet project. In Proceedings of the 17th international conference on Computational linguistics-Volume 1, 86–90. Association for Computational Linguistics.
  • [Bruni et al.2014] Bruni, E.; Tram, N.; Baroni, M.; et al. 2014. Multimodal distributional semantics.

    The Journal of Artificial Intelligence Research

  • [Camacho-Collados et al.2017] Camacho-Collados, J.; Pilehvar, M. T.; Collier, N.; and Navigli, R. 2017. Semeval-2017 task 2: Multilingual and cross-lingual semantic word similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 15–26.
  • [Camacho-Collados, Pilehvar, and Navigli2015] Camacho-Collados, J.; Pilehvar, M. T.; and Navigli, R. 2015. Nasari: a novel approach to a semantically-aware representation of items. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 567–577.
  • [Cer et al.2017] Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; and Specia, L. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 1–14.
  • [Daniel1990] Daniel, W. W. 1990. Spearman rank correlation coefficient. Applied nonparametric statistics 358–365.
  • [Faruqui et al.2014] Faruqui, M.; Dodge, J.; Jauhar, S. K.; Dyer, C.; Hovy, E.; and Smith, N. A. 2014. Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166.
  • [Finkelstein et al.2001] Finkelstein, L.; Gabrilovich, E.; Matias, Y.; Rivlin, E.; Solan, Z.; Wolfman, G.; and Ruppin, E. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, 406–414. ACM.
  • [Ganitkevitch, Van Durme, and Callison-Burch2013] Ganitkevitch, J.; Van Durme, B.; and Callison-Burch, C. 2013. Ppdb: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 758–764.
  • [Gerz et al.2016] Gerz, D.; Vulić, I.; Hill, F.; Reichart, R.; and Korhonen, A. 2016. Simverb-3500: A large-scale evaluation set of verb similarity. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    , 2173–2182.
  • [Glavaš and Vulić2018] Glavaš, G., and Vulić, I. 2018. Explicit retrofitting of distributional word vectors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 34–45.
  • [Hill, Reichart, and Korhonen2015] Hill, F.; Reichart, R.; and Korhonen, A. 2015.

    Simlex-999: Evaluating semantic models with (genuine) similarity estimation.

    Computational Linguistics 41(4):665–695.
  • [Jo and Choi2018] Jo, H., and Choi, S. J. 2018. Extrofitting: Enriching word representation and its vector space with semantic lexicons. arXiv preprint arXiv:1804.07946.
  • [Kiela, Hill, and Clark2015] Kiela, D.; Hill, F.; and Clark, S. 2015. Specializing word embeddings for similarity or relatedness. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2044–2048.
  • [Kim2014] Kim, Y. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–1751.
  • [Lehmann et al.2015] Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P. N.; Hellmann, S.; Morsey, M.; Van Kleef, P.; Auer, S.; et al. 2015. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6(2):167–195.
  • [Lenci2018] Lenci, A. 2018. Distributional models of word meaning. Annual review of Linguistics 4:151–171.
  • [Maaten and Hinton2008] Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne.

    Journal of machine learning research

  • [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119.
  • [Miller1995] Miller, G. A. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39–41.
  • [Mrkšić et al.2016] Mrkšić, N.; Séaghdha, D. O.; Thomson, B.; Gašić, M.; Rojas-Barahona, L.; Su, P.-H.; Vandyke, D.; Wen, T.-H.; and Young, S. 2016. Counter-fitting word vectors to linguistic constraints. arXiv preprint arXiv:1603.00892.
  • [Mrkšić et al.2017] Mrkšić, N.; Vulić, I.; Séaghdha, D. Ó.; Leviant, I.; Reichart, R.; Gašić, M.; Korhonen, A.; and Young, S. 2017. Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints. Transactions of the Association of Computational Linguistics 5(1):309–324.
  • [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543.
  • [Speer, Chin, and Havasi2017] Speer, R.; Chin, J.; and Havasi, C. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, 4444–4451.
  • [Vulić, Mrkšić, and Korhonen2017] Vulić, I.; Mrkšić, N.; and Korhonen, A. 2017. Cross-lingual induction and transfer of verb classes based on word vector space specialisation. arXiv preprint arXiv:1707.06945.
  • [Welling2005] Welling, M. 2005. Fisher linear discriminant analysis. Department of Computer Science, University of Toronto 3(1).