Vector representations of words (embeddings) have become the cornerstone of modern Natural Language Processing (NLP), as learning word vectors and utilizing them as features in downstream NLP tasks is thede facto standard. Word embeddings (Mikolov et al., 2013; Pennington et al., 2014) are typically trained in an unsupervised way on large monolingual corpora. Whilst such word representations are able to capture some syntactic as well as semantic information, their ability to map relations (e.g. synonymy, antonymy) between words is limited. To alleviate this deficiency, a set of refinement post-processing methods–called retrofitting or semantic specialization–has been introduced. In the next section, we discuss the intricacies of these methods in more detail.
To summarize, our contributions in this work are as follows:
We introduce a set of new linguistic constraints (i.e. synonyms and antonyms) created with BabelNet for three languages: English, German and Italian.
We show that the proposed approach achieves performance improvements on an intrinsic task (word similarity) as well as on a downstream task (dialog state tracking).
2 Related Work
Numerous methods have been introduced for incorporating structured linguistic knowledge from external resources to word embeddings. Fundamentally, there exist three categories of semantic specialization approaches: (a) joint methods which incorporate lexical information during the training of distributional word vectors; (b) specialization methods also referred to as retrofitting methods which use post-processing techniques to inject semantic information from external lexical resources into pre-trained word vector representations; and (c) post-specialization methods which use linguistic constraints to learn a general mapping function allowing to specialize the entire distributional vector space.
In general, joint methods perform worse than the other two methods, and are not model-agnostic, as they are tightly coupled to the distributional word vector models (e.g. Word2Vec, GloVe). Therefore, in this work we concentrate on the specialization and post-specialization methods. Approaches which fall in the former category can be considered local specialization methods, where the most prominent examples are: retrofitting (Faruqui et al., 2015)
which is a post-processing method to enrich word embeddings with knowledge from semantic lexicons, in this case it brings closer semantically similar words.Counter-fitting (Mrkšić et al., 2016) likewise fine-tunes word representations; however, conversely to the retrofitting technique it counter-fits the embeddings with respect to the given similarity and antonymy constraints. Attract-Repel (Mrkšić et al., 2017) uses linguistic constraints obtained from external lexical resources to semantically specialize word embeddings. Similarly to counter-fitting it injects synonymy and antonymy constraints into distributional word vector spaces. In contrast to counter-fitting, this method does not ignore how updates of the example word vector pairs affect their relations to other word vectors.
On the other hand, the latter group, post-specialization methods, performs global specialization of distributional spaces. We can distinguish: explicit retrofitting (Glavaš and Vulić, 2018) that was the first attempt to use external constraints (i.e. synonyms and antonyms) as training examples for learning an explicit mapping function for specializing the words not observed in the constraints. Later, a more robust DFFN (Vulić et al., 2018) method was introduced with the same goal – to specialize the full vocabulary by leveraging the already specialized subspace of seen words.
In this paper, we propose an approach that builds upon previous works (Vulić et al., 2018; Ponti et al., 2018). The process of specializing distributional vectors is a two-step procedure (as shown in Figure 1). First, an initial specialization is performed (see §3.1). In the second step, a global specialization mapping function is learned, allowing to generalize to unseen words (see §3.2).
3.1 Initial Specialization
In this step a subspace of distributional vectors for words that occur in the external constraints is specialized. To this end, fine-tuning of seen words can be performed using any specialization method. In this work, we utilize Attract-Repel model (Mrkšić et al., 2017) as it offers state-of-the-art performance. This method allows to make use of both synonymy (attract) and antonymy (repel) constraints. More formally, given a set of attract word pairs and a set of of repel word pairs, let be the vocabulary of words seen in the constraints. Hence, each word pair is represented by a corresponding vector pair . The model optimization method operates over mini-batches: a mini-batch of synonymy pairs (of size ) and a mini-batch of antonymy pairs (of size ). The pairs of negative examples and are drawn from word vectors in .
The negative examples serve the purpose of pulling synonym pairs closer and pushing antonym pairs further away with respect to their corresponding negative examples. For synonyms:
where is the rectifier function, and is the similarity margin determining the distance between synonymy vectors and how much closer they should be comparing to their negative examples. Similarly, the equation for antonyms is given as:
A distributional regularization term is used to retain the quality of the original distributional vector space using -regularization.
where is a -regularization constant, and is the original vector for the word .
Consequently, the final cost function is formulated as follows:
3.2 Proposed Post-Specialization Model
Once the initial specialization is completed, post-specialization methods can be employed. This step is important, because local specialization affects only words seen in the constraints, and thus just a subset of the original distributional space . While post-specialization methods learn a global specialization mapping function allowing them to generalize to unseen words .
Given the specialized word vectors from the vocabulary of seen words , our proposed method propagates this signal to the entire distributional vector space using a generative adversarial network (GAN) (Goodfellow et al., 2014). Hence, in our model, following the approach of Ponti et al. (2018), we introduce adversarial losses. More specifically, the mapping function is learned through a combination of a standard -loss with adversarial losses. The motivation behind this is to make the mappings more natural and ensure that vectors specialized for the full vocabulary are more realistic. To this end, we use the Wasserstein distance incorporated in the generative adversarial network (WGAN) (Arjovsky et al., 2017) as well as its improved variant with gradient penalty (WGAN-GP) (Gulrajani et al., 2017). For brevity, we call our model WGAN-postspec, which is an umbrella term for the WGAN and WGAN-GP methods implemented in the proposed post-specialization model. One of the benefits of using WGANs over vanilla GANs is that WGANs are generally more stable, and also they do not suffer from vanishing gradients.
Our proposed post-specialization approach is based on the principles of GANs, as it is composed of two elements: a generator network and a discriminator network . The gist of this concept, is to improve the generated samples through a min-max game between the generator and the discriminator.
In our post-specialization
model, a multi-layer feed-forward neural network, which trains a global mapping function, acts as the generator. Consequently, the generator is trained to produce predictionsthat are as similar as possible to the corresponding initially specialized word vectors . Therefore, a global mapping function is trained using word vector pairs, such that . On the other hand, the discriminator , which is a multi-layer classification network, tries to distinguish the generated samples from the initially specialized vectors sampled from . In this process, the differences between predictions and initially specialized vectors are used to improve the generator, resulting in more realistically looking outputs.
In general, for the GAN model we can define the loss of the generator as:
While the loss of the discriminator is given as:
In principle, the losses with Wasserstein distance can be formulated as follows:
An alternative scenario with a gradient penalty (WGAN-GP) requires adding gradient penalty coefficient in the Eq. (3.2).
|external + babelnet||4,545,045||4,396,350||3,515,111||1,360,040||1,089,338||1,349,612||976,877||808,605||808,225|
|external + babelnet||381,777||352,099||378,365||1,823||1,662||1,744||883||769||851|
Pre-trained Word Embeddings.
In order to evaluate our proposed approach as well as to compare our results with respect to current state-of-the-art post-specialization approaches, we use popular and readily available 300-dimensional pre-trained word vectors. Word2Vec (Mikolov et al., 2013) embeddings for English were trained using skip-gram with negative sampling on the cleaned and tokenized Polyglot Wikipedia (Al-Rfou’ et al., 2013) by Levy and Goldberg (2014), while German and Italian embeddings were trained using CBOW with negative sampling on WacKy corpora (Dinu et al., 2015; Artetxe et al., 2017, 2018). Moreover, GloVe vectors for English were trained on Common Crawl (Pennington et al., 2014).
To perform semantic specialization of word vector spaces, we exploit linguistic constraints used in previous works (Zhang et al., 2014; Ono et al., 2015; Vulić et al., 2018) (referred to as external) as well as introduce a new set of constraints collected by us (referred to as babelnet) for three languages: English, German and Italian. We use constraints in two different settings: disjoint and overlap. In the first setting, we remove all linguistic constraints that contain any of the words available in SimLex (Hill et al., 2015), SimVerb (Gerz et al., 2016) and WordSim (Leviant and Reichart, 2015) evaluation datasets. In the overlap setting, we let the SimLex, SimVerb and WordSim words remain in the constraints. To summarize, we present the number of word pairs for English, German and Italian constraints in Table 1.
Let us discuss in more detail how the lists of constraints were constructed. In this work, we use two sets of linguistic constraints: external and babelnet. The first set of constraints was retrieved from WordNet (Fellbaum, 1998) and Roget’s Thesaurus (Kipfer, 2009), resulting in 1,023,082 synonymy and 380,873 antonymy word pairs. The second set of constraints, which is a part of our contribution, comprises synonyms and antonyms obtained using NASARI lexical embeddings (Camacho-Collados et al., 2016) and BabelNet (Navigli and Ponzetto, 2012). As NASARI provides lexical information for BabelNet words in five languages (EN, ES, FR, DE and IT), we collected each word with its related BabelNetID (a sense database identifier) to extract the list of its synonyms and antonyms using BabelNet API.
Furthermore, to improve the list of Italian words, we also followed the approach proposed by Sucameli and Lenci (2017). The authors provided a new dataset of semantically related Italian word pairs. The dataset includes nouns, adjectives and verbs with their synonyms, antonyms and hypernyms. The information in this dataset was gathered by its authors through crowdsourcing from a pool of Italian native speakers. This way, we could concatenate Italian word pairs to provide a more complete list of synonyms and antonyms.
Similarly, we refer to the work of Scheible and Schulte im Walde (2014) that presents a new collection of semantically related word pairs in German, which was compiled through human evaluation. Relying on GermaNet and the respective JAVA API, the list of the word pairs was generated with a sampling technique. Finally, we used these word pairs in our experiments as external resources for the German language.
Initial Specialization and Post-Specialization.
Although, initially specialized vector spaces show gains over the non-specialized word embeddings, linguistic constraints represent only a fraction of their total vocabulary. Therefore, semantic specialization is a two-step process. Firstly, we perform initial specialization of the pre-trained word vectors by means of Attract-Repel (see §2
) algorithm. The values of hyperparameter are set according to the default values:and = = 50. Afterward, to perform a specialization of the entire vocabulary, a global specialization mapping function is learned. In our WGAN-postspec proposed approach, the post-specialization
model uses a GAN with improved loss functions by means of the Wasserstein distance and gradient penalty. Importantly, the optimization process differs depending on the algorithm implemented in our model. In the case of a vanilla GAN (AuxGAN
), standard stochastic gradient descent is used. While in theWGAN
model we employ RMSProp(Tieleman and Hinton, 2012). Finally, in the case of the WGAN-GP, Adam (Kingma and Ba, 2015) optimizer is applied.
5.1 Word Similarity
We report our experimental results with respect to a common intrinsic word similarity task, using standard benchmarks: SimLex-999 and WordSim-353 for English, German and Italian, as well as SimVerb-3500 for English. Each dataset contains human similarity ratings, and we evaluate the similarity measure using the Spearman’s rank correlation coefficient. In Table 2, we present results for English benchmarks, whereas results for German and Italian are reported in Table 3.
Word embeddings are evaluated in two scenarios: disjoint where words observed in the benchmark datasets are removed from the linguistic constraints; and overlap where all words provided in the linguistic constraints are utilized. We use the overlap setting in a downstream task (see §5.2). In the tasks we report scores for Original (non-specialized) word vectors, initial specialization method Attract-Repel (Mrkšić et al., 2017), and three post-specialization methods: DFFN (Vulić et al., 2018), AuxGAN (Ponti et al., 2018) and our proposed model WGAN-postspec (in two scenarios: WGAN and WGAN-GP).
The results suggest that the post-specialization methods bring improvements in the specialization of the distributional word vector space. Overall, the highest correlation scores are reported for the models with adversarial losses. We also observe that the proposed WGAN-postspec achieves fairly consistent correlation gains with GloVe vectors on the SimLex dataset. Interestingly, while exploiting additional constraints (i.e. external + babelnet) generally boosts correlation scores for German and Italian, the results are not conclusive in the case of English, and thus they require further investigation.
5.2 Dialog State Tracking
We also evaluate our proposed approach on a dialog state tracking (DST) downstream task. This task is a standard language understanding task, which allows to differentiate between word similarity and relatedness. To perform the evaluation we follow previous works (Henderson et al., 2014; Williams et al., 2016; Mrkšić et al., 2017)
. Concretely, a DST model computes probability based only on pre-trained word embeddings. We use Wizard-of-Oz (WOZ) v.2.0 dataset(Wen et al., 2017; Mrkšić et al., 2017) composed of 600 training dialogues as well as 200 development and 400 test dialogues.
In our experiments, we report results with a standard joint goal accuracy (JGA) score. The results in Table 4 confirm our findings from the previous word similarity task, as initial semantic specialization and post-specialization (in particular WGAN-postspec) yield improvements over original distributional word vectors. We expect this conclusion to hold in all settings; however, additional experiments for different languages and word embeddings would be beneficial.
6 Conclusion and Future Work
In this work, we presented a method to perform semantic specialization of word vectors. Specifically, we compiled a new set of constraints obtained from BabelNet. Moreover, we improved a state-of-the-art post-specialization method by incorporating adversarial losses with the Wasserstein distance. Our results obtained in an intrinsic and an extrinsic task, suggest that our method yields performance gains over current methods.
In the future, we plan to introduce constraints for asymmetric relations as well as extend our proposed method to leverage them. Moreover, we plan to experiment with adapting our model to a multilingual scenario, to be able to use it in a neural machine translation task. We make the code and resources available at:https://github.com/mbiesialska/wgan-postspec
We thank the anonymous reviewers for their insightful comments. This work is supported in part by the Spanish Ministerio de Economía y Competitividad, the European Regional Development Fund through the postdoctoral senior grant Ramón y Cajal and by the Agencia Estatal de Investigación through the projects EUR2019-103819 and PCIN-2017-079.
- Polyglot: distributed word representations for multilingual NLP. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria, pp. 183–192. External Links: Cited by: §4.
Wasserstein generative adversarial networks.
Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 214–223. Cited by: §3.2.
- Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 451–462. Cited by: §4.
Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 5012–5019. Cited by: §4.
- Nasari: integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities. Artificial Intelligence 240, pp. 36–64. Cited by: §4.
- Improving zero-shot learning by mitigating the hubness problem. Proceedings of ICLR. Cited by: §4.
- Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 1606–1615. External Links: Cited by: §2.
- WordNet: an electronic lexical database mit press. Cited by: §4.
- SimVerb-3500: a large-scale evaluation set of verb similarity. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2173–2182. External Links: Cited by: §4.
- Explicit retrofitting of distributional word vectors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 34–45. External Links: Cited by: §2.
- Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2672–2680. Cited by: §3.2.
- Improved training of wasserstein gans. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5767–5777. External Links: Cited by: §3.2.
- The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), Philadelphia, PA, U.S.A., pp. 263–272. External Links: Cited by: §5.2.
SimLex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. External Links: Cited by: §4.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Cited by: §4.
- Roget’s 21st century thesaurus (3rd edition). Philip Lief Group. Cited by: §4.
- Separated by an un-common language: towards judgment language informed vector space modeling. External Links: Cited by: §4.
- Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, Maryland, pp. 302–308. External Links: Cited by: §4.
- Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: §1, §4.
- Counter-fitting word vectors to linguistic constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 142–148. External Links: Cited by: §2.
- Neural belief tracker: data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1777–1788. External Links: Cited by: §5.2.
- Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints. Transactions of the Association for Computational Linguistics 5, pp. 309–324. External Links: Cited by: §2, §3.1, §5.1, §5.2.
- BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193, pp. 217–250. Cited by: §4.
- Word embedding-based antonym detection using thesauri and distributional information. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 984–989. External Links: Cited by: §4.
- Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Cited by: §1, §4.
- Adversarial propagation and zero-shot cross-lingual transfer of word vector specialization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 282–293. External Links: Cited by: 2nd item, §3.2, §3, §5.1.
- A database of paradigmatic semantic relation pairs for German nouns, verbs, and adjectives. In Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing, Dublin, Ireland, pp. 111–119. External Links: Cited by: §4.
- PARAD-it: eliciting italian paradigmatic relations with crowdsourcing. In Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017), Rome, Italy, December 11-13, 2017, External Links: Cited by: §4.
- Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §4.
- Post-specialisation: retrofitting vectors of words unseen in lexical resources. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 516–527. External Links: Cited by: 2nd item, §2, §3, §4, §5.1.
- A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 438–449. External Links: Cited by: §5.2.
- The dialog state tracking challenge series: a review. Dialogue & Discourse 7, pp. 4–33. Cited by: §5.2.
Word semantic representations using Bayesian probabilistic tensor factorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1522–1531. External Links: Cited by: §4.