Word-embedding methods handle word semantics in natural language processingMikolov et al. (2013a, b); Pennington et al. (2014); Vilnis and McCallum (2015); Bojanowski et al. (2017). Such word-embedding models as skip-gram with negative sampling (SGNS; Mikolov et al., 2013b) or GloVe Pennington et al. (2014) capture such analogic relations as . Previous work Levy and Goldberg (2014b); Arora et al. (2016); Gittens et al. (2017); Ethayarajh et al. (2019); Allen and Hospedales (2019) offers theoretical explanation based on Pointwise Mutual Information (PMI; Church and Hanks, 1990) for maintaining analogic relations in word vectors.
These relations can be used to transfer a certain attribute of a word, such as changing king into queen by transferring its gender. This transfer can be applied to perform data augmentation; for example, rewriting He is a boy to She is a girl. It can be used to generate negative examples for natural language inference, for example. We tackle a novel task that transfers any word associated with certain attributes: word attribute transfer.
A naive way for word attribute transfer is to use a difference vector based on analogic relations, such as adding to to obtain . This requires explicit knowledge whether an input word is male or female; we have to add a difference vector to a male word and subtract it from a female word for the gender transfer. We also have to avoid changing words without gender attributes, such as is and a in the example above, since they are non-attribute words. Developing such knowledge is very costly for words and attributes in practice. In this work, we propose a novel framework for a word attribute transfer based on reflection that does not require explicit knowledge of the given words in its prediction.
The contribution of this work is two-fold: (1) We propose a word attribute transfer method that obtains a vector with an inverted binary attribute without explicit knowledge. (2) The proposed method demonstrates more accurate word attribute transfer for words that have target attributes than other baselines without changing the words that do not have the target attributes.
2 Word Attribute Transfer Task
In this task, we focus on modeling the binary attributes (e.g. male and female111Gender-specific words are sometimes considered socially problematic. Here we use this as an example from the man-woman relation.). Let denote a word and let denote its vector representation. We assume that is learned in advance with an embedding model, such as skip-gram. In this task, we have two inputs, word and vector , which represent a certain target attribute, and output word with the inverted attribute of for . In this paper,
is a 300-dimensional vector embedded from a target attribute ID using an embedding function of a deep learning framework. For example, given a set of attributes, we assign different random vectors for gender and for antonym, respectively. Let denote a set of triplets , e.g., , and denote a set of words without attribute , e.g., . This task transfers input word vector to target word vector by transfer function that inverts attribute of :
The following property must be satisfied: (1) attribute words are transferred to their counterparts and (2) non-attribute words are not changed (transferred back into themselves). For instance with , given input word man, gender attribute transfer should result in a vector close to . Given another input word person as , the results should be .
3 Analogy-based Word Attribute Transfer
Analogy is a general idea that can be used for word attribute transfer. PMI-based word embedding, such as SGNS and GloVe, captures analogic relations, including Eq. 2 Mikolov et al. (2013c); Levy and Goldberg (2014a); Linzen (2016). By rearranging Eq. 2, Eq. 3 is obtained:
The analogy-based transfer function is
where is a set of words with a target attribute (e.g., male) and is a set of words with an inverse attribute (e.g., female). is a difference vector, such as . Eq. 4 indicates that the operation changes depending on whether input word belongs to or . However, to transfer the word attribute by analogy, we need such explicit knowledge as attribute value (, or others) that is contained by the input word.
4 Reflection-based Word Attribute Transfer
4.1 Ideal Transfer without Knowledge
What is ideal transfer function for a word attribute transfer? The following are the ideal natures of such a transfer function:
This function enables a word to be transferred without explicit knowledge because operation does not change depending on whether input word belongs to or . By combining Eqs. 5, 6 and 7, we obtain the following formulas:
Hence, the ideal transfer function is a mapping that becomes an identity mapping when we apply it twice for any . Such a mapping is called involution in geometry. For example, is one example of an involution.
Reflection is an ideal function because this mapping is an involution:
Reflection reverses the location between two vectors in a Euclidean space through an hyperplane called amirror. Reflection is different from inverse mapping. When and are paired words, reflection can transfer and each other with identical reflection mapping as in Eqs. 5 and 6, but an inverse mapping cannot. Given vector in Euclidean space , the formula for the reflection in the mirror is given:
where is a vector orthogonal to the mirror and is a point through which the mirror passes. and are parameters that determine the mirror.
4.3 Proposed method: Reflection-based Word Attribute Transfer
We apply reflection to the word attribute transfer. We learn a mirror (hyperplane) in a pre-trained embedding space using training word pairs with binary attribute (Fig. 2). Since the mirror is uniquely determined by two parameter vectors, and
, we estimateand from target attribute
using fully connected multi-layer perceptrons:
where is a set of trainable parameters of . Here, and are optimized for each attribute dataset. Transferred vector is obtained by inverting attribute of by reflection:
Reflection with a mirror by Eqs. 13 and 14 assumes a single mirror that only depends on . Previous discussion assumed pairs that share a stable pair, such as king and queen. However, since gendered words often do not come in pairs, gender is not stable enough to be modeled by a single mirror. For example, although actress is exclusively feminine, actor is clearly neutral in many cases. Thus, actor is not obviously a masculine counterpart like king. In fact, bias exists in gender words in the embedding space Zhao et al. (2018); Kaneko and Bollegala (2019). This phenomenon can occur not only with gender attributes but also with other attributes. The single mirror assumption forces the mirror to be a hyperplane that goes through the midpoints for all the word vector pairs. However, the vector pair actor-actress shown on the right in Fig. 3 cannot be transferred well since the single mirror (the green line) does not satisfy this constraint due to the bias of the embedding space. To solve this problem, we propose parameterized mirrors, based on the idea of using different mirrors for different words. We define mirror parameters and using word vector to be transferred in addition to attribute vector :
where indicates the vector concatenation in the column. The parameterized mirrors are expected to work more flexibly on different words than a single mirror because parameterized mirrors dynamically determine similar mirrors for similar words. For instance, as shown in Fig. 3, suppose we learned the mirror (the blue line) that transfers to in advance. If input word vector resembles , a mirror that resembles the one for should be derived and used for the transfer.
On the other hand, the reflection works as an identity mapping for a vector on the mirror (e.g., in Fig 3). That is, the proposed method assumes that non-attribute word vectors are located on the mirror. Since we used a 300-dimensional embedded space in the experiment, we assume that the non-attribute word vector exists in a 299-dimensional subspace.
Here, it should be noted that Eq. 11 may not hold for parameterized mirrors. In reflection with a single mirror, it is true that . However, with the -parameterized reflection , this is not guaranteed. Because mirror parameters and depend on an input word vector as Eqs. 16 and 17. Thus, we exclude this constraint and employ the constraints given by Eqs. 5-7
for our loss function.
The following property must be satisfied in word attribute transfer: (1) words with attribute are transferred and (2) words without it are not transferred. Thus, loss is defined:
where Eq. 18 is a term that draws target word vector closer to corresponding transferred vector and Eq. 19 is a term that prevents words without a target attribute from being moved by transfer function. is the output of a reflection (Eq. 15).
We evaluated the performance of the word attribute transfer using data with four different attributes. We used 300-dimensional word2vec and GloVe as the pre-trained word embedding. We used four different datasets of word pairs with four binary attributes: Male-Female, Singular-Plural, Capital-Country, and Antonym (Table 1). These word pairs were collected from analogy test sets Mikolov et al. (2013a); Gladkova et al. (2016) and the Internet. Noun antonyms were taken from the literature Nguyen et al. (2017). For non-attribute dataset , we sampled words from the vocabulary of word embedding. We sampled from 4 to 50 words for training and 1000 for the test ().
5.1 Evaluation Metrics
We measured the accuracy and stability performances of the word attribute transfer. The accuracy measures how many input words in were transferred correctly to the corresponding target words. The stability score measures how many words in were not mapped to other words. For example, in the Male-Female transfer, given man, the transfer is regarded as correct if woman is the closest word to the transferred vector; otherwise it is incorrect. Given person, the transfer is regarded as correct if person is the closest word to the transferred vector; otherwise it is incorrect. The accuracy and stability scores are calculated by the following formula:
where is the vocabulary of the word embedding model and
is the cosine similarity measure, defined as:.
5.2 Methods and Configurations
In our experiment, we compared our proposed method with the following baseline methods222Our code and datasets are available at: https://github.com/ahclab/reflection:
Reflection-based word attribute transfer with parameterized mirrors. We used the same architecture of MLP as the Ref.
Fully connected MLP with 300 hidden units and ReLU: . The highest accuracy models in SGNS are a 2-layer MLP for Capital-Country and 3-layer MLP for the other datasets. The highest accuracy models in GloVe are a 2-layer MLP for Singular-Plural and 3-layer MLP for the other datasets.
Analogy-based word attribute transfer with a difference vector: , where and are in the training data of . We chose that achieved the best accuracy in the validation data of . We determined whether to add or subtract to based on the explicit knowledge (Eq. 4). Diff and Diff transfer with a difference vector regardless of the explicit knowledge. and add or subtract the difference vector to any input word vector.
Analogy-based word attribute transfer with a mean difference vector : . We determined whether to add or subtract to based on the explicit knowledge (Eq. 4).
For proposed methods, we used the Adam optimizer Kingma and Ba (2015) with for Male-Female, Singular-Plural and Capital-Country, and
for Antonym (the other hyperparameters were the same as the original oneKingma and Ba (2015)). We did not use such regularization methods as dropout Srivastava et al. (2014)2015) because they did not show any improvement in our pilot test. We implemented Ref, RefPM and MLP with Chainer Tokui et al. (2019), which is one of the best deep learning frameworks.
|Accuracy (%)||Stability (%)||Accuracy (%)||Stability (%)|
5.3 Evaluation in Accuracy and Stability
Table 2 shows the accuracy and stability results. Different pre-trained word embeddings GloVe or word2vec gave similar results. RefPM achieved the best accuracy among the methods that did not use explicit attribute knowledge. For example, the accuracy of RefPM was 76% in Capital-Country, but the accuracy of Diff was 26%. For stability, reflection-based transfers achieved outstanding stability scores that exceeded 99%. The results show that our proposed method transfers an input word if it has a target attribute and does not transfer an input word with better score than the baselines otherwise, even though the proposed method does not use attribute knowledge of the input words. MLP worked poorly both in accuracy and stability. On the antonym dataset, although the transfer accuracy by the proposed method was a bit lower than that by MLP, the proposed method’s stability was 100% and that of MLP was extremely poor: about 1%.
|Accuracy (%)||Stability (%)|
We investigated the relation between the training data size of the non-attribute words, and the stability of the learning-based methods by conducting an additional experiment that varied . The stability scores by MLP did not improve (Table 3). On the other hand, RefPM achieved high stability scores with and maintained the accuracy. We hypothesized that the high stability came from the distance between the word and its mirror. If non-attribute words are distributed on the mirror, they will not be transferred. We investigated the distance between input word vector and its mirror (Fig. 4). The result shows that non-attribute words are close to the mirror, and attribute words are distributed away from it. In Male-Female and Singular-Plural, the distance is not significantly farther than Antonym and Capital-Country. If the distance between paired words is very small, the distance between the word and its mirror is also small. Fig. 5 shows the distribution of the distance between input and target word vector . The distance of Male-Female and Singular-Plural is much smaller than Capital-Country and Antonym.
5.4 Visualization of Parameterized Mirrors
Figure 6 shows the t-SNE results of mirror parameter obtained for the test words. Paired mirror, , is connected by a line segment. Fig. 6 suggests that the mirror parameters of the paired words are similar to each other and that those with the attribute form a cluster; words with the same attribute have similar mirror parameters .
5.5 Transfer Example
Table 4 shows the gender transfer results for a tiny example sentence. Here the attribute transfer was applied to every word in the sentence. MLP made many wrong transfers. Analogy-based transfers can transfer only in one direction. RefPM can transfer only attribute words. Table 5 shows that words with different target attributes were transferred by each reflection-based transfer.
6 Related Work
The theory of analogic relations in word embeddings has been widely discussed (Levy and Goldberg, 2014b; Arora et al., 2016; Gittens et al., 2017; Ethayarajh et al., 2019; Allen and Hospedales, 2019; Linzen, 2016). In our work, we focus on the analogic relations in a word embedding space and propose a novel framework to obtain a word vector with inverted attributes. The style transfer task (Niu et al., 2018; Prabhumoye et al., 2018; Logeswaran et al., 2018; Jain et al., 2019; Dai et al., 2019; Lample et al., 2019) resembles ours. In style transfer, the text style of the input sentences is changed. For instance, Jain et al. (2019) transferred from formal to informal sentences. These style transfer tasks use sentence pairs; our word attribute transfer task uses word pairs. Style transfer changes sentence styles, but our task changes the word attributes. Soricut and Och (2015) studied morphological transformation based on character information. Our work aims for more general attribute transfer, such as gender transfer and antonym, and is not limited to morphological transformation.
This research aims to transfer word binary attributes (e.g., gender) for applications such as data augmentation of a sentence. We can transfer the word attribute with analogy of word vectors, but it requires explicit knowledge whether the input word has the attribute or not (e.g., man gender, woman gender, person gender). The proposed method transfers binary word attributes using reflection-based mappings and keeps non-attribute words unchanged, without attribute knowledge in inference time. The experimental results showed that the proposed method outperforms analogy-based and MLP baselines in transfer accuracy for attribute words and stability for non-attribute words.
-  (2019) 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. External Links: Cited by: G. Lample, S. Subramanian, E. M. Smith, L. Denoyer, M. Ranzato, and Y. Boureau (2019).
Analogies Explained: Towards Understanding Word Embeddings.
Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 223–231. External Links: Cited by: §1, §6.
- A Latent Variable Model Approach to PMI-based Word Embeddings. Transactions of the Association for Computational Linguistics 4, pp. 385–399. External Links: Cited by: §1, §6.
- Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: Cited by: §1.
- Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics 16 (1), pp. 22–29. Cited by: §1.
- Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 5997–6007. External Links: Cited by: §6.
- Towards Understanding Linear Word Analogies. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 3253–3262. External Links: Cited by: §1, §6.
- Skip-Gram - Zipf + Uniform = Vector Additivity. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 69–76. External Links: Cited by: §1, §6.
- Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proceedings of the Student Research Workshop, SRW@HLT-NAACL 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 8–15. External Links: Cited by: §5.
Deep Sparse Rectifier Neural Networks. See Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, Gordon et al., pp. 315–323. External Links: Cited by: item Ref.
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011. JMLR Proceedings, Vol. 15, JMLR.org. External Links: Cited by: X. Glorot, A. Bordes, and Y. Bengio (2011).
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 448–456. External Links: Cited by: §5.2.
- Unsupervised Controllable Text Formalization. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019., pp. 6554–6561. External Links: Cited by: §6.
- Gender-preserving Debiasing for Pre-trained Word Embeddings. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 1641–1650. External Links: Cited by: §4.3.
- Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §5.2.
- Multiple-Attribute Text Rewriting. See 1, External Links: Cited by: §6.
- Linguistic Regularities in Sparse and Explicit Word Representations. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, CoNLL 2014, Baltimore, Maryland, USA, June 26-27, 2014, pp. 171–180. External Links: Cited by: §3.
- Neural Word Embedding as Implicit Matrix Factorization. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 2177–2185. External Links: Cited by: §1, §6.
- Issues in evaluating semantic spaces using word analogies. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, RepEval@ACL 2016, Berlin, Germany, August 2016, pp. 13–18. External Links: Cited by: §3, §6.
Content preserving text generation with attribute controls. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pp. 5108–5118. External Links: Cited by: §6.
- Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, External Links: Cited by: §1, §5.
- Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pp. 3111–3119. External Links: Cited by: §1.
- Linguistic Regularities in Continuous Space Word Representations. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pp. 746–751. External Links: Cited by: §3.
- Distinguishing Antonyms and Synonyms in a Pattern-based Neural Network. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers, pp. 76–85. External Links: Cited by: §5.
- Multi-Task Neural Models for Translating Between Styles Within and Across Languages. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pp. 1008–1021. External Links: Cited by: §6.
- Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1532–1543. External Links: Cited by: §1.
- Style Transfer Through Back-Translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 866–876. External Links: Cited by: §6.
- Unsupervised Morphology Induction Using Word Embeddings. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, pp. 1627–1637. External Links: Cited by: §6.
- Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), pp. 1929–1958. External Links: Cited by: §5.2.
- Chainer: A Deep Learning Framework for Accelerating the Research Cycle. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, pp. 2002–2011. External Links: Cited by: §5.2.
- Word Representations via Gaussian Embedding. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §1.
- Learning Gender-Neutral Word Embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 4847–4853. External Links: Cited by: §4.3.