Word embeddings have become an integral part of modern NLP. They capture semantic and syntactic similarities and are typically used as features in training NLP models for diverse tasks like named entity tagging, sentiment analysis, and classification, to name a few. Word embeddings are learnt in an unsupervised manner from a large text corpora and a number of pre-trained embeddings are readily available. The quality of the word embeddings, however, depends on various factors like the size and genre of training corpora as well as the training method used. This has led to ensemble approaches for creating meta-embeddings from different original embeddings(Yin and Shutze, 2016; Coates and Bollegala, 2018; Bao and Bollegala, 2018; O’Neill and Bollegala, 2020). Meta-embeddings are appealing because: (a) they can improve quality of embeddings on account of noise cancellation and diversity of data sources and algorithms, (b) no need to retrain the model, (c) the original corpus may not be available, and (d) may increase vocabulary coverage.
Various approaches have been proposed to learn meta-embeddings and can be broadly classified into two categories: (a) simple linear methods like averaging or concatenation, or a low-dimensional projection via singular value projection(Yin and Shutze, 2016; Coates and Bollegala, 2018) and (b) non-linear methods that aim to learn meta-embeddings as shared representation using auto-encoding or transformation between common representation and each embedding set (Muromägi et al., 2017; Bollegala et al., 2018; Bao and Bollegala, 2018; O’Neill and Bollegala, 2020).
In this work, we focus on simple linear methods such as averaging and concatenation for computing meta-embeddings, which are very easy to implement and have shown highly competitive performance (Yin and Shutze, 2016; Coates and Bollegala, 2018). Due to the nature of the underlying embedding generation algorithms (Mikolov et al., 2013; Pennington et al., 2014), correspondences between dimensions, e.g., of two embeddings and of the same word, are usually not known. Hence, averaging may be detrimental in cases where the dimensions are negatively correlated. Consider the scenario where . Here, simple averaging of and
would result in the zero vector. Similarly, whenis a (dimension-wise) permutation of , simple averaging would result in a sub-optimal meta-embedding vector than performing averaging of aligned embeddings. Therefore, we propose to align the embeddings (of a given word) as an important first step towards generating meta-embeddings.
To this end, we develop a geometric framework for learning meta-embeddings, by aligning different embeddings in a common latent space, where the dimensions of different embeddings (of a given word) are in coherence. Mathematically, we perform different orthogonal transformations of the source embeddings to learn a latent space along with a Mahalanobis metric that scales different features appropriately. The meta-embeddings are, subsequently, learned in the latent space, e.g., using averaging or concatenation. Empirical results on the word similarity and the word analogy tasks show that the proposed geometrically aligned meta-embeddings outperform strong baselines such as the plain averaging and the plain concatenation models.
2 Proposed Geometric Modeling
Consider two (monolingual) embeddings and of a given word in a -dimensional space. As discussed earlier, embeddings generated from different algorithms (Mikolov et al., 2013; Pennington et al., 2014) may express different characteristics (of the same word). Hence, the goal of learning a meta-embedding (corresponding to word ) is to generate a representation that inherits the properties of the different source embeddings (e.g., and ).
Our framework imposes orthogonal transformations on the given source embeddings to enable alignment. In this latent space, we additionally induce the Mahalanobis metric to incorporate the feature correlation information (Jawanpuria et al., 2019)
. The Mahalanobis similarity generalizes the cosine similarity measure, which is commonly used for evaluating the relatedness between word embeddings. The combination of the orthogonal transformation and Mahalanobis metric learning allows to capture anyaffine relationship between different available source embeddings of a given word (Bonnabel and Sepulchre, 2009; Mishra et al., 2014).
Overall, we formulate the problem of learning geometric transformations – the orthogonal rotations and the metric scaling – via a binary classification problem. The meta-embeddings are subsequently computed using these transformations. The following sections formalize the proposed latent space and meta-embedding models.
2.1 Learning the Latent Space
In this section, we learn the latent space using geometric transformations.
Let and be orthogonal transformations for embeddings and , respectively, for all words . Here represents the set of orthogonal matrices. The aligned embeddings in the latent space corresponding to and can then be expressed as and , respectively. We next induce the Mahalanobis metric in this (aligned) latent space, where is a symmetric positive-definite matrix. In this latent space, the similarity between the two embeddings and is given by the following expression: . An equivalent interpretation is that the expression boils down to the standard scalar product (cosine similarity) between and , where denotes the matrix square root of the symmetric positive definite matrix .
The orthogonal transformations as well as the Mahalanobis metric are learned via the following binary classification problem: pairs of word embeddings of the same word belong to the positive class while pairs belong to the negative class (for ). We consider the similarity between the two embeddings in the latent space as the decision function of the proposed binary classification problem. Let and be the word embedding matrices for words, where the columns correspond to different words. In addition, let denote the label matrix, where for and for
. The proposed optimization problem employs the simple to optimize square loss function:
where is the Frobenius norm and is the regularization parameter.
2.2 Averaging and Concatenation in Latent Space
Meta-embeddings constructed by averaging or concatenating the given word embeddings have been shown to obtain highly competitive performance (Yin and Shutze, 2016; Coates and Bollegala, 2018). Hence, we propose to learn meta-embeddings as averaging or concatenation in the learned latent space.
The meta-embedding of a word is generated as an average of the (aligned) word embeddings in the latent space. The latent space representation of , as a function of orthogonal transformation and metric , is (Jawanpuria et al., 2019). Hence, we obtain .
It should be noted that the proposed geometry-aware averaging approach generalizes (Coates and Bollegala, 2018), which is now a particular case in our framework by choosing , , and as identity matrices. Our framework easily generalizes to the case of more than two source embeddings, by learning different source-embedding specific orthogonal transformations and a common Mahalanobis metric.
We next propose to concatenate the aligned embeddings in the learned latent space. For a given word , with and as different source embeddings, the meta-embeddings learned by the proposed geometry-aware concatenation model is . It can be easily observed that the plain concatenation (Yin and Shutze, 2016) is a special case of the proposed geometry-aware concatenation (by setting , where is a
-dimensional identity matrix).
The proposed optimization problem (1) employs square loss function and -norm regularization, both of which are well-studied in literature. In addition, the proposed problem involves optimization over smooth constraint sets such as the set of symmetric positive definite matrices and the set of orthogonal matrices. Such sets have well-known Riemannian manifold structure (Lee, 2003) that allows to propose computationally efficient iterative optimization algorithms. We employ the popular Riemannian optimization framework (Absil et al., 2008) to solve (1). Recently, Jawanpuria et al. (2019) have studied a similar optimization problem in the context of learning cross-lingual word embeddings.
Our implementation is done using the Pymanopt toolbox (Townsend et al., 2016), which is a publicly available Python toolbox for Riemannian optimization algorithms. In particular, we use the conjugate gradient algorithm of Pymanopt. For, this we need only supply the objective function of (1). This can be done efficiently as the numerical cost of computing the objective function is . The overall computational cost of our implementation scales linearly with the number of words in the vocabulary sets.
In this section, we evaluate the performance of the proposed meta-embedding models.
3.1 Evaluation Tasks and Datasets
Word similarity: in this task, we compare the human-annotated similarity scores between pairs of words with the corresponding cosine similarity computed via the constructed meta-embeddings. We report results on the following benchmark datasets: RG (Rubenstein and Goodenough, 1965), MC (Miller and Charles, 1991), WS (Finkelstein et al., 2001), MTurk (Halawi et al., 2012), RW (Luong et al., 2013), and SL (Hill et al., 2015). Following previous works (Yin and Shutze, 2016; Coates and Bollegala, 2018; O’Neill and Bollegala, 2020), we report the Spearman correlation score (higher is better) between the cosine similarity (computed via meta-embeddings) and the human scores.
Word analogy: in this task, the aim is to answer questions which have the form “A is to B as C is to ?” (Mikolov et al., 2013). After generating the meta-embeddings , , and (corresponding to terms A, B, and C, respectively), the answer is chosen to be the term whose meta-embedding has the maximum cosine similarity with (Mikolov et al., 2013). The benchmark datasets include MSR (Gao et al., 2014), GL (Mikolov et al., 2013), and SemEval (Jurgens et al., 2012). Following previous works (Yin and Shutze, 2016; Coates and Bollegala, 2018; O’Neill and Bollegala, 2020), we report the percentage of correct answers for MSR and GL datasets, and the Spearman correlation score for SemEval. In both cases, a higher score implies better performance.
We learn the meta-embeddings from the following publicly available -dimensional pre-trained word embeddings for English.
CBOW (Mikolov et al., 2013): has word embeddings trained on Google News.
GloVe (Pennington et al., 2014): has word embeddings trained on B tokens of web data from the common crawl.
fastText (Bojanowski et al., 2017): has word embeddings trained on common crawl.
The meta-embeddings are learned on the common set of words from different pairs of the source embeddings. The number of common words between various source embeddings pairs are as follows: (GloVe CBOW), (GloVe fastText), and (CBOW fastText).
3.2 Results and Discussion
The performance of our geometry-aware averaging and concatenation models, henceforth termed as Geo-AVG and Geo-CONC, respectively, are reported in Table 1. We also report the performance of meta-embeddings models AVG (Coates and Bollegala, 2018) and CONC (Yin and Shutze, 2016), which perform plain averaging and concatenation, respectively. In addition, we report the performance of individual source embeddings (CBOW, GloVe, and fastText), serving as a benchmark which the meta-embeddings algorithms should ideally surpass in order to justify their usage.
We observe that the proposed geometry-aware models, Geo-AVG and Geo-CONC, outperform the individual source embeddings in all the datasets. The proposed models also easily surpass the AVG and CONC models in both the word similarity and the word analogy tasks. This shows that the alignment of word embedding spaces with orthogonal rotations and the Mahalanobis metric improves the overall quality of the meta-embeddings.
We propose a geometric framework for learning meta-embeddings of words from various sources of word embeddings. Our framework aligns the embeddings in a common latent space. The importance of learning the latent space is shown in several benchmark datasets, where the proposed algorithms (Geo-AVG and Geo-CONC) outperforms the plain averaging and the plain concatenation models. The proposed framework can be extended to generating sentence meta-embeddings, which remains a future research direction.
- Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ. Cited by: §2.3.
Learning word meta-embeddings by autoencoding. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1650–1661. Cited by: §1, §1.
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Note: https://fasttext.cc/docs/en/english-vectors.html Cited by: 3rd item.
- Think globally, embed locally—locally linear meta-embedding of words. In IJCAI, Cited by: §1.
Riemannian metric and geometric mean for positive semidefinite matrices of fixed rank. SIAM Journal on Matrix Analysis and Applications 31 (3), pp. 1055–1070. Cited by: §2.
- Frustratingly easy meta-embedding – computing meta-embeddings by averaging source word embeddings. In Proceedings of NAACL-HLT 2018, pp. 194–198. Cited by: §1, §1, §1, §2.2, §2.2, 1st item, 2nd item, §3.1, §3.2.
- Placing search in context: the concept revisited. In Proceedings of the 10th international conference on World Wide Web. ACM, pp. 406–414. Cited by: 1st item.
- Wordrep: a benchmark for research on learning word representation. Technical report arXiv preprint arXiv:1407.1640. Cited by: 2nd item.
- Large-scale learning of word relatedness with constraint. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1406–1414. Cited by: 1st item.
Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, pp. 665–695. Cited by: 1st item.
- Learning multilingual word embeddings in latent metric space: a geometric approach. Transactions of the Association for Computational Linguistics 7, pp. 107–120. Cited by: §2.2, §2.3, §2.
- Semeval-2012 task 2: measuring degrees of relational similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 356–364. Cited by: 2nd item.
- Introduction to smooth manifolds. Second edition, Graduate Texts in Mathematics, Vol. 218, Springer-Verlag, New York. Cited by: §2.3.
Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning (CoNLL), pp. 104–113. Cited by: 1st item.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (NeurIPS), pp. 3111–3119. Cited by: §1, §2, 2nd item, 1st item.
- Contextual correlates of semantic similarity. Language and congnitive processes, pp. 1–28. Cited by: 1st item.
- Fixed-rank matrix factorizations and Riemannian low-rank optimization. Computational Statistics 29 (3), pp. 591–621. Cited by: §2.
- Linear ensembles of word embedding models. In Proceedings of the 21st Nordic Conference on Computational Linguistics, Gothenburg, Sweden, pp. 96–104. External Links: Cited by: §1.
- Meta-embedding as auxiliary task regularization. In ECAI, Cited by: §1, §1, 1st item, 2nd item.
Glove: global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)14, pp. 1532–1543. Cited by: §1, §2, 2nd item.
- Contextual correlates of synonymy. Communications of ACM, pp. 627–633. Cited by: 1st item.
Pymanopt: a python toolbox for optimization on manifolds using automatic differentiation.
Journal of Machine Learning Research17 (137), pp. 1–5. Cited by: §2.3.
- Learning word meta-embeddings. In Proceedings of the 54th Annual Meeting of the Association of Computational Linguistics (ACL), pp. 1351–1360. Cited by: §1, §1, §1, §2.2, §2.2, 1st item, 2nd item, §3.1, §3.2.