The advent of word embeddings (Mikolov et al., 2013a; Pennington et al., 2014) has shifted the entire field of Natural Language Processing (NLP) from sparse representations, such as Bag-of-Words, to dense, vectorial representations that have proven to be capable of capturing meaningful syntactic and semantic concepts. Word embeddings are widely used in, e.g., text classification (Joulin et al., 2016) and machine translations (Mikolov et al., 2013b). Subsequently word embeddings have a crucial impact on downstream applications and, moreover, such models inherit (hidden) assumptions and properties of the data.
Text corpora for training word embeddings are typically composed of subsets with different properties. Properties can manifest, e.g., as U.K./U.S. English, but can also be induced by the authors, e.g., texts written by different genders, in different periods of time or in different contexts such as arts and politics. While it is the intention in the first place to capture semantic and syntactic information from the data in the best possible way — ideally by learning from as much data as possible, we argue that on second thought it is desirable to influence the composition of data (sub)sets.
Given a corpus where one category outnumbers the other, joint word embeddings will expose a bias towards the former — yet this might not reflect (actual) word semantics appropriately or can be simply undesired.
The desideratum would be for example that in a transfer learning setting word embeddings trained on a large data set are to be fine-tuned on a small task-specific data set. Or in order to achieve semantics of cultural diversity, several smaller newspaper data sets with different foci could be added to a large base newspaper data set with a Euro/US centered focus.
The problem of bias increases for word embeddings as they are often used as a starting point in e.g. downstream tasks. Those methods usually work in a black box manner whose decision making is difficult to see through.
Typical state-of-the-art embedding learning algorithms do not distinguish between different data subsets and thus merge their properties in an incidental manner. A notable exception is the work of Goikoetxea et al. (2016) that shows how text-based and wordnet-based Miller (1995) embeddings can be combined to improve the embedding quality, yet does not align the contribution of the individual data sets. For more details on related work we refer to Appendix 2.
In this contribution we research if and how the influence of individual subsets can be aligned, while retaining embedding quality w.r.t. word vectors learned on all the data. For this aim we propose a measure for the retained semantics of a subset in the final embedding and compare a total of 9 different combination methods (1-9) which are explained in detail in Section 4. The combinations vary in that they are (1) trained on the complete data set, (2-4) created without Goikoetxea et al. (2016), and (5-9) with consideration of the data distribution (our approaches).
2 Related work
Various authors combined text-based word embeddings with additional resources, as for instance wordnet-based information, embeddings trained by different algorithms or additional data sets (Goikoetxea et al., 2016; Rothe and Schütze, 2017; Speer and Chin, 2016; Henriksson et al., 2014). The main goal in those articles is to improve the quality of word embeddings overall.
However, to the best of our knowledge, so far no one adressed the influence subsets have on a combined embedding systematically in order to balance the impact of different data sets after their composition, while retaining the quality of the word embeddings.
3 Evaluating the influence of data subsets on word embeddings
|New York Times||Wikipedia|
|Analogy-test (in %)||Analogy-test (in %)|
Considering how embeddings encode word contexts, we illustrate the influence of data subsets on the final embedding on two real world data sets.
New York Times 1990-2016: The New York Times dataset111https://sites.google.com/site/zijunyaorutgers/ (NYT) contains headlines and lead texts of news articles published online and offline in the New York Times between 1990 and 2016 with a total of 99.872 documents. Political offices as well as sports teams are very closely discussed based on their representatives players, hot topics and their current score. Their context changes over time. As word embeddings are mainly based on the context of a word, their connotation and vectorial representation are influenced by those changes. We investigate the influence of these changes on common word embeddings by splitting this data set in subsets, the first one reaching from 1990-1999 (33.383 articles) and a second one from 2000-2016 (62.058 articles)
The Wikipedia data set (Wiki) contains articles from the English Wikipedia snapshot from April 1st, 2019. We select 12.236 articles from the category Arts as well as 24.473 articles from the category Politics to analyse the individual influence of those 2 fields on joint word embeddings.
As a first example, we consider the word shooting whose nearest neighbors (NNs) in both category groups of the Wiki data set are shown in Fig.1. Clearly, within Politics, shooting refers mostly to the firing of a gun, for Arts, shooting rather relates to a photo or movie shooting. When we train embeddings on the joint data set, the new vector reflects both realities, but is biased towards Politics due to the increased number of articles (23/100 and 51/100 common neighbors with the embedding from Arts and Politics, respectively).
Given this intuition, we would like to quantify the retained influence of the data subset (a) and (b) on embeddings
. Inspired by the Jaccard index we compare the neighboring words of a given embedding trained on a subset and those of the composed embeddings. In more detail, given the sets of NNs, and for two embeddings and of a word , the ratio of shared nearest words is:
and we denote the average over all words as . For instance would mean that words in and share on average of their NNs. We will use and
to indicate the retained influence of the according subsets on a resulting embedding. We use the cosine similarity to compute NNs for neighborhoods of different sizes.
We use a number of different embeddings that can be divided into three groups: merging the data before learning the embeddings, static merging algorithms, and dynamic merging approaches.
Baselines - (a), (b), (1) As baselines we train word embeddings with GloVe Pennington et al. (2014) on NYT on articles from (a) 1990-1999 and (b) 2000-2016. The resulting embeddings learned on (a), (b), and (1) are denoted as , , and . We further trained word embeddings with GloVe on Wiki for
(a) Arts (), (b) Politics () and (1) the merged data ().
GloVe embeddings are trained as 50-dimensional word embeddings on both NYT and Wiki with , . We choose a context window size of 15 for NYT and 5 for Wiki as the data set is considerably larger than NYT. We select one vocabulary for each data set and consider only words that occur at least 40 (NYT) and 250 (Wiki) times in the whole data set which leads to vocabularies of size 21398 (NYT) and 19936 (Wiki).
Static merging - (2), (3), (4) In constrast to (1) — merging before learning — the following approaches merge trained embeddings. They were proposed by Goikoetxea et al. (2016). Given the embeddings and of the subsets, method (2) is to average them, i.e. , (3) is to concatenate them to a 100-dimensional embedding, and (4) extends (3) by extracting the 50 most informative dimensions using PCA. (3) and (4) obtained good results in Goikoetxea et al. (2016).
Dynamic merging - (5), (6), (7), (8), (9)
We found that previously presented embeddings are biased towards the larger subset: .
To alleviate this we propose the following approaches.
A first attempt (5) is to upsample the smaller subset to the same size of the larger set. This leads to embeddings with a high score in analogy tests but a decrease in .
We further intent to balance the impact of the subsets by taking an average that is weighted by their inverse proportions (6): .
Unfortunately, we found that this approach results in embeddings with inferior quality. We define an optimization problem that on one hand optimizes the GloVe loss to obtain qualitative good embeddings and on the other hand tries to balance the influence of the respective subsets by regularizing the distance of the solution to the weighted embeddings . Given the co-occurence matrix and the GloVe weighting function Pennington et al. (2014) the embeddings are created by optimizing:
and denotes a point-wise multiplication. The regularization parameter allows to trade-off between embedding quality and a balanced influence. We restrict the solution space to the ”rectangle” between and and leave exploring an unconstrained version to future work. We optimize Eq. 4 with gradient descent. We therefore use Adam with a learning rate of and default values for . The optimization is stopped after
steps. We have implemented this in PyTorch.
We evaluate the quality of the obtained embeddings by measuring their performance on analogy tests Mikolov et al. (2013b) and how the influence of the subset is balanced by measuring the number of common neighbors , , their average and their difference (see Section 3). and indicate the respective average over for slices (mall) and (arge). Results for all methods, evaluated and averaged over the entire vocabulary, are summarized in Table 1.
First we note that embeddings of the subsets (a) and (b) have only few NNs in common.
Furthermore, when trained on both subsets the embeddings (1) show a clear shift towards the larger subset (b).
Qualitatively this can also be observed in Table 2 where we depict the NNs for the word ”war” in subset (a) and subset (b); and the position of the word in the ranked neighbors of (1) in the column ”90/00”.
We observe that most of the NNs of (a) are not present in the first NNs of (1),
while for (b) the set of 4 NNs is identical with (1).
Moreover, we note that also the static merging approaches (2), (3) & (4) exhibit the same shift (see Table 1).
We try to increase the influence of by upsampling (5) the data set to the size of before training GloVe embeddings. This leads to the same (or even better) quality of the word embeddings as (1) but also results in a decreased . To alleviate this we propose a weighted average (6) in order to consider the subset proportions. The results in Table 1 indicate that this simple approach indeed yields, in terms of our measure, balanced embeddings. This can also be observed exemplary in Table 2 where NNs of (6) correlate much more with the NNs of the respective subsets.
Unfortunately, we will see that the embedding quality suffers when performing a weighted average. With the aim to align both desiderata — balanced influence of the subsets and quality of the embeddings — we proposed an optimization procedure (7-9). From Table 1 we read that the resulting embeddings for different regularization strengths are balanced, but surprisingly the influence of the respective subsets decreases in comparison to (1). As a control experiment we consider the embeddings given by a weighted average between (1) and (6) (Figure 2), where this drop of influence cannot be noted. Yet none of the such averaged embeddings yields good performance and balancing; which justifies the application of an optimization procedure.
We measure the embedding quality by means of analogy tests.
The embeddings trained on all the data (1) perform best in this context — hinting that it is beneficial to leverage as much information from data as possible.
The statically merged embeddings (2), (3), (4) do not perform as well on our task, in contrast to the results of Goikoetxea et al. (2016).
Furthermore, we note that the weighted average (6) results also in a decrease in embedding quality. In contrast, we find that our optimization approach is able to capture both, embedding quality and balances the influence of the subsets.
Considering that text corpora are often composed of subsets, embedding learners merge them in incidental manner — either by merging the text before or the word vectors after training. We argue this can lead to undesired shifts in the embedded semantics and propose a measure for this shift as well as approaches to balance the composition of the subsets.
Our preliminary results show that one can indeed level the impact of different subsets. A weighted average of the subset embeddings yields balanced word embeddings, yet their quality decreases. The proposed optimization routine results in word vectors with good quality and balanced, yet decreased influence of the subsets.
As future work we aim to extend our empirical results and investigate the proposed optimization routine in more detail, e.g., by removing the constraints. As additional experiments we would like to investigate the influence of the different combination methods on downstream tasks, such as classification of sub-categories of the Wikipedia articles. This will further our understanding of the workings of the combination methods in comparison to the analogy tests that are not data slice specific. As alternative to the current regularization — that minimizes the distance to another, presumably balanced embedding — we would like to develop a (differentiable) regularization term that is closer related to our measure . Adapting the work of Berman et al. (2018), which proposes surrogate losses for the Jaccard index, seems to be a promising direction for this goal.
An interesting question posed by our results is how merging of data subsets impacts the resulting embedding semantics — considering that many NNs of are not NNs for the subset embeddings and .
This work was supported by the Federal Ministry of Education and Research (BMBF) for the Berlin Big Data Center BBDC (01IS14013A) and for the MALT III project (01IS17058). We thank L. Ruff, T. Schnake, O. Eberle and S. Dogadov for fruitful discussions. We also thank the reviewers from ACL and the Workshop on Ethical, Social and Governance Issues in AI at NeurIPS 2018 for their valuable comments.
Berman et al. (2018)
Maxim Berman, Amal Rannen Triki, and Matthew B. Blaschko. 2018.
The lovasz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks.
Goikoetxea et al. (2016)
Josu Goikoetxea, Eneko Agirre, and Aitor Soroa. 2016.
Single or multiple? combining word representations independently
learned from text and wordnet.
Thirtieth AAAI Conference on Artificial Intelligence.
- Henriksson et al. (2014) Aron Henriksson, Hans Moen, Maria Skeppstedt, Vidas Daudaravičius, and Martin Duneld. 2014. Synonym extraction and abbreviation expansion with ensembles of semantic spaces. Journal of Biomedical Semantics, 5(1):6.
- Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
- Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
- Mikolov et al. (2013b) Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168.
- Miller (1995) George A. Miller. 1995. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- Rothe and Schütze (2017) Sascha Rothe and Hinrich Schütze. 2017. Autoextend: Combining word embeddings with semantic resources. Computational Linguistics, 43(3):593–617.
- Speer and Chin (2016) Robert Speer and Joshua Chin. 2016. An ensemble method to produce high-quality word embeddings. arXiv preprint arXiv:1604.01692.