Causally Denoise Word Embeddings Using Half-Sibling Regression
Distributional representations of words, also known as word vectors, have become crucial for modern natural language processing tasks due to their wide applications. Recently, a growing body of word vector postprocessing algorithm has emerged, aiming to render off-the-shelf word vectors even stronger. In line with these investigations, we introduce a novel word vector postprocessing scheme under a causal inference framework. Concretely, the postprocessing pipeline is realized by Half-Sibling Regression (HSR), which allows us to identify and remove confounding noise contained in word vectors. Compared to previous work, our proposed method has the advantages of interpretability and transparency due to its causal inference grounding. Evaluated on a battery of standard lexical-level evaluation tasks and downstream sentiment analysis tasks, our method reaches state-of-the-art performance.READ FULL TEXT VIEW PDF
Word embedding has become essential for natural language processing as i...
Word vectors are at the core of many natural language processing tasks.
Word embeddings have found their way into a wide range of natural langua...
Data-driven representation learning for words is a technique of central
Human communication includes information, opinions, and reactions. React...
Word embeddings are an essential component in a wide range of natural
Word2vec (Mikolov et al., 2013) has proven to be successful in natural
Causally Denoise Word Embeddings Using Half-Sibling Regression
Distributional representations of words have become an indispensable asset in natural language processing (NLP) research due to its wide application in downstream tasks such as parsing [bansal2014tailoring]lample2016neural], and sentiment analysis [tang2014learning]. Of these, “neural” word vectors such as Word2Vec [Mikolov2013], GloVe [Pennington2014], and Paragram [wieting2015paraphrase] are amongst the most prevalently used and on which we focus in this article.
There has been a recent thrust in the study of word vector postprocessing methods [Faruqui2015, Fried2015, Mrksic2016, Mrksic2017, Shiue2017, Mu2018, Liu2019, Tang2019]. These methods directly operate on word embeddings and effectively enhance their linguistic regularities in light-weight fashions. Nonetheless, existing postprocessing methods usually come with a few limitations. For example, some rely on external linguistic resources such as English WordNet [Faruqui2015, Fried2015, Mrksic2016, Mrksic2017, Shiue2017]
, leaving out-of-database word vectors untouched. Others use heuristic methods to flatten the spectrum of word vector embedding matrices[Mu2018, Liu2019, Wang2019, Tang2019]. Although being effective, these spectral flattening algorithms are primarily motivated by experimental observations but lack of direct interpretability.
In this paper, we propose a novel word vector postprocessing approach that addresses these limitations. Under a causal inference framework, the proposed method meets the joint desiderata of (1) theoretical interpretability, (2) empirical effectiveness, and (3) computational efficiency. Concretely, the postprocessing pipeline is realized by Half-Sibling Regression (HSR) [Scholkopf2016]
, a method for identifying and removing confounding noise of word vectors. Using a simple linear regression method, we obtain results that are either on-par or outperform state-of-the-art results on a wide battery of lexical-level evaluation tasks and downstream sentiment analysis tasks. More specifically, our contributions are as follows:
We formulate the word vector postprocessing task as a confounding noise identification problem under a putative causal graph. This formulation brings causal interpretability and theoretical support to our postprocessing algorithm.
The proposed method is data-thrifty and computationally simple. Unlike many existing methods, it does not require external linguistic resources (e.g., synonym relationships); besides, the method can be implemented easily via simple linear regressions.
The proposed postprocessing method yields highly competitive empirical results. For example, while achieving the best performance on 20 semantic textual similarity tasks, on average, our proposed method brings 4.71%, 7.54%, and 6.54% improvement respectively compared to the previously best results, and it achieves 7.13%, 22.06%, and 9.83% improvement compared to the original word embedding when testing on Word2Vec, GloVe, and Paragram.
The rest of the paper is organized as follows. We first briefly review prior work on word vector postprocessing. Next, we introduce Half-Sibling Regression as a causal inference framework to remove confounding noise; we then proceed to explain how to apply Half-Sibling Regression to remove noise from word embeddings. Then, we showcase the effectiveness of the Half-Sibling Ridge Regression model on word similarity tasks, semantic textual similarity tasks, and downstream sentiment analysis tasks using three different pre-trained English word embeddings. Finally, we conduct statistical significance tests on all experimental results111Our codes are available at https://github.com/KunkunYang/denoiseHSR-AAAI.
In this section, we review prior art for word vector postprocessing. Modern word vector postprocessing methods can be broadly divided into two streams: (1) lexical and (2) spatial approaches.
The lexical approach uses lexical relational resources to enhance the quality of word vectors. These lexical relational resources specify semantic relationships of words such as synonym and antonym relationships. For example, Faruqui2015 Faruqui2015 inject synonym lexical information into pre-trained collections of word vectors. Mrksic2016 Mrksic2016 generalize this approach and insert both antonym and synonymy constraints into word vectors. Mrksic2017 Mrksic2017 use constraints from mono- and cross-lingual lexical resources to fine-tune word vectors. Fried2015 Fried2015 and Shiue2017 Shiue2017 propose to use hierarchical semantic relations such as hypernym semantics to enrich word vectors. To make sure that word vectors satisfy the lexical relational constraints, supervised machine learning algorithms are used.
The spatial approach differs from the lexical approach in that it does not require external knowledge bases. The general principle of this approach is to enforce word vectors to be more “isotropic”, i.e., more spread out in space. This goal is usually achieved by flattening the spectrum of word vectors. For example, Mu2018 Mu2018 propose All-But-The-Top (ABTT) method which removes leading principal components of word vectors; Wang2019 Wang2019 extend this idea by softly shrinking principal components of word embedding matrix using a variance normalization method; Liu2019 Liu2019 propose the Conceptor Negation (CN) method, which employs regularized identity maps to filter away high-variance latent features of word vectors; more recently, Tang2019 Tang2019 develop SearchBeta (SB) that uses a centralized kernel alignment method to smooth the spectrum of word vectors.
The lexical and spatial approaches introduced in the previous section have empirically proven to be effective. Nonetheless, they also suffer from a few limitations. A shortcoming of the lexical approach is that it is unable to postprocess out-of-database word vectors. Indeed, lexical relational resources like synonym-antonym relationships are informative for word meaning, in particular word meaning of adjectives. However, many non-adjective words do not have abundant lexical connections with other words, and for this reason, they are not well-represented in lexical-relationship databases. For instance, most nouns (e.g., car) and verbs (e.g., write) have few synonyms and even fewer antonyms, making the lexical postprocessing methods inapplicable to these words. The spatial approach favorably avoids this problem by lifting the requirement of lexical relational resources. Yet, one major downside of the spatial approach is its lack of direct interpretability. For example, many spatial approaches propose to completely or softly remove a few leading principal components (PCs) of word vectors. However, it is rather unclear what exactly has been encoded by these leading PCs other than the empirical finding that these leading PCs are somehow correlated with word frequencies [Mu2018].
In this paper, we go beyond the lexical and spatial schemes and introduce a novel causal inference approach for postprocessing word vectors. The method does not seek to infer the causal structure of words or word vectors; instead, in line with Scholkopf2012On Scholkopf2012On and Scholkopf2016 Scholkopf2016, it incorporates causal beliefs and assumptions for empirical objectives – postprocessing off-the-shelf word vectors in our case. Concretely, this is achieved by identifying and removing confounding noise of word vectors using Half-Sibling Regression (HSR) method [Scholkopf2016]. Here we first briefly introduce HSR and then explain how to apply HSR to word vectors.
In the passing, we introduce HSR mainly based on the presentation of Scholkopf2016 Scholkopf2016. Consider a hypothetical causal graph, shown in Figure 1, where each vertex labeled by , , , and
are random variables defined on an underlying probability space and each directed edge indicates the probabilistic dependency between two random variables. We are mostly interested in quantities taken by the random variable. Unfortunately, it is not possible to directly observe these quantities. Instead, we are given only the corrupted observations of , taken value by the random variable . That is, intuitively can be seen as a noisy, lossy version of . A natural assumption of is that it statistically depends on its “clean” version as well as some noise, whose values are taken by some unobservable random variable that encodes the noise source. We further assume that the noise source affects another random variable, , whose quantities are directly observable. Importantly, we require to be independent of .
Recall that we are mostly interested in the unobservable random variable . Hence the question we aim to answer is: How to reconstruct the quantities taken by by leveraging the underlying statistical dependency structure in Figure 1
? HSR provides a simple yet effective solution to this question – It estimatesvia its approximation , which is defined as
The HSR Equation 1 can be straightforwardly interpreted as follows. Recall that is independent of , and therefore is not predictive to or ’s influence on . However, is predictive to , because and are both influenced by the same noise source . When predicting based on realized by the term , since those signals of coming from cannot be predicted by , only those noise contained in coming from could be captured. To reconstruct from , we can therefore remove the captured noise from , resulting in the reconstruction , which is Equation 1. This procedure is referred to as Half-Sibling Regression because and share one parent vertex . is regressed upon its half-sibling to capture the components of inherited from their shared parent vertex .
HSR enjoys a few appealing theoretical properties. In particular, it is possible to show that reconstructs (up to its expectation ) at least as good as the mean-subtraction does. We refer the readers to Scholkopf2016 Scholkopf2016 for detailed theoretical discussions.
We now explain how we apply HSR to remove noise from word vectors. Before getting into the details, we first recall some linguistic basics of words, which are the key enablers of our approach. Semantically, words can be divided into two basic classes: (1) content or open-class words and (2) function or closed-class words (also known as stop words). Content words are those that have meaning or semantic value, such as nouns, verbs, adjectives, and adverbs. Function words have little lexical meaning; rather, they mainly exist to explain grammatical or structural relationships with other words. In English, examples of function words include a, to, for, of, the, and more.
Based on these basic linguistic facts, we posit that content-word vectors and function-word vectors can be seen as half-siblings as their linguistic properties align well with the HSR foundations. Indeed, since function-word vectors carry little semantic content, they could not be predictive to clean content-word vectors. Additionally, since content-word vectors and function-word vectors are induced from some shared training corpora, we hypothesize that they are subjected to the same noise profile. Using HSR language of Figure 1, this means we can model the off-the-shelf stop-word vectors with , off-the-shelf content-word vectors with , and “clean” yet unseen content-word vectors with . Under the HSR framework, when we regress content-word vectors upon function-word vectors, only the noise of the former is captured. Once such noises are identified, they can be directly subtracted, so that the clean content-word vectors will be reconstructed.
The above described procedure can be mathematically realized as follows. Let be a collection of function-word vectors and let be a collection of content-word vectors. To postprocess content-word vectors , we run a simple two-step algorithm. In the first step, we estimate parameters of a linear multiple-output model [Hastie2001, Section 3.2.4], in which we use model inputs to predict model outputs . This amounts to estimate each such that for each . In the second step, we remove the regression result from the target of the regression. That is, we let be the postprocessed content-word vector.
So far, we have described how to postprocess content-word vectors. To postprocess function-word vectors, we can employ a similar pipeline with the predictor and target flipped. That is, to identify confounding noise contained in stop-word vectors, we use off-the-shelf content-word vectors as features to predict off-the-shelf stop-word vectors. The full algorithm is summarized in Algorithm 1.
We provide a few remarks on the practical implementations and further generalizations of Algorithm 1. Our first remark goes to how to identify the function and content words in practice. Throughout our experiments, to identify function words, we use the stop word list provided by Natural Language Toolkit (NLTK) package222https://www.nltk.org/, which is a list of 179 words. We regard words outside of this list to be content words. A small amount of stop words works efficiently when postprocessing tens of thousands of content-word vectors because in this case, we only have a handful of features. However, when postprocessing stop-word vectors, it is cumbersome because the number of content words as features are too large to be efficiently implemented. For this reason, in practice, we only use a small sample of commonly used content-word vectors as features for postprocessing stop-word vectors. Specifically, borrowing the word list provided by Arora2017 Arora2017, we use the most frequent 1000 content words as features in Step 2.1 and Step 2.2 of Algorithm 1.
Moreover, while our framework postprocesses both content and function words, we have tried only postprocessing content words and leaving function words unchanged. We discover that the experimental results are still better than the baseline spatial approaches but worse than postprocessing both content and function words. The reason might be that stop words play non-trivial roles in various NLP tasks. As all baseline spatial approaches postprocess both content and function words, we follow this setting.
Finally, we remark that the linear model used in Algorithm 1
can be straightforwardly generalized to non-linear models. For this, we have formulated and tested Multilayer Perceptrons (MLPs) as extensions to the linear model used in Algorithm1. The detailed MLP version of Algorithm 1 is postponed to the appendix.
We evaluate the HSR postprocessing algorithm described in Algorithm 1 (denoted by HSR-RR as it is based on ridge regression). We test it on three different pre-trained English word embeddings including Word2Vec333https://code.google.com/archive/p/word2vec/ [Mikolov2013], GloVe444https://nlp.stanford.edu/projects/glove/ [Pennington2014], and Paragram555https://www.cs.cmu.edu/~jwieting/ [wieting2015paraphrase]. The original word vectors, as well as word vectors postprocessed by ABTT [Mu2018], CN [Liu2019], and SB [Tang2019], are set as baselines. The performances of these baselines against HSR-RR are compared on word similarity tasks, semantic textual similarity tasks, and downstream sentiment analysis tasks. A statistical significance test is conducted on all experimental results to verify whether our method yields significantly better results compared to the baselines. For ABTT, we set for GloVe and for Word2Vec and Paragram as suggested by the original authors. For CN, we fix for all word embeddings as suggested by the original authors. For HSR, we fix the regularization constants for HSR-RR. Generally, we recommend using for HSR-RR and other HSR models. Furthermore, we construct a Multilayer Perceptrons HSR model (denoted by HSR-MLP), and the experimental result of HSR-MLP is shown in the appendix.
We use seven popular word similarity tasks to evaluate the proposed postprocessing method. The seven tasks are RG65 [Rubenstein1965], WordSim-353 [Finkelstein2002], Rare-words [Luong2013], MEN [Bruni2014], MTurk [Radinsky2011], SimLex-999 [Hill2015], and SimVerb-3500 [Gerz2016].
For each task, we calculate the cosine similarity between the vector representation of two words, and the Spearman’s rank correlation coefficient[Myers1995] of the estimated rankings against the human rankings is reported in Table 1. In the table, the result marked in bold is the best, and the results underlined are the top three results.
From the table, we could see that while no postprocessing method performs dominantly better than others, HSR-RR has the best performance overall by performing the best on the most number of tasks for two out of the three word embeddings, which are Word2Vec and Paragram. HSR-RR generally achieves the best on these five tasks: WordSim-353, MEN, MTurk, SimLex-999, and SimVerb-3500. Notably, HSR-RR has the best performance on the task SimVerb-3500 for all three word embeddings, which achieves 8.72%, 40.04%, and 1.98% improvement respectively on SimVerb-3500 dataset relative to the original word embeddings and 2.84%, 9.58%, and 1.04% increase compared to the runner-up method. Since SimVerb-3500 is the state-of-the-art task that contains the highest number of word pairs and distinguishes genuine word similarity from conceptual association [Hill2015], the result obtained on SimVerb-3500 is usually deemed to be more telling than those of other tasks [Liu2019].
Next, we test the sentence-level effectiveness of our proposed HSR method on semantic textual similarity (STS) tasks, which measure the degree of semantic equivalence between two texts [Agirre2012]. The STS tasks we employ include 20 tasks from 2012 SemEval Semantic Related task (SICK) and SemEval STS tasks from 2012 to 2015 [Marco2014, Agirre2012, Agirre2013, Agirre2014, Agirre2015].
To construct the embedding of each sentence in the tasks, we first tokenize the sentence into a list of words, then average the word embedding of all words in the list as the vector representation of the sentence. Following Agirre2012 Agirre2012, we calculate the cosine distance between the two sentence embeddings and record the Pearson correlation coefficient of the estimated rankings of sentence similarity against the human rankings.
In Table 2, we present the result of the 20 STS tasks as well as the average result each year. From the table, we could observe that HSR-RR dominantly outperforms the original word embedding as well as other postprocessing methods, as the average result by year of HSR-RR is the best for all tasks except the SICK task on Word2Vec. On average, HSR-RR improves the Pearson correlation coefficient by 4.71%, 7.54%, and 6.54% respectively over the 20 STS tasks compared to the previously best results, and it achieves 7.13%, 22.06%, and 9.83% improvement respectively compared to the original word embeddings.
|Word Similarity||Semantic Textual Similarity||Sentiment Analysis|
P-value of one-tailed Student’s t-test of three experiments
Since the success of intrinsic lexical evaluation tasks does not imply success on downstream tasks, we test the performance of HSR on four sentiment analysis tasks. The dataset we adopt include Amazon reviews666https://www.kaggle.com/bittlingmayer/amazonreviews#train.ft.txt.bz2 (AR), customer reviews (CR) [hu2004mining], IMDB movie reviews (IMDB) [maas2011learning], and SST binary sentiment classification (SST-B) [socher2013recursive], which are all binary sentence-level sentiment classification tasks. Sentiment analysis is an important task in NLP which has been widely applied in business areas such as e-commerce and customer service.
Similar to the STS tasks, we first tokenize the sentence, then average the corresponding word embeddings as the vector representation of the sentence. We use a logistic regression model trained by minimizing cross-entropy loss to classify the sentence embeddings into positive or negative emotions. This procedure was adopted in previous studies such as zeng2017socialized zeng2017socialized. We report the five-fold cross-validation accuracy of the sentiment classification results in Table3.
From Table 3, we could observe that HSR-RR has the best downstream-task performance among all the tested postprocessing methods. Specifically, for Paragram, HSR-RR achieves the highest classification accuracy on all four tasks; for Word2Vec and GloVe, HSR-RR performs the best on three out of the four tasks.
To show that our proposed method yields significant improvement compared to the baselines, we employ the one-tailed Student’s t-test. The p-value of the t-test of HSR-RR against other methods for all three experiments is shown in Table 4 in scientific notation. We use the convention that a p-value is significant if it is smaller than 0.05, and all significant p-values are marked in bold.
From Table 4
, we observe that on word similarity and STS tasks, the improvements yielded by HSR are significant when compared to all three original word vectors. On sentiment analysis tasks, the improvement on Paragram is significant. We also test the significance of improvement of results yielded by HSR-RR with those yielded by other state-of-the-art baseline methods (ABTT, CN, and SB). We find that, for STS tasks, improvements against all three baseline methods on all three word vectors are significant; for sentiment analysis, the improvements against all three baseline methods on Word2Vec and Paragram are significant; for word similarity, only two results (SB on GloVe and CN on Paragram) are significant. While in other cases, improvements of HSR-RR over the original word vectors and the baseline algorithms are not significant, conversely, the baseline methods and the original word vectors also fail to surpass the performance of HSR-RR when the null hypothesis and alternative hypothesis are switched. Therefore, we conclude that HSR-RR yields solid improvement when compared to the original word vectors, and it is either significantly better or on-par with other state-of-the-art baseline methods.
We want to remark that, while statistical significance tests are useful for algorithm comparison, it is mostly excluded in previous word vector evaluation papers [Bullinaria2007, Levy2015, Faruqui2015, Fried2015, Mrksic2016, Mrksic2017, Shiue2017, Mu2018, Liu2019, tang2014learning], and there could be a valid reason for this. As pointed out by dror2018hitchhiker dror2018hitchhiker, the way how existing NLP datasets are structured tends to cripple those widely adopted significance tests: while most statistical significance tests (e.g., t-test) assume that the test set consists of independent observations, NLP datasets usually violate this hypothesis. For instance, most STS datasets only contain sentences from a certain source (e.g., news or image captions) and word similarity datasets usually contain words of specialized types (e.g., verbs). This makes a proper significance test quite challenging. Some NLP researchers even contend to abandon the null hypothesis statistical significance test approach due to this hard-to-meet assumption [koplenig2019against, mcshane2019abandon].
In this paper, we present a simple, fast-to-compute, and effective framework for postprocessing word embeddings, which is inspired by the recent development of causal inference. Specifically, we employ Half-Sibling Regression to remove confounding noise contained in word vectors and to reconstruct latent, “clean” word vectors of interest. The key enabler of the proposed Half-Sibling Regression is the linguistic fact that function words and content words are lexically irrelevant to each other, making them natural “half-siblings”. The experimental results on both intrinsic lexical evaluation tasks and downstream sentiment analysis tasks reveal that the proposed method efficiently eliminates noise and improves performance over the existing alternative methods on three different brands of word embeddings.
The current work has a few limitations, which we wish to address in the future. The first limitation resides in the way we formulate the regression. Note that, when performing the multiple-output regression step in HSR algorithm (Step 1.1 and Step 2.1 of Algorithm 1), we do not take the correlation of targets into account. Such correlations, however, could be beneficial in some cases. Consider, for instance, the task of predicting content words based on stop words (Step 1.1 of Algorithm 1). As content words as targets are strongly correlated (e.g., synonyms and antonyms), such correlations can be further employed to facilitate the regression with well-studied methods such as Reduced-rank regression [Anderson1949]. For a survey of these multiple outcome regression methods taking output into account, please see Hastie2001 Hastie2001, Section 3.7.
The second line of future work concerns how to use a non-linear model for HSR more effectively. Although we have tried neural-network-based HSR algorithms for various tasks (see the appendix for details), empirically they bring marginally improved results, if not slightly worsened. One hypothesis for explaining this phenomenon is that neural networks tend to be highly expressive, overfitting small datasets easily. For future work, we plan to explore more regularization methods which may improve the results of neural-network-based HSR.
The third line of future work is to develop a unified framework for understanding word vector postprocessing. As various word vector postprocessing algorithms yield (sometimes surprisingly) similar results in a few cases, it is our hope to establish connections between these approaches in the future. The recent work by zhou2019getting zhou2019getting points toward this direction.
Last but not least, we believe that there remain ample opportunities for using HSR in other NLP tasks and models. For instance, recently, we have observed that pre-trained language models such as BERT [devlin2019bert] start to replace word vectors as default feature representations for downstream NLP tasks. The HSR framework, in principle, can be incorporated in language model postprocessing pipelines as well. We would like to explore these possibilities in the future.
This work was partially supported by the National Natural Science Foundation of China (grant number 71874197). We appreciate the anonymous reviewers for their detailed and constructive comments. We thank all the people who helped Zekun Yang flee from Hong Kong to Shenzhen on Nov. 12th, 2019 such that she could safely finish writing the camera-ready version of this paper.