Unsupervised Post-processing of Word Vectors via Conceptor Negation

11/17/2018 ∙ by Tianlin Liu, et al. ∙ Jacobs University Bremen University of Pennsylvania 0

Word vectors are at the core of many natural language processing tasks. Recently, there has been interest in post-processing word vectors to enrich their semantic information. In this paper, we introduce a novel word vector post-processing technique based on matrix conceptors (Jaeger2014), a family of regularized identity maps. More concretely, we propose to use conceptors to suppress those latent features of word vectors having high variances. The proposed method is purely unsupervised: it does not rely on any corpus or external linguistic database. We evaluate the post-processed word vectors on a battery of intrinsic lexical evaluation tasks, showing that the proposed method consistently outperforms existing state-of-the-art alternatives. We also show that post-processed word vectors can be used for the downstream natural language processing task of dialogue state tracking, yielding improved results in different dialogue domains.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Distributional representations of words, better known as word vectors, are a cornerstone of practical natural language processing (NLP). Examples of word vectors include Word2Vec [Mikolov et al.2013], GloVe [Pennington, Socher, and Manning2014], Eigenwords [Dhillon, Foster, and Ungar2015], and Fasttext [Bojanowski et al.2017]. These word vectors are usually referred to as distributional word vectors, as their training methods rely on the distributional hypothesis of semantics [Firth1957].

Recently, there has been interest in post-processing distributional word vectors to enrich their semantic content. The post-process procedures are usually performed in a lightweight fashion, i.e., without re-training word vectors on a text corpus. In one line of study, researchers used supervised methods to enforce linguistic constraints (e.g., synonym relations) on word vectors [Faruqui et al.2015, Mrksic et al.2016, Mrksic et al.2017], where the linguistic constraints are extracted from an external linguistic knowledge base such as WordNet [Miller1995] and PPDB [Pavlick et al.2015]. In another line of study, researchers devised unsupervised methods

to post-process word vectors. Spectral-decomposition methods such as singular value decomposition (SVD) and principal component analysis (PCA) are usually used in this line of research

[Caron2001, Bullinaria and Levy2012, Turney2012, Levy and Goldberg2014, Levy, Goldberg, and Dagan2015, Mu and Viswanath2018]. The current paper is in line with the second, unsupervised, research direction.

Among different unsupervised word vector post-processing schemes, the all-but-the-top approach [Mu and Viswanath2018] is a prominent example. Empirically studying the latent features encoded by principal components (PCs) of distributional word vectors, Mu2018 Mu2018 found that the variances explained by the leading PCs “encode the frequency of the word to a significant degree”. Since word frequencies are arguably unrelated to lexical semantics, they recommend removing such leading PCs from word vectors using a PCA reconstruction.

The current work advances the findings of Mu2018 Mu2018 and improves their post-processing scheme. Instead of discarding a fixed number of PCs, we softly filter word vectors using matrix conceptors [Jaeger2014, Jaeger2017]

, which characterize the linear space of those word vector features having high variances – the features most contaminated by word frequencies according to Mu2018 Mu2018. The proposed approach is mathematically simple and computationally efficient, as it is founded on elementary linear algebra. Besides these traits, it is also practically effective: using a standard set of lexical-level intrinsic evaluation tasks and a deep neural network-based dialogue state tracking task, we show that conceptor-based post-processing considerably enhances linguistic regularities captured by word vectors. A more detailed list of our contributions are:

  1. We propose an unsupervised algorithm that leverages Boolean operations of conceptors to post-process word vectors. The resulting word vectors achieve up to 18.86% and 28.34% improvement on the SimLex-999 and SimVerb-3500 dataset relative to the original word representations.

  2. A closer look at the proposed algorithm reveals commonalities across several existing post-processing techniques for neural-based word vectors and pointwise mutual information (PMI) matrix based word vectors. Unlike the existing alternatives, the proposed approach is flexible enough to remove lexically-unrelated noise, while general-purpose enough to handle word vectors induced by different learning algorithms.

The rest of the paper is organized as follows. We first briefly review the principal component nulling approach for unsupervised word vector post-processing introduced in [Mu and Viswanath2018], upon which our work is based. We then introduce our proposed approach, Conceptor Negation (CN). Analytically, we reveal the links and differences between the CN approach and the existing alternatives. Finally, we showcase the effectiveness of the CN method with numerical experiments111Our codes are available at https://github.com/liutianlin0121/Conceptor-Negation-WV.

Notation

We assume a collection of words , where is a vocabulary set. Each word is embedded as a dimensional real valued vector

. An identity matrix will be denoted by

. For a vector , we denote diag() as the diagonal matrix with on its diagonal. We write for a positive integer .

Post-processing word vectors by PC removal

This section is an overview of the all-but-the-top (ABTT) word vector post-processing approach introduced by Mu2018 Mu2018. In brief, the ABTT approach is based on two key observations of distributional word vectors. First, using a PCA, Mu2018 Mu2018 revealed that word vectors are strongly influenced by a few leading principal components (PCs). Second, they provided an interpretation of such leading PCs: they empirically demonstrated a correlation between the variances explained by the leading PCs and word frequencies. Since word frequencies are arguably unrelated to lexical semantics, they recommend eliminating top PCs from word vectors via a PCA reconstruction. This method is described in Algorithm 1.

Input : (i) : word vectors with a vocabulary ; (ii) : the number of PCs to be removed.
1 Center the word vectors: Let for all , where is the mean of the input word vectors.
2 Compute the first PCs of the column-wisely stacked centered word vectors via a PCA.
3 Process the word vectors: .
Output : 
Algorithm 1 The all-but-the-top (ABTT) algorithm for word vector post-processing.

In practice, Mu2018 Mu2018 found that the improvements yielded by ABTT are particularly impressive for word similarity tasks. Here, we provide a straightforward interpretation of the effects. Concretely, consider two arbitrary words and with word vectors and . Without loss of generality, we assume and are normalized, i.e., . Given PCs of the word vectors , we re-write and via linear combinations with respect to the basis : and , for some and for all . We see

(2)

where holds because the word vectors were assumed to be normalized and holds because is an orthonormal basis of . Via Equation 2, the similarity between word and can be seen as the overall “compatibility” of their measurements and with respect to each latent feature . If leading PCs encode the word frequencies, removing the leading PCs, in theory, help the word vectors capture semantic similarities, and consequently improve the experiment results of word similarity tasks.

Post-processing word vectors via Conceptor Negation

Removing the leading PCs of word vectors using the ABTT algorithm described above is effective in practice, as seen in the elaborate experiments conducted by Mu2018 Mu2018. However, the method comes with a potential limitation: for each latent feature taking form as a PC of the word vectors, ABTT either completely removes the feature or keeps it intact. For this reason, Khodak2018 Khodak2018 argued that ABTT is liable either to not remove enough noise or to cause too much information loss.

The objective of this paper is to address the limitations of ABTT. More concretely, we propose to use matrix conceptors [Jaeger2017] to gate away variances explained by the leading PCs of word vectors. As will be seen later, the proposed Conceptor Negation

method removes noise in a “softer” manner when compared to ABTT. We show that it shares the spirit of an eigenvalue weighting approach for PMI-based word vector post-processing. We proceed by providing the technical background of conceptors.

Conceptors

Conceptors are a family of regularized identity maps introduced by Jaeger2014 Jaeger2014. We present a sketch of conceptors by heavily re-using [Jaeger2014, He and Jaeger2018] sometimes verbatim. In brief, a matrix conceptor

for some vector-valued random variable

taking values in

is defined as a linear transformation that minimizes the following loss function.

(3)

where is a control parameter called aperture, is the norm, and is the Frobenius norm. This optimization problem has a closed-form solution

(4)

where and are matrices. If is the SVD of , then the SVD of is given as , where the singular values of can be written in terms of the singular values of : for . In intuitive terms, is a soft projection matrix on the linear subspace where the samples of lie, such that for a vector in this subspace, acts like the identity: , and when some orthogonal to the subspace is added to , reconstructs : .

Moreover, operations that satisfy most laws of Boolean logic such as NOT , OR , and AND , can be defined on matrix conceptors. These operations all have interpretation on the data level, i.e., on the distribution of the random variable (details in [Jaeger2014, Section 3.9]). Among these operations, the negation operation NOT is relevant for the current paper:

(5)

Intuitively, the negated conceptor, , softly projects the data onto a linear subspace that can be roughly understood as the orthogonal complement of the subspace characterized by .

Post-processing word vectors with Conceptor Negation

This subsection explains how conceptors can be used to post-process word vectors. The intuition behind our approach is simple. Consider a random variable taking values on word vectors

. We can estimate a conceptor

that describes the distribution of using Equation 4. Recall that [Mu and Viswanath2018] found that the directions with which has the highest variances encode word frequencies, which are unrelated to word semantics. To suppress such word-frequency related features, we can simply pass all word vectors through the negated conceptor , so that dampens the directions with which has the highest variances. This simple method is summarized in Algorithm 2.

Input : (i) : word vectors of a vocabulary ; (ii) : a hyper-parameter
1 Compute the conceptor from word vectors: , where is estimated by
2 Compute
3 Process the word vectors:
Output : 
Algorithm 2 The conceptor negation (CN) algorithm for word vector post-processing.

The hyper-parameter of Algorithm 2 governs the “sharpness” of the suppressing effects on word vectors employed by . Although in this work we are mostly interested in , it is nonetheless illustrative to consider the extreme cases where or : for , will be an identity matrix, meaning that word vectors will be kept intact; for ,

will be a zero matrix, meaning that all word vectors will be nulled to zero vectors. The computational costs of the Algorithm

2 are dominated by its step 1: one needs to calculate the matrix product for being the matrix whose columns are word vectors. Since modern word vectors usually come with a vocabulary of some millions of words (e.g., Google News Word2Vec contains 3 million tokens), performing a matrix product on such large matrices is computationally laborious. But considering that there are many uninteresting words in the vast vocabulary, we find it is empirically beneficial to only use a subset of the vocabulary, whose words are not too peculiar222This trick has also been used for ABTT by Mu2018 Mu2018 (personal communications).. Specifically, borrowing the word list provided by Arora2017 Arora2017333https://github.com/PrincetonML/SIF/tree/master/auxiliary_data, we use the words that appear at least 200 times in a Wikipedia dump 2015 to estimate . This greatly boosts the computation speed. Somewhat surprisingly, the trick also improves the performance of Algorithm 2. This might due to the higher quality of word vectors of common words compared with infrequent ones.

Analytic comparison with other methods

Since most of the existing unsupervised word vector post-processing methods are ultimately based on linear data transformations, we hypothesize that there should be commonalities between the methods. In this section, we show CN resembles ABTT in that both methods can be interpreted as “spectral encode-decode processes”; when applied to word vectors induced by a pointwise mutual information (PMI) matrix, CN shares the spirit with the eigenvalue weighting (EW) post-processing [Caron2001, Levy, Goldberg, and Dagan2015]: they both assign weights on singular vectors of a PMI matrix. A key distinction of CN is that it does soft noise removal (unlike ABTT) and that it is not restricted to post-processing PMI-matrix induced word vectors (unlike EW).

Relation to ABTT

In this subsection, we reveal the connection between CN and ABTT. To do this, we will re-write the last step of both algorithms into different formats. For the convenience of comparison, throughout this section, we will assume that the word vectors in Algorithm 1 and Algorithm 2 possess a zero mean, although this is not a necessary requirement in general.

We first re-write the equation in step 3 of Algorithm 1. We let be the matrix whose columns are the PCs estimated from the word vectors. Let be the first columns of . It is clear that step 2 of Algorithm 1, under the assumption that word vectors possess zero mean, can be re-written as

(6)

Next, we re-write step 3 of the Conceptor Negation (CN) method of algorithm 2. Note that for word vectors with zero mean, the estimation for is a (sample) covariance matrix of a random variable taking values as word vectors, and therefore the singular vectors of are PCs of word vectors. Letting be the SVD of , the equation in step 3 of Algorithm 2 can be re-written via elementary linear algebraic operations:

(7)

where are diagonal entries of .

Examining Equations 6 and 7, we see ABTT and CN share some similarities. In particular, they both can be unified into “spectral encode-decode processes,” which contain the following three steps:

  1. PC encoding. Load word vectors on PCs by multiplying with .

  2. Variance gating. Pass the PC-encoded data through the variance gating matrices and respectively for ABTT and CN.

  3. PC decoding. Transform the data back to the usual coordinates using the matrix .

With the above encode-decode interpretation, we see CN differ from ABTT is its variance gating step. In particular, ABTT does a hard gating, in the sense that the diagonal entries of the variance gating matrix (call them variance gating coefficients) take values in the set . The CN approach, on the other hand, does a softer gating as the entries take values in :

for all and . To illustrate the gating effects, we plot the variance gating coefficients for ABTT and CN for Word2Vec in Figure 1.

Figure 1: The variance gating coefficients of ABTT and CN for Word2Vec. Hyper-parameters: for ABTT and for CN.

Relation with eigenvalue weighting

We relate the conceptor approach to the eigenvalue weighting approach for post-processing PMI-based word vectors. This effort is in line with the ongoing research in the NLP community that envisages a connection between “neural word embedding” and PMI-matrix factorization based word embedding [Levy and Goldberg2014, Pennington, Socher, and Manning2014, Levy, Goldberg, and Dagan2015].

In the PMI approach for word association modeling, for each word and each context (i.e., sequences of words) , the PMI matrix assigns a value for the pair : . In practical NLP tasks, the sets of words and contexts tend to be large, and therefore, directly working with is inconvenient. To lift the problem, one way is to perform a truncated SVD on , factorizing into the product of three smaller matrices , where is the first left singular vectors of the matrix , is the diagonal matrix containing leading singular values of , and are the first right singular vectors of the matrix . A generic way to induce word vectors from is to let

which is a matrix containing word vectors as rows. Coined by Levy2015 Levy2015, the term eigenvalue weighting444It seems to us that a term “singular value weighting” is more appropriate because the weighting is based on singular values of a PMI matrix but not eigenvalues of . The term “eigenvalue” is relevant here only because the singular values of are also the square roots of eigenvalues of . (EW) refers to a post-processing technique for PMI-matrix-induced word vectors. This technique has its root in Latent Semantic Analysis (LSA): Caron2001 Caron2001 first propose to define the post-processed version of as

where is the weighting exponent determining the relative weights assigned to each singular vector of . While an optimal depends on specific task demands, previous research suggests that is generally preferred, i.e., the contributions of the initial singular vectors of should be suppressed. For instance, , and are recommended in [Caron2001, Bullinaria and Levy2012, Levy, Goldberg, and Dagan2015]. Bullinaria2012 Bullinaria2012 argue that the initial singular vectors of tend to be contaminated most by aspects other than lexical semantics.

We now show that applying CN on the PMI-matrix-based word embedding has a tantamount effect with “suppressing initial singular vectors” of EW. Acting the negated on word vectors of (i.e., rows of ), we get the post-processed word vectors as rows of the :

Since

for all and , these weights suppress the contribution of the initial singular vectors, similar to what has been done in EW.

Experiments

We evaluate the post-processed word vectors on a variety of lexical-level intrinsic tasks and a down-stream deep learning task. We use the publicly available pre-trained Google News Word2Vec

[Mikolov et al.2013]555https://code.google.com/archive/p/word2vec/ and Common Crawl GloVe666https://nlp.stanford.edu/projects/glove/ [Pennington, Socher, and Manning2014] to perform lexical-level experiments. For CN, we fix for Word2Vec and GloVe throughout the experiments777Analytical optimization methods for the aperture are available from [Jaeger2014], remaining to be connected with the word vector post-processing scheme in the future.. For ABTT, we set for Word2Vec and for GloVe, as what has been suggested by Mu2018 Mu2018.

Word similarity

We test the performance of CN on seven benchmarks that have been widely used to measure word similarity: the RG65 [Rubenstein and Goodenough1965], the WordSim-353 (WS) [Finkelstein et al.2002], the rare-words (RW) [Luong, Socher, and Manning2013], the MEN dataset [Bruni, Tran, and Baroni2014], the MTurk [Radinsky et al.2011], the SimLex-999 (SimLex) [Hill, Reichart, and Korhonen2015], and the SimVerb-3500 [Gerz et al.2016]. To evaluate the word similarity, we calculate the cosine distance between vectors of two words using Equation Post-processing word vectors by PC removal. We report the Spearman’s rank correlation coefficient [Myers and Well1995] of the estimated rankings against the rankings by humans in Table 1. We see that the proposed CN method consistently outperforms the original word embedding (orig.) and the post-processed word embedding by ABTT for most of the benchmarks.

WORD2VEC GLOVE
orig. ABTT CN orig. ABTT CN
RG65 76.08 78.34 78.92 76.96 74.36 78.40
WS 68.29 69.05 69.30 73.79 76.79 79.08
RW 53.74 54.33 58.04 46.41 52.04 58.98
MEN 78.20 79.08 78.67 80.49 81.78 83.38
MTurk 68.23 69.35 66.81 69.29 70.85 71.07
SimLex 44.20 45.10 46.82 40.83 44.97 48.53
SimVerb 36.35 36.50 38.30 28.33 32.23 36.36
Table 1: Post-processing results (Spearman’s rank correlation coefficient 100) under seven word similarity benchmarks. The baseline results (orig. and ABTT) are collected from [Mu and Viswanath2018].

The improvement of results by CN are particularly impressive for two “modern” word similarity benchmarks SimLex and SimVerb – these two benchmarks carefully distinguish genuine word similarity from conceptual association [Hill, Reichart, and Korhonen2015]. For instance, coffee is associated with cup but by no means similar to cup, a confusion often made by earlier benchmarks. In particular, SimLex has been heavily used to evaluate word vectors yielded by supervised word vector fine-tuning algorithms, which perform gradient descent on word vectors with respect to linguistic constraints such as synonym and antonym relationships of words extracted from WordNet and/or PPDB. When compared to a recent supervised approach of counter-fitting. Our results on SimLex are comparable to those reported by Mrksic2016 Mrksic2016, as shown in Table 2.

Post-processing method WORD2VEC GLOVE
supervised Counter-Fitting + syn. 0.45 0.46
Counter-Fitting + ant. 0.33 0.43
Counter-Fitting + syn. + ant. 0.47 0.50
unsupervised CN 0.47 0.49
Table 2: Comparing the testing results (Spearman’s rank correlation coefficient) on SimLex with those of Counter-Fitting approach (results collected from [Mrksic et al.2016, Table 2] and [Mrksic et al.2017, Table 3]). The linguistic constraints for Counter-Fitting are synonym (syn.) and/or antonym (ant.) relationships extracted from English PPDB.

Semantic Textual Similarity

In this subsection, we showcase the effectiveness of the proposed post-processing method using semantic textual similarity (STS) benchmarks, which are designed to test the semantic similarities of sentences. We use 2012-2015 SemEval STS tasks [Agirre et al.2012, Agirre et al.2013, Agirre et al.2014, Agirre et al.2015] and the 2012 SemEval Semantic Related task (SICK) [Marelli et al.2014].

Concretely, for each pair of sentences, and , we computed and by averaging their constituent word vectors. We then calculated the cosine distance between two sentence vectors and . This naive method has been shown to be a strong baseline for STS tasks [Wieting et al.2016]. As in Agirre2012 Agirre2012, we used Pearson correlation of the estimated rankings of sentence similarity against the rankings by humans to assess model performance.

In Table 7, we report the average result for the STS tasks each year (detailed results are in the supplemental material). Again, our CN method consistently outperforms the alternatives.

WORD2VEC GLOVE
orig. ABTT CN orig. ABTT CN
STS 2012 57.22 57.67 54.31 48.27 54.06 54.38
STS 2013 56.81 57.98 59.17 44.83 51.71 55.51
STS 2014 62.89. 63.30 66.22 51.11 59.23 62.66
STS 2015 62.74 63.35 67.15 47.23 57.29 63.74
SICK 70.10 70.20 72.71 65.14 67.85 66.42
Table 3: Post-processing results (100) on the semantic textual similarity tasks. The baseline results (orig. and ABTT) are collected from [Mu and Viswanath2018].

Concept Categorization

In the concept categorization task, we used -means to cluster words into concept categories based on their vector representations (for example, “bear” and “cat” belong to the concept category of animals). We use three standard datasets: (i) a rather small dataset ESSLLI 2008 [Baroni, Evert, and Lenci2008] that contains 44 concepts in 9 categories; (ii) the Almuhareb-Poesio (AP) [Poesio and Almuhareb2005], which contains 402 concepts divided into 21 categories; and (iii) the BM dataset [Battig and Montague1969] that 5321 concepts divided into 56 categories. Note that the datasets of ESSLLI, AP, and BM are increasingly challenging for clustering algorithms, due to the increasing numbers of words and categories.

Following [Baroni, Dinu, and Kruszewski2014, Schnabel et al.2015, Mu and Viswanath2018], we used “purity” of clusters [Manning, Raghavan, and Schütze2008, Section 16.4] as the evaluation criterion. That the results of -means heavily depend on two hyper-parameters: (i) the number of clusters and (ii) the initial centroids of clusters. We follow previous research [Baroni, Dinu, and Kruszewski2014, Schnabel et al.2015, Mu and Viswanath2018] to set as the ground-truth number of categories. The settings of the initial centroids of clusters, however, are less well-documented in previous work – it is not clear how many initial centroids have been sampled, or if different centroids have been sampled at all. To avoid the influences of initial centroids in -means (which are particularly undesirable for this case because word vectors live in ), in this work, we simply fixed the initial centroids as the average of original, ABTT-processed, and CN-processed word vectors respectively from ground-truth categories. This initialization is fair because all post-processing methods make use of the ground-truth information equally, similar to the usage of the ground-truth numbers of clusters. We report the experiment results in Table 4.

WORD2VEC GLOVE
orig. ABTT CN. orig. ABTT CN
ESSLLI 100.0 100.0 100.0 100.0 100.0 100.0
AP 87.28 88.3 89.31 86.43 87.19 90.95
BM 58.15 59.24 60.19 65.34 67.35 67.63
Table 4: Purity ( 100) of the clusters in concept categorization task with fixed centroids.

The performance of the proposed methods and the baseline methods performed equally well for the smallest dataset ESSLLI. As the dataset got larger, the results differed and the proposed CN approach outperformed the baselines.

A Downstream NLP task: Neural Belief Tracker

The experiments we have reported so far are all intrinsic lexical evaluation benchmarks. Only evaluating the post-processed word vectors using these benchmarks, however, invites an obvious critique: the success of intrinsic evaluation tasks may not transfer to downstream NLP tasks, as suggested by previous research [Schnabel et al.2015]

. Indeed, when supervised learning tasks are performed, the post-processing methods such as ABTT and CN can

in principle

be absorbed into a classifier such as a neural network. Nevertheless, good initialization for classifiers is crucial. We hypothesize that the post-processed word vectors serve as a good initialization for those downstream NLP tasks that semantic knowledge contained in word vectors is needed.

To validate this hypothesis, we conducted an experiment using Neural Belief Tracker (NBT), a deep neural network based dialogue state tracking (DST) model [Mrksic et al.2017, Mrkšić and Vulić2018]

. As a concrete example to illustrate the purpose of the task, consider a dialogue system designed to help users find restaurants. When a user wants to find a Sushi restaurant, the system is expected to know that Japanese restaurants have a higher probability to be a good recommendation than Italian restaurants or Thai restaurants. Word vectors are important for this task because NBT needs to absorb useful semantic knowledge from word vectors using a neural network.

In our experiment with NBT, we used the model specified in [Mrkšić and Vulić2018] with default hyper-parameter settings888https://github.com/nmrksic/neural-belief-tracker. We report the goal accuracy, a default DST performance measure, defined as the proportion of dialogue turns where all the user’s search goal constraints match with the model predictions. The test data was Wizard-of-Oz (WOZ) 2.0 [Wen et al.2017], where the goal constraints of users were divided into three domains: food, price range, and area. The experiment results are reported in Table 5.

WORD2VEC GLOVE
orig. ABTT CN. orig. ABTT CN
Food 48.6 84.7 78.5 86.4 83.7 88.8
Price range 90.2 88.1 92.2 91.0 93.9 94.7
Area 83.1 82.4 86.1 93.5 94.9 93.7
Average 74.0 85.1 85.6 90.3 90.8 92.4
Table 5: The goal accuracy of food, price range, and area.

Further discussions

Besides the NBT task, we have also tested ABTT and CN methods on other downstream NLP tasks such as text classification (not reported). We found that ABTT and CN yield equivalent results in such tasks. One explanation is that the ABTT and CN post-processed word vectors are different only up to a small perturbation. With a sufficient amount of training data and an appropriate regularization method, a neural network should generalize over such a perturbation. With a relatively small training data (e.g., the 600 dialogues for training NBT task), however, we found that word vectors as initializations matters, and in such cases, CN post-processed word vectors yield favorable results. Another interesting finding is that having tested ABTT and CN on Fasttext [Bojanowski et al.2017], we found that neither post-processing method provides visible gain. We hypothesize that this might be because Fasttext includes subword (character-level) information in its word representation during training, which suppresses the word frequency features contained in word vectors. It remains for future work to validate this hypothesis.

Conclusion

We propose a simple yet effective method for post-processing word vectors via the negation operation of conceptors. With a battery of intrinsic evaluation tasks and a down-stream deep-learning empowered dialogue state tracking task, the proposed method enhances linguistic regularities captured by word vectors and consistently improves performance over existing alternatives.

There are several possibilities for future work. We envisage that the logical operations and abstract ordering admitted by conceptors can be used in other NLP tasks. As concrete examples, the AND operation can be potentially applied to induce and fine-tune bi-lingual word vectors, by mapping word representations of individual languages into a shared linear space; the OR together with NOT operation can be used to study the vector representations of polysemous words, by joining and deleting sense-specific vector representations of words; the abstraction ordering is a natural tool to study graded lexical entailment of words.

Acknowledgement

We appreciate the anonymous reviewers for their constructive comments. We thank Xu He, Jordan Rodu, and Daphne Ippolito, and Chris Callison-Burch for helpful discussions.

References

  • [Agirre et al.2012] Agirre, E.; Diab, M.; Cer, D.; and Gonzalez-Agirre, A. 2012. Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics, SemEval ’12, 385–393. Stroudsburg, PA, USA: Association for Computational Linguistics.
  • [Agirre et al.2013] Agirre, E.; Cer, D.; Diab, M.; Gonzalez-Agirre, A.; and Guo, W. 2013. Sem 2013 shared task: Semantic textual similarity. In Second Joint Conference on Lexical and Computational Semantics, volume 1, 32–43.
  • [Agirre et al.2014] Agirre, E.; Banea, C.; Cardie, C.; Cer, D.; Diab, M.; Gonzalez-Agirre, A.; Guo, W.; Mihalcea, R.; Rigau, G.; and Wiebe, J. 2014. Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation, 81–91.
  • [Agirre et al.2015] Agirre, E.; Banea, C.; Cardie, C.; Cer, D.; Diab, M.; Gonzalez-Agirre, A.; Guo, W.; Lopez-Gazpio, I.; Maritxalar, M.; Mihalcea, R.; Rigaua, G.; Uriaa, L.; and Wiebeg, J. 2015. Semeval-2015 task 2: Semantic textual similarity, English, Spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation, 252–263.
  • [Arora, Liang, and Ma2017] Arora, S.; Liang, Y.; and Ma, T. 2017. A simple but tough-to-beat baseline for sentence embeddings. In International Conference on Learning Representations.
  • [Baroni, Dinu, and Kruszewski2014] Baroni, M.; Dinu, G.; and Kruszewski, G. 2014. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings the ACL 2014, 238–247. Association for Computational Linguistics.
  • [Baroni, Evert, and Lenci2008] Baroni, M.; Evert, S.; and Lenci, A. 2008. Verb categorization: a shared tasks from the ESSLLI 2008 workshop.
  • [Battig and Montague1969] Battig, W. F., and Montague, W. E. 1969. Category norms of verbal items in 56 categories a replication and extension of the connecticut category norms. Journal of Experimental Psychology 80(3p2):1.
  • [Bojanowski et al.2017] Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5:135–146.
  • [Bruni, Tran, and Baroni2014] Bruni, E.; Tran, N. K.; and Baroni, M. 2014. Multimodal distributional semantics.

    Journal of Artificial Intelligence Research

    49(1):1–47.
  • [Bullinaria and Levy2012] Bullinaria, J. A., and Levy, P. J. 2012. Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behavior Research Methods 44(3):890–907.
  • [Caron2001] Caron, J. 2001. Experiments with LSA scoring: Optimal rank and basis. Computational Information Retrieval 157–169.
  • [Dhillon, Foster, and Ungar2015] Dhillon, P. S.; Foster, D. P.; and Ungar, L. H. 2015. Eigenwords: Spectral word embeddings.

    Journal of Machine Learning Research

    16:3035–3078.
  • [Faruqui et al.2015] Faruqui, M.; Dodge, J.; Jauhar, S. K.; Dyer, C.; Hovy, E. H.; and Smith, N. A. 2015.

    Retrofitting word vectors to semantic lexicons.

    In Proceedings of the NAACL HLT 2015, 1606–1615.
  • [Finkelstein et al.2002] Finkelstein, L.; Gabrilovich, E.; Matias, Y.; Rivlin, E.; Solan, Z.; Wolfman, G.; and Ruppin, E. 2002. Placing search in context: the concept revisited. ACM Transactions on Information Systems 20(1):116–131.
  • [Firth1957] Firth, J. R. 1957. A synopsis of linguistic theory 1930-55. In Studies in Linguistic Analysis (special volume of the Philological Society), volume 1952-59, 1–32. Oxford: The Philological Society.
  • [Gerz et al.2016] Gerz, D.; Vulic, I.; Hill, F.; Reichart, R.; and Korhonen, A. 2016. SimVerb-3500: a large-scale evaluation set of verb similarity. In Proceedings of the EMNLP 2016, 2173–2182.
  • [He and Jaeger2018] He, X., and Jaeger, H. 2018.

    Overcoming catastrophic interference using conceptor-aided backpropagation.

    In International Conference on Learning Representations.
  • [Hill, Reichart, and Korhonen2015] Hill, F.; Reichart, R.; and Korhonen, A. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41(4):665–695.
  • [Jaeger2014] Jaeger, H. 2014.

    Controlling recurrent neural networks by conceptors.

    Technical report, Jacobs University Bremen.
  • [Jaeger2017] Jaeger, H. 2017. Using conceptors to manage neural long-term memories for temporal patterns. Journal of Machine Learning Research 18(13):1–43.
  • [Khodak et al.2018] Khodak, M.; Saunshi, N.; Liang, Y.; Ma, T.; Stewart, B.; and Arora, S. 2018. A la carte embedding: Cheap but effective induction of semantic feature vectors. In the Proceedings of ACL.
  • [Levy and Goldberg2014] Levy, O., and Goldberg, Y. 2014. Neural word embedding as implicit matrix factorization. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27. Curran Associates, Inc. 2177–2185.
  • [Levy, Goldberg, and Dagan2015] Levy, O.; Goldberg, Y.; and Dagan, I. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3:211–225.
  • [Luong, Socher, and Manning2013] Luong, M.; Socher, R.; and Manning, C. D. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the CoNLL 2013.
  • [Manning, Raghavan, and Schütze2008] Manning, C. D.; Raghavan, P.; and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.
  • [Marelli et al.2014] Marelli, M.; Menini, S.; Baroni, M.; Bentivogli, L.; Bernardi, R.; and Zamparelli, R. 2014. A sick cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA).
  • [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Burges, C. J. C.; Bottou, L.; Welling, M.; Ghahramani, Z.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 26. Curran Associates, Inc. 3111–3119.
  • [Miller1995] Miller, G. A. 1995. Wordnet: A lexical database for English. Communications of the ACM 38(11):39–41.
  • [Mrkšić and Vulić2018] Mrkšić, N., and Vulić, I. 2018. Fully statistical neural belief tracking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 108–113. Association for Computational Linguistics.
  • [Mrksic et al.2016] Mrksic, N.; Séaghdha, D.; Thomson, B.; Gasic, M.; Rojas-Barahona, L. M.; Su, P.; Vandyke, D.; Wen, T.; and Young, S. J. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the NAACL HLT 2016, 142–148.
  • [Mrksic et al.2017] Mrksic, N.; Vulic, I.; Séaghdha, D. Ó.; Leviant, I.; Reichart, R.; Gasic, M.; Korhonen, A.; and Young, S. J. 2017. Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints. TACL 5:309–324.
  • [Mu and Viswanath2018] Mu, J., and Viswanath, P. 2018. All-but-the-top: Simple and effective postprocessing for word representations. In International Conference on Learning Representations.
  • [Myers and Well1995] Myers, J. L., and Well, A. D. 1995. Research Design & Statistical Analysis. Routledge, 1 edition.
  • [Pavlick et al.2015] Pavlick, E.; Rastogi, P.; Ganitkevitch, J.; Durme, B. V.; and Callison-Burch, C. 2015. PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In Proceedings of the ACL 2015 (Volume 2: Short Papers), 425–430. Beijing, China: Association for Computational Linguistics.
  • [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP, 1532–1543.
  • [Poesio and Almuhareb2005] Poesio, M., and Almuhareb, A. 2005. Identifying concept attributes using a classifier. In Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition, 18–27. Stroudsburg, PA, USA: Association for Computational Linguistics.
  • [Radinsky et al.2011] Radinsky, K.; Agichtein, E.; Gabrilovich, E.; and Markovitch, S. 2011. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International World Wide Web Conference, 337–346.
  • [Rubenstein and Goodenough1965] Rubenstein, H., and Goodenough, J. B. 1965. Contextual correlates of synonymy. Communications of the ACM 8(10):627–633.
  • [Schnabel et al.2015] Schnabel, T.; Labutov, I.; Mimno, D. M.; and Joachims, T. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of EMNLP 2015, 298–307.
  • [Turney2012] Turney, P. D. 2012. Domain and function: A dual-space model of semantic relations and compositions. Journal of Artificial Intelligence Research 44(1):533–585.
  • [Wen et al.2017] Wen, T.; Vandyke, D.; Mrkšić, N.; Milica, M.; Rojas-Barahona, L. M.; Su, P.; Ultes, S.; and Young, S. 2017. A network-based end-to-end trainable task-oriented dialogue system. In EACL, 438–449. Valencia, Spain: Association for Computational Linguistics.
  • [Wieting et al.2016] Wieting, J.; Bansal, M.; Gimpel, K.; and Livescu, K. 2016. Towards universal paraphrastic sentence embeddings. In International Conference on Learning Representations.

Appendix

Detailed experiments in Semantic Textual Similarity (STS) tasks

In the main body of the paper we have reported the average results for STS tasks by year. A detailed list the STS tasks can be found in Table 6 and can be downloaded from http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark.

STS 2012 STS 2013 STS 2014 STS 2015
MSRpar FNWN deft forum anwsers-forums
MSRvid OnWN deft news answers-students
OnWN headlines headline belief
SMTeuroparl images headline
SMTnews OnWN images
tweet news
Table 6: STS tasks in year 2012 - 2015. Note that tasks with shared names in different years are different tasks.

We report the detailed experiment results for the above STS tasks in Table 7.

WORD2VEC GLOVE
orig. ABTT CN orig. ABTT CN
MSRpar 42.12 43.85 40.30 44.54 44.09 41.19
MSRvid 72.07 72.16 75.22 64.47 68.05 62.50
OnWN 69.38 69.48 70.82 53.07 65.67 67.96
SMTeuroparl 53.15 54.32 35.14 41.74 45.28 52.58
SMTnews 49.37 48.53 50.08 37.54 47.22 47.69
STS 2012 57.22 57.67 54.31 48.27 54.06 54.38
FNWN 40.70 41.96 43.99 37.54 39.34 42.07
OnWN 67.87 68.17 68.76 47.22 58.60 57.45
headlines 61.88 63.81 64.78 49.73. 57.20 67.00
STS 2013 56.81 57.98 59.17 44.83 51.71 55.51
OnWN 74.61 74.78 75.08 57.41 67.56 66.43
deft-forum 32.19 33.26 42.80 21.55 29.39 37.57
deft-news 66.83 65.96 65.57 65.14 71.45 69.08
headlines 58.01 59.58 61.09 47.05 52.60 61.71
images 73.75 74.17 78.24 57.22 68.28 65.81
tweet-news 71.92 72.07 74.55 58.32 66.13 75.37
STS 2014 62.89 63.30 66.22 51.11 59.23 62.66
forum 46.35 46.80 53.66 30.02 39.86 48.62
students 68.07 67.99 71.45 49.20 62.38 69.68
belief 59.72 60.42 61.29 44.05 57.68 59.77
headlines 61.47 63.45 68.88 46.22 53.31 69.18
images 78.09 78.08 80.48 66.63 73.20 71.43
STS 2015 62.74 63.35 67.15 47.23 57.29 63.74
SICK 70.10 70.20 72.71 65.14 67.85 66.42
Table 7: Before-After results (x100) on the semantic textual similarity tasks.