Introduction
Distributional representations of words, better known as word vectors, are a cornerstone of practical natural language processing (NLP). Examples of word vectors include Word2Vec [Mikolov et al.2013], GloVe [Pennington, Socher, and Manning2014], Eigenwords [Dhillon, Foster, and Ungar2015], and Fasttext [Bojanowski et al.2017]. These word vectors are usually referred to as distributional word vectors, as their training methods rely on the distributional hypothesis of semantics [Firth1957].
Recently, there has been interest in postprocessing distributional word vectors to enrich their semantic content. The postprocess procedures are usually performed in a lightweight fashion, i.e., without retraining word vectors on a text corpus. In one line of study, researchers used supervised methods to enforce linguistic constraints (e.g., synonym relations) on word vectors [Faruqui et al.2015, Mrksic et al.2016, Mrksic et al.2017], where the linguistic constraints are extracted from an external linguistic knowledge base such as WordNet [Miller1995] and PPDB [Pavlick et al.2015]. In another line of study, researchers devised unsupervised methods
to postprocess word vectors. Spectraldecomposition methods such as singular value decomposition (SVD) and principal component analysis (PCA) are usually used in this line of research
[Caron2001, Bullinaria and Levy2012, Turney2012, Levy and Goldberg2014, Levy, Goldberg, and Dagan2015, Mu and Viswanath2018]. The current paper is in line with the second, unsupervised, research direction.Among different unsupervised word vector postprocessing schemes, the allbutthetop approach [Mu and Viswanath2018] is a prominent example. Empirically studying the latent features encoded by principal components (PCs) of distributional word vectors, Mu2018 Mu2018 found that the variances explained by the leading PCs “encode the frequency of the word to a significant degree”. Since word frequencies are arguably unrelated to lexical semantics, they recommend removing such leading PCs from word vectors using a PCA reconstruction.
The current work advances the findings of Mu2018 Mu2018 and improves their postprocessing scheme. Instead of discarding a fixed number of PCs, we softly filter word vectors using matrix conceptors [Jaeger2014, Jaeger2017]
, which characterize the linear space of those word vector features having high variances – the features most contaminated by word frequencies according to Mu2018 Mu2018. The proposed approach is mathematically simple and computationally efficient, as it is founded on elementary linear algebra. Besides these traits, it is also practically effective: using a standard set of lexicallevel intrinsic evaluation tasks and a deep neural networkbased dialogue state tracking task, we show that conceptorbased postprocessing considerably enhances linguistic regularities captured by word vectors. A more detailed list of our contributions are:

We propose an unsupervised algorithm that leverages Boolean operations of conceptors to postprocess word vectors. The resulting word vectors achieve up to 18.86% and 28.34% improvement on the SimLex999 and SimVerb3500 dataset relative to the original word representations.

A closer look at the proposed algorithm reveals commonalities across several existing postprocessing techniques for neuralbased word vectors and pointwise mutual information (PMI) matrix based word vectors. Unlike the existing alternatives, the proposed approach is flexible enough to remove lexicallyunrelated noise, while generalpurpose enough to handle word vectors induced by different learning algorithms.
The rest of the paper is organized as follows. We first briefly review the principal component nulling approach for unsupervised word vector postprocessing introduced in [Mu and Viswanath2018], upon which our work is based. We then introduce our proposed approach, Conceptor Negation (CN). Analytically, we reveal the links and differences between the CN approach and the existing alternatives. Finally, we showcase the effectiveness of the CN method with numerical experiments^{1}^{1}1Our codes are available at https://github.com/liutianlin0121/ConceptorNegationWV.
Notation
We assume a collection of words , where is a vocabulary set. Each word is embedded as a dimensional real valued vector
. An identity matrix will be denoted by
. For a vector , we denote diag() as the diagonal matrix with on its diagonal. We write for a positive integer .Postprocessing word vectors by PC removal
This section is an overview of the allbutthetop (ABTT) word vector postprocessing approach introduced by Mu2018 Mu2018. In brief, the ABTT approach is based on two key observations of distributional word vectors. First, using a PCA, Mu2018 Mu2018 revealed that word vectors are strongly influenced by a few leading principal components (PCs). Second, they provided an interpretation of such leading PCs: they empirically demonstrated a correlation between the variances explained by the leading PCs and word frequencies. Since word frequencies are arguably unrelated to lexical semantics, they recommend eliminating top PCs from word vectors via a PCA reconstruction. This method is described in Algorithm 1.
In practice, Mu2018 Mu2018 found that the improvements yielded by ABTT are particularly impressive for word similarity tasks. Here, we provide a straightforward interpretation of the effects. Concretely, consider two arbitrary words and with word vectors and . Without loss of generality, we assume and are normalized, i.e., . Given PCs of the word vectors , we rewrite and via linear combinations with respect to the basis : and , for some and for all . We see
(2) 
where holds because the word vectors were assumed to be normalized and holds because is an orthonormal basis of . Via Equation 2, the similarity between word and can be seen as the overall “compatibility” of their measurements and with respect to each latent feature . If leading PCs encode the word frequencies, removing the leading PCs, in theory, help the word vectors capture semantic similarities, and consequently improve the experiment results of word similarity tasks.
Postprocessing word vectors via Conceptor Negation
Removing the leading PCs of word vectors using the ABTT algorithm described above is effective in practice, as seen in the elaborate experiments conducted by Mu2018 Mu2018. However, the method comes with a potential limitation: for each latent feature taking form as a PC of the word vectors, ABTT either completely removes the feature or keeps it intact. For this reason, Khodak2018 Khodak2018 argued that ABTT is liable either to not remove enough noise or to cause too much information loss.
The objective of this paper is to address the limitations of ABTT. More concretely, we propose to use matrix conceptors [Jaeger2017] to gate away variances explained by the leading PCs of word vectors. As will be seen later, the proposed Conceptor Negation
method removes noise in a “softer” manner when compared to ABTT. We show that it shares the spirit of an eigenvalue weighting approach for PMIbased word vector postprocessing. We proceed by providing the technical background of conceptors.
Conceptors
Conceptors are a family of regularized identity maps introduced by Jaeger2014 Jaeger2014. We present a sketch of conceptors by heavily reusing [Jaeger2014, He and Jaeger2018] sometimes verbatim. In brief, a matrix conceptor
for some vectorvalued random variable
taking values inis defined as a linear transformation that minimizes the following loss function.
(3) 
where is a control parameter called aperture, is the norm, and is the Frobenius norm. This optimization problem has a closedform solution
(4) 
where and are matrices. If is the SVD of , then the SVD of is given as , where the singular values of can be written in terms of the singular values of : for . In intuitive terms, is a soft projection matrix on the linear subspace where the samples of lie, such that for a vector in this subspace, acts like the identity: , and when some orthogonal to the subspace is added to , reconstructs : .
Moreover, operations that satisfy most laws of Boolean logic such as NOT , OR , and AND , can be defined on matrix conceptors. These operations all have interpretation on the data level, i.e., on the distribution of the random variable (details in [Jaeger2014, Section 3.9]). Among these operations, the negation operation NOT is relevant for the current paper:
(5) 
Intuitively, the negated conceptor, , softly projects the data onto a linear subspace that can be roughly understood as the orthogonal complement of the subspace characterized by .
Postprocessing word vectors with Conceptor Negation
This subsection explains how conceptors can be used to postprocess word vectors. The intuition behind our approach is simple. Consider a random variable taking values on word vectors
. We can estimate a conceptor
that describes the distribution of using Equation 4. Recall that [Mu and Viswanath2018] found that the directions with which has the highest variances encode word frequencies, which are unrelated to word semantics. To suppress such wordfrequency related features, we can simply pass all word vectors through the negated conceptor , so that dampens the directions with which has the highest variances. This simple method is summarized in Algorithm 2.The hyperparameter of Algorithm 2 governs the “sharpness” of the suppressing effects on word vectors employed by . Although in this work we are mostly interested in , it is nonetheless illustrative to consider the extreme cases where or : for , will be an identity matrix, meaning that word vectors will be kept intact; for ,
will be a zero matrix, meaning that all word vectors will be nulled to zero vectors. The computational costs of the Algorithm
2 are dominated by its step 1: one needs to calculate the matrix product for being the matrix whose columns are word vectors. Since modern word vectors usually come with a vocabulary of some millions of words (e.g., Google News Word2Vec contains 3 million tokens), performing a matrix product on such large matrices is computationally laborious. But considering that there are many uninteresting words in the vast vocabulary, we find it is empirically beneficial to only use a subset of the vocabulary, whose words are not too peculiar^{2}^{2}2This trick has also been used for ABTT by Mu2018 Mu2018 (personal communications).. Specifically, borrowing the word list provided by Arora2017 Arora2017^{3}^{3}3https://github.com/PrincetonML/SIF/tree/master/auxiliary_data, we use the words that appear at least 200 times in a Wikipedia dump 2015 to estimate . This greatly boosts the computation speed. Somewhat surprisingly, the trick also improves the performance of Algorithm 2. This might due to the higher quality of word vectors of common words compared with infrequent ones.Analytic comparison with other methods
Since most of the existing unsupervised word vector postprocessing methods are ultimately based on linear data transformations, we hypothesize that there should be commonalities between the methods. In this section, we show CN resembles ABTT in that both methods can be interpreted as “spectral encodedecode processes”; when applied to word vectors induced by a pointwise mutual information (PMI) matrix, CN shares the spirit with the eigenvalue weighting (EW) postprocessing [Caron2001, Levy, Goldberg, and Dagan2015]: they both assign weights on singular vectors of a PMI matrix. A key distinction of CN is that it does soft noise removal (unlike ABTT) and that it is not restricted to postprocessing PMImatrix induced word vectors (unlike EW).
Relation to ABTT
In this subsection, we reveal the connection between CN and ABTT. To do this, we will rewrite the last step of both algorithms into different formats. For the convenience of comparison, throughout this section, we will assume that the word vectors in Algorithm 1 and Algorithm 2 possess a zero mean, although this is not a necessary requirement in general.
We first rewrite the equation in step 3 of Algorithm 1. We let be the matrix whose columns are the PCs estimated from the word vectors. Let be the first columns of . It is clear that step 2 of Algorithm 1, under the assumption that word vectors possess zero mean, can be rewritten as
(6) 
Next, we rewrite step 3 of the Conceptor Negation (CN) method of algorithm 2. Note that for word vectors with zero mean, the estimation for is a (sample) covariance matrix of a random variable taking values as word vectors, and therefore the singular vectors of are PCs of word vectors. Letting be the SVD of , the equation in step 3 of Algorithm 2 can be rewritten via elementary linear algebraic operations:
(7) 
where are diagonal entries of .
Examining Equations 6 and 7, we see ABTT and CN share some similarities. In particular, they both can be unified into “spectral encodedecode processes,” which contain the following three steps:

PC encoding. Load word vectors on PCs by multiplying with .

Variance gating. Pass the PCencoded data through the variance gating matrices and respectively for ABTT and CN.

PC decoding. Transform the data back to the usual coordinates using the matrix .
With the above encodedecode interpretation, we see CN differ from ABTT is its variance gating step. In particular, ABTT does a hard gating, in the sense that the diagonal entries of the variance gating matrix (call them variance gating coefficients) take values in the set . The CN approach, on the other hand, does a softer gating as the entries take values in :
for all and . To illustrate the gating effects, we plot the variance gating coefficients for ABTT and CN for Word2Vec in Figure 1.
Relation with eigenvalue weighting
We relate the conceptor approach to the eigenvalue weighting approach for postprocessing PMIbased word vectors. This effort is in line with the ongoing research in the NLP community that envisages a connection between “neural word embedding” and PMImatrix factorization based word embedding [Levy and Goldberg2014, Pennington, Socher, and Manning2014, Levy, Goldberg, and Dagan2015].
In the PMI approach for word association modeling, for each word and each context (i.e., sequences of words) , the PMI matrix assigns a value for the pair : . In practical NLP tasks, the sets of words and contexts tend to be large, and therefore, directly working with is inconvenient. To lift the problem, one way is to perform a truncated SVD on , factorizing into the product of three smaller matrices , where is the first left singular vectors of the matrix , is the diagonal matrix containing leading singular values of , and are the first right singular vectors of the matrix . A generic way to induce word vectors from is to let
which is a matrix containing word vectors as rows. Coined by Levy2015 Levy2015, the term eigenvalue weighting^{4}^{4}4It seems to us that a term “singular value weighting” is more appropriate because the weighting is based on singular values of a PMI matrix but not eigenvalues of . The term “eigenvalue” is relevant here only because the singular values of are also the square roots of eigenvalues of . (EW) refers to a postprocessing technique for PMImatrixinduced word vectors. This technique has its root in Latent Semantic Analysis (LSA): Caron2001 Caron2001 first propose to define the postprocessed version of as
where is the weighting exponent determining the relative weights assigned to each singular vector of . While an optimal depends on specific task demands, previous research suggests that is generally preferred, i.e., the contributions of the initial singular vectors of should be suppressed. For instance, , and are recommended in [Caron2001, Bullinaria and Levy2012, Levy, Goldberg, and Dagan2015]. Bullinaria2012 Bullinaria2012 argue that the initial singular vectors of tend to be contaminated most by aspects other than lexical semantics.
We now show that applying CN on the PMImatrixbased word embedding has a tantamount effect with “suppressing initial singular vectors” of EW. Acting the negated on word vectors of (i.e., rows of ), we get the postprocessed word vectors as rows of the :
Since
for all and , these weights suppress the contribution of the initial singular vectors, similar to what has been done in EW.
Experiments
We evaluate the postprocessed word vectors on a variety of lexicallevel intrinsic tasks and a downstream deep learning task. We use the publicly available pretrained Google News Word2Vec
[Mikolov et al.2013]^{5}^{5}5https://code.google.com/archive/p/word2vec/ and Common Crawl GloVe^{6}^{6}6https://nlp.stanford.edu/projects/glove/ [Pennington, Socher, and Manning2014] to perform lexicallevel experiments. For CN, we fix for Word2Vec and GloVe throughout the experiments^{7}^{7}7Analytical optimization methods for the aperture are available from [Jaeger2014], remaining to be connected with the word vector postprocessing scheme in the future.. For ABTT, we set for Word2Vec and for GloVe, as what has been suggested by Mu2018 Mu2018.Word similarity
We test the performance of CN on seven benchmarks that have been widely used to measure word similarity: the RG65 [Rubenstein and Goodenough1965], the WordSim353 (WS) [Finkelstein et al.2002], the rarewords (RW) [Luong, Socher, and Manning2013], the MEN dataset [Bruni, Tran, and Baroni2014], the MTurk [Radinsky et al.2011], the SimLex999 (SimLex) [Hill, Reichart, and Korhonen2015], and the SimVerb3500 [Gerz et al.2016]. To evaluate the word similarity, we calculate the cosine distance between vectors of two words using Equation Postprocessing word vectors by PC removal. We report the Spearman’s rank correlation coefficient [Myers and Well1995] of the estimated rankings against the rankings by humans in Table 1. We see that the proposed CN method consistently outperforms the original word embedding (orig.) and the postprocessed word embedding by ABTT for most of the benchmarks.
WORD2VEC  GLOVE  

orig.  ABTT  CN  orig.  ABTT  CN  
RG65  76.08  78.34  78.92  76.96  74.36  78.40 
WS  68.29  69.05  69.30  73.79  76.79  79.08 
RW  53.74  54.33  58.04  46.41  52.04  58.98 
MEN  78.20  79.08  78.67  80.49  81.78  83.38 
MTurk  68.23  69.35  66.81  69.29  70.85  71.07 
SimLex  44.20  45.10  46.82  40.83  44.97  48.53 
SimVerb  36.35  36.50  38.30  28.33  32.23  36.36 
The improvement of results by CN are particularly impressive for two “modern” word similarity benchmarks SimLex and SimVerb – these two benchmarks carefully distinguish genuine word similarity from conceptual association [Hill, Reichart, and Korhonen2015]. For instance, coffee is associated with cup but by no means similar to cup, a confusion often made by earlier benchmarks. In particular, SimLex has been heavily used to evaluate word vectors yielded by supervised word vector finetuning algorithms, which perform gradient descent on word vectors with respect to linguistic constraints such as synonym and antonym relationships of words extracted from WordNet and/or PPDB. When compared to a recent supervised approach of counterfitting. Our results on SimLex are comparable to those reported by Mrksic2016 Mrksic2016, as shown in Table 2.
Postprocessing method  WORD2VEC  GLOVE  

supervised  CounterFitting + syn.  0.45  0.46 
CounterFitting + ant.  0.33  0.43  
CounterFitting + syn. + ant.  0.47  0.50  
unsupervised  CN  0.47  0.49 
Semantic Textual Similarity
In this subsection, we showcase the effectiveness of the proposed postprocessing method using semantic textual similarity (STS) benchmarks, which are designed to test the semantic similarities of sentences. We use 20122015 SemEval STS tasks [Agirre et al.2012, Agirre et al.2013, Agirre et al.2014, Agirre et al.2015] and the 2012 SemEval Semantic Related task (SICK) [Marelli et al.2014].
Concretely, for each pair of sentences, and , we computed and by averaging their constituent word vectors. We then calculated the cosine distance between two sentence vectors and . This naive method has been shown to be a strong baseline for STS tasks [Wieting et al.2016]. As in Agirre2012 Agirre2012, we used Pearson correlation of the estimated rankings of sentence similarity against the rankings by humans to assess model performance.
In Table 7, we report the average result for the STS tasks each year (detailed results are in the supplemental material). Again, our CN method consistently outperforms the alternatives.
WORD2VEC  GLOVE  

orig.  ABTT  CN  orig.  ABTT  CN  
STS 2012  57.22  57.67  54.31  48.27  54.06  54.38 
STS 2013  56.81  57.98  59.17  44.83  51.71  55.51 
STS 2014  62.89.  63.30  66.22  51.11  59.23  62.66 
STS 2015  62.74  63.35  67.15  47.23  57.29  63.74 
SICK  70.10  70.20  72.71  65.14  67.85  66.42 
Concept Categorization
In the concept categorization task, we used means to cluster words into concept categories based on their vector representations (for example, “bear” and “cat” belong to the concept category of animals). We use three standard datasets: (i) a rather small dataset ESSLLI 2008 [Baroni, Evert, and Lenci2008] that contains 44 concepts in 9 categories; (ii) the AlmuharebPoesio (AP) [Poesio and Almuhareb2005], which contains 402 concepts divided into 21 categories; and (iii) the BM dataset [Battig and Montague1969] that 5321 concepts divided into 56 categories. Note that the datasets of ESSLLI, AP, and BM are increasingly challenging for clustering algorithms, due to the increasing numbers of words and categories.
Following [Baroni, Dinu, and Kruszewski2014, Schnabel et al.2015, Mu and Viswanath2018], we used “purity” of clusters [Manning, Raghavan, and Schütze2008, Section 16.4] as the evaluation criterion. That the results of means heavily depend on two hyperparameters: (i) the number of clusters and (ii) the initial centroids of clusters. We follow previous research [Baroni, Dinu, and Kruszewski2014, Schnabel et al.2015, Mu and Viswanath2018] to set as the groundtruth number of categories. The settings of the initial centroids of clusters, however, are less welldocumented in previous work – it is not clear how many initial centroids have been sampled, or if different centroids have been sampled at all. To avoid the influences of initial centroids in means (which are particularly undesirable for this case because word vectors live in ), in this work, we simply fixed the initial centroids as the average of original, ABTTprocessed, and CNprocessed word vectors respectively from groundtruth categories. This initialization is fair because all postprocessing methods make use of the groundtruth information equally, similar to the usage of the groundtruth numbers of clusters. We report the experiment results in Table 4.
WORD2VEC  GLOVE  

orig.  ABTT  CN.  orig.  ABTT  CN  
ESSLLI  100.0  100.0  100.0  100.0  100.0  100.0 
AP  87.28  88.3  89.31  86.43  87.19  90.95 
BM  58.15  59.24  60.19  65.34  67.35  67.63 
The performance of the proposed methods and the baseline methods performed equally well for the smallest dataset ESSLLI. As the dataset got larger, the results differed and the proposed CN approach outperformed the baselines.
A Downstream NLP task: Neural Belief Tracker
The experiments we have reported so far are all intrinsic lexical evaluation benchmarks. Only evaluating the postprocessed word vectors using these benchmarks, however, invites an obvious critique: the success of intrinsic evaluation tasks may not transfer to downstream NLP tasks, as suggested by previous research [Schnabel et al.2015]
. Indeed, when supervised learning tasks are performed, the postprocessing methods such as ABTT and CN can
in principlebe absorbed into a classifier such as a neural network. Nevertheless, good initialization for classifiers is crucial. We hypothesize that the postprocessed word vectors serve as a good initialization for those downstream NLP tasks that semantic knowledge contained in word vectors is needed.
To validate this hypothesis, we conducted an experiment using Neural Belief Tracker (NBT), a deep neural network based dialogue state tracking (DST) model [Mrksic et al.2017, Mrkšić and Vulić2018]
. As a concrete example to illustrate the purpose of the task, consider a dialogue system designed to help users find restaurants. When a user wants to find a Sushi restaurant, the system is expected to know that Japanese restaurants have a higher probability to be a good recommendation than Italian restaurants or Thai restaurants. Word vectors are important for this task because NBT needs to absorb useful semantic knowledge from word vectors using a neural network.
In our experiment with NBT, we used the model specified in [Mrkšić and Vulić2018] with default hyperparameter settings^{8}^{8}8https://github.com/nmrksic/neuralbelieftracker. We report the goal accuracy, a default DST performance measure, defined as the proportion of dialogue turns where all the user’s search goal constraints match with the model predictions. The test data was WizardofOz (WOZ) 2.0 [Wen et al.2017], where the goal constraints of users were divided into three domains: food, price range, and area. The experiment results are reported in Table 5.
WORD2VEC  GLOVE  

orig.  ABTT  CN.  orig.  ABTT  CN  
Food  48.6  84.7  78.5  86.4  83.7  88.8 
Price range  90.2  88.1  92.2  91.0  93.9  94.7 
Area  83.1  82.4  86.1  93.5  94.9  93.7 
Average  74.0  85.1  85.6  90.3  90.8  92.4 
Further discussions
Besides the NBT task, we have also tested ABTT and CN methods on other downstream NLP tasks such as text classification (not reported). We found that ABTT and CN yield equivalent results in such tasks. One explanation is that the ABTT and CN postprocessed word vectors are different only up to a small perturbation. With a sufficient amount of training data and an appropriate regularization method, a neural network should generalize over such a perturbation. With a relatively small training data (e.g., the 600 dialogues for training NBT task), however, we found that word vectors as initializations matters, and in such cases, CN postprocessed word vectors yield favorable results. Another interesting finding is that having tested ABTT and CN on Fasttext [Bojanowski et al.2017], we found that neither postprocessing method provides visible gain. We hypothesize that this might be because Fasttext includes subword (characterlevel) information in its word representation during training, which suppresses the word frequency features contained in word vectors. It remains for future work to validate this hypothesis.
Conclusion
We propose a simple yet effective method for postprocessing word vectors via the negation operation of conceptors. With a battery of intrinsic evaluation tasks and a downstream deeplearning empowered dialogue state tracking task, the proposed method enhances linguistic regularities captured by word vectors and consistently improves performance over existing alternatives.
There are several possibilities for future work. We envisage that the logical operations and abstract ordering admitted by conceptors can be used in other NLP tasks. As concrete examples, the AND operation can be potentially applied to induce and finetune bilingual word vectors, by mapping word representations of individual languages into a shared linear space; the OR together with NOT operation can be used to study the vector representations of polysemous words, by joining and deleting sensespecific vector representations of words; the abstraction ordering is a natural tool to study graded lexical entailment of words.
Acknowledgement
We appreciate the anonymous reviewers for their constructive comments. We thank Xu He, Jordan Rodu, and Daphne Ippolito, and Chris CallisonBurch for helpful discussions.
References
 [Agirre et al.2012] Agirre, E.; Diab, M.; Cer, D.; and GonzalezAgirre, A. 2012. Semeval2012 task 6: A pilot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics, SemEval ’12, 385–393. Stroudsburg, PA, USA: Association for Computational Linguistics.
 [Agirre et al.2013] Agirre, E.; Cer, D.; Diab, M.; GonzalezAgirre, A.; and Guo, W. 2013. Sem 2013 shared task: Semantic textual similarity. In Second Joint Conference on Lexical and Computational Semantics, volume 1, 32–43.
 [Agirre et al.2014] Agirre, E.; Banea, C.; Cardie, C.; Cer, D.; Diab, M.; GonzalezAgirre, A.; Guo, W.; Mihalcea, R.; Rigau, G.; and Wiebe, J. 2014. Semeval2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation, 81–91.
 [Agirre et al.2015] Agirre, E.; Banea, C.; Cardie, C.; Cer, D.; Diab, M.; GonzalezAgirre, A.; Guo, W.; LopezGazpio, I.; Maritxalar, M.; Mihalcea, R.; Rigaua, G.; Uriaa, L.; and Wiebeg, J. 2015. Semeval2015 task 2: Semantic textual similarity, English, Spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation, 252–263.
 [Arora, Liang, and Ma2017] Arora, S.; Liang, Y.; and Ma, T. 2017. A simple but toughtobeat baseline for sentence embeddings. In International Conference on Learning Representations.
 [Baroni, Dinu, and Kruszewski2014] Baroni, M.; Dinu, G.; and Kruszewski, G. 2014. Don’t count, predict! a systematic comparison of contextcounting vs. contextpredicting semantic vectors. In Proceedings the ACL 2014, 238–247. Association for Computational Linguistics.
 [Baroni, Evert, and Lenci2008] Baroni, M.; Evert, S.; and Lenci, A. 2008. Verb categorization: a shared tasks from the ESSLLI 2008 workshop.
 [Battig and Montague1969] Battig, W. F., and Montague, W. E. 1969. Category norms of verbal items in 56 categories a replication and extension of the connecticut category norms. Journal of Experimental Psychology 80(3p2):1.
 [Bojanowski et al.2017] Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5:135–146.

[Bruni, Tran, and Baroni2014]
Bruni, E.; Tran, N. K.; and Baroni, M.
2014.
Multimodal distributional semantics.
Journal of Artificial Intelligence Research
49(1):1–47.  [Bullinaria and Levy2012] Bullinaria, J. A., and Levy, P. J. 2012. Extracting semantic representations from word cooccurrence statistics: stoplists, stemming, and SVD. Behavior Research Methods 44(3):890–907.
 [Caron2001] Caron, J. 2001. Experiments with LSA scoring: Optimal rank and basis. Computational Information Retrieval 157–169.

[Dhillon, Foster, and
Ungar2015]
Dhillon, P. S.; Foster, D. P.; and Ungar, L. H.
2015.
Eigenwords: Spectral word embeddings.
Journal of Machine Learning Research
16:3035–3078. 
[Faruqui et al.2015]
Faruqui, M.; Dodge, J.; Jauhar, S. K.; Dyer, C.; Hovy, E. H.; and Smith, N. A.
2015.
Retrofitting word vectors to semantic lexicons.
In Proceedings of the NAACL HLT 2015, 1606–1615.  [Finkelstein et al.2002] Finkelstein, L.; Gabrilovich, E.; Matias, Y.; Rivlin, E.; Solan, Z.; Wolfman, G.; and Ruppin, E. 2002. Placing search in context: the concept revisited. ACM Transactions on Information Systems 20(1):116–131.
 [Firth1957] Firth, J. R. 1957. A synopsis of linguistic theory 193055. In Studies in Linguistic Analysis (special volume of the Philological Society), volume 195259, 1–32. Oxford: The Philological Society.
 [Gerz et al.2016] Gerz, D.; Vulic, I.; Hill, F.; Reichart, R.; and Korhonen, A. 2016. SimVerb3500: a largescale evaluation set of verb similarity. In Proceedings of the EMNLP 2016, 2173–2182.

[He and Jaeger2018]
He, X., and Jaeger, H.
2018.
Overcoming catastrophic interference using conceptoraided backpropagation.
In International Conference on Learning Representations.  [Hill, Reichart, and Korhonen2015] Hill, F.; Reichart, R.; and Korhonen, A. 2015. Simlex999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41(4):665–695.

[Jaeger2014]
Jaeger, H.
2014.
Controlling recurrent neural networks by conceptors.
Technical report, Jacobs University Bremen.  [Jaeger2017] Jaeger, H. 2017. Using conceptors to manage neural longterm memories for temporal patterns. Journal of Machine Learning Research 18(13):1–43.
 [Khodak et al.2018] Khodak, M.; Saunshi, N.; Liang, Y.; Ma, T.; Stewart, B.; and Arora, S. 2018. A la carte embedding: Cheap but effective induction of semantic feature vectors. In the Proceedings of ACL.
 [Levy and Goldberg2014] Levy, O., and Goldberg, Y. 2014. Neural word embedding as implicit matrix factorization. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27. Curran Associates, Inc. 2177–2185.
 [Levy, Goldberg, and Dagan2015] Levy, O.; Goldberg, Y.; and Dagan, I. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3:211–225.
 [Luong, Socher, and Manning2013] Luong, M.; Socher, R.; and Manning, C. D. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the CoNLL 2013.
 [Manning, Raghavan, and Schütze2008] Manning, C. D.; Raghavan, P.; and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.
 [Marelli et al.2014] Marelli, M.; Menini, S.; Baroni, M.; Bentivogli, L.; Bernardi, R.; and Zamparelli, R. 2014. A sick cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA).
 [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Burges, C. J. C.; Bottou, L.; Welling, M.; Ghahramani, Z.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 26. Curran Associates, Inc. 3111–3119.
 [Miller1995] Miller, G. A. 1995. Wordnet: A lexical database for English. Communications of the ACM 38(11):39–41.
 [Mrkšić and Vulić2018] Mrkšić, N., and Vulić, I. 2018. Fully statistical neural belief tracking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 108–113. Association for Computational Linguistics.
 [Mrksic et al.2016] Mrksic, N.; Séaghdha, D.; Thomson, B.; Gasic, M.; RojasBarahona, L. M.; Su, P.; Vandyke, D.; Wen, T.; and Young, S. J. 2016. Counterfitting word vectors to linguistic constraints. In Proceedings of the NAACL HLT 2016, 142–148.
 [Mrksic et al.2017] Mrksic, N.; Vulic, I.; Séaghdha, D. Ó.; Leviant, I.; Reichart, R.; Gasic, M.; Korhonen, A.; and Young, S. J. 2017. Semantic specialization of distributional word vector spaces using monolingual and crosslingual constraints. TACL 5:309–324.
 [Mu and Viswanath2018] Mu, J., and Viswanath, P. 2018. Allbutthetop: Simple and effective postprocessing for word representations. In International Conference on Learning Representations.
 [Myers and Well1995] Myers, J. L., and Well, A. D. 1995. Research Design & Statistical Analysis. Routledge, 1 edition.
 [Pavlick et al.2015] Pavlick, E.; Rastogi, P.; Ganitkevitch, J.; Durme, B. V.; and CallisonBurch, C. 2015. PPDB 2.0: Better paraphrase ranking, finegrained entailment relations, word embeddings, and style classification. In Proceedings of the ACL 2015 (Volume 2: Short Papers), 425–430. Beijing, China: Association for Computational Linguistics.
 [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP, 1532–1543.
 [Poesio and Almuhareb2005] Poesio, M., and Almuhareb, A. 2005. Identifying concept attributes using a classifier. In Proceedings of the ACLSIGLEX Workshop on Deep Lexical Acquisition, 18–27. Stroudsburg, PA, USA: Association for Computational Linguistics.
 [Radinsky et al.2011] Radinsky, K.; Agichtein, E.; Gabrilovich, E.; and Markovitch, S. 2011. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International World Wide Web Conference, 337–346.
 [Rubenstein and Goodenough1965] Rubenstein, H., and Goodenough, J. B. 1965. Contextual correlates of synonymy. Communications of the ACM 8(10):627–633.
 [Schnabel et al.2015] Schnabel, T.; Labutov, I.; Mimno, D. M.; and Joachims, T. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of EMNLP 2015, 298–307.
 [Turney2012] Turney, P. D. 2012. Domain and function: A dualspace model of semantic relations and compositions. Journal of Artificial Intelligence Research 44(1):533–585.
 [Wen et al.2017] Wen, T.; Vandyke, D.; Mrkšić, N.; Milica, M.; RojasBarahona, L. M.; Su, P.; Ultes, S.; and Young, S. 2017. A networkbased endtoend trainable taskoriented dialogue system. In EACL, 438–449. Valencia, Spain: Association for Computational Linguistics.
 [Wieting et al.2016] Wieting, J.; Bansal, M.; Gimpel, K.; and Livescu, K. 2016. Towards universal paraphrastic sentence embeddings. In International Conference on Learning Representations.
Appendix
Detailed experiments in Semantic Textual Similarity (STS) tasks
In the main body of the paper we have reported the average results for STS tasks by year. A detailed list the STS tasks can be found in Table 6 and can be downloaded from http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark.
STS 2012  STS 2013  STS 2014  STS 2015 

MSRpar  FNWN  deft forum  anwsersforums 
MSRvid  OnWN  deft news  answersstudents 
OnWN  headlines  headline  belief 
SMTeuroparl  images  headline  
SMTnews  OnWN  images  
tweet news 
We report the detailed experiment results for the above STS tasks in Table 7.
WORD2VEC  GLOVE  
orig.  ABTT  CN  orig.  ABTT  CN  
MSRpar  42.12  43.85  40.30  44.54  44.09  41.19 
MSRvid  72.07  72.16  75.22  64.47  68.05  62.50 
OnWN  69.38  69.48  70.82  53.07  65.67  67.96 
SMTeuroparl  53.15  54.32  35.14  41.74  45.28  52.58 
SMTnews  49.37  48.53  50.08  37.54  47.22  47.69 
STS 2012  57.22  57.67  54.31  48.27  54.06  54.38 
FNWN  40.70  41.96  43.99  37.54  39.34  42.07 
OnWN  67.87  68.17  68.76  47.22  58.60  57.45 
headlines  61.88  63.81  64.78  49.73.  57.20  67.00 
STS 2013  56.81  57.98  59.17  44.83  51.71  55.51 
OnWN  74.61  74.78  75.08  57.41  67.56  66.43 
deftforum  32.19  33.26  42.80  21.55  29.39  37.57 
deftnews  66.83  65.96  65.57  65.14  71.45  69.08 
headlines  58.01  59.58  61.09  47.05  52.60  61.71 
images  73.75  74.17  78.24  57.22  68.28  65.81 
tweetnews  71.92  72.07  74.55  58.32  66.13  75.37 
STS 2014  62.89  63.30  66.22  51.11  59.23  62.66 
forum  46.35  46.80  53.66  30.02  39.86  48.62 
students  68.07  67.99  71.45  49.20  62.38  69.68 
belief  59.72  60.42  61.29  44.05  57.68  59.77 
headlines  61.47  63.45  68.88  46.22  53.31  69.18 
images  78.09  78.08  80.48  66.63  73.20  71.43 
STS 2015  62.74  63.35  67.15  47.23  57.29  63.74 
SICK  70.10  70.20  72.71  65.14  67.85  66.42 
Comments
There are no comments yet.