Vector space representations of words capture many aspects of word similarity, but such methods tend to make vector spaces in which antonyms (as well as synonyms) are close to each other. We present a new signed spectral normalized graph cut algorithm, signed clustering, that overlays existing thesauri upon distributionally derived vector representations of words, so that antonym relationships between word pairs are represented by negative weights. Our signed clustering algorithm produces clusters of words which simultaneously capture distributional and synonym relations. We evaluate these clusters against the SimLex-999 dataset (Hill et al.,2014) of human judgments of word pair similarities, and also show the benefit of using our clusters to predict the sentiment of a given text.READ FULL TEXT VIEW PDF
We present SeVeN (Semantic Vector Networks), a hybrid resource that enco...
Vector representations of words have heralded a transformational approac...
The objective of this project is to solve one of the major problems face...
In this paper, we are mainly concerned with the ability to quickly and
Certain concepts, words, and images are intuitively more similar than ot...
This paper presents three case studies of modeling aspects of lexical
In this paper, we propose an extension to graph-based sentiment lexicon
While vector space models (Turney et al., 2010) such as Eigenwords, Glove, or word2vec capture relatedness, they do not adequately encode synonymy and similarity (Mohammad et al., 2013; Scheible et al., 2013). Our goal was to create clusters of synonyms or semantically-equivalent words and linguistically-motivated unified constructs. We innovated a novel theory and method that extends multiclass normalized cuts (K-cluster) to signed graphs (Gallier, 2016), which allows the incorporation of semi-supervised information. Negative edges serve as repellent or opposite relationships between nodes.
In distributional vector representations opposite relations are not fully captured. Take, for example, words such as “great” and “awful”, which can appear with similar frequency in the same sentence structure: “today is a great day” and “today is an awful day”. Word embeddings, which are successful in a wide array of NLP tasks, fail to capture this antonymy because they follow the distributional hypothesis that similar words are used in similar contexts (Harris, 1954)
, thus assigning small cosine or euclidean distances between the vector representations of “great” and “awful”. Our signed spectral normalized graph cut algorithm (henceforth, signed clustering) builds antonym relations into the vector space, while maintaining distributional similarity. Furthermore, another strength of K-clustering of signed graphs is that it can be used collaboratively with other methods for augmenting semantic meaning. Signed clustering leads to improved clusters over spectral clustering of word embeddings, and has better coverage than thesaurus look-up. This is because thesauri erroneously give equal weight to rare senses of word, such as “absurd” and its rarely used synonym “rich”. Also, the overlap between thesauri is small, due to their manual creation.Lin (1998) found 0.178397 between-synonym set from Roget’s Thesaurus and WordNet 1.5. We also found similarly small overlap between all three thesauri tested.
We evaluated our clusters by comparing them to different vector representations. In addition, we evaluated our clusters against SimLex-999. Finally, we tested our method on the sentiment analysis task. Overall, signed spectral clustering results are a very clean and elegant augmentation to current methods, and may have broad application to many fields. Our main contributions are the novel method for signed clustering of signed graphs byGallier (2016), the application of this method to create semantic word clusters which are agnostic to both vector space representations and thesauri, and finally, the systematic evaluation and creation of word clusters using thesauri.
Semantic word cluster and distributional thesauri have been well studied (Lin, 1998; Curran, 2004). Recently there has been a lot of work on incorporating synonyms and antonyms into word embeddings. Most recent models either attempt to make richer contexts, in order to find semantic similarity, or overlay thesaurus information in a supervised or semi-supervised manner. Tang et al. (2014) created sentiment-specific word embedding (SSWE), which were trained for twitter sentiment. Yih et al. (2012) proposed polarity induced latent semantic analysis (PILSA) using thesauri, which was extended by Chang et al. (2013)
to a multi-relational setting. The Bayesian tensor factorization model (BPTF) was introduced in order to combine multiple sources of information(Zhang et al., 2014). Faruqui et al. (2015) used belief propagation to modify existing vector space representations. The word embeddings on Thesauri and Distributional information (WE-TD) model (Ono et al., 2015) incorporated thesauri by altering the objective function for word embedding representations. Similarly, The Pham et al. (2015) introduced multitask Lexical Contrast Model which extended the word2vec Skip-gram method to optimize for both context as well as synonymy/antonym relations. Our approach differs from the afore-mentioned methods in that we created word clusters using the antonym relationships as negative links. Similar to Faruqui et al. (2015) our signed clustering method uses existing vector representations to create word clusters.
To our knowledge, Gallier (2016) is the first theoretical foundation of multiclass signed normalized cuts. Hou (2005) used positive degrees of nodes in the degree matrix of a signed graph with weights (-1, 0, 1), which was advanced by Kolluri et al. (2004); Kunegis et al. (2010) using absolute values of weights in the degree matrix. Although must-link and cannot-link soft spectral clustering (Rangapuram & Hein, 2012) both share similarities with our method, this similarity only applies to cases where cannot-link edges are present. Our method excludes a weight term of cannot-link, as well as the volume of cannot-link edges within the clusters. Furthermore, our optimization method differs from that of must-link / cannot-link algorithms. We developed a novel theory and algorithm that extends the clustering of Shi & Malik (2000); Yu & Shi (2003) to the multi-class signed graph case (Gallier, 2016).
Weighted graphs for which the weight matrix is a symmetric matrix in which negative and positive entries are allowed are called signed graphs. Such graphs (with weights ) were introduced as early as 1953 by Harary (1953), to model social relations involving disliking, indifference, and liking. The problem of clustering the nodes of a signed graph arises naturally as a generalization of the clustering problem for weighted graphs. Figure 1 shows a signed graph of word similarities with a thesaurus overlay.
Gallier (2016) extends normalized cuts signed graphs in order to incorporate antonym information into word clusters.
A weighted graph is a pair , where is a set of nodes or vertices, and is a symmetric matrix called the weight matrix, such that for all , and for . We say that a set is an edge iff . The corresponding (undirected) graph with , is called the underlying graph of .
Given a signed graph (where is a symmetric matrix with zero diagonal entries), the underlying graph of is the graph with node set and set of (undirected) edges .
If is a signed graph, where is an symmetric matrix with zero diagonal entries and with the other entries arbitrary, for any node , the signed degree of is defined as
and the signed degree matrix as
For any subset of the set of nodes , let
For any two subsets and of , define , , and by
Then, the signed Laplacian is defined by
and its normalized version by
For a graph without isolated vertices, we have for , so is well defined.
For any symmetric matrix , if we let where is the signed degree matrix associated with , then we have
Consequently, is positive semidefinite.
Given a partition of into clusters , if we represent the th block of this partition by a vector such that
for some .
The signed normalized cut of the partition is defined as
Another formulation is
where is the matrix whose th column is .
Observe that minimizing amounts to minimizing the number of positive and negative edges between clusters, and also minimizing the number of negative edges within clusters. This second minimization captures the intuition that nodes connected by a negative edge should not be together (they do not “like” each other; they should be far from each other).
We have our first formulation of -way clustering of a graph using normalized cuts, called problem PNC1 (the notation PNCX is used in Yu Yu & Shi (2003), Section 2.1):
If we let
our solution set is
-way Clustering of a graph using Normalized Cut, Version 1:
An equivalent version of the optimization problem is
The natural relaxation of problem PNC2 is to drop the condition that , and we obtain the
If is a solution to the relaxed problem, then is also a solution, where .
If we make the change of variable or equivalently .
However, since , we have
so we get the equivalent problem
Given a solution of problem , we look for pairs with and where is a matrix with nonzero and pairwise orthogonal columns, with , that minimize
Here, is the Frobenius norm of .
This is a difficult nonlinear optimization problem involving two unknown matrices and . To simplify the problem, we proceed by alternating steps during which we minimize with respect to holding fixed, and steps during which we minimize with respect to holding fixed.
This second step in which is held fixed has been studied, but it is still a hard problem for which no closed–form solution is known. Consequently, we further simplify the problem. Since is of the form where and
is a diagonal invertible matrix, we minimizein two stages.
We set and find that minimizes .
Given , , and , find a diagonal invertible matrix that minimizes .
The matrix is not a minimizer of in general, but it is an improvement on alone, and both stages can be solved quite easily.
In stage 1, the matrix is orthogonal, so , and since and are given, the problem reduces to minimizing ; that is, maximizing .
The evaluation of clusters is non-trivial to generalize. We used both intrinsic and extrinsic methods of evaluation. Intrinsic evaluation is two fold where we only examine cluster entropy, purity, number of disconnected components and number of negative edges. We also compare multiple word embeddings and thesauri to show stability of our method. The second intrinsic measure is using a gold standard. We chose a gold standard designed for the task of capturing word similarity. Our metric for evaluation is a detailed accuracy and recall. For extrinsic evaluation, we use our clusters to identify polarity and apply this to the task.
For clustering there are several choices to make. The first choice being the similarity metric. In this paper we chose the heat kernel based off of Euclidean distance between word vector representations. We define the distance between two words and as . In the paper by Belkin & Niyogi (2003), the authors show that the heat kernel where
The next choice of how to combine the word embeddings with the thesauri in order to make a signed graph also has hyperparameters. We can represent the thesaurus as a matrix where
Another alternative is to only look at the antonym information, so
We can write the signed graph as or in matrix form where computes Hadamard product (element-wise multiplication); however, the graph will only contain the overlapping vocabulary. In order to solve this problem we use .
It is important to note that this metric does not require a gold standard. Obviously we want this number to be as small as possible.
As we used thesaurus information for two other novel metrics which are the number of negative edges (NNE) in the clusters, and the number of disconnected components (NDC) in the cluster where we only use synonym edges.
The NDC has the disadvantage of thesaurus coverage. Figure 2 shows a graphical representation of the number of disconnected components and negative edges.
Next we evaluate our clusters using an external gold standard. Cluster purity and entropy Zhao & Karypis (2001) is defined as,
where is the number of classes, the number of clusters, is the size of cluster , and number of data points in class clustered in cluster . The purity and entropy measures improve (increased purity, decreased entropy) monotonically with the number of clusters.
In this section we begin with intrinsic analysis of the resulting clusters. We then compare empirical clusters with SimLex-999 as a gold standard for semantic word similarity. Finally, we evaluate our metric using the sentiment prediction task. Our synonym clusters are well suited for this task, as including antonyms in clusters results in incorrect predictions.
In order to evaluate our signed graph clustering method, we first focused on intrinsic measures of cluster quality. Figure 3.2 demonstrates that the number of negative edges within a cluster is minimized using our clustering algorithm on simulated data. However, as the number of clusters becomes large, the number of disconnected components, which includes clusters of size one, increases. For our further empirical analysis, we used both the number of disconnected components as well as the number of antonyms within clusters in order to set the cluster size.
For comparison, we used four different word embedding methods: Skip-gram vectors (word2vec) (Mikolov et al., 2013), Global vectors (GloVe) (Pennington et al., 2014), Eigenwords (Dhillon et al., 2015), and Global Context (GloCon) (Huang et al., 2012) vector representation. We used word2vec 300 dimensional embeddings which were trained using word2vec code on several billion words of English comprising the entirety of Gigaword and the English discussion forum data gathered as part of BOLT. A minimal tokenization was performed based on CMU’s twoknenize111https://github.com/brendano/ark-tweet-nlp. For GloVe we used pretrained 200 dimensional vector embeddings222http://nlp.stanford.edu/projects/GloVe/ trained using Wikipedia 2014 + Gigaword 5 (6B tokens). Eigenwords were trained on English Gigaword with no lowercasing or cleaning. Finally, we used 50 dimensional vector representations from Huang et al. (2012), which used the April 2010 snapshot of the Wikipedia corpus Lin (1998); Shaoul (2010), with a total of about 2 million articles and 990 million tokens.
Several thesauri were used, in order to test robustness (including Roget’s Thesaurus, the Microsoft Word English (MS Word) thesaurus from Samsonovic et al. (2010) and WordNet 3.0) (Miller, 1995). Jarmasz & Szpakowicz (2004); Hale (1998) have shown that Roget’s thesaurus has better semantic similarity than WordNet. This is consistent with our results using a larger dataset of SimLex-999.
We chose a subset of 5108 words for the training dataset, which had high overlap between various sources. Changes to the training dataset had minimal effects on the optimal parameters. Within the training dataset, each of the thesauri had roughly 3700 antonym pairs, and combined they had 6680. However, the number of distinct connected components varied, with Roget’s Thesaurus having the least (629), and MS Word Thesaurus (1162) and WordNet (2449) having the most. These ratios were consistent across the full dataset.
One of our main goals was to go beyond qualitative analysis into quantitative measures of synonym clusters and word similarity. In Table 1, we show the 4 most-associated words with “accept”, “negative” and “unlike”.
|Ref word||Roget||WordNet||MS Word||W2V||GloDoc||EW||Glove|
|accept your fate||get||swallow||reject||consider||declare||reject|
|be fooled by||fancy||consent||agree||know||endorse||willin|
For a similarity metric between any two words, we use the heat kernel of Euclidean distance, so . The thesaurus matrix entry has a weight of 1 if words and are synonyms, -1 if words and are antonyms, and 0 otherwise. Thus the weight matrix entries .
|Word2Vec + Roget||0.7||0.04||750||0.033||0.94||0.09|
|Eigenword + MSW||1.0||0.08||200||0.042||0.95||0.01|
|GloCon + Roget||0.9||0.06||750||0.048||0.94||0.02|
|Glove + Roget||11.0||0.01||1000||0.070||0.91||0.10|
Table 2 shows results from the grid search of hyperparameter optimization. Here we show that Eigenword + MSW outperforms Eigenword + Roget, which is in contrast with the other word embeddings where the combination with Roget performs better.
As a baseline, we created clusters using K-means where the number of K clusters was set to 750. All K-means clusters have a statistically significant difference in the number of antonym pairs relative to random assignment of labels. When compared with the MS Word thesaurus, Word2Vec, Eigenword, GloCon, and GloVe word embeddings had a total of 286, 235, 235, 220 negative edges, respectively. The results are similar with the other thesauri. This shows that there are a significant number of antonyms pairs in the K-means clusters derived from the word embeddings. By optimizing the hyperparameters using normalized cuts without thesauri information, we found a significant decrease in the number of negative edges, which was indistinguishable from random assignment and corresponded to a roughly ninety percent decrease across clusters. When analyzed using an out of sample thesaurus and 27081 words, the number of antonym clusters decreased to under 5 for all word embeddings, with the addition of antonym relationship information.
If we examined the number of distinct connected components within the different word clusters, we observed that when K-means were used, the number of disconnected components were statistically significant from random labelling. This suggests that the word embeddings capture synonym relationships. By optimizing the hyperparameters we found roughly a 10 percent decrease in distinct connected components using normalized cuts. When we added the signed antonym relationships using our signed clustering algorithm, on average we found a thirty-nine percent decrease over the K-means clusters. Again, this shows that the hyperparameter optimization is highly effective.
SimLex-999 is a gold standard resource for semantic similarity, not relatedness, based on ratings by human annotators. The differentiation between relatedness and similarity was a problem with previous datasets such as WordSim-353. Hill et al. (2014) has a further comparison of SimLex-999 to previous datasets. Table 3 shows the difference between SimLex-999 and WordSim-353.
|Pair||Simlex-999 rating||WordSim-353 rating|
|coast - shore||9.00||9.10|
|clothes - closet||1.96||8.00|
SimLex-999 comprises of multiple parts-of-speech with 666 Noun-Noun pairs, 222 Verb-Verb pairs and 111 Adjective-Adjective pairs. In a perfect setting, all word pairs rated highly similar by human annotators would be in the same cluster, and all words which were rated dissimilar would be in different clusters. Since our clustering algorithm produced sets of words, we used this evaluation instead of the more commonly-reported correlations.
|MS Thes Lookup||0.70||0.57|
|Roget Thes Lookup||0.63||0.99|
|WordNet Thes Lookup||0.43||1.00|
|Combined Thes Lookup||0.90||1.00|
In Table 4 we show the results of the evaluation with SimLex-999. Accuracy increased for all of the clustering methods aside from Eigenwords+CombThes. However, we achieved better results when we exclusively used the MS Word thesaurus. Combining thesaurus lookup and word2vec+CombThes clusters yielded an accuracy of 0.96.
We used the Socher et al. (2013) sentiment treebank 333http://nlp.stanford.edu/sentiment/treebank.html with coarse grained labels on phrases and sentences from movie review excerpts. The treebank is split into training (6920) , development (872), and test (1821) datasets. We trained an
-norm regularized logistic regression(Friedman et al., 2001)
using our word clusters in order to predict the coarse-grained sentiment at the sentence level. We compared our model against existing models: Naive Bayes with bag of words (NB), sentence word embedding averages (VecAvg), retrofitted sentence word embeddings (RVecAvg)(Faruqui et al., 2015)
, simple recurrent neural network (RNN), recurrent neural tensor network (RNTN)(Socher et al., 2013)
, and the state-of-the art Convolutional neural network (CNN)(Kim, 2014). Table 5 shows that although our model does not out-perform the state-of-the-art, signed clustering performs better than comparable models, including the recurrent neural network, which has access to more information.
|NB (Socher et al., 2013)||0.818|
|VecAvg (W2V, GV, GC)||0.812, 0.796, 0.678|
|(Faruqui et al., 2015)|
|RVecAvg (W2V, GV, GC)||0.821, 0.822, 0.689|
|(Faruqui et al., 2015)|
|RNN, RNTN (Socher et al., 2013)||0.824, 0.854|
|CNN (Le & Zuidema, 2015)||0.881|
We developed a novel theory for signed normalized cuts as well as an algorithm for finding the discrete solution. We showed that we can find superior synonym clusters which do not require new word embeddings, but simply overlay thesaurus information. The clusters are general and can be used with several out of the box word embeddings. By accounting for antonym relationships, our algorithm greatly outperforms simple normalized cuts, even with Huang’s word embeddings , which are designed to capture semantic relations. Finally, we examined our clustering method on the sentiment analysis task from Socher et al. (2013) sentiment treebank dataset and showed improved performance versus comparable models.
This method could be applied to a broad range of NLP tasks, such as prediction of social group clustering, identification of personal versus non-personal verbs, and analysis of clusters which capture positive, negative, and objective emotional content. It could also be used to explore multi-view relationships, such as aligning synonym clusters across multiple languages. Another possibility is to use thesauri and word vector representations together with word sense disambiguation to generate synonym clusters for multiple senses of words. Finally, our signed clustering could be extended to evolutionary signed clustering.
Retrofitting word vectors to semantic lexicons.Proceedings of NAACL 2015, Denver, CO, 2015.
Recent Advances in Natural Language Processing III: Selected Papers from RANLP, 2003:111, 2004.
International conference on Artificial Intelligence and Statistics (AISTATS), 22:1143––1151, 2012.