The success of unsupervised word embeddings has motivated researchers to learn embeddings for larger chunks of text such as sentences. Current research in sentence embedding is mainly advancing along two lines. In one line, researchers use powerful and complex models such as deep neural networks and recurrent neural networks to capture the semantics of sentences(Blunsom et al., 2014; Iyyer et al., 2015; Yin and Schütze, 2015; Cer et al., 2018). In a complementary second line, researchers have invented computationally cheap alternatives to embed sentences using simple linear algebraic operations (Wieting et al., 2016; Arora et al., 2017; Mu et al., 2017a; Khodak et al., 2018; Ethayarajh, 2018). Surprisingly, many simple methods yield comparable or even better results compared to complicated methods, particularly in out-of-domain tasks (Wieting et al., 2016). The current paper follows the second avenue of research.
Among all methods for sentence embedding, arguably the simplest one is to compute a sentence embedding as the average of the sentence’s word vectors. This naive approach has proven to be a formidable baseline for many downstream natural language processing (NLP) tasks (Faruqui et al., 2014; Wieting et al., 2016). However, it comes with a limitation: Since word vectors of a given sentence are spanned by a few leading directions (Mu et al., 2017b; Khodak et al., 2018), averaging the word vectors amplify these leading directions while diminishing the useful signals contained in the trailing directions. We refer this problem as the common direction bias in linear representations of sentences.
To correct the common direction bias, researchers have invented a “common component removal” trick (Arora et al., 2017; Mu and Viswanath, 2018). This technique removes the top one or top few principal components from the word vectors (Mu and Viswanath, 2018) or the weighted average thereof (Arora et al., 2017)
. Intuitively, since dominating directions of word vectors tend to influence the additive composition in the same way, nulling out such directions ameliorates the effect. Post-processed with such a technique, linear representations of sentences usually deliver strong performances, sometimes outperforming sophisticated Deep-Learning based methods including RNN’s and LSTM’s on standard benchmarks(Arora et al., 2017; Mu and Viswanath, 2018; Ethayarajh, 2018).
Although common component removal has proven to be effective, the technique is liable either to not remove enough noise or to cause too much information loss (Khodak et al., 2018). In this paper, we propose a novel and simple way to address this issue. Our proposed method can be regarded as a “soft” version of common component removal. Specifically, given a sequence of word vectors, we softly down-weight principal components (PCs) with the assistance of a regularized identity map called a Conceptor (Jaeger, 2017).
The rest of the paper is organized as follows. We first review the linear representation of sentences by Arora et al. (2017). We then introduce the Conceptor approach for soft common component removal, which is the main contribution of this paper. After that, we demonstrate the effectiveness of the proposed method on the Clinical STS dataset of the BioCreative/OHNLP Challenge 2018.
2 Linear representation of sentences
, which is a vector-valued random variable taking values in. Arora et al. (2017) further assume that there exists a fixed common discourse vector which is orthogonal to all realizations of , i.e., . Given a discourse
, the emitting probability for a wordis assumed to be
where is the word vector for the word , is the monogram probability for the word , is the normalizing term, , and , are scalar hyper-parameters. As a result, this model favors to produce two types of words: those words with high monogram probability and those words whose vector representation located close to both and (up to a balancing parameter ). Using this model, Arora et al. (2017) derived a sentence algorithm which contains two steps. In the first step, is approximated by , which has the form
for a scalar hyper-parameter ; in the second step, the common discourse
is estimated as the first PCof a set of sentences via an uncentered PCA. The final sentence embedding is consequently obtained by removing its projection on the first PC, i.e., by letting
as an approximation of .
We now take a more abstract view on the common component removing step, i.e., Equation 2. This step relies on a key assumption: There exists a single direction which represents a syntax (i.e. function word)-related “discourse”. As a straightforward generalization, one can also assume that there exists a proper -dimensional linear subspace , where , such that all discourses are syntax-related discourses. Under this assumption, one can define a projection matrix which characterizes the subspace of common discourses. To separate from , one projects to the orthogonal complement of , written as , by letting , where
is an identity matrix. In particular, choosing, where is the first PC of a set of sentences , we recover the second step of (Arora et al., 2017). As an alternative, we can also choose , where is a matrix whose columns are the first PCs of a set of sentences or a set of words. This alternative has been investigated by Mu and Viswanath (2018).
As shown in Mu and Viswanath (2018), the number plays a crucial role in the effect of common component(s) removal. In many situations, a fixed integer makes this approach liable to either not remove enough noise or to cause too much information loss (Khodak et al., 2018). We therefore propose an alternative method which removes the common components in a “softer” manner.
Our starting point is a relaxation of the key assumptions of Arora et al. (2017) and Mu and Viswanath (2018). Instead of assuming that the function words to be allocated along a single direction (Arora et al., 2017) or to be constrained in a proper linear subspace (Mu and Viswanath, 2018), we allow function and common words to span the whole . This assumption admits a more realistic modeling. Indeed, we find that the word vectors of stop words (Stone et al., 2011) span the entire space of .
Allowing function words to span the whole , however, leads an obstacle: We can not project the sentence embedding to the orthogonal complement of such a space. To address this issue, we use the Conceptor matrix (Jaeger, 2017) to approximate the space occupied by function words.
3 Conceptors as soft subspace projection maps
In this section we briefly introduce matrix Conceptors, sometimes using the wordings of Jaeger (2017). Consider a set of vectors , for all . A Conceptor matrix (under the assumption that data points are identically distributed) can be defined as a regularized identity map that minimizes
where is the Frobenius norm and is a scalar parameter called aperture.
It can be shown that has a closed form solution:
where is a data collection matrix whose -th column is
. Assuming that the singular value decomposition (SVD) ofhas the form , we can re-write as s , where singular values of can be written in terms of the singular values of : . Applying Conceptors on the averaged word vectors , i.e., using in place of in Equation 3, we see that the columns of the matrix are exactly the PCs estimated via the un-centered PCA of . In particular, the first column of is the first PC used in Equation 2 by Arora et al. (2017) introduced in the previous section.
We now study a matrix . This matrix characterizes a linear subspace that can be roughly understood as the orthogonal complement of the subspace characterized by . This fact can be seen via the following representation:
Note that can be considered as a soft projection matrix which down-weights the leading PCs of : For vectors
in the linear subspace spanned by the leading PCs with large variance,; for vectors in the linear subspace spanned by trailing PCs with low variance, . The soft projection has the following relationship with the hard projection in Equation 2: It is clear that, if we modify into , where (c.f. Equation 5), we recover the result in Equation 2:
Besides applying Conceptors on the averaged word vectors , another reasonable approach is to directly apply Conceptors on all word vectors which constituent sentences in a dataset. The Conceptors learned in this way have a more transparent interpretation: they characterize the shared linear subspace of mainly two types of words (i) function words that have little lexical meaning, (ii) frequent but non-function words in a particular dataset, which can be regarded as a shared background information of a dataset. In practice, we find that learning Conceptors directly from words vectors usually delivers better results than from averaged word vectors, and therefore we use the former method throughout the numerical experiments presented below.
To help capture the space spanned by the two types of words introduced previously, we find that we obtain better results if we estimate not only based on word vectors appearing in the set of actual sentences but also based on the word vectors appearing in a predefined set of stop words. Such a set of stop words can be thought of a prior which describes the subspace of common words. The overall sentence embedding procedure is displayed in Algorithm 1.
We apply our proposed method to the BioCreative/OHNLP Challenge 2018 (Wang et al., 2018). Similar to the SemEval STS challenge series (Cer et al., 2017), the BioCreative/OHNLP Challenge 2018 offers a platform to evaluate the semantic similarity between a pair of sentences and compare the results with manual annotations. Constructing a dataset by gathering naturally occurring pairs of sentences in the clinical context is a challenging task on its own. For the detailed description of the dataset, we refer the readers to (Wang et al., 2018).
For preprocessing, we use the nltk Python package (Bird et al., 2009) to tokenize the words in the sentences. We discard all punctuations. To estimate the monogram probabilities of words, we use word frequencies collected from Wikipedia111https://github.com/PrincetonML/SIF/blob/master/auxiliary_data/enwiki_vocab_min200.txt. We use two sets of pretrained word vectors, GloVe (Pennington et al., 2014) and Paragram-SL999 (Wieting et al., 2015). For the hyper-parameters in Algorithm 1, we use the set of stop words collected by Stone et al. (2011); we fix the aperture for the experiments; we choose as done in Arora et al. (2017). The experimental results in the metric of Spearman’s rank correlation coefficient for sentence similarities are shown the figure 1, where the similarity between two sentence vectors is evaluated using cosine distance.
In this paper, we described how to use a regularized identity map named Conceptors to correct the common component bias in linear sentence embedding. The goal is to softly project the sentence embeddings away from the principal components of word vectors which correspond to high variances. Empirically, we find the proposed method outperforms the baseline method of (Arora et al., 2017). In future work, we will combine this method with the recently proposed unsupervised random walk sentence embedding (Ethayarajh, 2018).
The authors thank anonymous reviewers for their helpful comments. Tianlin Liu appreciates a travel grant of BioCreative/OHNLP Challenge 2018 at the ACM-BCB 2018 (grant number 5R01GM080646-12).
- Arora et al. (2017) S. Arora, Y. Liang, and T. Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. In International Conference on Learning Representations.
- Bird et al. (2009) S. Bird, E. Klein, and E. Loper. 2009. Natural Language Processing with Python, 1st edition. O’Reilly Media, Inc.
Blunsom et al. (2014)
P. Blunsom, E. Grefenstette, and N. Kalchbrenner. 2014.
A convolutional neural network for modelling sentences.In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.
- Cer et al. (2017) D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. 2017. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055.
- Cer et al. (2018) D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175.
- Ethayarajh (2018) K. Ethayarajh. 2018. Unsupervised random walk sentence embeddings: A strong but simple baseline. In Proceedings of The Third Workshop on Representation Learning for NLP, pages 91–100. Association for Computational Linguistics.
- Faruqui et al. (2014) M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. 2014. Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166.
- Iyyer et al. (2015) M. Iyyer, V. Manjunatha, J. Boyd-Graber, and Daumé III H. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1681–1691.
H. Jaeger. 2017.
Using conceptors to manage neural long-term memories for temporal
Journal of Machine Learning Research, 18(13):1–43.
- Khodak et al. (2018) M. Khodak, N. Saunshi, Y. Liang, T. Ma, B. Stewart, and S. Arora. 2018. A la carte embedding: Cheap but effective induction of semantic feature vectors. To Appear in the Proceedings of the Association for Computation Linguistics (ACL).
- Mu et al. (2017a) J. Mu, S. Bhat, and P. Viswanath. 2017a. Representing sentences as low-rank subspaces. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 2: Short Papers, pages 629–634.
- Mu et al. (2017b) J. Mu, S. Bhat, and P. Viswanath. 2017b. Representing sentences as low-rank subspaces. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 629–634. Association for Computational Linguistics.
- Mu and Viswanath (2018) J. Mu and P. Viswanath. 2018. All-but-the-top: Simple and effective postprocessing for word representations. In International Conference on Learning Representations.
- Pennington et al. (2014) J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- Stone et al. (2011) B. Stone, S. Dennis, and P. J. Kwantes. 2011. Comparing methods for single paragraph similarity analysis. Topics in Cognitive Science, 3(1):92–122.
- Wang et al. (2018) Y. Wang, N. Afzal, M S. Liu, Rastegar-Mojarad, L. Wang, F. Shen, and H. Liu S. Fu. 2018. Overview of the biocreative/ohnlp challenge 2018 task 2: Clinical semantic textual similarity. In Proceedings of the BioCreative/OHNLP Challenge. 2018.
- Wieting et al. (2016) J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. 2016. Towards universal paraphrastic sentence embeddings. In International Conference on Learning Representations.
- Wieting et al. (2015) J. Wieting, M. Bansal, K. Gimpel, K. Livescu, and D. Roth. 2015. From paraphrase database to compositional paraphrase model and back. Transactions of the Association for Computational Linguistics, 3:345–358.
- Yin and Schütze (2015) W. Yin and H. Schütze. 2015. Convolutional neural network for paraphrase identification. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 901–911.