There has been a large body of work that focuses on learning word representations, either in the form of word clusters (Brown et al., 1992) or vectors (Sahlgren, 2006; Turney & Pantel, 2010; Mikolov et al., 2013; Pennington et al., 2014; Baroni et al., 2014; Bansal et al., 2014) and these have proven useful in many NLP applications (Koo et al., 2008; Turian et al., 2010).
An ideal lexical representation should compress the space of lexical words while retaining the essential properties of words in order to make predictions that correctly generalize across words. The typical approach is to first induce a lexical representation in a task-agnostic setting and then use it in different tasks as features. A different approach is to learn a lexical representation tailored for a certain task. In this work we explore the second approach, and employ the formulation by Madhyastha et al. (2014) to induce task-specific word embeddings. This method departs from a given lexical vector space, and compresses it such that the resulting word embeddings are good predictors for a given lexical relation.
Specifically we learn functions that compute compatibility scores between pairs of lexical items under some linguistic relation. In our work, we refer to these functions as bilexical operators. As an instance of this problem, consider learning a model that predicts the probability that an adjective modifies a noun in a sentence. In this case, we would like the bilexical operator to capture the fact that some adjectives are more compatible with some nouns than others.
Given the complexity of lexical relations, one expects that the properties of words that are relevant for some relation are different for another relation. This might affect the quality of an embedding, both in terms of its predictive power and the compression it obtains. If we employ a task-agnostic low-dimensional embedding, will it retain all important lexical properties for any relation? And, given a fixed relation, can we further compress an existing word representation? In this work we present experiments along these lines that confirm that task-specific embeddings can benefit both the quality and the efficiency of lexicalized predictive models.
Let be a vocabulary, and let denote a word. We are interested in modeling a target bilexical relation, that is, a relation between pairs of words without context. For example, in a noun-adjective relation we model what nouns can be assigned to what adjectives. We will denote as the set of query words, or words that appear in the left side of the bilexical relation. And we will use to denote candidate words, appearing in the right side of the relation.
In this paper we experiment with the log-linear models by Madhyastha et al. (2014) that given a query word compute a conditional distribution over candidate words . The models take the following form:
is a distributional representation of words, andis a bilinear form.
The learning problem is to obtain and from data, and we approach it in a semi-supervised fashion. There exist many approaches to learn from unlabeled data, and in this paper we experiment with two approaches: (a) a simple distributional approach where we represent words with a bag-of-words of contextual words; and (b) the skip-gram model by Mikolov et al. (2013). To learn we assume access to labeled data in the form pairs of compatible examples, i.e. , where and . The goal is to be able to predict query-candidate pairs that are unseen during training. Recall that we model relations between words without context. Thus the lexical representation is essential to generalize to pairs involving unseen words.
With fixed, we learn by minimizing the negative log-likelihood of the labeled data using a regularized objective, , where is a regularization penalty and is a constant controlling the trade-off.
We are interested in regularizers that induce low-rank parameters , since they lead to task-specific embeddings. Assume that has rank , such that with . If we consider the product , we can now interpret as a -dimensional embedding of query words, and as a -dimensional embedding of candidate words. Thus, if we obtain a low-rank that is highly predictive, we can interpret and as task-specific compressions of the original embedding tailored for the target bilexical relation, from to dimensions.
Since minimizing the rank of a matrix is hard, we employ a convex relaxation based on the nuclear norm of the matrix (that is, the
norm of the singular values, seeSrebro et al. (2005)). In our experiments we compare the low-rank approach to and regularization penalties, which are common in linear prediction tasks. For all settings we use the forward-backward splitting (FOBOS) optimization algorithm by Duchi & Singer (2009).
We note that if we set
to be the identity matrix our model scores are inner products between the query-candidate embeddings, a common approach to evaluate semantic similarity in unsupervised distributional approaches. In general, we can compute a low-dimensional projection ofdown to dimensions, using SVD, and perform the inner product in the projected space. We refer to this as the unsupervised approach, since the projected embeddings do not use the labeled dataset specifying the target relation.
3 Experiments with Syntactic Relations
We conducted a set of experiments to test the performance of the learning algorithm with respect to the initial lexical representation , for different configurations of the representation and the learner. We experiment with six bilexical syntactic relations using the Penn Treebank corpus (Marcus et al., 1993), following the experimental setting by Madhyastha et al. (2014). For a relation between queries and candidate words, such as noun-adjective, we partition the query words into train, development and test queries, thus test pairs are always unseen pairs.
To report performance, we measure pairwise accuracy with respect to the efficiency of the model in terms of number of active parameters. To measure the efficiency of a model we consider the number of double operations that are needed to compute, given a query word, the scores for all candidates in the vocabulary. See (Madhyastha et al., 2014) for details.
We experiment with two types of initial representations . The first is a simple high-dimensional distributional representation based on contextual bag-of-words (BoW): each word is represented by the bag of words that appear in contextual windows. In our experiments these were sparse 2,000-dimensional vectors. The second representation are the low-dimensional skip-gram embeddings (SKG) by Mikolov et al. (2013), where we used 300 dimensions. In both cases we induced such representations using the BLIPP corpus (Charniak et al., 2000) and using a context window of size 10 for both. Thus the main difference is that the bag-of-words is an uncompressed representation, while the skip-gram embeddings are a neural-net-style compression of the same contextual windows.
As for the bilexical model, we test it under three regularization schemes, namely , , and . For the first two, the efficiency of computing predictions is a function of the non-zero entries in , while for the latter it is the rank of , which defines the dimension of the task-specific embeddings. We also test a baseline unsupervised approach (UNS).
4 Results and Discussion
Figure 1 shows the performance of models for noun-adjective, verb-object and verb-subject relations (in both directions). In line with the results by Madhyastha et al. (2014) we observe that the supervised approach in all cases outperforms the unsupervised case, and that the nuclear norm scheme provides the best performance in terms of accuracy and speed: other regularizers can obtain similar accuracies, but low-rank constraints during learning favor very-low dimensional embeddings that are highly predictive.
In terms of starting with bag-of-words vectors or skip-gram embeddings, in three relations the former is clearly better, while in the other three relations the latter is clearly better. We conclude that task-agnostic embeddings do identify useful relevant properties of words, but at the same time not all necessary properties are retained. In all cases, the nuclear norm regularizer successfully compresses the initial representation, even for the embeddings which are already low-dimensional.
Table 1 presents the best result for each relation, initial representation and regularization scheme. Plus, for the regularizer we present results at three different ranks, namely 5, 10 or the rank that obtains the best result for each relation. These highly compressed embeddings perform nearly as good as the best performing model for each relation.
|city||province, area, island, township, freeways||residents, towns, marchers, streets, mayor|
|securities||bonds, mortgage, issuers, debt, loans||bonds, memberships, equities, certificates, syndicate|
|board||committee directors, commission, nominees, refusal||slate, membership, committee, appointment, stockholder|
|debt||loan, loans, debts, financing, mortgage||reinvestment, indebtedness, expenditures, outlay, repayment|
|law||laws, constitution, code, legislation, immigration||laws, ordinance, decree, statutes, state|
|director||assistant, editor, treasurer, postmaster, chairman||firm, consultant, president, manager, leader|
Table 2 shows a set of query nouns, and two sets of neighbor query nouns, using the embeddings for two different relations to compute the two sets. We can see that, by changing the target relation, the set of close words changes. This suggests that words have a wide range of different behaviors, and different relations might exploit lexical properties that are specific to the relation.
We have presented a set of experiments where we compute word embeddings specific to target linguistic relations. We observe that low-rank penalties favor embeddings that are good both in terms of predictive accuracy and efficiency. For example, in certain cases, models using very low-dimensional embeddings perform nearly as good as the best models.
In certain tasks, we have shown that we can refine low-dimensional skip-gram embeddings, making them more compressed while retaining their predictive properties. In other tasks, we have shown that our method can improve over skip-gram models when starting from uncompressed distributional representations. This suggests that skip-gram embeddings do not retain all the necessary information of the original words. This motivates future research that aims at general-purpose embeddings that do retain all necessary properties, and can be further compressed in light of specific lexical relations.
We thank the reviewers for their helpful comments. This work has been partially funded by the Spanish Government through the SKATER project (TIN2012-38584-C06-01).
- Bansal et al. (2014) Bansal, Mohit, Gimpel, Kevin, and Livescu, Karen. Tailoring continuous word representations for dependency parsing. In Proceedings of ACL, 2014.
- Baroni et al. (2014) Baroni, Marco, Dinu, Georgiana, and Kruszewski, Germán. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 238–247, Baltimore, Maryland, June 2014. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P14-1023.
Brown et al. (1992)
Brown, Peter F., deSouza, Peter V., Mercer, Robert L., Pietra, Vincent
J. Della, and Lai, Jenifer C.
Class-based n-gram models of natural language.Computational Linguistics, 18:467–479, 1992.
- Charniak et al. (2000) Charniak, Eugene, Blaheta, Don, Ge, Niyu, Hall, Keith, and Johnson, Mark. BLLIP 1987–89 WSJ Corpus Release 1, LDC No. LDC2000T43. Linguistic Data Consortium, 2000.
Duchi & Singer (2009)
Duchi, John and Singer, Yoram.
Efficient online and batch learning using forward backward splitting.
Journal of Machine Learning Research, 10:2899–2934, 2009.
- Koo et al. (2008) Koo, Terry, Carreras, Xavier, and Collins, Michael. Simple semi-supervised dependency parsing. In Proceedings of ACL-08: HLT, pp. 595–603, Columbus, Ohio, June 2008. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P/P08/P08-1068.
- Madhyastha et al. (2014) Madhyastha, Swaroop Pranava, Carreras, Xavier, and Quattoni, Ariadna. Learning task-specific bilexical embeddings. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 161–171. Dublin City University and Association for Computational Linguistics, 2014. URL http://aclweb.org/anthology/C14-1017.
- Marcus et al. (1993) Marcus, Mitchell P., Santorini, Beatrice, and Marcinkiewicz, Mary A. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993.
- Mikolov et al. (2013) Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
Pennington et al. (2014)
Pennington, Jeffrey, Socher, Richard, and Manning, Christopher.
Glove: Global vectors for word representation.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, 2014. URL http://aclweb.org/anthology/D14-1162.
- Sahlgren (2006) Sahlgren, Magnus. The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. PhD thesis, Stockholm University, 2006.
- Srebro et al. (2005) Srebro, Nathan, Rennie, Jason D. M., and Jaakola, Tommi S. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 17, pp. 1329–1336. MIT Press, 2005.
Turian et al. (2010)
Turian, Joseph, Ratinov, Lev-Arie, and Bengio, Yoshua.
Word representations: A simple and general method for semi-supervised learning.In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P10-1040.
Turney & Pantel (2010)
Turney, Peter D. and Pantel, Patrick.
From frequency to meaning: Vector space models of semantics.
Journal of Artificial Intelligence Research, 37(1):141–188, January 2010. ISSN 1076-9757. URL http://dl.acm.org/citation.cfm?id=1861751.1861756.