Type-level word embeddings map a word type (i.e., a surface form) to a dense vector of real numbers such that similar word types have similar embeddings. When pre-trained on a large corpus of unlabeled text, they provide an effective mechanism for generalizing statistical models to words which do not appear in the labeled training data for a downstream task.
In accordance with standard terminology, we make the following distinction between types and tokens in this paper: By word types, we mean the surface form of the word, whereas by tokens we mean the instantiation of the surface form in a context. For example, the same word type ‘pool’ occurs as two different tokens in the sentences “He sat by the pool,” and “He played a game of pool.”
Most word embedding models define a single vector for each word type. However, a fundamental flaw in this design is their inability to distinguish between different meanings and abstractions of the same word. In the two sentences shown above, the word ‘pool’ has different meanings, but the same representation is typically used for both of them. Similarly, the fact that ‘pool’ and ‘lake’ are both kinds of water bodies is not explicitly incorporated in most type-level embeddings. Furthermore, it has become a standard practice to tune pre-trained word embeddings as model parameters during training for an NLP task (e.g., Chen and Manning, 2014; Lample et al., 2016), potentially allowing the parameters of a frequent word in the labeled training data to drift away from related but rare words in the embedding space.
Previous work partially addresses these problems by estimating concept embeddings in WordNet (e.g., Rothe and Schütze, 2015)
, or improving word representations using information from knowledge graphs(e.g., Faruqui et al., 2015). However, it is still not clear how to use a lexical ontology to derive context-sensitive token embeddings.
In this work, we represent a word token in a given context by estimating a context-sensitive probability distribution over relevant concepts in WordNetMiller (1995) and use the expected value (i.e., weighted sum) of the concept embeddings as the token representation (see §2). We take a task-centric approach towards doing this, and learn the token representations jointly with the task-specific parameters. In addition to providing context-sensitive token embeddings, the proposed method implicitly regularizes the embeddings of related words by forcing related words to share similar concept embeddings. As a result, the representation of a rare word which does not appear in the training data for a downstream task benefits from all the updates to related words which share one or more concept embeddings.
Our approach to context-sensitive embeddings assumes the availability of a lexical ontology. While this work relies on WordNet, and we exploit the order of senses given by WordNet, our model is, in principle applicable to any ontology, with appropriate modifications. In this work, we do not assume the inputs are sense tagged. We use the proposed embeddings to predict prepositional phrase (PP) attachments (see §3), a challenging problem which emphasizes the selectional preferences between words in the PP and each of the candidate head words. Our empirical results and detailed analysis (see §4) show that the proposed embeddings effectively use WordNet to improve the accuracy of PP attachment predictions.
2 WordNet-Grounded Context-Sensitive Token Embeddings
In this section, we focus on defining our context-sensitive token embeddings. We first describe our grounding of word types using WordNet concepts. Then, we describe our model of context-sensitive token-level embeddings as a weighted sum of WordNet concept embeddings.
2.1 WordNet Grounding
We use WordNet to map each word type to a set of synsets, including possible generalizations or abstractions. Among the labeled relations defined in WordNet between different synsets, we focus on the hypernymy relation to help model generalization and selectional preferences between words, which is especially important for predicting PP attachments Resnik (1993). To ground a word type, we identify the set of (direct and indirect) hypernyms of the WordNet senses of that word. A simplified grounding of the word ‘pool’ is illustrated in Figure 1. This grounding is key to our model of token embeddings, to be described in the following subsections.
2.2 Context-Sensitive Token Embeddings
Our goal is to define a context-sensitive model of token embeddings which can be used as a drop-in replacement for traditional type-level word embeddings.
Let be the list of synsets defined as possible word senses of a given word type in WordNet, and be the list of hypernyms for a synset .111For notational convenience, we assume that . For example, according to Figure 1:
Each WordNet synset is associated with a set of parameters which represent its embedding. This parameterization is similar to that of rothe:15.
Given a sequence of tokens and their corresponding word types , let be the embedding of the word token at position . Unlike most embedding models, the token embeddings are not parameters. Rather, is computed as the expected value of concept embeddings used to ground the word type corresponding to the token :
The distribution which governs the expectation over synset embeddings factorizes into two components:
The first component, , is a sense prior which reflects the prominence of each word sense for a given word type. Here, we exploit222Note that for ontologies where such information is not available, our method is still applicable but without this component. We show the effect of using a uniform sense prior in §4.2.
the fact that WordNet senses are ordered in descending order of their frequencies, obtained from sense tagged corpora, and parameterize the sense prior like an exponential distribution.denotes the rank of sense for the word type , thus corresponds to being the first sense of . The scalar parameter () controls the decay of the probability mass, which is learned along with the other parameters in the model. Note that sense priors are defined for each word type (), and are shared across all tokens which have the same word type.
, the second component, is what makes the token representations context-sensitive. It scores each concept in the WordNet grounding of
by feeding the concatenation of the concept embedding and a dense vector that summarizes the textual context into a multilayer perceptron (MLP) with two layers followed by a
layer. This component is inspired by the soft attention often used in neural machine translationBahdanau et al. (2014).333Although soft attention mechanism is typically used to explicitly represent the importance of each item in a sequence, it can also be applied to non-sequential items. The definition of the context function is dependent on the encoder used to encode the context. We describe a specific instantiation of this function in §3.
To summarize, Figure 2 illustrates how to compute the embedding of a word token in a given context:
compute a summary of the context ,
enumerate related concepts for ,
compute for each pair , and
In the following section, we describe our model for predicting PP attachments, including our definition for context.
3 PP Attachment
Disambiguating PP attachments is an important and challenging NLP problem. Since modeling hypernymy and selectional preferences is critical for successful prediction of PP attachments Resnik (1993), it is a good fit for evaluating our WordNet-grounded context-sensitive embeddings.
Figure 3, reproduced from belinkov2014exploring, illustrates an example of the PP attachment prediction problem. The accuracy of a competitive English dependency parser at predicting the head word of an ambiguous prepositional phrase is 88.5%, significantly lower than the overall unlabeled attachment accuracy of the same parser (94.2%).444See Table 2 in §4 for detailed results.
This section formally defines the problem of PP attachment disambiguation, describes our baseline model, then shows how to integrate the token-level embeddings in the model.
3.1 Problem Definition
We follow belinkov2014exploring’s definition of the PP attachment problem. Given a preposition and its direct dependent in the prepositional phrase (PP), our goal is to predict the correct head word for the PP among an ordered list of candidate head words . Each example in the train, validation, and test sets consists of an input tuple and an output index to identify the correct head among the candidates in . Note that the order of words that form each is the same as that in the corresponding original sentence.
3.2 Model Definition
Both our proposed and baseline models for PP attachment use bidirectional RNN with LSTM cells (bi-LSTM) to encode the sequence .
We score each candidate head by feeding the concatenation of the output bi-LSTM vectors for the head , the preposition and the direct dependent through an MLP, with a fully connected tanh layer to obtain a non-linear projection of the concatenation, followed by a fully-connected softmax layer:
To train the model, we use cross-entropy loss at the output layer for each candidate head in the training set. At test time, we predict the candidate head with the highest probability according to the model in Eq. 3, i.e.,
This model is inspired by the Head-Prep-Child-Ternary model of belinkov2014exploring. The main difference is that we replace the input features for each token with the output bi-RNN vectors.
We now describe the difference between the proposed and the baseline models. Generally, let and represent the input and output vectors of the bi-LSTM for each token in the sequence. The outputs at each timestep are obtained by concatenating those of the forward and backward LSTMs.
In the baseline model, we use type-level word embeddings to represent the input vector for a token in the sequence. The word embedding parameters are initialized with pre-trained vectors, then tuned along with the parameters of the bi-LSTM and . We call this model LSTM-PP.
In the proposed model, we use token level word embedding as described in §2 as the input to the bi-LSTM, i.e., . The context used for the attention component is simply the hidden state from the previous timestep. However, since we use a bi-LSTM, the model essentially has two RNNs, and accordingly we have two context vectors, and associated attentions. That is, for the forward RNN and for the backward RNN. Consequently, each token gets two representations, one from each RNN. The synset embedding parameters are initialized with pre-trained vectors and tuned along with the sense decay () and MLP parameters from Eq. 2, the parameters of the bi-LSTM and those of . We call this model OntoLSTM-PP.
Dataset and evaluation.
We used the English PP attachment dataset created and made available by belinkov2014exploring. The training and test splits contain 33,359 and 1951 labeled examples respectively. As explained in §3.1, the input for each example is 1) an ordered list of candidate head words, 2) the preposition, and 3) the direct dependent of the preposition. The head words are either nouns or verbs and the dependent is always a noun. All examples in this dataset have at least two candidate head words. As discussed in belinkov2014exploring, this dataset is a more realistic PP attachment task than the RRR dataset Ratnaparkhi et al. (1994). The RRR dataset is a binary classification task with exactly two head word candidates in all examples. The context for each example in the RRR dataset is also limited which defeats the purpose of our context-sensitive embeddings.
Model specifications and hyperparameters.
For efficient implementation, we use mini-batch updates with the same number of senses and hypernyms for all examples, padding zeros and truncating senses and hypernyms as needed. For each word type, we use a maximum ofsenses and indirect hypernyms from WordNet. In our initial experiments on a held-out development set (10% of the training data), we found that values greater than and did not improve performance. We also used the development set to tune the number of layers in separately for the OntoLSTM-PP and LSTM-PP, and the number of layers in the attention MLP in OntoLSTM-PP. When a synset has multiple hypernym paths, we use the shortest one. Finally, words types which do not appear in WordNet are assumed to have one unique sense per word type with no hypernyms. Since the POS tag for each word is included in the dataset, we exclude WordNet synsets which are incompatible with the POS tag. The synset embedding parameters are initialized using the synset vectors obtained by running AutoExtend Rothe and Schütze (2015) on 100-dimensional GloVe Pennington et al. (2014) vectors for WordNet 3.1. We refer to this embedding as GloVe-extended. Representation for the OOV word types in LSTM-PP and OOV synset types in OntoLSTM-PP were randomly drawn from a uniform 100-d distribution. Initial sense prior parameters () were also drawn from a uniform 1-d distribution.
In our experiments, we compare our proposed model, OntoLSTM-PP with three baselines – LSTM-PP initialized with GloVe embedding, LSTM-PP initialized with GloVe vectors retrofitted to WordNet using the approach of faruqui:15 (henceforth referred to as GloVe-retro), and finally the best performing standalone PP attachment system from belinkov2014exploring, referred to as HPCD (full) in the paper. HPCD (full)
is a neural network model that learns to compose the vector representations of each of the candidate heads with those of the preposition and the dependent, and predict attachments. The input representations are enriched using syntactic context information, POS, WordNet and VerbNetKipper et al. (2008) information and the distance of the head word from the PP is explicitly encoded in composition architecture. In contrast, we do not use syntactic context, VerbNet and distance information, and do not explicitly encode POS information.
4.1 PP Attachment Results
|HPCD (full)||Syntactic-SG||Type||WordNet, VerbNet||88.7|
Table 1 shows that our proposed token level embedding scheme OntoLSTM-PP outperforms the better variant of our baseline LSTM-PP (with GloVe-retro intialization) by an absolute accuracy difference of 4.9%, or a relative error reduction of 32%. OntoLSTM-PP also outperforms HPCD (full), the previous best result on this dataset.
Initializing the word embeddings with GloVe-retro (which uses WordNet as described in Faruqui et al. (2015)) instead of GloVe amounts to a small improvement, compared to the improvements obtained using OntoLSTM-PP. This result illustrates that our approach of dynamically choosing a context sensitive distribution over synsets is a more effective way of making use of WordNet.
Effect on dependency parsing.
Following belinkov2014exploring, we used RBG parser Lei et al. (2014), and modified it by adding a binary feature indicating the PP attachment predictions from our model.
We compare four ways to compute the additional binary features: 1) the predictions of the best standalone system HPCD (full) in belinkov2014exploring, 2) the predictions of our baseline model LSTM-PP, 3) the predictions of our improved model OntoLSTM-PP, and 4) the gold labels Oracle PP.
Table 2 shows the effect of using the PP attachment predictions as features within a dependency parser. We note there is a relatively small difference in unlabeled attachment accuracy for all dependencies (not only PP attachments), even when gold PP attachments are used as additional features to the parser. However, when gold PP attachment are used, we note a large potential improvement of 10.46 points in PP attachment accuracies (between the PPA accuracy for RBG and RBG + Oracle PP), which confirms that adding PP predictions as features is an effective approach. Our proposed model RBG + OntoLSTM-PP recovers 15% of this potential improvement, while RBG + HPCD (full) recovers 10%, which illustrates that PP attachment remains a difficult problem with plenty of room for improvements even when using a dedicated model to predict PP attachments and using its predictions in a dependency parser.
We also note that, although we use the same predictions of the HPCD (full) model in belinkov2014exploring555The authors kindly provided their predictions for 1942 test examples (out of 1951 examples in the full test set). In Table 2, we use the same subset of 1942 test examples and will include a link to the subset in the final draft., we report different results than belinkov2014exploring. For example, the unlabeled attachment score (UAS) of the baselines RBG and RBG + HPCD (full) are 94.17 and 94.19, respectively, in Table 2, compared to 93.96 and 94.05, respectively, in belinkov2014exploring. This is due to the use of different versions of the RBG parser.666We use the latest commit (SHA: e07f74) on the GitHub repository of the RGB parser.
|System||Full UAS||PPA Acc.|
|RBG + HPCD (full)||94.19||89.59|
|RBG + LSTM-PP||94.14||86.35|
|RBG + OntoLSTM-PP||94.30||90.11|
|RBG + Oracle PP||94.60||98.97|
In this subsection, we analyze different aspects of our model in order to develop a better understanding of its behavior.
Effect of context sensitivity and sense priors.
We now show some results that indicate the relative strengths of two components of our context-sensitive token embedding model. The second row in Table 3 shows the test accuracy of a system trained without sense priors (that is, making from Eq. 1
a uniform distribution), and the third row shows the effect of making the token representations context-insensitive by giving a similar attention score to all related concepts, essentially making them type level representations, but still grounded in WordNet. As it can be seen, removing context sensitivity has an adverse effect on the results. This illustrates the importance of the sense priors and the attention mechanism.
It is interesting that, even without sense priors and attention, the results with WordNet grounding is still higher than that of the two LSTM-PP systems in Table 1. This result illustrates the regularization behavior of sharing concept embeddings across multiple words, which is especially important for rare words.
Effect of training data size.
Since OntoLSTM-PP uses external information, the gap between the model and LSTM-PP is expected to be more pronounced when the training data sizes are smaller. To test this hypothesis, we trained the two models with different amounts of training data and measured their accuracies on the test set. The plot is shown in Figure 4. As expected, the gap tends to be larger at smaller data sizes. Surprisingly, even with 2000 sentences in the training data set, OntoLSTM-PP outperforms LSTM-PP trained with the full data set. When both the models are trained with the full dataset, LSTM-PP reaches a training accuracy of 95.3%, whereas OntoLSTM-PP reaches 93.5%. The fact that LSTM-PP is overfitting the training data more, indicates the regularization capability of OntoLSTM-PP.
|- sense priors||88.4|
To better understand the effect of WordNet grounding, we took a sample of 100 sentences from the test set whose PP attachments were correctly predicted by OntoLSTM-PP but not by LSTM-PP. A common pattern observed was that those sentences contained words not seen frequently in the training data. Figure 5 shows two such cases. In both cases, the weights assigned by OntoLSTM-PP to infrequent words are also shown. The word types soapsuds and buoyancy do not occur in the training data, but OntoLSTM-PP was able to leverage the parameters learned for the synsets that contributed to their token representations. Another important observation is that the word type buoyancy has four senses in WordNet (we consider the first three), none of which is the metaphorical sense that is applicable to markets as shown in the example here. Selecting a combination of relevant hypernyms from various senses may have helped OntoLSTM-PP make the right prediction. This shows the value of using hypernymy information from WordNet. Moreover, this indicates the strength of the hybrid nature of the model, that lets it augment ontological information with distributional information.
We note that the vocabulary sizes in OntoLSTM-PP and LSTM-PP are comparable as the synset types are shared across word types. In our experiments with the full PP attachment dataset, we learned embeddings for 18k synset types with OntoLSTM-PP and 11k word types with LSTM-PP. Since the biggest contribution to the parameter space comes from the embedding layer, the complexities of both the models are comparable.
5 Related Work
This work is related to various lines of research within the NLP community: dealing with synonymy and homonymy in word representations both in the context of distributed embeddings and more traditional vector spaces; hybrid models of distributional and knowledge based semantics; and selectional preferences and their relation with syntactic and semantic relations.
The need for going beyond a single vector per word-type has been well established for a while, and many efforts were focused on building multi-prototype vector space models of meaning (Reisinger and Mooney, 2010; Huang et al., 2012; Chen et al., 2014; Jauhar et al., 2015; Neelakantan et al., 2015; Arora et al., 2016, etc.). However, the target of all these approaches is obtaining multi-sense word vector spaces, either by incorporating sense tagged information or other kinds of external context. The number of vectors learned is still fixed, based on the preset number of senses. In contrast, our focus is on learning a context dependent distribution over those concept representations. Other work not necessarily related to multi-sense vectors, but still related to our work includes belanger:15’s work which proposed a Gaussian linear dynamical system for estimating token-level word embeddings, and Vilnis2014WordRV’s work which proposed mapping each word type to a density instead of a point in a space to account for uncertainty in meaning. These approaches do not make use of lexical ontologies and are not amenable for joint training with a downstream NLP task.
Related to the idea of concept embeddings is rothe:15 who estimated WordNet synset representations, given pre-trained type-level word embeddings. In contrast, our work focuses on estimating token-level word embeddings as context sensitive distributions of concept embeddings.
There is a large body of work that tried to improve word embeddings using external resources. yu:14 extended the CBOW model Mikolov et al. (2013)
by adding an extra term in the training objective for generating words conditioned on similar words according to a lexicon. jauhar:15 extended the skipgram modelMikolov et al. (2013) by representing word senses as latent variables in the generation process, and used a structured prior based on the ontology. faruqui:15 used belief propagation to update pre-trained word embeddings on a graph that encodes lexical relationships in the ontology. Similarly, johansson2015embedding improved word embeddings by representing each sense of the word in a way that reflects the topology of the semantic network they belong to, and then representing the words as convex combinations of their senses. In contrast to previous work that was aimed at improving type level word representations, we propose an approach for obtaining context-sensitive embeddings at the token level, while jointly optimizing the model parameters for the NLP task of interest.
resnik:93 showed the applicability of semantic classes and selectional preferences to resolving syntactic ambiguity. Zapirain2013SelectionalPF applied models of selectional preferences automatically learned from WordNet and distributional information, to the problem of semantic role labeling. resnik:93,brill1994rule,agirre2008improving and others have used WordNet information towards improving prepositional phrase attachment predictions.
In this paper, we proposed a grounding of lexical items which acknowledges the semantic ambiguity of word types using WordNet and a method to learn a context-sensitive distribution over their representations. We also showed how to integrate the proposed representation with recurrent neural networks for disambiguating prepositional phrase attachments, showing that the proposed WordNet-grounded context-sensitive token embeddings outperforms standard type-level embeddings for predicting PP attachments. We provided a detailed qualitative and quantitative analysis of the proposed model.
Implementation and code availability.
This approach may be extended to other NLP tasks that can benefit from using encoders that can access WordNet information. WordNet also has some drawbacks, and may not always have sufficient coverage given the task at hand. As we have shown in §4.2, our model can deal with missing WordNet information by augmenting it with distributional information. Moreover, the methods described in this paper can be extended to other kinds of structured knowledge sources like Freebase which may be more suitable for tasks like question answering.
The first author is supported by a fellowship from the Allen Institute for Artificial Intelligence. We would like to thank Matt Gardner, Jayant Krishnamurthy, Julia Hockenmaier, Oren Etzioni, Hector Liu, Filip Ilievski, and anonymous reviewers for their comments.
- Agirre (2008) Eneko Agirre. 2008. Improving parsing and pp attachment performance with sense information. In ACL. Citeseer.
- Arora et al. (2016) Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2016. Linear algebraic structure of word senses, with applications to polysemy. arXiv preprint arXiv:1601.03764 .
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473.
- Belanger and Kakade (2015) David Belanger and Sham M. Kakade. 2015. A linear dynamical system model for text. In ICML.
- Belinkov et al. (2014) Yonatan Belinkov, Tao Lei, Regina Barzilay, and Amir Globerson. 2014. Exploring compositional architectures and word vector representations for prepositional phrase attachment. Transactions of the Association for Computational Linguistics 2:561–572.
- Brill and Resnik (1994) Eric Brill and Philip Resnik. 1994. A rule-based approach to prepositional phrase attachment disambiguation. In Proceedings of the 15th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, pages 1198–1204.
- Chen and Manning (2014) Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser using neural networks. In EMNLP. pages 740–750.
- Chen et al. (2014) Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A unified model for word sense representation and disambiguation. In EMNLP. pages 1025–1035.
- Chollet (2015) François Chollet. 2015. Keras. https://github.com/fchollet/keras.
- Faruqui et al. (2015) Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard H. Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In NAACL.
- Huang et al. (2012) Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, pages 873–882.
- Jauhar et al. (2015) Sujay Kumar Jauhar, Chris Dyer, and Eduard H. Hovy. 2015. Ontologically grounded multi-sense representation learning for semantic vector space models. In NAACL.
- Johansson and Piña (2015) Richard Johansson and Luis Nieto Piña. 2015. Embedding a semantic network in a word space. In In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics–Human Language Technologies. Citeseer.
- Kipper et al. (2008) Karin Kipper, Anna Korhonen, Neville Ryant, and Martha Palmer. 2008. A large-scale classification of english verbs. Language Resources and Evaluation 42(1):21–40.
Lample et al. (2016)
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and
Chris Dyer. 2016.
Neural architectures for named entity recognition.In NAACL.
Lei et al. (2014)
Tao Lei, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2014.
Low-rank tensors for scoring dependency structures.In ACL. Association for Computational Linguistics.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546.
- Miller (1995) George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39–41.
- Neelakantan et al. (2015) Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2015. Efficient non-parametric estimation of multiple embeddings per word in vector space. arXiv preprint arXiv:1504.06654 .
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP.
- Ratnaparkhi et al. (1994) Adwait Ratnaparkhi, Jeff Reynar, and Salim Roukos. 1994. A maximum entropy model for prepositional phrase attachment. In Proceedings of the workshop on Human Language Technology.
- Reisinger and Mooney (2010) Joseph Reisinger and Raymond J Mooney. 2010. Multi-prototype vector-space models of word meaning. In HLT-ACL.
- Resnik (1993) Philip Resnik. 1993. Semantic classes and syntactic ambiguity. In Proceedings of the workshop on Human Language Technology. Association for Computational Linguistics.
- Rothe and Schütze (2015) Sascha Rothe and Hinrich Schütze. 2015. Autoextend: Extending word embeddings to embeddings for synsets and lexemes. In ACL.
- Vilnis and McCallum (2015) Luke Vilnis and Andrew McCallum. 2015. Word representations via gaussian embedding. In ICLR.
- Yu and Dredze (2014) Mo Yu and Mark Dredze. 2014. Improving lexical embeddings with semantic knowledge. In ACL.
- Zapirain et al. (2013) Benat Zapirain, Eneko Agirre, Lluis Marquez, and Mihai Surdeanu. 2013. Selectional preferences for semantic role classification. Computational Linguistics .