Olive Oil is Made of Olives, Baby Oil is Made for Babies: Interpreting Noun Compounds using Paraphrases in a Neural Model

03/21/2018 ∙ by Vered Shwartz, et al. ∙ Google 0

Automatic interpretation of the relation between the constituents of a noun compound, e.g. olive oil (source) and baby oil (purpose) is an important task for many NLP applications. Recent approaches are typically based on either noun-compound representations or paraphrases. While the former has initially shown promising results, recent work suggests that the success stems from memorizing single prototypical words for each relation. We explore a neural paraphrasing approach that demonstrates superior performance when such memorization is not possible.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic classification of a noun-compound (NC) to the implicit semantic relation that holds between its constituent words is beneficial for applications that require text understanding. For instance, a personal assistant asked “do I have a morning meeting tomorrow?” should search the calendar for meetings occurring in the morning, while for group meeting it should look for meetings with specific participants. The NC classification task is a challenging one, as the meaning of an NC is often not easily derivable from the meaning of its constituent words Spärck Jones (1983).

Previous work on the task falls into two main approaches. The first maps NCs to paraphrases that express the relation between the constituent words (e.g. Nakov and Hearst, 2006; Nulty and Costello, 2013), such as mapping coffee cup and garbage dump to the pattern

. The second approach computes a representation for NCs from the distributional representation of their individual constituents. While this approach yielded promising results, recently, W16-1604 showed that similar performance is achieved by representing the NC as a concatenation of its constituent embeddings, and attributed it to the

lexical memorization phenomenon Levy et al. (2015).

In this paper we apply lessons learned from the parallel task of semantic relation classification. We adapt HypeNET Shwartz et al. (2016) to the NC classification task, using their path embeddings to represent paraphrases and combining with distributional information. We experiment with various evaluation settings, including settings that make lexical memorization impossible. In these settings, the integrated method performs better than the baselines. Even so, the performance is mediocre for all methods, suggesting that the task is difficult and warrants further investigation.111The code is available at https://github.com/tensorflow/ models/tree/master/research/lexnet_nc.

2 Background

Various tasks have been suggested to address noun-compound interpretation. NC paraphrasing extracts texts explicitly describing the implicit relation between the constituents, for example student protest is a protest led by, be sponsored by, or be organized by students (e.g. Nakov and Hearst, 2006; Kim and Nakov, 2011; Hendrickx et al., 2013; Nulty and Costello, 2013). Compositionality prediction determines to what extent the meaning of the NC can be expressed in terms of the meaning of its constituents, e.g. spelling bee is non-compositional, as it is not related to bee (e.g. Reddy et al., 2011)

. In this paper we focus on the NC classification task, which is defined as follows: given a pre-defined set of relations, classify

to the relation that holds between and . We review the various features used in the literature for classification.222Leaving out features derived from lexical resources (e.g. Nastase and Szpakowicz, 2003; Tratz and Hovy, 2010).

2.1 Compositional Representations

In this approach, classification is based on a vector representing the NC (

), which is obtained by applying a function to its constituents’ distributional representations: . Various functions have been proposed in the literature.

mitchell2010composition proposed 3 simple combinations of and (additive, multiplicative, dilation). Others suggested to represent compositions by applying linear functions, encoded as matrices, over word vectors. baroni2010nouns focused on adjective-noun compositions (AN) and represented adjectives as matrices, nouns as vectors, and ANs as their multiplication. Matrices were learned with the objective of minimizing the distance between the learned vector and the observed vector (computed from corpus occurrences) of each AN. The full-additive model Zanzotto et al. (2010); Dinu et al. (2013) is a similar approach that works on any two-word composition, multiplying each word by a square matrix: .

D12-1110 suggested a non-linear composition model. A recursive neural network operates bottom-up on the output of a constituency parser to represent variable-length phrases. Each constituent is represented by a vector that captures its meaning and a matrix that captures how it modifies the meaning of constituents that it combines with. For a binary NC,

, where and is a non-linear function.

These representations were used as features in NC classification, often achieving promising results (e.g. Van de Cruys et al., 2013; Dima and Hinrichs, 2015). However, W16-1604 recently showed that similar performance is achieved by representing the NC as a concatenation of its constituent embeddings, and argued that it stems from memorizing prototypical words for each relation. For example, classifying any NC with the head oil to the source relation, regardless of the modifier.

2.2 Paraphrasing

In this approach, the paraphrases of an NC, i.e. the patterns connecting the joint occurrences of the constituents in a corpus, are treated as features. For example, both paper cup and steel knife may share the feature made of. seaghdha2013interpreting leveraged this “relational similarity” in a kernel-based classification approach. They combined the relational information with the complementary lexical features of each constituent separately. Two NCs labeled to the same relation may consist of similar constituents (paper-steel, cup-knife) and may also appear with similar paraphrases. Combining the two information sources has shown to be beneficial, but it was also noted that the relational information suffered from data sparsity: many NCs had very few paraphrases, and paraphrase similarity was based on ngram overlap.

Recently, surtani2015vsm suggested to represent NCs in a vector space model (VSM) using paraphrases as features. These vectors were used to classify new NCs based on the nearest neighbor in the VSM. However, the model was only tested on a small dataset and performed similarly to previous methods.

3 Model

coffee cup



[] of []

[] containing []

[] in []

mean pooling




Figure 1: An illustration of the classification models for the NC coffee cup. The model consists of two parts: (1) the distributional representations of the NC (left, orange) and each word (middle, green). (2) the corpus occurrences of coffee and cup, in the form of dependency path embeddings (right, purple).

We similarly investigate the use of paraphrasing for NC relation classification. To generate a signal for the joint occurrences of and , we follow the approach used by HypeNET Shwartz et al. (2016). For an in the dataset, we collect all the dependency paths that connect and in the corpus, and learn path embeddings as detailed in Section 3.2. Section 3.1 describes the classification models with which we experimented.

3.1 Classification Models

Figure 1 provides an overview of the models: path-based, integrated, and integrated-NC, each which incrementally adds new features not present in the previous model. In the following sections, denotes the input vector representing the NC. The network classifies NC to the highest scoring relation: , where is the output layer. All networks contain a single hidden layer whose dimension is . is the number of relations in the dataset. See Appendix A for additional technical details.


Classifies the NC based only on the paths connecting the joint occurrences of and in the corpus, denoted . We define the feature vector as the average of its path embeddings, where the path embedding of a path is weighted by its frequency :


We concatenate and ’s word embeddings to the path vector, to add distributional information:

. Potentially, this allows the network to utilize the contextual properties of each individual constituent, e.g. assigning high probability to

substance-material-ingredient for edible s (e.g. vanilla pudding, apple cake).


We add the NC’s observed vector as additional distributional input, providing the contexts in which occur as an NC:
. Like W16-1604, we learn NC vectors using the GloVe algorithm Pennington et al. (2014), by replacing each NC occurrence in the corpus with a single token.

This information can potentially help clustering NCs that appear in similar contexts despite having low pairwise similarity scores between their constituents. For example, gun violence and abortion rights belong to the topic relation and may appear in similar news-related contexts, while (gun, abortion) and (violence, rights) are dissimilar.

3.2 Path Embeddings

Following HypeNET, for a path composed of edges , we represent each edge by the concatenation of its lemma, part-of-speech tag, dependency label and direction vectors: . The edge vectors are encoded using an LSTM Hochreiter and Schmidhuber (1997), and the last output vector is used as the path embedding.

We use the NC labels as distant supervision. While HypeNET predicts a word pair’s label from the frequency-weighted average of the path vectors, we differ from it slightly and compute the label from the frequency-weighted average of the predictions obtained from each path separately:

We conjecture that label distribution averaging allows for more efficient training of path embeddings when a single NC contains multiple paths.

4 Evaluation

Dataset Split Best Freq Dist Dist-NC Best Comp Path Int Int-NC
Tratz-fine Rand 0.319 0.692 0.673 0.725 0.538 0.714 0.692
Lex 0.222 0.458 0.449 0.450 0.448 0.510 0.478
Lex 0.292 0.574 0.559 0.607 0.472 0.613 0.600
Lex 0.066 0.363 0.360 0.334 0.423 0.421 0.429
Tratz-coarse Rand 0.256 0.734 0.718 0.775 0.586 0.736 0.712
Lex 0.225 0.501 0.497 0.538 0.518 0.558 0.548
Lex 0.282 0.630 0.600 0.645 0.548 0.646 0.632
Lex 0.136 0.406 0.409 0.372 0.472 0.475 0.478
Table 1: All methods’ performance () on the various splits: best freq: best performing frequency baseline (head / modifier),444In practice, in lexical-full this is a random baseline, in lexical-head it is the modifier frequency baseline, and in lexical-mod it is the head frequency baseline.  best comp: best model from W16-1604.
Dataset Split Train Validation Test
Tratz-fine Lex 4,730 1,614 869
Lex 9,185 5,819 4,154
Lex 9,783 5,400 3,975
Rand 14,369 958 3,831
Tratz-coarse Lex 4,746 1,619 779
Lex 9,214 5,613 3,964
Lex 9,732 5,402 3,657
Rand 14,093 940 3,758
Table 2: Number of instances in each dataset split.

4.1 Dataset

We follow W16-1604 and evaluate on the tratz2011semantically dataset, with 19,158 instances and two levels of labels: fine-grained (Tratz-fine, 37 relations) and coarse-grained (Tratz-coarse, 12 relations). We report results on both versions. See tratz2011semantically for the list of relations.

Dataset Splits

W16-1604 showed that a classifier based only on and performs on par with compound representations, and that the success comes from lexical memorization Levy et al. (2015): memorizing the majority label of single words in particular slots of the compound (e.g. topic for travel guide, fishing guide

, etc.). This memorization paints a skewed picture of the state-of-the-art performance on this difficult task.

To better test this hypothesis, we evaluate on 4 different splits of the datasets to train, test, and validation sets: (1) random, in a 75:20:5 ratio, (2) lexical-full, in which the train, test, and validation sets each consists of a distinct vocabulary. The split was suggested by levy-EtAl:2015:NAACL-HLT, and it randomly assigns words to distinct sets, such that for example, including travel guide in the train set promises that fishing guide would not be included in the test set, and the models do not benefit from memorizing that the head guide is always annotated as topic. Given that the split discards many NCs, we experimented with two additional splits: (3) lexical-mod split, in which the words are unique in each set, and (4) lexical-head split, in which the words are unique in each set. Table 2 displays the sizes of each split.

Relation Path Examples
measure [] varies by [] state limit, age limit
2,560 [] portion of [] acre estate
personal title [] Anderson [] Mrs. Brown
[] Sheridan [] Gen. Johnson
create-provide-generate-sell [] produce [] food producer, drug group
[] selling [] phone company, merchandise store
[] manufacture [] engine plant, sugar company
time-of1 [] begin [] morning program
[] held Saturday [] afternoon meeting, morning session
substance-material-ingredient [] made of wood and [] marble table, vinyl siding
[] material includes type of [] steel pipe
Table 3: Indicative paths for selected relations, along with NC examples.

4.2 Baselines

Frequency Baselines.

mod freq classifies to the most common relation in the train set for NCs with the same modifier (), while head freq considers NCs with the same head ().555Unseen heads/modifiers are assigned a random relation.

Distributional Baselines.

Ablation of the path-based component from our models: Dist uses only and ’s word embeddings: , while Dist-NC includes also the NC embedding: . The network architecture is defined similarly to our models (Section 3.1).

Compositional Baselines.

We re-train Dima’s Dima (2016) models, various combinations of NC representations Zanzotto et al. (2010); Socher et al. (2012) and single word embeddings in a fully connected network.666We only include the compositional models, and omit the “basic” setting which is similar to our Dist model. For the full details of the compositional models, see W16-1604.

4.3 Results

Table 4 shows the performance of various methods on the datasets. Dima’s Dima (2016) compositional models perform best among the baselines, and on the random split, better than all the methods. On the lexical splits, however, the baselines exhibit a dramatic drop in performance, and are outperformed by our methods. The gap is larger in the lexical-full split. Finally, there is usually no gain from the added NC vector in Dist-NC and Integrated-NC.

5 Analysis

Test NC Most Similar NC
NC Label NC Label
majority party equative minority party whole+part_or_member_of
enforcement director objective enforcement chief perform&engage_in
fire investigator objective fire marshal organize&supervise&authority
stabilization plan objective stabilization program perform&engage_in
investor sentiment experiencer-of-experience market sentiment topic_of_cognition&emotion
alliance member whole+part_or_member_of alliance leader objective
Table 4: Example of NCs from the Tratz-fine random split test set, along with the most similar NC in the embeddings, where the two NCs have different labels.

Path Embeddings.

To focus on the changes from previous work, we analyze the performance of the path-based model on the Tratz-fine random split. This dataset contains 37 relations and the model performance varies across them. Some relations, such as measure and personal_title yield reasonable performance ( score of 0.87 and 0.68). Table 3 focuses on these relations and illustrates the indicative paths that the model has learned for each relation. We compute these by performing the analysis in shwartz-goldberg-dagan:2016:P16-1, where each path is fed into the path-based model, and is assigned to its best-scoring relation. For each relation, we consider paths with a score .

Other relations achieve very low F scores, indicating that the model is unable to learn them at all. Interestingly, the four relations with the lowest performance in our model777lexicalized, topic_of_cognition&emotion, whole+attribute&feat, partial_attr_transfer are also those with the highest error rate in W16-1604, very likely since they express complex relations. For example, the lexicalized relation contains non-compositional NCs (soap opera) or lexical items whose meanings departed from the combination of the constituent meanings. It is expected that there are no paths that indicate lexicalization. In partial_attribute_transfer (bullet train), transfers an attribute to (e.g. bullet transfers speed to train). These relations are not expected to be expressed in text, unless the text aims to explain them (e.g. train as fast as a bullet).

Looking closer at the model confusions shows that it often defaulted to general relations like objective (recovery plan) or relational-noun-complement (eye shape). The latter is described as “indicating the complement of a relational noun (e.g., son of, price of)”, and the indicative paths for this relation indeed contain many variants of “[] of []”, which potentially can occur with NCs in other relations. The model also confused between relations with subtle differences, such as the different topic relations. Given that these relations were conflated to a single relation in the inter-annotator agreement computation in tratz-hovy:2010:ACL, we can conjecture that even humans find it difficult to distinguish between them.

NC Embeddings.

To understand why the NC embeddings did not contribute to the classification, we looked into the embeddings of the Tratz-fine

test NCs; 3091/3831 (81%) of them had embeddings. For each NC, we looked for the 10 most similar NC vectors (in terms of cosine similarity), and compared their labels. We have found that only 27.61% of the NCs were mostly similar to NCs with the same label. The problem seems to be inconsistency of annotations rather than low embeddings quality. Table 

4 displays some examples of NCs from the test set, along with their most similar NC in the embeddings, where the two NCs have different labels.

6 Conclusion

We used an existing neural dependency path representation to represent noun-compound paraphrases, and along with distributional information applied it to the NC classification task. Following previous work, that suggested that distributional methods succeed due to lexical memorization, we show that when lexical memorization is not possible, the performance of all methods is much worse. Adding the path-based component helps mitigate this issue and increase performance.


We would like to thank Marius Pasca, Susanne Riehemann, Colin Evans, Octavian Ganea, and Xiang Li for the fruitful conversations, and Corina Dima for her help in running the compositional baselines.


Appendix A Technical Details

To extract paths, we use a concatenation of English Wikipedia and the Gigaword corpus.888https://catalog.ldc.upenn.edu/ldc2003t05 We consider sentences with up to 32 words and dependency paths with up to 8 edges, including satellites, and keep only 1,000 paths for each noun-compound. We compute the path embeddings in advance for all the paths connecting NCs in the dataset (§3.2), and then treat them as fixed embeddings during classification (§3.1).

We use TensorFlow

Abadi et al. (2016) to train the models, fixing the values of the hyper-parameters after performing preliminary experiments on the validation set. We set the mini-batch size to 10, use Adam optimizer Kingma and Ba (2014)

with the default learning rate, and apply word dropout with probability 0.1. We train up to 30 epochs with early stopping, stopping the training when the

score on the validation set drops 8 points below the best performing score.

We initialize the distributional embeddings with the 300-dimensional pre-trained GloVe embeddings Pennington et al. (2014) and the lemma embeddings (for the path-based component) with the 50-dimensional ones. Unlike HypeNET, we do not update the embeddings during training. The lemma, POS, and direction embeddings are initialized randomly and updated during training. NC embeddings are learned using a concatenation of Wikipedia and Gigaword. Similarly to the original GloVe implementation, we only keep the most frequent 400,000 vocabulary terms, which means that roughly 20% of the noun-compounds do not have vectors and are initialized randomly in the model.