“Our feeling is that an effective characterization of knowledge can result in a real understanding system in the not too distant future.” These were the words of Roger Schank and Robert Abelson more than 40 years ago [Schenk AbelsonSchenk Abelson1975].
A key challenge in language understanding is that most texts are prohibitively difficult to understand in isolation. Their meaning only becomes apparent when interpreted in combination with background knowledge that reflects a previously acquired view of the world. This is true for many tasks in language understanding, ranging from co-reference resolution, to negation detection, to prepositional phrase attachment, and even high-level language understanding tasks such as entity linking and relation extraction. Earlier systems relied on manually specified background knowledge. [Schenk AbelsonSchenk Abelson1975]
introduced scripts as predefined structures that describe stereotypical sequences of events for common situations. For example, a restaurant script describes the sequence of events that happen between the time a customer enters a restaurant and when they leave. Such a restaurant script can then be used to infer that if “Alice left a restaurant after a good meal”, then with some probability, she paid the bill and left a tip. Scripts fall within the theme of knowledge-aware machine reading. However, it is clear that with manually specified knowledge, we cannot hope to accumulate comprehensive background knowledge. In this paper, we study machine reading methods that leverage automatically generated, high volume background knowledge.
Knowledge bases are structures for characterizing and storing world knowledge. They have been extensively studied in the past 10 years, resulting in a plethora of large resources containing hundreds of millions of assertions about real world entities. [Auer, Bizer, Kobilarov, Lehmann, Cyganiak, IvesAuer et al.2007, Bollacker, Evans, Paritosh, Sturge, TaylorBollacker et al.2008, MitchellMitchell2015, Suchanek, Kasneci, WeikumSuchanek et al.2007]. We aim to build language understanding systems that make use of the abundant background knowledge found in knowledge bases.
We have developed two methods for sentence level machine reading that make use of background knowledge:
The first method addresses a difficult case of syntactic ambiguity caused by prepositions. Prepositions such as “in”, “at”, and “for” express important details about the where, when, and why of relations and events. However, prepositions are a major source of syntactic ambiguity and still pose problems in language analysis. In particular, they cause the problem of prepositional phrase attachment ambiguity, which arises for example, in cases such as “she caught the butterfly with the spots” vs. “she caught the butterfly with the net”. In the first case, the preposition phrase “with the net” modifies the verb “caught”, while in the second case, “with the spots” modifies the noun “butterfly”. Disambiguating these two attachments requires knowing that butterflies can have spots, and that a net is an instrument that can be used for catching. Our approach uses this type of knowledge within a semi-supervised machine learning algorithm that learns from both labeled and unlabeled data. The approach produces state-of-the-art results on two datasets and performs significantly better than the Stanford syntactic parser, which is commonly used in natural language processing pipelines.
The second method exploits background knowledge to extract relationships from compound nouns. Compound nouns, consist mostly of adjectives and nouns, they do not contain verbs. As a result, there are many lexical variations even across compound knows that express similar semantic information between the nouns involved. Therefore, methods that rely on co-occurrences of lexical items are bound to be limited in the task of compound noun analysis. On the other hand, relationships such as a person’s job title, nationality, or stance on a political issue are often expressed using compound nouns. For example, “pro-choice Democratic gubernatorial candidate James Florio”, and “White House spokesman Marlin Fitzwater” are compound nouns expressing useful information. We have developed a knowledge-aware method for compound noun analysis which accurately extracts relationships from compound nouns.
In summary, our contributions are as follows: 1) Knowledge-Aware Machine Reading: We study machine reading methods
that leverage background knowledge. While the problem of machine reading has attracted a lot of attention in recent years,
there’s been little work on machine reading with background knowledge. We show compelling results on knowledge-aware machine reading within the context of two problems: prepositional phrase attachment, and compound noun relation extraction.
2) Prepositional Phrase Attachment: We present a knowledge-aware method for prepositional phrase attachment.
Previous solutions to this problem largely rely on corpus statistics. Our approach draws upon
diverse sources of background knowledge, leading to significant performance improvements.
In addition to training on labeled data, we also make use of a large amount of unlabeled data. This enhances our method’s ability to generalize to diverse data sets.
In addition to the standard Wall Street Journal corpus (WSJ) [Ratnaparkhi, Reynar, RoukosRatnaparkhi
et al.1994], we labeled two new datasets for testing purposes, one from Wikipedia (WKP), and the other from the New York Times Corpus (NYTC). We make these datasets freely available for future research. In addition, we have applied our model to over 4 million 5-tuples of the form noun0, verb, noun1, preposition, noun2, and we also make this dataset available111http://rtw.ml.cmu.edu/resources/ppa.
This work was first published in [Nakashole MitchellNakashole Mitchell2015], in this paper we report additional experiments on ternary relations.
We also place this work in the larger context of knowledge-aware machine reading.
3) Compound Noun Analysis: We introduce a knowledge-aware method for extracting relations from compound nouns. We collected over 2 million compound nouns from which we learned fine-grained semantic type sequences that express ontological from the NELL knowledge base. Our experiments show that we obtain significantly higher accuracy than a baseline.
The rest of the paper is organized as follows. Section 2 presents our method for prepositional phrase attachment disambiguation. In addition to our main results, we also present findings on how we can use our method in high-level machine reading tasks such as relation extraction, in particular, ternary relation extraction. Section 3 introduces our approach to compound noun analysis. In addition to the task-specific related work presented in each of the first two sections, Section 4 presents additional work related to knowledge-aware machine reading. Lastly, in Section 5 we discuss the implications of our results, and bring forward a number of open questions.
2 Prepositional Phrase Attachment
Prepositional phrases (PPs) express crucial information that information extraction methods need to extract. However, PPs are a major source of syntactic ambiguity. In this paper, we introduce an algorithm that uses background knowledge to improve PP attachment accuracy. Prepositions such as “in”, “at”, and “for” express details about the where, when, and why of relations and events. PPs also state attributes of nouns.
As an example, consider the following sentences: S1.) Alice caught the butterfly with the spots. S2.) Alice caught the butterfly with the net. S1 and S2 are identical except in their final nouns. However their parses differ as seen in Figure 1). This is because in S1, the butterfly has spots and therefore the PP, “with the spots”, attaches to the noun. In the task relation extraction, we obtain a binary relation of the form: Alice caught butterfly with spots. However, in S2, the net is the instrument used for catching and therefore the PP, “with the net”, attaches to the verb. For relation extraction, we get a ternary extraction of the form: Alice caught butterfly with net.
The PP attachment problem is often defined as follows: given a PP occurring within a sentence where there are multiple possible attachment sites for the PP, choose the most plausible attachment site. In the literature, prior work going as far back as [Brill ResnikBrill Resnik1994, Ratnaparkhi, Reynar, RoukosRatnaparkhi et al.1994, Collins BrooksCollins Brooks1995] has focused on the language pattern that causes most PP ambiguities, which is the 4-word sequence: (e.g., caught, butterfly, with, spots). The task is to determine if the prepositional phrase attaches to the verb or to the first noun . Following common practice, we focus on PPs occurring as quadruples — we shall refer to these as PP quads.
The approach we present here differs from prior work in two main ways. First, we make extensive use of semantic knowledge about nouns, verbs, prepositions, pairs of nouns, and the discourse context in which a PP quad occurs. Table 1
summarizes the types of knowledge we considered in our work. Second, in training our model, we rely on both labeled and unlabeled data, employing an expectation maximization (EM) algorithm[Dempster, Laird, RubinDempster et al.1977].
Noun-Noun binary relations
(Paris, located in, France)
(net, caught, butterfly)
Noun semantic categories
(butterfly, isA, animal)
caught(agent, patient, instrument)
f(for)= used for, has purpose, …
f(with)= has, contains, …
2.1 State of the Art
To quantitatively assess existing tools, we analyzed performance of the widely used Stanford parser222http://nlp.stanford.edu:8080/parser/ as of 2014, and the established baseline algorithm [Collins BrooksCollins Brooks1995], which has stood the test of time. We first manually labeled PP quads from the NYTC dataset, then prepended the noun phrase appearing before the quad, effectively creating sentences made up of 5 lexical items . We then applied the Stanford parser, obtaining the results summarized in Figure 2. The parser performs well on some prepositions, for example, “of”, which tends to occur with noun attaching PPs as can be seen in Figure 3. However, for prepositions with an even distribution over verb and noun attachments, such as “on”, precision is as low as 50%. The Collins baseline achieves 84% accuracy on the benchmark Wall Street Journal PP dataset. However, drawing a distinction in the precision of different prepositions provides useful insights on its performance. We re-implemented this baseline and found that when we remove the trivial preposition, “of”, whose PPs are by default attached to the noun by this baseline, precision drops to 78%. This analysis suggests there is substantial room for improvement.
2.2 Related Work
Prominent prior methods learn to perform PP attachment based on corpus co-occurrence statistics, gathered either from manually annotated training data [Collins BrooksCollins Brooks1995, Brill ResnikBrill Resnik1994] or from automatically acquired training data that may be noisy [RatnaparkhiRatnaparkhi1998, Pantel LinPantel Lin2000]. These models collect statistics on how often a given quadruple, , occurs in the training data as a verb attachment as opposed to a noun attachment. The issue with this approach is sparsity, that is, many quadruples occuring in the test data might not have been seen in the training data. Smoothing techniques are often employed to overcome sparsity. For example, [Collins BrooksCollins Brooks1995] proposed a back-off model that uses subsets of the words in the quadruple, by also keeping frequency counts of triples, pairs and single words. Another approach to overcoming sparsity has been to use WordNet [FellbaumFellbaum1998] classes, by replacing nouns with their WordNet classes [Stetina NagaoStetina Nagao1997, Toutanova, Manning, NgToutanova et al.2004] to obtain less sparse corpus statistics. Corpus-derived clusters of similar nouns and verbs have also been used [Pantel LinPantel Lin2000].
Hindle and Rooth proposed a lexical association approach based on how words are associated with each other [Hindle RoothHindle Rooth1993]. Lexical preference is used by computing co-occurrence frequencies (lexical associations) of verbs and nouns, with prepositions. In this manner, they would discover that, for example, the verb “send” is highly associated with the preposition from, indicating that in this case, the PP is likely to be a verb attachment.
These methods are based on high-level observations that are then generalized into heuristics for PP attachment decisions.[KimballKimball1988] proposed a right association method, whose premise is that a word tends to attach to another word immediately to its right. [FrazierFrazier1978] introduced a minimal attachment method, which posits that words attach to an existing word using the fewest additional syntactic nodes. While simple, in practice these methods have been found to perform poorly [Whittemore, Ferrara, BrunnerWhittemore et al.1990].
[Brill ResnikBrill Resnik1994] proposed methods that learn a set of transformation rules from a corpus. The rules consist of nouns, verbs, and prepositions. Therefore, these rules can be too specific to have broad applicability, resulting in low recall. To address low recall, knowledge about nouns, as found in WordNet, is used to replace certain words in rules with their WordNet classes.
In addition to prior work on prepositional phrase attachment, a highly related problem is preposition sense disambiguation [Hovy, Vaswani, Tratz, Chiang, HovyHovy et al.2011, Srikumar RothSrikumar Roth2013]. Even a syntactically correctly attached PP can still be semantically ambiguous with respect to questions of machine reading such as where, when, and who. The same preposition can express many semantic relations. For example, in the sentence: “Poor care caused her death from pneumonia”, the preposition “from” expresses the relation Cause(death, pneumonia). But “from” can denote other relations, for example in “copied the scene from another film” (Source) and in “recognized him from the start (Temporal)” [Srikumar RothSrikumar Roth2013]. Therefore, when extracting information from prepositions, the problem of preposition sense disambiguation (semantics) has to be addressed in addition to prepositional phrase attachment disambiguation (syntax). In this paper, we consider both the syntax and semantic aspects of prepositions.
|Noun-Noun Binary Relations||Source: SVOs|
|Noun Semantic Categories||Source:|
|Verb Role Fillers||Source: VerbNet|
|Discourse Features||Source: Sentence(s),|
|Lexical Features||Source: PP quads||For q1;|
|Prepositional Phrase Quadruple||Feature|
|<Alice caught the butterfly with the net >:||F1:|
|<The dog caught the butterfly with spots >||F1: n/a|
Our approach consists of first generating features from background knowledge and then training a model to learn with these features. The trained model is applied to new sentences after also annotating them with background knowledge features. ] The types of features considered in our experiments are summarized in Table 2. Additionally, Table 3 shows examples of the instantiated features for two prepositional phrase quadruples. The choice of features was motivated by our empirically driven characterization of the problem shown in Table 4.
|(Verb attach) v has-slot-filler n2|
|(Noun attach a.) n1 described-by n2|
|(Noun attach b.) n2 described-by n1|
We sampled PP quads from the WSJ dataset. The PP quads are labeled with noun or verb attachment. We found that every noun or verb attachment could be explained using our threeway characterization in Table 4. In particular, we found that in verb-attaching PPs, the second noun is usually a role filler for the verb. Going back to our verb attaching PP in ”Alice caught the butterfly with the net”, we can see that the net fills the role of an instrument for the verb . On the other hand, for noun-attaching PPs, one noun describes or elaborates on the other. In particular, we found two kinds of noun attachments. For the first kind of noun attachment, the second noun describes the first noun , for example might be an attribute or property of . For example, in “Alice caught the butterfly with the net” the spots() are an attribute of the butterfly (). And for the second kind of noun attachment, the first noun describes the second noun . For example in the PP quad expect, decline, in, rates, where the PP “in rates”, attaches to the . The decline: that is expected: is in the rates:. We make this labeling available with the rest of the datasets.
We next describe in more detail how each type of feature is derived from the background knowledge in Table 1. We generate boolean-valued features for all the feature types we describe in this section.
2.3.1 Noun-Noun Binary Relations
The noun-noun binary relation features, F1-2 in Table 2, are boolean features (where is any verb) and (where is the verb in the PP quad, and the roles of and are reversed). These features describe diverse semantic relations between pairs of nouns (e.g., butterfly-has-spots, clapton-played-guitar). To obtain this type of knowledge, we dependency parsed all sentences in the 500 million English web pages of the ClueWeb09 corpus, then extracted subject-verb-object (SVO) triples from these parses, along with the frequency of each SVO triple in the corpus. The value of any given feature is defined to be 1 if that SVO triple was found at least times in these SVO triples, and 0 otherwise.
To see why these relations are relevant, let us suppose that we have the knowledge that butterfly-has-spots, . From this, we can infer that the PP in is likely to attach to the noun. Similarly, suppose we know that net-caught-butterfly, . The fact that a net can be used to catch a butterfly can be used to predict that the PP in
is likely to attach to the verb.
2.3.2 Noun Semantic Categories
Noun semantic type features, F3-4, are boolean features and where is a noun category in a noun categorization scheme such as WordNet classes. Knowledge about semantic types of nouns, for example that a butterfly is an animal, enables extrapolating predictions to other PP quads that contain nouns of the same type. We ran experiments with several noun categorizations including WordNet classes, knowledge base ontological types, and an unsupervised noun categorization produced by clustering nouns based on the verbs and adjectives with which they co-occur (distributional similarity).
2.3.3 Verb Role Fillers
The verb role feature, F5, is a boolean feature where is a role that can fulfill for the verb in the PP quad, according to background knowledge. Notice that if fills a role for the verb, then the PP is a verb attachment. Consider the quad , if we know that a net can play the role of an instrument for the verb catch, this suggests a likely verb attachment. We obtained background knowledge of verbs and their possible roles from the VerbNet lexical resource [Kipper, Korhonen, Ryant, PalmerKipper et al.2008]. From VerbNet we obtained labeled sentences containing PP quads (verbs in the same VerbNet group are considered synonymous), and the labeled semantic roles filled by the second noun in the PP quad. We use these example sentences to label similar PP quads, where similarity of PP quads is defined by verbs from the same VerbNet group.
2.3.4 Preposition Definitions
The preposition definition feature, , is a boolean feature , where is a definition mapping of prepositions to verb phrases. This mapping defines prepositions, using verbs in our ClueWeb09 derived SVO corpus, in order to capture their senses using verbs; it contains definitions such as def(with, *) = contains, accompanied by, … . If “with” is used in the sense of “contains” , then the PP is a likely noun attachment, as in contains in the quad . However, if “with” is used in the sense of “accompanied by”, then the PP is a likely verb attachment, as in the quad .
To obtain the mapping, we took the labeled PP quads (WSJ, [Ratnaparkhi, Reynar, RoukosRatnaparkhi et al.1994]) and computed a ranked list of verbs from SVOs, that appear frequently between pairs of nouns for a given preposition. Other sample mappings are: def(for,*)= used for, def(in,*)= located in. Notice that this feature is a selective, more targeted version of .
2.3.5 Discourse and Lexical Features
The discourse feature, , is a boolean feature , for each noun category found in a noun category ontology such as WordNet semantic types. For example, we might realize pick up the fact that a PP quad is surrounded by many mentions of people, or food, or organizations. By doing this we leverage the context of the PP quad, which can contain relevant information for attachment decisions. We take into account the noun preceding a PP quad, in particular, its semantic type. This in effect makes the PP quad into a PP 5-tuple: , where the provides additional context.
Finally, we use lexical features in the form of PP quads, features F8-15. To overcome sparsity of occurrences of PP quads, we also use counts of shorter sub-sequences, including triples, pairs and singles. We only use sub-sequences that contain the preposition, as the preposition has been found to be highly crucial in PP attachment decisions [Collins BrooksCollins Brooks1995].
2.4 Disambiguation Algorithm
We use the described features to train a model for making PP attachment decisions. Our goal is to compute , the probability that the PP in the tuple attaches to the verb (v) , or to the ,
, given a feature vectordescribing that tuple. As input to training the model, we are given a collection of PP quads, where . A small subset, is labeled data, thus for each we know the corresponding . The rest of the quads, , are unlabeled, hence their corresponding s are unknown. From each PP quad , we extract a feature vector according to the feature generation process discussed earlier..
, there a various possibilities. One could use a generative model (e.g., Naive Bayes) or a discriminative model ( e.g., logistic regression). In our experiments we used both kinds of models, but found the discriminative model performed better. Therefore, we present details only for our discriminative model. We use the logistic function:
is a vector of model parameters. To estimate these parameters, we could use the labeled data as training data and use standard gradient descent to minimize the logistic regression cost function. However, we also leverage the unlabeled data.
2.4.2 Parameter Estimation
To estimate model parameters based on both labeled and unlabeled data, we use an Expectation Maximization (EM) algorithm. EM estimates model parameters that maximize the expected log likelihood of the full (observed and unobserved) data.
Since we are using a discriminative model, our likelihood function is a conditional likelihood function:
where indexes over the training examples.
The EM algorithm produces parameter estimates that correspond to a local maximum in the expected log likelihood of the data under the posterior distribution of the labels, given by:
. In the E-step, we use the current parameters to compute the posterior distribution over the labels, give by . We then use this posterior distribution to find the expectation of the log of the complete-data conditional likelihood, this expectation is given by , defined as:
In the M-step, a new estimate is then produced, by maximizing this function with respect to :
EM iteratively computes parameters , using the above update rule at each iteration , halting when there is no further improvement in the value of the function. Our algorithm is summarized in Algorithm 1. The M-step solution for is obtained using gradient ascent to maximize the function.
2.5 Experimental Evaluation
We evaluated our method on several datasets containing PP quads of the form . The task is to predict if the PP () attaches to the verb or to the first noun .
2.5.1 Experimental Setup
|DataSet||# Training quads||# Test quads|
Table 5 shows the datasets used in our experiments. As labeled training data, we used the Wall Street Journal (WSJ) dataset. For the unlabeled training data, we extracted PP quads from Wikipedia (WKP) and randomly selected which we found to be a sufficient amount of unlabeled data. The largest labeled test dataset is WSJ but it is also made up of a large fraction, of “of” PP quads, 30% , which trivially attach to the noun, as already seen in Figure 3. The New York Times (NYTC) and Wikipedia (WKP) datasets are smaller but contain fewer proportions of “of” PP quads, 15%, and 14%, respectively. Additionally, we applied our model to over 4 million unlabeled 5-tuples from Wikipedia. We make this data available for download, along with our manually labeled NYTC and WKP datasets. For the WKP & NYTC corpora, each quad has a preceding noun, , as context, resulting in PP 5-tuples of the form: . The WSJ dataset was only available to us in the form of PP quads with no other sentence information.
Methods Under Comparison
(Prepositional Phrase Attachment Disambiguator) is our proposed method. It uses diverse types of semantic knowledge, a mixture of labeled and unlabeled data for training data, a logistic regression classifier, and expectation maximization (EM) for parameter estimation2) Collins is the established baseline among PP attachment algorithms [Collins BrooksCollins Brooks1995]. 3) Stanford Parser is a state-of-the-art dependency parser, the 2014 online version. 4) PPAD Naive Bayes(NB) is the same as PPAD but uses a generative model, as opposed to the discriminative model used in PPAD.
2.5.2 PPAD vs. Baselines
Comparison results of our method to the three baselines are shown in Table 6. For each dataset, we also show results when the “of” quads are removed, shown as “WKP\of”, “NYTC\of”, and “WSJ\of”. Our method yields improvements over the baselines. Improvements are especially significant on the datasets for which no labeled data was available (NYTC and WKP). On WKP, our method is 7% and 9% ahead of the Collins baseline and the Stanford parser, respectively. On NYTC, our method is 4% and 6% ahead of the Collins baseline and the Stanford parser, respectively. On WSJ, which is the source of the labeled data, our method is not significantly better than the Collins baseline. We could not evaluate the Stanford parser on the WSJ dataset. The parser requires well-formed sentences which we could not generate from the WSJ dataset as it was only available to us in the form of PP quads with no other sentence information. For the same reason, we could not generate discourse features,, for the WSJ PP quads. For the NYTC and WKP datasets, we generated well-formed short sentences containing only the PP quad and the noun preceding it.
2.5.3 Feature Analysis
We found that features and did not improve performance, therefore we excluded them from the final model, PPAD. This means that binary noun-noun relations were not useful when used permissively, feature , but when used selectively, feature , we found them to be useful. Our attempt at mapping prepositions to verb definitions produced some noisy mappings, resulting in feature producing mixed results. To analyze the impact of the unlabeled data, we inspected the features and their weights as produced by the PPAD model. From the unlabeled data, new lexical features were discovered that were not in the original labeled data. Some sample new features with high weights for verb attachments are: (perform,song,for,*), (lose,*,by,*), (buy,property,in,*). And for noun attachments: (*,conference,on,*), (obtain,degree,in,*), (abolish,taxes,on,*).
We evaluated several variations of PPAD, the results are shown in Figure 4. For “PPAD-WordNet Verbs”, we expanded the data by replacing verbs in PP quads with synonymous WordNet verbs, ignoring verb senses. This resulted in more instances of features F1, F8-10, & F12.
We also used different types of noun categorizations: WordNet classes, semantic types from the NELL knowledge base [MitchellMitchell2015] and unsupervised types. The KB types and the unsupervised types did not perform well, possibly due to the noise found in these categorizations. WordNet classes showed the best results, hence they were used in the final PPAD model for features F3-4 & F7. In Section 2.5.1, PPAD corresponds to the best model.
2.5.4 Application to Ternary Relations
Through the application of ternary relation extraction, we further tested PPAD’s PP disambiguation accuracy and illustrated its usefulness for knowledge base population. Recall that a PP 5-tuple of the form , whose enclosed PP attaches to the verb , denotes a ternary relation with arguments n0, n1, & n2. Therefore, we can extract a ternary relation from every 5-tuple for which our method predicts a verb attachment. If we have a mapping between verbs and binary relations from a knowledge base (KB), we can extend KB relations to ternary relations by augmenting the KB relations with a third argument .
|acquired||from||99.97||BNY Mellon acquired Insight from Lloyds.|
|hasSpouse||in||91.54||David married Victoria in Ireland.|
|worksFor||as||99.98||Shubert joined CNN as reporter.|
|playsInstrument||with||98.40||Kushner played guitar with rock band Weezer.|
We considered four NELL KB binary relations and their instances. For example, one instance of the relation is . We then took the collection of 4 million 5-tuples that we extracted from Wikipedia, and mapped verbs in 5-tuples to KB relations, based on overlap in the instances of the KB relations, noun pairs such as with the pairs in the Wikipedia PP 5-tuple collection. We found that, for example, instances of the noun-noun KB relation “worksFor” match pairs in tuples where and , with referring to the job title (e.g., Shubert joined CNN as reporter ). Other binary relations extended are: “hasSpouse” extended by “in” with wedding location (e.g., “David married Victoria in Ireland’). Or “acquired” extended by “from” with the seller of the company being acquired (e.g., BNY Mellon acquired Insight from Lloyds. ). Examples are shown in Table 7. In all these mappings, the proportion of verb attachments in the corresponding PP quads is significantly high ( ). PPAD is overwhelming making the right attachment decisions in this setting.
Efforts in temporal and spatial relation extraction have shown that higher N-ary relation extraction is challenging. Since prepositions specify details that transform binary relations to higher N-ary relations, our method can be used to read information that can augment binary relations already in KBs. As future work, we would like to incorporate our method into a pipeline for reading beyond binary relations. One possible direction is to read details about the where,why, who of events and relations, effectively moving from extracting only binary relations to reading at a more general level.
2.5.5 Labeled Ternary Arguments
In the above experiment, we studied the case of extending existing KB relations to ternary relations. However, we did not provide any semantic information about the role of the third arguments. In this section, we study the case when we want to label the role of the third argument. For example, for the acquisition instance of “BNY Mellon acquired Insight from Lloyds”, we want to predict that the label of “Lloyds” is the “Source”, indicating the source company of acquisition. As another example, consider the sentence: ‘Bailey bought earrings for Josie”, we want to predict that the label of “Josie” is “Beneficiary”, indicating the beneficiary of the items bought.
To obtain labels for ternary relations we make use of VerbNet [Kipper, Korhonen, Ryant, PalmerKipper
et al.2008]. VerbNet provides, for each verb, frames of the different use cases of the verb. Here we consider only verb uses that make use of prepositions. In VerbNet, these frames are described using a label of the form: “primary=NP V NP PP.label” where the “label” is the role of the argument to the right of the prepositional phrase. One example of a VerbNet frame is: “primary=NP V NP PP.instrument”, each such frame is accompanied by an example sentence. In this case the example is: “Paula hit the ball with a stick”, where the “stick” takes the role of the instrument. Notice that a given verb and preposition combination does not necessary invoke a given label. For example in “Paula hit the ball with joy”, “joy” does not play the role of the instrument. Therefore, we introduce further constraints. We learn these constraints from the collection of 4 million 5-tuples that we extracted from Wikipedia as explained in Section 2.5.1. In particular, we replace mentions of entities with their NELL and WordNet semantic types. Using this approach, we generate templates of the form:
<np_v_np_pp.LABEL ><verb><typeofArg1><preposition ><typeofArg2>.
We used five labels from VerbNet:
np_v_np_pp.asset, np_v_np_pp.source, and np_v_np_pp.topic. The labels form ternary relations as follows. Consider the sentence “Paula hit the ball with a stick”. This sentence matches the label np_v_np_pp.instrument. The binary relation is: hit(Paula, ball). Extending this to a ternary relation we get: hit_pp.instrument(Paula, ball, stick). Table 8 shows an example of the ternary relation label: np_v_np_pp.beneficiary. Additional examples of learned templates for each of the five labels are as shown below in Table 9.
|Template:||<np_v_np_pp.beneficiary ><buy><jewelry><for ><person>|
|Sentence:||Sue bought earrings for Mary|
|Ternary Relation:||buy_pp.beneficiary(Sue,earrings,Mary )|
|<np_v_np_pp.instrument ><shoot><person><with ><weapon>|
|<np_v_np_pp.asset ><sell><company><for ><amount>|
|<np_v_np_pp.source ><buy><organization><from ><organization>|
|<np_v_np_pp.topic ><ask><person><for ><advice>|
|<np_v_np_pp.topic ><ask><person><for ><divorce>|
Table 10 shows sample instances of the different learned templates for labeled ternary arguments. We randomly sampled 100 such instances evaluated them for accuracy, we found a sampling accuracy of 88%.
|Ternary argument label||Instance|
|np_v_np_pp.beneficiary||danai udomchoke won gold medal for thailand|
|np_v_np_pp.beneficiary||alton cooked breakfast for crew|
|np_v_np_pp.beneficiary||boys cooked cakes for girls|
|np_v_np_pp.beneficiary||bailey buys earrings for josie|
|np_v_np_pp.beneficiary||jim buys bracelet for kathy|
|np_v_np_pp.beneficiary||leonard buys engagement ring for michelle|
|np_v_np_pp.beneficiary||headmaster bought goggles for children|
|np_v_np_pp.instrument||lord edward thynne shot golden eagle with rifle|
|np_v_np_pp.instrument||mohawks opened fire with gunshots|
|np_v_np_pp.instrument||unidentified militants opened fire with grenade launcher|
|np_v_np_pp.instrument||jarvis opened fire with 5-inch guns|
|np_v_np_pp.instrument||prince stabs vizier with dagger|
|np_v_np_pp.instrument||isaac van scoy killed british soldier with pitchfork|
|np_v_np_pp.instrument||ambush positions opened fire with mortars|
|np_v_np_pp.instrument||tamalika karmakar killed rebecca with knife|
|np_v_np_pp.source||telugu film homam drew inspiration from martin scorsese|
|np_v_np_pp.source||john coltrane received call from davis|
|np_v_np_pp.source||kenneth o’keefe received letter from state department|
|np_v_np_pp.source||tony receives letter from mandy|
|np_v_np_pp.source||peter receives call from claire|
|np_v_np_pp.source||huppertz drew inspiration from richard wagner|
|np_v_np_pp.source||fiz receives call from alan hoyle|
|np_v_np_pp.source||elbaz drew inspiration from bruce willis|
|np_v_np_pp.source||smolensky bought company from wheeler|
|n np_v_np_pp.topic||wittenberg asked jan kazimierz for permission|
|np_v_np_pp.topic||brando asked john gielgud for advice|
|np_v_np_pp.topic||lutician delegates asked conrad for help|
|np_v_np_pp.topic||logan asked scott for help|
|np_v_np_pp.topic||philadelphia quakers asked nhl for permission|
|np_v_np_pp.topic||steven asks frank for advice|
|np_v_np_pp.topic||rowe asked jackson for divorce|
2.6 Prepositional Phrase Attachment Ambiguity Summary
We have presented a knowledge-aware approach to prepositional phrase (PP) attachment disambiguation, which is a type of syntactic ambiguity. Our method incorporates knowledge about verbs, nouns, discourse, and noun-noun binary relations. We trained a model using both labeled data and unlabeled data, making use of expectation maximization for parameter estimation. Our method can be seen as an example of tapping into a positive feedback loop for machine reading enabled by recent advances in information extraction and knowledge base construction techniques.
3 Compound Nouns Analysis
Noun phrases contain a number of challenging compositional phenomena, including implicit relations. Compound nouns such as “pro-choice Democratic gubernatorial candidate James Florio”, or “White House spokesman Marlin Fitzwater” primarily consist of nouns and adjectives. They do not contain verbs. This means that traditional pattern detection algorithms for detecting relations through lexical regularities will not work well on compound nouns. On the other hand, beliefs such as a person’s job title, nationality, or stance on a political issue are often expressed using compound nouns. We propose a knowledge-aware algorithm for extracting semantic relations from compound noun analysis that learns, through distant supervision, to map fine-grained type sequences of compound nouns to the relations they express. Consider the following compound nouns.
|1.a) Giants cornerback Aaron Ross|
|1.b) Patriots quarterback Matt Cassel|
|1.c) Colts receiver Bryan Fletcher|
|2.a) Japanese astronaut Soichi Noguchi|
|2.b) Irish golfer Padraig Harrington|
|2.c) French philosopher Jean-Paul Sartre|
|2.a) Seabiscuit author Laura Hillenbrand|
|2.b) Harry porter author J.K Rowling|
|2.c) Walking the Bible author Bruce Feile|
The concepts in the compound noun sequences (1a. – c.), (2a. – c.), (3a. – c.) are of the semantic type sequences:
Therefore, our task is to learn semantic type sequences and their mappings to knowledge base relations. We use relations from the NELL knowledge base. Since NELL has binary relations that take only two arguments, and compound nouns contain more than two noun phrases, we additionally keep track of the position information for the two arguments of the relation. For example, from the type sequence: <country ><profession><person>, we generate mappings to two different relations.
To learn mappings from compound nouns to binary relations as shown in Table 11, we use distant supervision, that is using the NELL knowledge base as the only form of supervision. In general, the intuition behind distant supervision is that a sentence that contains a pair of entities that participate in a known relation is likely to express that relation. In our case, the sentence is just a compound noun, for example “Japanese astronaut Soichi Noguchi”. Therefore, we first extract compound nouns from a large collection of documents. For every compound noun, we map its noun phrase to entities in the NELL. The entities are then replaced by their NELL types. This creates type sequences of the form: <country ><profession><person>. Each type sequence has a support set, which is the collection of compound nouns that satisfy the type sequence. For example, Japanese astronaut Soichi Noguchi is a support compound noun for the type sequence: <country><profession><person>. We retain type sequences whose support set sizes are above a threshold of in our experiments. For type sequences with support set size above the threshold, we use their support sets to learn mappings from type sequences to relations using distant supervision. That is, from each supporting compound noun we collect pairs of entities, and do look ups in NELL to determine which relations hold between the pair of entities. We additionally keep track of the position of the entities within the compound noun. This gives us mappings from types sequences to relations such as: <citizenofcountry><3><1><country><profession>. We only retain mappings that have a support set size (relation instances in NELL) above a threshold of in our experiments.
3.1 Experimental Evaluation
We extracted compound nouns from three different corpora: the New York Times archive which includes about 1.8 Million articles from the years 1987 to 2007, the English edition of Wikipedia with about about 3.8 Million articles, and the KBP dataset [SurdeanuSurdeanu2013] which contains over 2 million documents with Gigaword newswire and Wb documents. We extracted a total of compound nouns. From these compound nouns, we extract 10 relations that are expressed by compound nouns and are in the NELL knowledge base. From this dataset we learned 291 mappings from types sequences to relations. Using these mappings we then predicted new relation instances. We report recall and accuracy in Table 12. We compare our system to a baseline which is not knowledge aware. The baseline is created as follows: we generate sequences that have no awareness of background knowledge by discarding type information from the sequences. For example, the sequence <book>“author” <person> becomes <noun phrase>“author” <noun phrase> where “noun phrases” refers to any noun phrase found in text regardless of its type. In cases where the entire template is made up of semantic types only, for example, <country ><profession><person>, discarding semantic types results in a template which is too general. Therefore, for the baseline, we discarded such permissive templates that would be present a significant disadvantage for the baseline.
As shown in Table 12, K-nom has high accuracy across all the relations in comparison to the baseline. K-nom also achieves high recall for some of the relations. The reason why K-nom yields low recall for some of the relations ia probably because while those relations are occasionally expressed using compound nouns, they are more commonly expressed in other forms such as using verbs.
We have incorporated K-nom into the NELL reading software. A screenshot of extracting relations from compound nouns is shown in Figure 5.
3.2 Related Work
Much of the work on information extraction has been on extracting relations expressed by verb phrases that occur between pairs of noun phrases. Extracting knowledge base relations from noun phrases alone has been much less explored. In [Yahya, Whang, Gupta, HalevyYahya et al.2014], a method is developed that learns noun phrase structure for open information extraction. This is different from our work in that we are extract knowledge base relations as opposed to open information extraction. Therefore, the authors do not ground their extracted attributes to an external knowledge base.
The work of [Choi, Kwiatkowski, ZettlemoyerChoi et al.2015] developed a semantic parser for extracting relations from noun phrases. Given an input noun phrases, it is first transformed to a logical form, where the logical form is an intermediate unambiguous representation of the noun phrase. The logical form is chosen such that it closely matches the linguistic structure of the input text noun phrase. The logical form is then transformed into one that, where possible, uses the Freebase ontology predicates [Bollacker, Evans, Paritosh, Sturge, TaylorBollacker et al.2008]. These predicates can then be read off as relations expressed about the entities described by the noun phrase. The authors test their work on Wikipedia category names. Since each Wikipedia category describes a set of entities, by extracting relations from each category name, one learns relations about all the members of the category. Consider the Wikipedia category Symphonic Poems by Jean Sibelius. An example of the knowledge base transformed logical form for this category name would be:
where one can now extract attributes for the entities, such as The Bard, Finlandia, Pohjola’s Daughter, En Saga, Spring Song, Tapiola, …, that fall under this category in particular that for all in this category and where and are Freebase attributes. In generating the logical forms, several features are used that capture some background knowledge, in particular, a number of features that enable soft type checking on the produced logical form, and features that test agreement of these types on different parts of the produced logical form.
In a related but different line of work, the NomBank project [Meyers, Reeves, Macleod, Szekely, Zielinska, Young, GrishmanMeyers et al.2004, Gerber ChaiGerber Chai2010] annotated the argument structures for common nouns. For example, from the expression Greenspan’s replacement Ben Bernanke, the arguments for the nominal “replacement”, are: “Ben Bernanke” is ARG0 and “Greenspan” is ARG1. The resulting annotations has been used as training data for work on semantic role labeling on nominals [Jiang NgJiang Ng2006, Liu NgLiu Ng2007]. Again, this work is different from our work in that no knowledge base relations are extracted.
There has also work on the broader topic of semantic structure of noun phrases. In [Sawai, Shindo, MatsumotoSawai et al.2015], a method is proposed that parsers noun phrases into the Abstract Meaning Representation in order to detect the argument structures, and noun-noun relations in compound nouns.
3.3 Compound Noun Analysis Summary
We have presented a knowledge-aware method for relation extraction from compound nouns. Our method uses semantic types of concepts in compound noun sequences to predict relations expressed by novel compound noun sequences containing concepts that we have not seen before. This method can be seen as another example of tapping into a positive feedback loop for machine reading made possible by projects that construct large-scale knowledge bases. Compound nouns are non-trivial to interpret in many different ways besides the noun-noun relations problem we addressed here. For example, one problem that could benefit from background knowledge is that of analyzing the internal structure of noun phrases through bracketing [Vadas CurranVadas Curran2007, Vadas CurranVadas Curran2008]. For example, in the noun phrase (lung cancer) deaths, the task would be to determine that lung cancer modifies the head deaths. Additionally, as future work we can increase our predicate vocabulary to learn more common sense type of relations from compound nouns, for example, in the noun phrase cooking pot, we can extract the relationpurpose, to mean the pot is used for cooking.
4 Related Work
While not many, there have been other approaches that make use of knowledge bases for machine reading. [Krishnamurthy MitchellKrishnamurthy Mitchell2014] introduced a method for training a joint syntactic and semantic parser. Their parser makes use of a knowledge base to produce logical forms containing knowledge base predicates. However, their use of the knowledge base is limited to unary predicates for determining semantic types of concepts. In contrast, in this paper we make extensive use of a knowledge base augmented by linguistic resources and corpus statistics. This results in a huge collection of world knowledge that our knowledge-aware methods have access to at inference time.
Understanding a piece of text requires both background knowledge and context. Our focus in this paper is on background knowledge. Recurrent neural networks (RNNs) including Long short-term memory (LSTMs ) applied to language understanding focus on managing memory to enable the model to store and retrieve context. For example RNNs have been applied to the task of answering queries about self-contained synthetic short stories, and to the task of language modeling where the task is to predict the next word(s) in a text sequence given the previous words[Mikolov, Karafiát, Burget, Cernocký, KhudanpurMikolov et al.2010, Sundermeyer, Schlüter, NeySundermeyer et al.2012, Weston, Chopra, BordesWeston et al.2015, Sukhbaatar, Weston, Fergus, et al.Sukhbaatar et al.2015].
These tasks are treated as instances of sequence processing and words are stored in memory as they are read. To retrieve relevant memories, smooth lookups are performed, whereby each memory is scored for its relevance, this may not scale well to our case where an entire knowledge base is considered. A notable exception in this line of work is the approach of [Weston, Chopra, BordesWeston et al.2015] which introduced memory networks combining RNN inference with a long-term memory. One of their experiments was performed on a question answering task that requires background knowledge in the form of statements stored as (subject, relation, object) triples. However, this setting is not a machine reading task. It is an information retrieval task since there is no reading required to answer the questions. Instead look ups are performed to find the triples most relevant to the question.
In this paper, we have presented results that illustrate that background knowledge is useful for machine reading. Going forward, there are still some open questions: (i) have we captured all types of knowledge? (ii) is our coverage of the current knowledge comprehensive? (iii) what other natural language understanding tasks could benefit from background knowledge. (iv) how does context interact with background knowledge.
5.1 Knowledge Breadth
Have we captured all types of knowledge required to do inference for language understanding? Currently, there are some types of knowledge not covered by our knowledge sources. For example, knowledge about actions is lacking in both the subject-verb-object corpus statistics and ontological knowledge bases. For example, commonsense knowledge about the actions a person can perform as opposed to the actions an animal can perform. A person can cook, sing, and write a book while an animal cannot. Additionally, our sources lack commonsense knowledge pertaining to sound, for example which sounds are typical for which scenes. Our sources cannot tell us if loud music is more likely to be heard in a bar scene than in a hospital scene. Our knowledge sources also lack spatial information. For example, our knowledge sources cannot tell us that a street can be found in a city but not inside a car or a building. When these these voids are filled in the knowledge sources, our methods can yield even more performance gains, and make them more applicable to more language understanding tasks.
5.2 Knowledge Density
Knowledge found in knowledge bases is tied to a formal ontological representation, and is therefore highly suited to the kind of reasoning performed in machine reading. However, the mechanisms for building knowledge bases still have coverage limitations. For example, the NELL knowledge graph contains 1.34 facts per entity[Hegde TalukdarHegde Talukdar2015]. This knowledge sparsity curtails the performance gains we can obtain from knowledge-aware reading methods. To mitigate the this problem, in this paper we augmented the ontological knowledge with corpus statistics consisting of subject-verb-object triples. While the corpus statistics are broad coverage , they are noisy, and are riddled with ambiguity that can negatively impact performance. For that reason, we believe our approach will benefit from improvements in coverage of the much cleaner ontological knowledge found in knowledge bases..
5.3 Other Natural Language Understanding Tasks
In this paper we focused on the tasks of prepositional phrase attachments and compound nouns. We believe a variety of tasks in natural language understanding can benefit from background knowledge. In noun phrase segmentation, coordinator terms such as “and, or” introduce ambiguity. For example, if we encounter the sentence: “ my daughter likes cartoons so every Friday we watched Tom and Jerry ”. It is not clear if “Tom and Jerry ” denotes a single name or two. However, since the context refers to cartoons, we can use the knowledge that “Tom and Jerry” can refer to an animated film, to make the correct segmentation. Co-reference resolution is still a difficult problem in natural language understanding. This is because often there are many candidates of mentions that can co-refer. However, with relevant background knowledge, some of those candidates can be ruled out, thereby improving accuracy of co-reference resolution. Consider the sentence “The bee landed on the flower because it wanted pollen.” If we know that bees feed on pollen, we can correctly determine that “it” refers to the bee and not the flower. In negation detection, consider the sentence: “Things would be different if Microsoft was headquartered in Texas.” From this sentence alone, a machine reading program might incorrectly extract a relationship that Microsoft is headquartered in Texas. But from the prior knowledge that Microsoft was never headquartered in Texas, we might be able to better detect the negation, in addition to the syntactic cues such as “if”. One direction for future work is to develop knowledge-aware machine reading methods for additional tasks.
5.4 Context vs. Background Knowledge
Understanding a piece of writing requires not only drawing upon background knowledge, but also upon discourse context. Instead of reading each sentence of a document as a self-contained unit, a machine reading program needs to keep track of what has been stated in preceding sentences. This is useful for dealing with basic language concepts such as entity co-reference, but also for keeping track of concepts already mentioned. Consider the sentence: “John saw the girl with the binoculars”. In the absence of context, the likely interpretation is that John used the binoculars to see the girl. However, if context suggests that there is a girl in possession of binoculars, the interpretation of the sentence changes. In the current work, we completely ignore context. Therefore, one direction for future work is explore how background knowledge interacts with context.
We are at a time where high impact technologies call for effective language under standing algorithms: robotics, mobile phone voice assistants, and entertainment systems software. With knowledge-aware machine reading, we have the ambitious goal of exploiting advances in knowledge engineering to push natural language understanding systems toward human level performance.
Lastly, exploiting the success in building large machine learning models consisting of millions of parameters will likely further improve results produced by our approach. This is due to the ability of such models to establish non-trivial connections between different pieces of evidence.
This research was supported by DARPA under contract number FA8750-13-2-0005. Any opinions, findings, conclusions and recommendations expressed in this paper are the authors’ and do not necessarily reflect those of the sponsor.
- [Agirre, Baldwin, MartinezAgirre et al.2008] Agirre, E., Baldwin, T., Martinez, D. 2008. Improving parsing and PP attachment performance with sense information In Proceedings of ACL-08: HLT, 317–325.
- [Altmann SteedmanAltmann Steedman1988] Altmann, G. Steedman, M. 1988. Interaction with context during human sentence processing Cognition, 30, 191–238.
- [Anguiano CanditoAnguiano Candito2011] Anguiano, E. H. Candito, M. 2011. Parse correction with specialized models for difficult attachment types In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP, 1222–1233.
- [Atkeson SchaalAtkeson Schaal1995] Atkeson, C. G. Schaal, S. 1995. Memory-based neural networks for robot learning Neurocomputing, 9(3), 243–269.
- [Atterer SchützeAtterer Schütze2007] Atterer, M. Schütze, H. 2007. Prepositional phrase attachment without oracles Computational Linguistics, 33(4), 469–476.
- [Auer, Bizer, Kobilarov, Lehmann, Cyganiak, IvesAuer et al.2007] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z. G. 2007. Dbpedia: A nucleus for a web of open data In The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11-15, 2007., 722–735.
- [Banko, Cafarella, Soderland, Broadhead, EtzioniBanko et al.2007] Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., Etzioni, O. 2007. Open information extraction for the web In IJCAI, 7, 2670–2676.
- [Bollacker, Evans, Paritosh, Sturge, TaylorBollacker et al.2008] Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J. 2008. Freebase: A collaboratively created graph database for structuring human knowledge In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, 1247–1250.
- [Brill ResnikBrill Resnik1994] Brill, E. Resnik, P. 1994. A rule-based approach to prepositional phrase attachment disambiguation In 15th International Conference on Computational Linguistics, COLING, 1198–1204.
[Carlson, Betteridge, Wang, Hruschka, MitchellCarlson et al.2010]
Carlson, A., Betteridge, J., Wang, R. C., Hruschka, Jr., E. R., Mitchell,
T. M. 2010.
Coupled semi-supervised learning for information extractionIn Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM ’10, 101–110.
- [Choi, Kwiatkowski, ZettlemoyerChoi et al.2015] Choi, E., Kwiatkowski, T., Zettlemoyer, L. S. 2015. Scalable semantic parsing with partial ontologies In Association for Computational Linguistics (ACL).
- [Collins BrooksCollins Brooks1995] Collins, M. Brooks, J. 1995. Prepositional phrase attachment through a backed-off model In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 27–38.
- [de Marneffe, MacCartney, Manningde Marneffe et al.2006] de Marneffe, M.-C., MacCartney, B., Manning, C. D. 2006. Generating typed dependency parses from phrase structure parses In Proceedings of the International Conference on Language Recources and Evaluation (LREC, 449–454.
- [Del Corro GemullaDel Corro Gemulla2013] Del Corro, L. Gemulla, R. 2013. Clausie: Clause-based open information extraction In Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13, 355–366.
- [Dempster, Laird, RubinDempster et al.1977] Dempster, A. P., Laird, N. M., Rubin, D. B. 1977. Maximum likelihood from incomplete data via the em algorithm Journal of the Royal Statistical Society, Series B, 39(1), 1–38.
- [Fader, Soderland, EtzioniFader et al.2011a] Fader, A., Soderland, S., Etzioni, O. 2011a. Identifying relations for open information extraction In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, 1535–1545.
- [Fader, Soderland, EtzioniFader et al.2011b] Fader, A., Soderland, S., Etzioni, O. 2011b. Identifying relations for open information extraction In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1535–1545. Association for Computational Linguistics.
- [FellbaumFellbaum1998] Fellbaum, C.. 1998. WordNet: an electronic lexical database. MIT Press.
- [FrazierFrazier1978] Frazier, L. 1978. On comprehending sentences: Syntactic parsing strategies. Ph.D. thesis, University of Connecticut.
- [Gerber ChaiGerber Chai2010] Gerber, M. Chai, J. Y. 2010. Beyond nombank: A study of implicit arguments for nominal predicates In ACL 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden, 1583–1592.
- [GravesGraves2013] Graves, A. 2013. Generating sequences with recurrent neural networks arXiv preprint arXiv:1308.0850.
- [Harabagiu PascaHarabagiu Pasca1999] Harabagiu, S. M. Pasca, M. 1999. Integrating symbolic and statistical methods for prepositional phrase attachment In Proceedings of the Twelfth International Florida Artificial Intelligence Research Society ConferenceFLAIRS, 303–307.
- [Hegde TalukdarHegde Talukdar2015] Hegde, M. Talukdar, P. P. 2015. An entity-centric approach for overcoming knowledge graph sparsity In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, 530–535.
- [Hindle RoothHindle Rooth1993] Hindle, D. Rooth, M. 1993. Structural ambiguity and lexical relations Computational Linguistics, 19(1), 103–120.
- [Hochreiter SchmidhuberHochreiter Schmidhuber1997] Hochreiter, S. Schmidhuber, J. 1997. Long short-term memory Neural computation, 9(8), 1735–1780.
- [Hovy, Vaswani, Tratz, Chiang, HovyHovy et al.2011] Hovy, D., Vaswani, A., Tratz, S., Chiang, D., Hovy, E. 2011. Models and training for unsupervised preposition sense disambiguation In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, 323–328.
- [Jiang NgJiang Ng2006] Jiang, Z. P. Ng, H. T. 2006. Semantic role labeling of nombank: A maximum entropy approach In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, 138–145. Association for Computational Linguistics.
- [KimballKimball1988] Kimball, J. 1988. Seven principles of surface structure parsing in natural language Cognition, 2, 15–47.
- [Kipper, Korhonen, Ryant, PalmerKipper et al.2008] Kipper, K., Korhonen, A., Ryant, N., Palmer, M. 2008. A large-scale classification of english verbs Language Resources and Evaluation, 42(1), 21–40.
- [Klein ManningKlein Manning2003] Klein, D. Manning, C. D. 2003. Accurate unlexicalized parsing In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics,ACL, 423–430.
- [Krishnamurthy MitchellKrishnamurthy Mitchell2014] Krishnamurthy, J. Mitchell, T. M. 2014. Joint syntactic and semantic parsing with combinatory categorial grammar In ACL.
- [Lao, Mitchell, CohenLao et al.2011] Lao, N., Mitchell, T., Cohen, W. W. 2011. Random walk inference and learning in a large scale knowledge base In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 529–539. Association for Computational Linguistics.
- [Liu NgLiu Ng2007] Liu, C. Ng, H. T. 2007. Learning predictive structures for semantic role labeling of nombank In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic.
- [Meyers, Reeves, Macleod, Szekely, Zielinska, Young, GrishmanMeyers et al.2004] Meyers, A., Reeves, R., Macleod, C., Szekely, R., Zielinska, V., Young, B., Grishman, R. 2004. Annotating noun argument structure for nombank In LREC. European Language Resources Association.
- [Mikolov, Karafiát, Burget, Cernocký, KhudanpurMikolov et al.2010] Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S. 2010. Recurrent neural network based language model In 11th Annual Conference of the International Speech Communication Association, (INTERSPEECH).
- [Mintz, Bills, Snow, JurafskyMintz et al.2009] Mintz, M., Bills, S., Snow, R., Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, 1003–1011. Association for Computational Linguistics.
- [Mitchell LapataMitchell Lapata2008] Mitchell, J. Lapata, M. 2008. Vector-based models of semantic composition In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, ACL, 236–244.
- [MitchellMitchell2015] Mitchell, T. M. 2015. Never-ending learning In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA., 2302–2310.
- [Nakashole MitchellNakashole Mitchell2014] Nakashole, N. Mitchell, T. M. 2014. Language-aware truth assessment of fact candidates In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers, 1009–1019.
- [Nakashole MitchellNakashole Mitchell2015] Nakashole, N. Mitchell, T. M. 2015. A knowledge-intensive model for prepositional phrase attachment In ACL (1), 365–375. The Association for Computer Linguistics.
- [Nakashole, Theobald, WeikumNakashole et al.2011] Nakashole, N., Theobald, M., Weikum, G. 2011. Scalable knowledge harvesting with high precision and high recall In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11, 227–236.
- [Nakashole, Tylenda, WeikumNakashole et al.2013] Nakashole, N., Tylenda, T., Weikum, G. 2013. Fine-grained semantic typing of emerging entities In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL, 1488–1497.
- [Nakashole WeikumNakashole Weikum2012] Nakashole, N. Weikum, G. 2012. Real-time population of knowledge bases: opportunities and challenges In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, 41–45. Association for Computational Linguistics.
- [Nigam, McCallum, Thrun, MitchellNigam et al.2000] Nigam, K., McCallum, A., Thrun, S., Mitchell, T. M. 2000. Text classification from labeled and unlabeled documents using EM Machine Learning, 39(2/3), 103–134.
- [Pantel LinPantel Lin2000] Pantel, P. Lin, D. 2000. An unsupervised approach to prepositional phrase attachment using contextually similar words In 38th Annual Meeting of the Association for Computational Linguistics, ACL.
- [RatnaparkhiRatnaparkhi1998] Ratnaparkhi, A. 1998. Statistical models for unsupervised prepositional phrase attachement In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, COLING-ACL, 1079–1085.
- [Ratnaparkhi, Reynar, RoukosRatnaparkhi et al.1994] Ratnaparkhi, A., Reynar, J., Roukos, S. 1994. A maximum entropy model for prepositional phrase attachment In Proceedings of the Workshop on Human Language Technology, HLT ’94, 250–255.
- [Sawai, Shindo, MatsumotoSawai et al.2015] Sawai, Y., Shindo, H., Matsumoto, Y. 2015. Semantic structure analysis of noun phrases using abstract meaning representation In ACL (2), 851–856.
- [Schenk AbelsonSchenk Abelson1975] Schenk, R. Abelson, R. P. 1975. Scripts, plans and knowledge In IJCAI, 151–157.
- [Srikumar RothSrikumar Roth2013] Srikumar, V. Roth, D. 2013. Modeling semantic relations expressed by prepositions TACL, 1, 231–242.
- [Stetina NagaoStetina Nagao1997] Stetina, J. Nagao, M. 1997. Prepositional phrase attachment through a backed-off model In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 66–80.
- [Suchanek, Kasneci, WeikumSuchanek et al.2007] Suchanek, F. M., Kasneci, G., Weikum, G. 2007. Yago: a core of semantic knowledge In Proceedings of the 16th international conference on World Wide Web, 697–706. ACM.
- [Sukhbaatar, Weston, Fergus, et al.Sukhbaatar et al.2015] Sukhbaatar, S., Weston, J., Fergus, R., et al. 2015. End-to-end memory networks In Advances in Neural Information Processing Systems, 2431–2439.
- [Sundermeyer, Schlüter, NeySundermeyer et al.2012] Sundermeyer, M., Schlüter, R., Ney, H. 2012. LSTM neural networks for language modeling In INTERSPEECH, 194–197.
- [SurdeanuSurdeanu2013] Surdeanu, M. 2013. Overview of the TAC2013 knowledge base population evaluation: English slot filling and temporal slot filling In TAC. NIST.
- [Toutanova, Manning, NgToutanova et al.2004] Toutanova, K., Manning, C. D., Ng, A. Y. 2004. Learning random walk models for inducing word dependency distributions In Machine Learning, Proceedings of the Twenty-first International Conference, ICML.
- [Vadas CurranVadas Curran2007] Vadas, D. Curran, J. R. 2007. Adding noun phrase structure to the penn treebank In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic.
- [Vadas CurranVadas Curran2008] Vadas, D. Curran, J. R. 2008. Parsing noun phrase structure with CCG In ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA, 335–343.
- [van Herwijnen, van den Bosch, Terken, Marsivan Herwijnen et al.2003] van Herwijnen, O., van den Bosch, A., Terken, J. M. B., Marsi, E. 2003. Learning PP attachment for filtering prosodic phrasing In 10th Conference of the European Chapter of the Association for Computational Linguistics,EACL, 139–146.
- [Wehbe, Vaswani, Knight, MitchellWehbe et al.2014] Wehbe, L., Vaswani, A., Knight, K., Mitchell, T. M. 2014. Aligning context-based statistical models of language with brain activity during reading In EMNLP, 233–243. ACL.
- [Weston, Chopra, BordesWeston et al.2015] Weston, J., Chopra, S., Bordes, A. 2015. Memory networks In In International Conference on Learning Representations, ICLR.
- [Whittemore, Ferrara, BrunnerWhittemore et al.1990] Whittemore, G., Ferrara, K., Brunner, H. 1990. Empirical study of predictive powers od simple attachment schemes for post-modifier prepositional phrases In 28th Annual Meeting of the Association for Computational Linguistics,ACL, 23–30.
- [Wijaya, Nakashole, MitchellWijaya et al.2014] Wijaya, D., Nakashole, N., Mitchell, T. 2014. Ctps: Contextual temporal profiles for time scoping facts via entity state change detection In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- [Yahya, Whang, Gupta, HalevyYahya et al.2014] Yahya, M., Whang, S., Gupta, R., Halevy, A. Y. 2014. Renoun: Fact extraction for nominal attributes In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, 325–335. ACL.
- [Zhao LinZhao Lin2004] Zhao, S. Lin, D. 2004. A nearest-neighbor method for resolving pp-attachment ambiguity In Natural Language Processing - First International Joint Conference, IJCNLP, 545–554.