ASER: A Large-scale Eventuality Knowledge Graph

05/01/2019 ∙ by Hongming Zhang, et al. ∙ The Hong Kong University of Science and Technology 0

Understanding human's language requires complex world knowledge. However, existing large-scale knowledge graphs mainly focus on knowledge about entities while ignoring knowledge about activities, states, or events, which are used to describe how entities or things act in the real world. To fill this gap, we develop ASER (activities, states, events, and their relations), a large-scale eventuality knowledge graph extracted from more than 11-billion-token unstructured textual data. ASER contains 15 relation types belonging to five categories, 194-million unique eventualities, and 64-million unique edges among them. Both human and extrinsic evaluations demonstrate the quality and effectiveness of ASER.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In his conceptual semantics theory, Ray Jackendoff, a Rumelhart Prize111The David E. Rumelhart Prize is funded for contributions to the theoretical foundations of human cognition. winner, describes semantic meaning as ‘a finite set of mental primitives and a finite set of principles of mental combination [JackendoffJackendoff1990]’. The primitive units of semantic meanings include Thing (or Object), Activity222In his original book, he called it Action. But given the later definitions and terminologies we adopted [P. D. MourelatosP. D. Mourelatos1978, BachBach1986], it means Activity. The difference between an activity and an event is that an event is defined as an occurrence that is inherently countable [P. D. MourelatosP. D. Mourelatos1978]. For example, ‘The coffee machine brews a cup of coffee once more’ is an event because it admits a countable noun ‘a cup’ and cardinal count adverbials ‘once’, while ‘The coffee machine brews coffee’ is not an event with an imperfective aspect which is not countable., State, Event, Place, Path, Property, Amount, etc. Understanding the semantics related to the world requires the understanding of these units and their relations. Traditionally, linguists and domain experts built knowledge graphs (KGs)333Traditionally, people used the knowledge base to describe the database containing human knowledge. In 2012, Google released its knowledge graph where vertices and edges in a knowledge base are emphasized. We discuss in the context of the knowledge graph, as our knowledge is also constructed as a complex graph. to formalize these units and enumerate categories (or senses) and relations of them. Typical KGs include WordNet [FellbaumFellbaum1998] for words, FrameNet [Baker, Fillmore,  LoweBaker et al.1998] for events, and Cyc [Lenat  GuhaLenat  Guha1989] for commonsense knowledge. However, their small scales restricted their usage in real-world applications.

Nowadays, with the growth of Web contents, computational power, and the availability of crowdsourcing platforms, many modern and large-scale KGs, such as Freebase [Bollacker, Evans, Paritosh, Sturge,  TaylorBollacker et al.2008], KnowItAll [Etzioni, Cafarella,  DowneyEtzioni et al.2004], TextRunner [Banko, Cafarella, Soderland, Broadhead,  EtzioniBanko et al.2007], YAGO [Suchanek, Kasneci,  WeikumSuchanek et al.2007], DBpedia [Auer, Bizer, Kobilarov, Lehmann, Cyganiak,  IvesAuer et al.2007], NELL [Carlson, Betteridge, Kisiel, Settles, Jr.,  MitchellCarlson et al.2010], Probase [Wu, Li, Wang,  ZhuWu et al.2012], and Google Knowledge Vault [Dong, Gabrilovich, Heitz, Horn, Lao, Murphy, Strohmann, Sun,  ZhangDong et al.2014], have been built based on semi-automatic mechanisms. Most of these KGs are designed and constructed based on Things or Objects, such as instances and their concepts, named entities and their categories, as well as their properties and relations. On top of them, a lot of semantic understanding problems such as question answering [Berant, Chou, Frostig,  LiangBerant et al.2013] can be supported by grounding natural language texts on knowledge graphs, e.g., asking a bot for the nearest restaurants for lunch. Nevertheless, these KGs may fall short in circumstances that require not only knowledge about Things or Objects, but also those about Activities, States, and Events. Consider the following utterance that a human would talk to the bot: ‘I am hungry’, which may also imply one’s need for restaurant recommendation. This, however, will not be possible unless the bot is able to identify that the consequence of being hungry would be ‘having lunch’ at noon.

In this paper, we propose an approach to discovering useful real-world knowledge about Activities (e.g., ‘I sleep’), States (e.g., ’I am hungry’), Events (e.g., ‘I make a call’), and their Relations (e.g., ‘I am hungry’ may result in ‘I have lunch’), for which we call ASER. In fact, Activities, States, and Events, which are expressed by verb-related clauses, are all eventualities following the commonly adopted terminology and categorization proposed by Mourelatos [P. D. MourelatosP. D. Mourelatos1978] and Bach [BachBach1986]. Thus, ASER is essentially an eventuality-centric knowledge graph.

For eventualities, traditional extraction approaches used in natural language processing based on FrameNet 

[Baker, Fillmore,  LoweBaker et al.1998] or ACE [NISTNIST2005] first define complex structures of events by enumerating triggers with senses and arguments with roles. They then learn from limited annotated examples and try to generalize to other text contents. However, detecting trigger senses and argument roles suffers from the ambiguity and variability of the semantic meanings of words. For example, using the ACE training data, the current state-of-the-art system can only achieve about 40% overall F1 score with 33 event types [Li, Ji,  HuangLi et al.2013]. Different from them, we use extract eventuality-centric knowledge based on dependency grammar since the English language’s syntax is relatively fixed and consistent across domains and topics. Instead of defining complex triggers and role structures of events, we simply use syntactic patterns to extract all possible eventualities. We do not distinguish between semantic senses or categories of particular triggers or arguments in eventualities but treat all extracted words with their dependency relations as hyperedge in a graph to define an eventuality.

For eventuality relations, we use the definition used in PDTB [Prasad, Miltsakaki, Dinesh, Lee, Joshi, Robaldo,  WebberPrasad et al.2007]. In PDTB, the relations are defined between two sentences or clauses. In ASER, we focus on relations between two eventualities, which are defined with simple but semantically complete patterns. Moreover, as shown in PDTB, some connectives, e.g., ‘and’ and ‘but’, are less ambiguous than others, e.g., ‘while’. Thus, we use less ambiguous connectives as seed connectives to find initial relations and then bootstrap the eventuality relation extraction using large corpora.

Figure 1: ASER Demonstration. Eventualities are connected with weighted directed edges. Each eventuality is a dependency graph.

In ASER, we have extracted 194 million unique eventualities. After bootstrapping, ASER contains 64 million edges among eventualities. One example of ASER is shown in Figure 1. Table 1 provides a size comparison between ASER and existing eventuality-related (or simply verb-related) knowledge bases. Essentially, they are not large enough as modern knowledges graph and inadequate for capturing the richness and complexity of eventualities and their relations. FrameNet [Baker, Fillmore,  LoweBaker et al.1998] is considered the earliest knowledge base defining events and their relations. It provides annotations about relations among about 1,000 human defined eventuality frames, which contain 27,691 eventualities. However, given the fine-grained definition of frames, the scale of the annotations is limited. ACE [NISTNIST2005] (and its follow-up evaluation TAC-KBP [Aguilar, Beller, McNamee, Van Durme, Strassel, Song,  EllisAguilar et al.2014]) reduces the number of event types and annotates more examples in each of event types. PropBank [Palmer, Gildea,  KingsburyPalmer et al.2005] and NomBank [Meyers, Reeves, Macleod, Szekely, Zielinska, Young,  GrishmanMeyers et al.2004] build frames over syntactic parse trees, and focus on annotating popular verbs and nouns. TimeBank focuses only on temporal relations between verbs [Pustejovsky, Hanks, Sauri, See, Gaizauskas, Setzer, Radev, Sundheim, Day, Ferro, et al.Pustejovsky et al.2003]. While the aforementioned knowledge bases are annotated by domain experts, OMCS [Singh, Lin, Mueller, Lim, Perkins,  ZhuSingh et al.2002], Event2Mind [Smith, Choi, Sap, Rashkin,  AllawaySmith et al.2018], ProPora [Dalvi, Huang, Tandon, tau Yih,  ClarkDalvi et al.2018] and ATOMIC [Sap, LeBras, Allaway, Bhagavatula, Lourie, Rashkin, Roof, Smith,  ChoiSap et al.2018] leveraged crowdsourcing platforms or the general public to annotate commonsense knowledge about eventualities, in particular the relations among them. Furthermore, KnowlyWood uses semantic parsing to extract activities (verb+object) from movie/TV scenes and novels to build four types of relations (parent, previous, next, similarity) between activities using inference rules. Compared with all these eventuality-related KGs, ASER is larger by one or more orders of magnitude in terms of the numbers of eventualities444Some of the eventualities are not connected with others, but the frequency of an eventuality is also valuable for downstream tasks. One example is the coreference resolution task. Given one sentence ‘The dog is chasing the cat, it barks loudly’, we can correctly resolve ‘it’ to ‘dog’ rather than ‘cat’ because ‘dog barks’ appears 12,247 times in ASER, while ‘cat barks’ never appears. and relations it contains.

# Eventuality # Relation # R Types
FrameNet [Baker, Fillmore,  LoweBaker et al.1998] 27,691 1,709 7
ACE [Aguilar, Beller, McNamee, Van Durme, Strassel, Song,  EllisAguilar et al.2014] 3,290 0 0
PropBank [Palmer, Gildea,  KingsburyPalmer et al.2005] 112,917 0 0
NomBank [Meyers, Reeves, Macleod, Szekely, Zielinska, Young,  GrishmanMeyers et al.2004] 114,576 0 0
TimeBank [Pustejovsky, Hanks, Sauri, See, Gaizauskas, Setzer, Radev, Sundheim, Day, Ferro, et al.Pustejovsky et al.2003] 7,571 8,242 1
OMCS [Singh, Lin, Mueller, Lim, Perkins,  ZhuSingh et al.2002] 100,008 158,166 4
Event2Mind [Smith, Choi, Sap, Rashkin,  AllawaySmith et al.2018] 24,716 57,097 3
ProPora [Dalvi, Huang, Tandon, tau Yih,  ClarkDalvi et al.2018] 2,406 16,269 1
ATOMIC [Sap, LeBras, Allaway, Bhagavatula, Lourie, Rashkin, Roof, Smith,  ChoiSap et al.2018] 309,515 877,108 9
Knowlywood [Tandon, de Melo, De,  WeikumTandon et al.2015] 964,758 2,644,415 4
ASER (core) 27,565,673 10,361,178 15
ASER (full) 194,000,677 64,351,959 15
Table 1: Size comparison of ASER and existing eventuality-related resources. # Eventuality, # Relation, and # R types are the number of eventualities, relations between these eventualities, and relation types. For KGs containing knowledge about both entity and eventualities, we report the statistics about the eventualities subset. ASER (core) filters out eventualities that appear only once and thus has better accuracy while ASER (full) can cover more knowledge.

In summary, our contributions are as follows.

Definition of ASER. We define a brand new KG where the primitive units of semantics are eventualities. We organize our KG as a relational graph of hyperedges. Each eventuality instance is a hyperedge connecting several vertices, which are words. A relation between two eventualities in our KG represents one of the 14 relation types defined in PDTB [Prasad, Miltsakaki, Dinesh, Lee, Joshi, Robaldo,  WebberPrasad et al.2007] or a co-occurrence relation.

Scalable Extraction of ASER. We perform eventuality extraction over large-scale corpora. We designed several high-quality patterns based on dependency parsing results and extract all eventualities that match these patterns. We use unambiguous connectives obtained from PDTB to find seed relations among eventualities. Then we leverage a neural bootstrapping framework to extract more relations from the unstructured textual data.

Inference over ASER.

We also provide several ways of inference over ASER. We show that both eventuality and relation retrieval over one-hop or multi-hop relations can be modeled as conditional probability inference problems.

Evaluation and Applications of ASER. We conduct both human and extrinsic evaluations to validate the quality and effectiveness of ASER. For human evaluation, we sample instances of extracted knowledge in ASER over iterations, and submitted them to the Amazon Mechanical Turk (AMT) for human workers to verify. For extrinsic evaluation, we use the Winograd Schema Challenge [Levesque, Davis,  MorgensternLevesque et al.2011] to test whether ASER can effectively address the language understanding problem and a dialogue generation task to demonstrate the effect of using ASER for the language generation problem. The results of both evaluations show that ASER is a promising large-scale KG with great potentials. The proposed ASER and supporting packages are available at:

2 Overview of ASER

Each eventuality in ASER is represented by a set of words, where the number of words varies from one eventuality to another. Thus, we cannot use a traditional graph representation such as triplets to represent knowledge in ASER. We devise the formal definition of our ASER KG as below.

Definition 1

ASER KG is a hybrid graph of eventualities ’s. Each eventuality is a hyperedge linking to a set of vertices ’s. Each vertex is a word in the vocabulary. We define in the vertex set and in the hyperedge set. is a subset of the power set of . We also define a relation between two eventualities and , where is the relation set. Each relation has a type where is the type set. Overall, we have ASER KG .

ASER KG is a hybrid graph combining a hypergraph where each hyperedge is constructed over vertices, and a traditional graph where each edge is built among eventualities. For example, =(I, am, hungry) and =(I, eat, anything) are eventualities, where we omit the internal dependency structures for brevity. They have a relation =Result, where Result is the relation type.

Pattern Code Example
-nsubj- s-v ‘The dog barks’
-nsubj--dobj- s-v-o ‘I love you’
-nsubj--xcomp- s-v-a ‘He felt ill’
-nsubj-(-iobj-)-dobj- s-v-o-o ‘You give me the book’
-nsubj--cop- s-be-a ‘The price is expensive’
-nsubj--cop- s-be-o ‘He is a boy’
-nsubj--xcomp--cop- s-v-be-a ‘I want to be slim’
-nsubj--xcomp--cop- s-v-be-o ‘I want to be a hero’
-nsubj--xcomp--dobj- s-v-v-o ‘I want to eat the apple’
-nsubj--xcomp- s-v-v ‘I want to go’
(-nsubj--cop-)-nmod--case- s-be-a-p-o ‘It’ cheap for the quality’
-nsubj--nmod--case- s-v-p-o ‘He walks into the room’
(-nsubj--dobj-)-nmod--case- s-v-o-p-o ‘He plays football with me’
-nsubjpass- spass-v ‘The bill is paid’
-nsubjpass--nmod--case- spass-v-p-o ‘The bill is paid by me’
Table 2: Selected eventuality patterns (‘v’ stands for normal verbs other than ‘be’, ‘be’ stands for ‘be’ verbs, ‘n’ stands for nouns, ‘a’ stands for adjectives, and ‘p’ stands for prepositions.), Code (to save space, we create a unique code for each pattern and will use that in the rest of this paper), and the corresponding examples.

2.1 Eventuality

Different from named entities or concepts, which are noun phrases, eventualities are usually expressed as verb phrases, which are more complicated in structure. Our definition of eventualities is built upon the following two assumptions: (1) syntactic patterns of English are relatively fixed and consistent; (2) the eventuality’s semantic meaning is determined by the words it contains. To avoid the extracted eventualities being too sparse, we use words fitting certain patterns rather than a whole sentence to represent an eventuality. In addition, to make sure the extracted eventualities have complete semantics, we retain all necessary words extracted by patterns rather than those simple verbs or verb-object pairs in sentences. The selected patterns are shown in Table 2. For example, for the eventuality (dog, bark), we have a relation nsubj between the two words to indicate that there is a subject-of-a-verb relation in between. We now formally define an eventuality as follows.

Definition 2

An eventuality is a hyperedge linking multiple words , where is the number of words in eventuality . Here, are all in the vocabulary. A pair of words in may follow a syntactic relation .

We use patterns from dependency parsing to extract eventualities ’s from unstructured large-scale corpora. Here is one of the relations that dependency parsing may return. Although in this way the recall is sacrificed, our patterns are of high precision and we use very large corpora to extract as many eventualities as possible. This strategy is also shared with many other modern KGs [Etzioni, Cafarella,  DowneyEtzioni et al.2004, Banko, Cafarella, Soderland, Broadhead,  EtzioniBanko et al.2007, Carlson, Betteridge, Kisiel, Settles, Jr.,  MitchellCarlson et al.2010, Wu, Li, Wang,  ZhuWu et al.2012].

Relation Explanation
, ‘Precedence’, happens before .
, ‘Succession’, happens after .
, ‘Synchronous’, happens at the same time as .
, ‘Reason’, happens because happens.
, ‘Result’, If happens, it will result in the happening of .
, ‘Condition’, Only when happens, can happen.
, ‘Contrast’, and share a predicate or property and have significant difference on that property.
, ‘Concession’, should result in the happening of , but indicates the opposite of happens.
, ‘Conjunction’, and both happen.
, ‘Instantiation, is a more detailed description of .
, ‘Restatement’, restates the semantics meaning of .
, ‘Alternative’, and are alternative situations of each other.
, ‘ChosenAlternative’, and are alternative situations of each other, but the subject prefers .
, ‘Exception’, is an exception of .
, ‘Co-Occurrence’, and appear in the same sentence.
Table 3: Eventuality relation types between two eventualities and and explanations.

2.2 Eventuality Relation

For relations among eventualities, as introduced in Section 1, we follow PDTB’s [Prasad, Miltsakaki, Dinesh, Lee, Joshi, Robaldo,  WebberPrasad et al.2007] definition of relations between sentences or clauses but simplify it to eventualities. Following the CoNLL 2015 discourse parsing shared task [Xue, Ng, Pradhan, Prasad, Bryant,  RutherfordXue et al.2015], we select 14 discourse relation types and an additional co-occurrence relation to build our knowledge graph.

Definition 3

A relation between a pair of eventualities and has one of the following types and all types can be grouped into five categories: Temporal (including Precedence, Succession, and Synchronous), Contingency (including Reason, Result, and Condition), Comparison (including Contrast and Concession), Expansion (including Conjunction, Instantiation, Restatement, Alternative, ChosenAlternative, and Exception), and Co-Occurrence. The detailed definitions of these relation types are shown in Table 3. The weight of is defined by the number of tuple , , appears in the whole corpora.

2.3 KG Storage

All eventualities in ASER are small-dependency graphs, where vertices are the words and edges are the internal dependency relations between these words. We store the information about eventualities and relations among them separately in two tables with a SQL database. In the eventuality table, we record information about event ids, all the words, dependencies edges between words, and frequencies. In the relation table, we record ids of head and tail eventualities and relations between them.

3 Knowledge Extraction

In this section, we introduce the knowledge extraction methodologies for building ASER.

3.1 System Overview

We first introduce the overall framework of our knowledge extraction system. The framework is shown in Figure 2

. After textual data collection, we first preprocess the texts with the dependency parser. Then we perform eventuality extracting using pattern matching. For each sentence, if we find more than two eventualities, we first group these eventualities into pairs. And for each pair, we generate one training instance, where each training instance contains two eventualities and their original sentence. After that, we extract seed relations from these training instances based on the less ambiguous connectives obtained from PDTB 

[Prasad, Miltsakaki, Dinesh, Lee, Joshi, Robaldo,  WebberPrasad et al.2007]

. In the end, a bootstrapping process is conducted to learn more relations and train the new classifier repeatedly. In the following sub-sections, we will introduce each part of the system separately.

3.2 Corpora

To make sure the broad coverage of ASER, we select corpora from different resources (reviews, news, forums, social media, movie subtitles, e-books) as the raw data. The details of these datasets are as follows.

Yelp: Yelp is a social media platform where users can write reviews for businesses, e.g., restaurants, hotels. The latest release of the Yelp dataset555 contains over five million reviews.

New York Times (NYT): The NYT [Sandhaus  EvanSandhaus  Evan2008] corpus contains over 1.8 million news articles from the NYT throughout 20 years.

Name # Sentences # Tokens # Instances # Unique Eventualities
Yelp 48.9M 758.3M 54.2M 20.5M
NYT 56.8M 1,196.9M 41.6M 23.9M
Wiki 105.1M 2,347.3M 38.9M 38.4M
Reddit 235.9M 3,373.2M 185.7M 82.6M
Subtitles 445.0M 3,164.1M 137.6M 27.0M
E-books 27.6M 618.6M 22.1M 11.1M
Overall 919.2M 11,458.4M 480.1M 194.0M
Table 4: Statistics of used corpora. (M means millions.)

Wiki: Wikipedia is one of the largest free knowledge dataset. To build ASER, we select the English version of Wikipedia666

Reddit: Reddit is one of the largest online forums. In this work, we select the anonymized post records777 over one period month.

Movie Subtitles: The movie subtitles corpus was collected by [Lison  TiedemannLison  Tiedemann2016] and we select the English subset, which contains subtitles for more than 310K movies.

E-books: The last resource we include is the free English electronic books from Project Gutenberg888

We merge these resources as a whole to perform knowledge extraction. The statistics of different corpora are shown in Table 4.

Figure 2: ASER extraction framework. The seed relation selection and the bootstrapping process are shown in the orange dash-dotted and blue dashed box respectively. Two gray databases are the resulted ASER.

INPUT: Parsed dependency graph , center verb . Positive dependency edges , optional edges , and negative edges . OUTPUT: Extracted eventuality .

1:Initialize eventuality .
2:for Each connection (a relation and the associated word) in positive dependency edges  do
3:     if Find in  then
4:         Append in .
5:     else
6:         Return NULL.
7:     end if
8:end for
9:for Each connection in optional dependency edges  do
10:     if Find in  then
11:         Append in .
12:     end if
13:end for
14:for Each connection in negative dependency edges  do
15:     if Find in  then
16:         Return NULL.
17:     end if
18:end for
Algorithm 1 Eventuality Extraction with One Pattern

3.3 Preprocessing and Eventuality Extraction

For each sentence , we first parse it with the Stanford Dependency Parser999 We then filter out all the sentences that contain clauses. As each sentence may contain multiple eventualities and verbs are the centers of them, we first extract all verbs. To make sure that all the extracted eventualities are semantically complete without being too complicated, we design 14 patterns to extract the eventualities via pattern matching. Each of the patterns contains three kinds of dependency edges: positive dependency edges, optional dependency edges, and negative dependency edges. All the positives edges are shown in Table 2. Six more dependency relations (advmod, amod, nummod, aux, compound, and neg) are optional dependency edges that can associate with any of the selected patterns. We omit all optional edges in the table because they are the same for all patterns. All other dependency edges are considered are negative dependency edges, which are designed to make sure all the extracted eventualities are semantically complete and all the patterns are exclusive with each other. Take sentence ‘I have a book’ as an example, we will only select ‘I’, ‘have’, ‘book’ rather than ‘I’, ‘have’ as the valid eventuality, because ‘have’-dobj-‘book’ is a negative dependency edge for pattern ‘s-v’. For each verb and each pattern, we first put it in the position of and then try to find all the positive dependency edges. If we can find all the positive dependency edges around the center verb we consider it as one potential valid eventuality and then add all the words connected via those optional dependency edges. In the end, we will check if any negative dependency edge can be found in the dependency graph. If not, we will keep it as one valid eventuality. Otherwise, we will disqualify it. The pseudo-code of our extraction algorithm is shown in Algorithm 1. The time complexity of eventuality extraction is where is the number of sentences, is the average number of dependency edges in a dependency parse tree, and is the average number of verbs in a sentence.

Relation Type Seed Patterns
Precedence before ; , then ; till ; until
Succession after ; once
Synchronous , meanwhile ; meantime ; , at the same time
Reason , because
Result , so ; , thus ; , therefore ; , so that
Condition , if ; , as long as
Contrast , but ; , however ; , , by contrast ; , , in contrast ; , , on the other hand, ; , , on the contrary,
Concession , although
Conjunction and ; , also ;
Instantiation , for example ; , for instance
Restatement , in other words
Alternative or ; , unless ; , as an alternative ; , otherwise
ChosenAlternative , instead
Exception , except
Table 5: Selected seed connectives. Here relations are directed relation from to . Each relation can have multiple seed connectives, where the corresponding connectives are highlighted as boldface.

3.4 Eventuality Relation Extraction

For each training instance, we use a two-step approach to decide the relations between the two eventualities.

We first extract seed relations from the corpora by using the unambiguous connectives obtained from PDTB [Prasad, Miltsakaki, Dinesh, Lee, Joshi, Robaldo,  WebberPrasad et al.2007]. According to PDTB’s annotation manual, we found that some of the connectives are more unambiguous than the others. For example, in the PDTB annotations, the connective ‘so that’ is annotated 31 times and is only with the Result relation. On the other hand, the connective ‘while’ is annotated as Conjunction 39 times, Contrast 111 times, expectation 79 times, and Concession 85 times, etc. When we identify connectives like ‘while’, we can not determine the relation between the two eventualities related to it. Thus, we choose connectives that are less ambiguous, where more than 90% annotations of each are indicating the same relation, to extract seed relations. The selected connectives are listed in Table 5. Formally, we denote one informative connective word(s) and its corresponding relation type as and . Given a training instance =(, , ), if we can find a connective such that and are connected by according to the dependency parse, we will select this instance as an instance for relation type .

Since the seed relations extracted with selected connectives can only cover the limited number of the knowledge, we use a bootstrapping framework to incrementally extract more eventuality relations. Bootstrapping [Agichtein  GravanoAgichtein  Gravano2000]

is a commonly used technique in information extraction. Here we use a neural network based approach to bootstrap. The general steps of bootstrapping are as follows.

Step 1: Use the extracted seed training instances as the initial labeled training instances.

Step 2: Train a classifier based on labeled training instances.

Step 3: Use the classifier to predict relations of each training instance. If the prediction confidence of certain relation type is higher than the selected threshold, we will label this instance with and add it to the labeled training instances. Then go to Step 2.

Figure 3: The overview of the neural classifier. For each instance , we first encode the information of two eventualities , and the original sentence with three bidirectional LSTMs [Hochreiter  SchmidhuberHochreiter  Schmidhuber1997] module and the output representations are , and respectively. We then concatenate , , , and together, where indicates the element-wise multiplication, and feed them to a two-layer feed forward network. In the end, we use a softmax function to generate scores for different relation types.

The neural classifier architecture is shown in Figure 3. In the training process, we randomly select labelled training instances as the positive examples and unlabelled training instances as negative examples. The cross-entropy is used as the loss and the whole model is updated via Adam [Kingma  BaKingma  Ba2015]. In the labeling process, for each training instance , the classifier can predict a score for each relation type. For any relation type, if the output score is larger than a threshold , where is the number of bootstrapping iteration, we will label with that relation type. To avoid error accumulation, we also use the annealing strategy to increase the threshold , where is the total iteration number. The complexities of both training and labeling processes in iteration are linear to the number of parameters in LSTM cell , the number of training examples , and the number of instances to predict in iteration. So the overall complexity in iteration is .

Used hyper-parameters and other implementation details are as follows: For preprocessing, we first parse all the raw corpora with the Stanford Dependency parser, which costs eight days with two 12-core Intel Xeon Gold 5118 CPUs. After that, We extract eventualities, build the training instance set, and extract seed relations, which costs two days with the same CPUs. For bootstrapping, Adam optimizer [Kingma  BaKingma  Ba2015]

is used and the initial learning rate is 0.001. The batch size is 512. We use GloVe as the pre-trained word embeddings. The dropout rate is 0.2 to prevent overfitting. The hidden sizes of LSTMs are 256 and the hidden size of the two-layer feed forward network with ReLU is 512. As relation types belonging to different categories could both exist in one training instance, in each bootstrapping iteration, four different classifiers are trained corresponding to four categories (

Temporal, Contingency, Comparison, Temporal). Each classifier predicts the types belong to that category or ‘None’ of each instance. Therefore, classifiers do not influence each other so that they can be processed in parallel. Each iteration using ASER (core) takes around one hour with the same CPUs and four TITAN X GPUs. We spend around eight hours predicting ASER (full) with the learned classifier in the 10th iteration.

4 Inference over ASER

In this section, we provide two kinds of inferences (eventuality retrieval and relation retrieval) based on ASER. For each of them, inferences over both one-hop and multi-hops are provided. Complexities of these two retrieval algorithms are both , where is the number of average adjacent eventualities per eventuality and is the number of hops. In this section, we show how to conduct these inferences over one-hop and two-hop as the demonstration.

4.1 Eventuality Retrieval

The eventuality retrieval inference is defined as follows. Given a head eventuality101010ASER also supports the prediction of head eventualities given tail eventuality and relations. We omit it in this section for the clear presentation. and a relation list = (), find related eventualities and their associated probabilities such that for each eventuality we can find a path, which contains all the relations in in order from to .

4.1.1 One-hop Inference

For the one-hop inference, we assume the target relation is . We then define the probability of any potential tail eventuality as:


where is the relation weight, which is defined in Definition 3. If no eventuality is connected with via , will be 0 for any .

4.1.2 Two-hop Inference

On top of Eq. (1), it is easy for us to define the probability of on two-hop setting. Assume the two relations are and in order. We can define the probability as follows:


where is the set of intermediate eventuality such that and .

4.2 Relation Retrieval

The relation retrieval inference is defined as follows. Given two eventualities and , find all relation lists and their probabilities such that for each relation list = (), we can find a path from to , which contains all the relations in in order.

4.2.1 One-hop Inference

Assuming that the path length is one, we define the probability of one relation exist from to as:


where is the relation set.

Figure 4: Examples of inference over ASER.

4.2.2 Two-hop Inference

Similarly, given two eventualities and , we define the probability of a two-hop connection (, ) between them as follows:


where is the probability of relation , given head eventuality , and is defined as follows:


4.3 Case Study

Pattern Code # Eventuality # Unique # Agreed # Valid Accuracy
s-v 109.0M 22.1M 171 158 92.4%
s-v-o 129.0M 60.0M 181 173 95.6%
s-v-a 5.2M 2.1M 195 192 98.5%
s-v-o-o 3.5M 1.7M 194 187 96.4%
s-be-a 89.9M 29.0M 189 188 99.5%
s-v-be-a 1.2M 0.5M 190 187 98.4%
s-v-be-o 1.2M 0.7M 186 171 91.9%
s-v-v-o 12.4M 6.6M 193 185 95.9%
s-v-v 8.7M 2.7M 185 155 83.8%
s-be-a-p-o 13.2M 8.7M 189 185 97.9%
s-v-p-o 39.0M 23.5M 178 161 90.4%
s-v-o-p-o 27.2M 19.7M 181 167 92.2%
spass-v 15.1M 6.2M 177 155 87.6%
spass-v-p-o 13.5M 10.3M 188 177 94.1%
Overall 468.1M 194.0M 94.5%
Table 6: Statistics and annotations of the eventuality extraction. # Eventuality and # Unique means the total number and the unique number of extracted eventualities using corresponding patterns (‘M’ stands for millions). # Agreed means the number of agreed eventualities among five annotators. # Valid means the number valid eventualities labeled by annotators. Accuracy=# Valid/# Agrees. The Overall accuracy is calculated based on the pattern distribution.

In this section, we showcase several interesting inference examples with ASER in Figure 4, which is conducted over the extracted sub-graph of ASER shown in Figure 1. By doing inference over eventuality retrieval, we can easily find out that ‘I am hungry’ usually results in having lunch and the eventuality ‘I make a call’ often happens before someone goes or departs. More interestingly, leveraging the two-hop inference, given the eventuality ‘I sleep’, we can find out an eventuality ‘I rest on a bench’ such that both of them are caused by the same reason, which is ‘I am tired’ in this example. From another angle, we can also retrieve possible relations between eventualities. For example, we can know that ‘I am hungry’ is most likely the reason for ‘I have lunch’ rather than the other way around. Similarly, over the 2-hop inference, we can find out that even though ‘I am hungry’ has no direct relation with ‘I sleep’, ‘I am hungry’ often appears at the same time with ‘I am tired’, which is one plausible reason for ‘I sleep’.

5 Evaluations

In this section, we present human evaluation and extrinsic experiments to evaluate the quality of ASER.

5.1 Human Evaluation

5.1.1 Eventualities Extraction

In this section, we present human evaluation to assess the quantity and quality of extracted eventualities. We first present the statistics of the extracted eventualities in Table 6, which shows that simpler patterns like ‘s-v-o’ appear more frequently than the complicated patterns like ‘s-v-be-a’.

Figure 5: Distribution of eventualities by their frequencies. Sampled eventualities are shown along with their frequencies.

The distribution of extracted eventualities is shown in Figure 5. In general, the distribution of eventualities follows the Zipf’s law, where only a few eventualities appear many times while the majority of eventualities appear only a few times. To better illustrate the distribution of eventualities, we also show several representative eventualities along with their frequencies and we have two observations. First, eventualities which can be used in general cases, like ‘You think’, appear much more times than other eventualities. Second, eventualities contained in ASER are more related to our daily life like ‘Food is tasty’ or ‘I sleep’ rather than domain-specific ones such as ‘I learn python’.

After extracting the eventualities, we employ the Amazon Mechanical Turk platform (MTurk)111111 for annotations. For each eventuality pattern, we randomly select 200 extracted eventualities and then provide these extracted eventualities along with their original sentences to the annotators. In the annotation task, we ask them to label whether one auto-extracted eventuality phrase can fully and precisely represent the semantic meaning of the original sentence. If so, they should label them with ‘Valid’. Otherwise, they should label it with ‘Not Valid’. For each eventuality, we invite 4 workers to label and if at least 3 of them give the same annotation result, we consider it to be one agreed annotation. Otherwise, this extraction is considered as disagreed. In total, we spent $201.6. The detailed result is shown in Table 6. We got 2,597 agreed annotations out of 2,800 randomly selected eventualities, and the overall agreement rate is 92.8%, which indicates that annotators can easily understand our task and provide consistent annotations. Besides that, as the overall accuracy is 94.5%, the result proves the effectiveness of the proposed eventuality extraction method.

5.1.2 Relations Extraction

(a) Statistics and evaluation of bootstrapping.
(b) Distribution and accuracy of different relation types.
Figure 6: Human Evaluation of the bootstrapping process. Relation Co_Occurrence is not included in the figures since it is not influenced by the bootstrapping.

In this section, we evaluate the quantity and quality of extracted relations in ASER. Here, to make sure the quality of the learned bootstrapping model, we filter out eventuality and eventuality pairs that appear once and use the resulting training instances to train the bootstrapping model. The KG extracted from the selected data is called the core part of ASER. Besides that, after the bootstrapping, we directly apply the final bootstrapping model on all training instances and get the full ASER. In this section, we will first evaluate the bootstrapping process and then evaluate relations in two versions of ASER (core and full).

For the bootstrapping process, similar to the evaluation of eventuality extraction, we invite annotators from Amazon Turk to annotate the extracted edges. For each iteration, we randomly select 100 edges for each relation type. For each edge, we generate a question by asking the annotators if they think certain relation exists between the two eventualities. If so, they should label as ‘Valid’. Otherwise, they should label it as ‘Not Valid’. Similarly, if at least 3 of the 4 annotators give the same annotation result, we consider it to be an agreed one and the overall agreement rate is 82.8 %. For simplicity, we report the average accuracy, which is calculated based on the distribution of different relation types, as well as the total number of edges in Figure 6(a). The number of edges grows very fast at the beginning and slows down later. After ten iterations of bootstrapping, the number of edges grows four times with the decrease of less than 6% accuracy (from 92.3% to 86.5%).

Finally, we evaluate the core and full versions of ASER. For both versions of ASER, we randomly select 100 edges per relation type and invite annotators to annotate them using the same way as we annotating the bootstrapping process. Together with the evaluation on bootstrapping, we spent $1698.4. The accuracy along with the distribution of different relation types are shown in Figure 6(b). We also compute the overall accuracy for the core and full versions of ASER by computing the weighted average of these accuracy scores based on the frequency. The overall accuracies of the core and full versions are 86.5% and 84.3% respectively, which is comparable with KnowlyWood [Tandon, de Melo, De,  WeikumTandon et al.2015] (85%), even though Knowlywood only relies on human designed patterns and ASER involves bootstrapping. From the result, we observe that, in general, the core version of ASER has a better accuracy than the full version, which fits our understanding that the quality of those rare eventualities might not be good. But from another perspective, the full version of ASER can cover much more relations than the core version with acceptable accuracy.

5.2 Extrinsic Evaluations

In this section, we use two extrinsic experiments to demonstrate the importance of ASER. All the experiments are conducted with the support of the core version of ASER.

5.2.1 Winograd Schema Challenge

Winograd Schema Challenge is known as related to commonsense knowledge and argued as a replacement of the Turing test [Levesque, Davis,  MorgensternLevesque et al.2011]. Given two sentences and , both of them contain two candidate noun phrases and , and one targeting pronoun . The goal is to detect the correct noun phrase refers to. Here is an example [Levesque, Davis,  MorgensternLevesque et al.2011].

(1) The fish ate the worm. It was hungry. Which was hungry?

Answer: the fish.

(2) The fish ate the worm. It was tasty. Which was tasty?

Answer: the worm.

This task is challenging because and

are quite similar to each other (only one-word difference), but the result is totally reversed. Besides that, all the widely used features such as gender/number are removed, and thus all the conventional rule-based resolution system failed on this task. For example, in the above example, both fish and worm can be hungry or tasty by themselves. We can solve the problem because fish is subject of ‘eat’ while the worm is the object, which requires understanding eventualities related to ‘eat’. Moreover, due to the small size of the Winograd schema challenge, supervised learning based methods are not practical.

To demonstrate the effectiveness of ASER, we try to solve Winograd questions using simple inference based on ASER. For each question sentence , we first extract eventualities with the same method introduced in Section 3.3 and then select eventualities , , and that contain candidates nouns / and the target pronoun respectively. We then replace , , and with placeholder , , and , and hence generate the pseudo-eventualities , , and . After that, if we can find the seed connectives in Table 5 between any two eventualities, we use the corresponding relation type as relation type . Otherwise, we use Co_Occurrence as the relation type. To evaluate the candidate, we first replace the placeholder in with the corresponding placeholders or and then use the following equation to define its overall plausibility score:


where indicates the number of edges in ASER that can support that there exist one typed relation between the eventuality pairs and . For each edge (, , ) in ASER, if it can fit the following three requirements:

  1. = other than the words in the place holder positions.

  2. = other than the words in the place holder positions.

  3. Assume the word in the placeholder positions of and are and respectively, has to be same as .

we consider that edge as a valid edge to support the observed eventuality pair. If any of and cannot be extracted with our patterns, we will assign 0 to . We then predict the candidate with the higher score to be the correct reference. If both of them have the same score (including 0), we will make no prediction.

Method NA
Random Guess 83 82 0 50.3% 50.3%
Deterministic 75 71 19 51.4% 51.2%
Statistical 75 78 12 49.0% 49.1%
Deep-RL 80 76 9 51.3% 51.2%
End2end 79 84 2 48.5% 48.5%
Knowledge Hunting 94 71 0 56.9% 56.9%
ELMo 83 82 0 50.3% 50.3%
BERT 84 81 0 50.9% 50.9%
ASER 63 27 75 70.0% 60.9%
Table 7: Experimental results on Winograd Schema Challenge. indicates the number of correct answers, indicates the number of wrong answers, and means that the model cannot give a prediction. means the prediction accuracy without examples, and means the overall accuracy.

Baseline Methods. We compare several state-of-the-art general co-reference resolution approaches such as deterministic rules [Raghunathan, Lee, Rangarajan, Chambers, Surdeanu, Jurafsky,  ManningRaghunathan et al.2010], statistical supervised model [Clark  ManningClark  Manning2015]

, deep reinforcement learning 

[Clark  ManningClark  Manning2016]

, and end-to-end deep learning model 

[Lee, He,  ZettlemoyerLee et al.2018]. Besides these general co-reference models, a knowledge hunting framework [Emami, Cruz, Trischler, Suleman,  CheungEmami et al.2018] was proposed to search commonsense knowledge on search engines (e.g., Google) for the Winograd questions. After that, rule-based methods are used to leverage the collected commonsense knowledge to make final decisions. They achieve the current state-of-the-art performance on the Winograd schema challenge. Finally, language models or contextual representations [Trinh  LeTrinh  Le2018, Peters, Neumann, Iyyer, Gardner, Clark, Lee,  ZettlemoyerPeters et al.2018, Devlin, Chang, Lee,  ToutanovaDevlin et al.2018] trained with large corpora have demonstrated strong ability of encoding semantics information. Hence, to test whether commonsense knowledge can be captured by these language models, we also consider the available pre-trained ELMo [Peters, Neumann, Iyyer, Gardner, Clark, Lee,  ZettlemoyerPeters et al.2018] and BERT [Devlin, Chang, Lee,  ToutanovaDevlin et al.2018] as baseline methods. For the fair comparison, we use their original released code for all the aforementioned baseline methods.

Result Analysis. We select all Winograd questions satisfying two criteria to form the dataset: (1) They should have no subordinate clause; (2) The targeting pronoun is covered by an eventuality detected from the questions. As a result, we get 165 out of the total 273 questions. We use one-hop relations in ASER to perform inference for Winograd questions. As shown in Table 7, the Winograd schema challenge is challenging for the current co-reference system as all the general co-reference models and contextual representations cannot achieve better performance than the random guess. This is because all the Winograd questions are specifically designed to test model’s ability to do inference over commonsense knowledge. Compared with them, the knowledge hunting model [Emami, Cruz, Trischler, Suleman,  CheungEmami et al.2018], which leverages the search engine to acquire related commonsense knowledge, can achieve better performance. However, this method requires the support of search engine and the result is not stable considering that the search result given by the search engine might change every day. Compared with them, ASER can achieve significantly better performance, but at the same time, we also notice that a large percentage of questions remain unsolved with the proposed model. This is because (1) some of the questions, although eventualities can be found using the same patterns, cannot be covered by ASER; and (2) they may require more complicated reasoning methods. We leave the improvement for the coverage (e.g., using embeddings) and more advanced reasoning methods as future works.

Figure 7: Example of using ASER to solve Winograd questions. The number before questions are the original question ID in the Winograd dataset. Correct answer and the other candidate are labeled with purple underline and red italic font respectively.

Case Study. One example is shown in Figure 7, our model can correctly resolve ‘it’ to ‘fish’ in question 97, because 18 edges in ASER support that the subject of ‘eat’ should be ‘hungry’, while only one edge supports the object of ‘eat’ should be ‘hungry’. Similarly, our model can correctly resolve ‘it’ to ‘the worm’ in question 98, because seven edges in ASER support that the object of ‘eat’ should be ‘tasty’ while no edge supports that the subject of ‘eat’ should be ‘tasty’.

5.2.2 Eventuality knowledge enhanced dialogue system

As one of the most direct way for machines to interact with human, the dialogue system has been a hot research topic. We conduct experiments to demonstrate that the knowledge contained in ASER can help generate better dialogue response.

Experiment Details. To test the effectiveness of ASER in daily life rather than a specific domain, we select Dailydialog [Li, Su, Shen, Li, Cao,  NiuLi et al.2017] as the experimental dataset and use the widely used BLEU [Papineni, Roukos, Ward,  ZhuPapineni et al.2002]

score (%) as the evaluation metrics. We use the sequence-to-sequence with attention mechanism model 

[Luong, Pham,  ManningLuong et al.2015] as the base model and leverage the memory module to incorporate knowledge about eventuality into the dialogue generation model following  [Ghazvininejad, Brockett, Chang, Dolan, Gao, Yih,  GalleyGhazvininejad et al.2018]. Two existing eventuality-related resources OMCS [Singh, Lin, Mueller, Lim, Perkins,  ZhuSingh et al.2002] and KnowlyWood [Tandon, de Melo, De,  WeikumTandon et al.2015] are selected as the baseline KGs. Originally, Dailydialog contains 13,118 conversations and 49,188 post-response pairs. We first count the number of conversation pairs whose eventuality can be covered by the three KGs. OMCS, Knowlywood, and ASER can cover 7,246, 17,183, and 20,494 pairs respectively. For each conversation pair, if it contains an eventuality that can be found in any of the three KGs, we select it as a valid experiment dialogue conversation pair. As a result, we have 30,145 pairs. These pairs are divided into training, validation, and test data following the original setting.

Coverage Statistics. The detailed statistics about the coverages of different KGs are shown in Table 8. The number of covered conversation pairs, the percentage of such pairs, and the number of unique covered eventualities of each KG are reported. The statistics show that OMCS can only cover a very small portion of the questions due to its relatively small size and ASER covers the most conversation pairs. We also notice that compared with ASER, Knowlywood can cover more eventualities in fewer conversation pairs. The reason behind is that the definition of eventuality is different. In Knowlywood, each eventuality is represented with two words (verb+object), which may not be semantically complete but can be more easily found in the text. In ASER, we require the matched eventualities to be semantically complete, each of which typically contains 3-5 words. This makes them more difficult to be matched. Nonetheless, as ASER is extracted from different resources, it can cover the topics in more conversation pairs.

KG # Covered pairs Coverage rate # Unique matched events
OMCS 7,246 24.04% 1,195
KnowlyWood 17,183 57.00% 30,036
ASER 20,494 67.98% 9,511
Table 8: Statistics of the dialogue dataset. # Covered pairs means the number of conversation pairs, whose eventualities can be covered by the corresponding KG. coverage rate means the percentage of such pairs. And # Unique matched events means the number of unique matched eventualities in the KG.

Implementation. For each conversation pair, which contains one post and one response, we first extract eventualities121212Different KGs have different definitions of eventuality. Hence, we use different formats to extract eventuality based on their original settings (OMCS uses strings to represent eventualities, KnowlyWood uses a verb-object pair to define eventualities, and ASER uses dependency graphs to represent eventualities). from post . Then we get eventuality set , which contains eventualities . For each , we search it in the KG and retrieval all related edges, which are represented as triplets. For each triplet , where is the eventuality we extract from the post, is the retreivaled related eventuality, and

is the relation type between them, we represent it as a concatenation of four vectors

, where , , , and are the embeddings of the triplets, , , and respectively. All of them are set to be trainable. We group the representations of all triplets as a memory . We use cross-entropy as the loss and Adam [Kingma  BaKingma  Ba2015] with the initial learning rate of 0.005 to update all parameters, which are initialized randomly. We use the 256-dimension two-layer biGRU as the encoder and the 512-dimension two-layer GRU as the decoder. The word embedding size is set to 300 and the embedding sizes of , , , and

are all 128. All the models are trained up to 20 epochs and the best models are selected based on the dev set. Dropout is set to be 0.1. In the inference stage, the beam search size is set to be five.

Base 30.16 (0.44) 5.75 (0.37) 2.28 (0.24) 0.98 (0.16)
+OMCS 30.89 (0.40) 6.14 (0.15) 2.60 (0.13) 1.21 (0.12)
+KnowlyWood 30.72 (0.19) 6.26 (0.24) 2.68 (0.17) 1.29 (0.11)
+ASER 32.10 (0.42) 7.14 (0.17) 3.54 (0.10) 2.07 (0.08)
Table 9:

Experimental results on the dialogue task. BLEU scores with standard deviations in the brackets are reported. The highest BLEU scores are in boldface. ‘Base’ represents the seq2seq model with the attention mechanism.

Result and Analysis. For each KG, we repeat the experiment 5 times and report the average performance as well as the standard deviation. From the result shown in Table 9 we can observe that the effect of OMCS is not obvious due to its small coverage. KnowlyWood can cover much more examples but its effect is also limited due to its semantically incomplete definition of eventualities. Last but not least, ASER achieves the best performance on all of the four BLEU metrics, especially on BLEU-3 and BLEU-4. The reason behind is that the knowledge about eventuality can help the system generate the response with a more suitable eventuality rather than a single word and thus the metrics take more words into consideration can benefit more from using eventuality-related knowledge.

Post I should eat some food .
Response Yeah, you must be hungry. Do you like to eat some beaf?
OMCS ‘eat food’, MotivatedByGoal, ‘you are hungry’
‘eat food’, HasPrerequisite, ‘open your mouth’
KnowlyWood (eat,food), next, (keep, eating)
(eat,food), next, (enjoy, taste)
(eat,food), next, (stick, wasp)
ASER i eat food [s-v-o], Conjunction, beef is good [s-be-a]
i eat food [s-v-o], Condition, i am hungry [s-be-a]
i eat food [s-v-o], Concession, i take picture [s-v-o]
Table 10: Eventuality matching example.

One example is shown in Table 10. After getting the post ‘I should eat some food’, we extract the contained eventuality ‘eat food’, ‘eat food’, and ‘I eat food’ for the three KGs respectively, and then find the related eventualities in KGs to generate the response. By retrieving from OMCS, we know that ‘eat food’ can be motivated by ‘you are hungry’ and has the prerequisite that we have to open our mouth. Similarly, by retrieving from KnowlyWood, we know that we often ‘keep eating’, ‘enjoy taste’, or ‘stick swap’ after ‘eat food’. By retrieving from ASER, we know that ‘I eat food’ and ‘beef is good’ can happen at the same time, and eating food often has the condition of being hungry.

In general, the OMCS is accurate and correct, because they are generated by humans. However, their small scale limits their usage. KnowlyWood has a better scale, but its semantically incomplete definition of eventualities also limits the usage. As a comparison, ASER leverages carefully designed patterns to make sure the semantic completeness of extracted eventualities and uses a neural bootstrapping model to automatically learn relations between eventualities from large unlabeled corpus. Thus, it can provide a larger scale and higher quality eventuality knowledge.

6 Conclusions

In this paper, we introduce ASER, a large-scale eventuality knowledge graph. We extract eventualities from texts based the dependency graphs. Then we build seed relations among eventualities using unambiguous connectives found from PDTB and use a neural bootstrapping framework to extract more relations. ASER is the first large-scale eventuality KG using the above strategy. We conduct systematic experiments to evaluate the quality and applications of the extracted knowledge. Both human and extrinsic evaluations show that ASER is a promising large-scale eventuality knowledge graph with great potential in many downstream tasks.


  • [Agichtein  GravanoAgichtein  Gravano2000] Agichtein, E.  Gravano, L. 2000. Snowball: Extracting relations from large plain-text collections  In ACM DL,  85–94.
  • [Aguilar, Beller, McNamee, Van Durme, Strassel, Song,  EllisAguilar et al.2014] Aguilar, J., Beller, C., McNamee, P., Van Durme, B., Strassel, S., Song, Z.,  Ellis, J. 2014. A comparison of the events and relations across ace, ere, tac-kbp, and framenet annotation standards  In Workshop on EVENTS: Definition, Detection, Coreference, and Representation,  45–53.
  • [Auer, Bizer, Kobilarov, Lehmann, Cyganiak,  IvesAuer et al.2007] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R.,  Ives, Z. 2007. DBpedia: A nucleus for a web of open data. Springer.
  • [BachBach1986] Bach, E. 1986. The algebra of events  Linguistics and philosophy, 9(1), 5–16.
  • [Baker, Fillmore,  LoweBaker et al.1998] Baker, C. F., Fillmore, C. J.,  Lowe, J. B. 1998. The berkeley framenet project  In COLING-ACL,  86–90.
  • [Banko, Cafarella, Soderland, Broadhead,  EtzioniBanko et al.2007] Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M.,  Etzioni, O. 2007. Open information extraction from the web  In IJCAI,  2670–2676.
  • [Berant, Chou, Frostig,  LiangBerant et al.2013] Berant, J., Chou, A., Frostig, R.,  Liang, P. 2013. Semantic parsing on freebase from question-answer pairs  In EMNLP,  1533–1544.
  • [Bollacker, Evans, Paritosh, Sturge,  TaylorBollacker et al.2008] Bollacker, K. D., Evans, C., Paritosh, P., Sturge, T.,  Taylor, J. 2008. Freebase: a collaboratively created graph database for structuring human knowledge  In SIGMOD,  1247–1250.
  • [Carlson, Betteridge, Kisiel, Settles, Jr.,  MitchellCarlson et al.2010] Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Jr., E. R. H.,  Mitchell, T. M. 2010. Toward an architecture for never-ending language learning  In AAAI,  1306–1313.
  • [Clark  ManningClark  Manning2015] Clark, K.  Manning, C. D. 2015. Entity-centric coreference resolution with model stacking  In ACL-IJCNLP, 2015,  1,  1405–1415.
  • [Clark  ManningClark  Manning2016] Clark, K.  Manning, C. D. 2016. Deep reinforcement learning for mention-ranking coreference models  In EMNLP,  2256–2262.
  • [Dalvi, Huang, Tandon, tau Yih,  ClarkDalvi et al.2018] Dalvi, B., Huang, L., Tandon, N., tau Yih, W.,  Clark, P. 2018. Tracking state changes in procedural text: A challenge dataset and models for process paragraph comprehension  NAACL.
  • [Devlin, Chang, Lee,  ToutanovaDevlin et al.2018] Devlin, J., Chang, M.-W., Lee, K.,  Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding  arXiv preprint arXiv:1810.04805.
  • [Dong, Gabrilovich, Heitz, Horn, Lao, Murphy, Strohmann, Sun,  ZhangDong et al.2014] Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S.,  Zhang, W. 2014. Knowledge vault: A web-scale approach to probabilistic knowledge fusion  In KDD,  601–610.
  • [Emami, Cruz, Trischler, Suleman,  CheungEmami et al.2018] Emami, A., Cruz, N. D. L., Trischler, A., Suleman, K.,  Cheung, J. C. K. 2018. A knowledge hunting framework for common sense reasoning  In EMNLP,  1949–1958.
  • [Etzioni, Cafarella,  DowneyEtzioni et al.2004] Etzioni, O., Cafarella, M.,  Downey, D. 2004. Webscale information extraction in knowitall (preliminary results)  In WWW,  100–110.
  • [FellbaumFellbaum1998] Fellbaum, C.. 1998. WordNet: an electronic lexical database. MIT Press.
  • [Ghazvininejad, Brockett, Chang, Dolan, Gao, Yih,  GalleyGhazvininejad et al.2018] Ghazvininejad, M., Brockett, C., Chang, M., Dolan, B., Gao, J., Yih, W.,  Galley, M. 2018. A knowledge-grounded neural conversation model  In AAAI-IAAI-EAAI,  5110–5117.
  • [Hochreiter  SchmidhuberHochreiter  Schmidhuber1997] Hochreiter, S.  Schmidhuber, J. 1997. Long short-term memory  Neural computation, 9(8), 1735–1780.
  • [JackendoffJackendoff1990] Jackendoff, R.. 1990. Semantic Structures. Cambridge, Massachusetts: MIT Press.
  • [Kingma  BaKingma  Ba2015] Kingma, D. P.  Ba, J. 2015. Adam: A method for stochastic optimization  In ICLR.
  • [Lee, He,  ZettlemoyerLee et al.2018] Lee, K., He, L.,  Zettlemoyer, L. 2018. Higher-order coreference resolution with coarse-to-fine inference  In NAACL-HLT,  687–692.
  • [Lenat  GuhaLenat  Guha1989] Lenat, D. B.  Guha, R. V. 1989. Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project. Addison-Wesley.
  • [Levesque, Davis,  MorgensternLevesque et al.2011] Levesque, H. J., Davis, E.,  Morgenstern, L. 2011. The winograd schema challenge.  In AAAI Spring Symposium: Logical formalizations of commonsense reasoning,  46,  47.
  • [Li, Ji,  HuangLi et al.2013] Li, Q., Ji, H.,  Huang, L. 2013. Joint event extraction via structured prediction with global features  In ACL,  1,  73–82.
  • [Li, Su, Shen, Li, Cao,  NiuLi et al.2017] Li, Y., Su, H., Shen, X., Li, W., Cao, Z.,  Niu, S. 2017. Dailydialog: A manually labelled multi-turn dialogue dataset  In IJCNLP,  986–995.
  • [Lison  TiedemannLison  Tiedemann2016] Lison, P.  Tiedemann, J. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles.
  • [Luong, Pham,  ManningLuong et al.2015] Luong, T., Pham, H.,  Manning, C. D. 2015.

    Effective approaches to attention-based neural machine translation 

    In EMNLP,  1412–1421.
  • [Meyers, Reeves, Macleod, Szekely, Zielinska, Young,  GrishmanMeyers et al.2004] Meyers, A., Reeves, R., Macleod, C., Szekely, R., Zielinska, V., Young, B.,  Grishman, R. 2004. The nombank project: An interim report  In Workshop Frontiers in Corpus Annotation at HLT-NAACL.
  • [NISTNIST2005] NIST 2005. The ACE evaluation plan..
  • [P. D. MourelatosP. D. Mourelatos1978] P. D. Mourelatos, A. 1978. Events, processes, and states  Linguistics and Philosophy, 2, 415–434.
  • [Palmer, Gildea,  KingsburyPalmer et al.2005] Palmer, M., Gildea, D.,  Kingsbury, P. 2005. The proposition bank: An annotated corpus of semantic roles  Computational linguistics, 31(1), 71–106.
  • [Papineni, Roukos, Ward,  ZhuPapineni et al.2002] Papineni, K., Roukos, S., Ward, T.,  Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation  In Proceedings of the 40th annual meeting on association for computational linguistics,  311–318. Association for Computational Linguistics.
  • [Peters, Neumann, Iyyer, Gardner, Clark, Lee,  ZettlemoyerPeters et al.2018] Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K.,  Zettlemoyer, L. 2018. Deep contextualized word representations  In NAACL-HLT,  2227–2237.
  • [Prasad, Miltsakaki, Dinesh, Lee, Joshi, Robaldo,  WebberPrasad et al.2007] Prasad, R., Miltsakaki, E., Dinesh, N., Lee, A., Joshi, A., Robaldo, L.,  Webber, B. L. 2007. The penn discourse treebank 2.0 annotation manual.
  • [Pustejovsky, Hanks, Sauri, See, Gaizauskas, Setzer, Radev, Sundheim, Day, Ferro, et al.Pustejovsky et al.2003] Pustejovsky, J., Hanks, P., Sauri, R., See, A., Gaizauskas, R., Setzer, A., Radev, D., Sundheim, B., Day, D., Ferro, L., et al. 2003. The timebank corpus  In Corpus linguistics,  2003,  40.
  • [Raghunathan, Lee, Rangarajan, Chambers, Surdeanu, Jurafsky,  ManningRaghunathan et al.2010] Raghunathan, K., Lee, H., Rangarajan, S., Chambers, N., Surdeanu, M., Jurafsky, D.,  Manning, C. 2010. A multi-pass sieve for coreference resolution  In EMNLP,  492–501.
  • [Sandhaus  EvanSandhaus  Evan2008] Sandhaus  Evan 2008. The new york times annotated corpus ldc2008t19.
  • [Sap, LeBras, Allaway, Bhagavatula, Lourie, Rashkin, Roof, Smith,  ChoiSap et al.2018] Sap, M., LeBras, R., Allaway, E., Bhagavatula, C., Lourie, N., Rashkin, H., Roof, B., Smith, N. A.,  Choi, Y. 2018. ATOMIC: an atlas of machine commonsense for if-then reasoning  In AAAI.
  • [Singh, Lin, Mueller, Lim, Perkins,  ZhuSingh et al.2002] Singh, P., Lin, T., Mueller, E. T., Lim, G., Perkins, T.,  Zhu, W. L. 2002. Open mind common sense: Knowledge acquisition from the general public  In OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”,  1223–1237.
  • [Smith, Choi, Sap, Rashkin,  AllawaySmith et al.2018] Smith, N. A., Choi, Y., Sap, M., Rashkin, H.,  Allaway, E. 2018. Event2mind: Commonsense inference on events, intents, and reactions  In ACL,  463–473.
  • [Suchanek, Kasneci,  WeikumSuchanek et al.2007] Suchanek, F. M., Kasneci, G.,  Weikum, G. 2007. Yago: a core of semantic knowledge  In WWW,  697–706.
  • [Tandon, de Melo, De,  WeikumTandon et al.2015] Tandon, N., de Melo, G., De, A.,  Weikum, G. 2015. Knowlywood: Mining activity knowledge from hollywood narratives  In CIKM,  223–232.
  • [Trinh  LeTrinh  Le2018] Trinh, T. H.  Le, Q. V. 2018. A simple method for commonsense reasoning  CoRR, abs/1806.02847.
  • [Wu, Li, Wang,  ZhuWu et al.2012] Wu, W., Li, H., Wang, H.,  Zhu, K. Q. 2012. Probase: A probabilistic taxonomy for text understanding  In SIGMOD,  481–492.
  • [Xue, Ng, Pradhan, Prasad, Bryant,  RutherfordXue et al.2015] Xue, N., Ng, H. T., Pradhan, S., Prasad, R., Bryant, C.,  Rutherford, A. 2015. The conll-2015 shared task on shallow discourse parsing  In CoNLL Shared Task,  1–16.