1 Introduction
Understanding events described in text is crucial to many artificial intelligence (AI) applications, such as discourse understanding, intention recognition and dialog generation. Script event prediction is the most challenging task in this line of work. This task was first proposed by chambers2008 chambers2008, who defined it as giving an existing event context, one needs to choose the most reasonable subsequent event from a candidate list (as shown in Figure
1).Previous studies built prediction models either based on event pairs [Chambers and Jurafsky2008, GranrothWilding and Clark2016] or event chains [Wang et al.2017]. Although success in using event pairs and chains, rich connections among events are not fully explored. To better model event connections, we propose to solve the problem of script event prediction based on event graph structure and infer the correct subsequent event based on network embedding.
Figure 2(a) gives an example to motive our idea of using more broader event connections (say graph structure). Given an event context A(enter), B(order), C(serve), we need to choose the most reasonable subsequent event from the candidate list D(eat) and E(talk), where D(eat) is the correct answer and E(talk) is a randomly selected candidate event that occurs frequently in various scenarios. Pairbased and chainbased models trained on event chains datasets (as shown in Figure 2(b)) are very likely to choose the wrong answer E, as training data show that C and E have a stronger relation than C and D. As shown in Figure 2(c), by constructing an event graph based on training event chains, context events B, C and the candidate event D compose a strongly connected component, which indicates that D is a more reasonable subsequent event, given context events A, B, C.
Abstract event evolutionary principles and patterns are valuable commonsense knowledge, which is crucial for understanding narrative text, human behavior and social development. We use the notion of event evolutionary graph (EEG) to denote the knowledge base that stores this kind of knowledge. Structurally, EEG is a directed cyclic graph, whose nodes are events and edges stand for the relations between events, e.g. temporal and causal relations. In this paper, we construct an event evolutionary graph based on narrative event chains, which is called narrative event evolutionary graph (NEEG). Having a NEEG in hand, another challenging problem is how to infer the subsequent event on the graph. A possible solution is to learn event representations based on network embedding.
duvenaud2015convolutional duvenaud2015convolutional introduced a convolutional neural network that could operate directly on graphs, which could be used for endtoend learning of prediction tasks whose inputs are graphs of arbitrary size and shape. kipf2016semi kipf2016semi presented a scalable approach for semisupervised learning on graphs that was based on an efficient variant of convolutional neural networks. They chose a localized firstorder approximation of spectral graph convolutions as the convolutional architecture, to scale linearly in the number of graph edges and learn hidden layer representations that encode both local graph structure and features of nodes. However, their models require the adjacency matrix to be symmetric and can only operate on undirected graphs. gori2005new gori2005new proposed graph neural network (GNN), which extended recursive neural networks and could be applied on most of the practically useful kinds of graphs, including directed, undirected, labeled and cyclic graphs. However, the learning algorithm in their model required running the propagation to convergence, which could have trouble propagating information across a long range in a graph. To remedy this, li2015gated li2015gated introduced modern optimization techniques of gated recurrent units to GNN. Nevertheless, their models can only operate on small graphs. In this paper, we further extend the work of li2015gated li2015gated by proposing a scaled graph neural network (SGNN), which is feasible to largescale graphs. We borrow the idea of divide and conquer in the training process that instead of computing the representations on the whole graph, SGNN processes only the concerned nodes each time. By comparing between context event representations and candidate event representations learned from SGNN, we can choose the correct subsequent event.
This paper makes the following two key contributions:

[leftmargin=*]

We are among the first to propose constructing event graph instead of event pairs and event chains for the task of script event prediction.

We present a scaled graph neural network, which can model event interactions on largescale dense directed graphs and learn better event representations for prediction.
Empirical results on widely used New York Times corpus show that our model achieves the best performance compared to stateoftheart baseline methods, by using standard multiple choice narrative cloze (MCNC) evaluation. The data and code are released at https://github.com/eecrazy/ConstructingNEEG_IJCAI_2018.
2 Model
As shown in Figure 3, our model consists of two steps. The first step is to construct an event evolutionary graph based on narrative event chains. Second, we present a scaled graph neural network to solve the inference problem on the constructed event graph.
2.1 Narrative Event Evolutionary Graph Construction
NEEG construction consists of two steps: (1) we extract narrative event chains from newswire corpus; (2) construct NEEG based on the extracted event chains.
In order to compare with previous work, we adopt the same news corpus and event chains extraction methods as [GranrothWilding and Clark2016]. We extract a set of narrative event chains , where . For example, can be { = customer, walk(, restaurant, ), seat(, , ), read(, menu, ), order(, food, ), serve(waiter, food, ), eat(, food, fork)}. is the protagonist entity shared by all the events in this chain. is an event that consists of four components , where is the predicate verb, and are the subject, object and indirect object to the verb, respectively.
NEEG can be formally denoted as , where is the node set, and is the edge set. In order to overcome the sparsity problem of events, we represent event by its abstract form (, ), where is denoted by a nonlemmatized predicate verb, and is the grammatical dependency relation of to the chain entity , for example =(eats, subj). This kind of event representation is called predicateGR [GranrothWilding and Clark2016]. We count all the predicateGR bigrams in the training event chains, and regard each predicateGR bigram as an edge in . Each is a directed edge along with a weight , which can be computed by:
(1) 
where means the frequency of the bigram () appears in the training event chains.
The constructed NEEG has 104,940 predicateGR nodes, and 6,187,046 directed and weighted edges. Figure 3(a) illustrates a local subgraph in , which describes the possible events involved in the restaurant scenario. Unlikely event pairs or event chains, event graph has dense connections among events and contains more abundant event interactions information.
2.2 Scaled Graph Neural Network
GNN was first proposed by gori2005new gori2005new. li2015gated li2015gated further introduced modern optimization technique of backpropagation through time and gated recurrent units to GNN, which is called gated graph neural network (GGNN). Nevertheless, GGNN needs to take the whole graph as inputs, thus it cannot effectively handle largescale graph with hundreds of thousands of nodes. For the purpose of scaling to largescale graphs, we borrow the idea of divide and conquer in the training process that we do not feed the whole graph into GGNN. Instead, only a subgraph (as shown in Figure
3(b)) with context and candidate event nodes is fed into it for each training instance. Finally, the learned node representations can be used to solve the inference problem on the graph.As shown in Figure 3(c), the overall framework of SGNN has three main components. The first part is a representation layer, which is used to learn the initial event representation. The second part is a gated graph neural network, which is used to model the interactions among events and update the initial event representations. The third part is used to compute the relatedness scores between context and candidate events, according to which we can choose the correct subsequent event.
2.2.1 Learning Initial Event Representations
We learn the initial event representation by composing pretrained word embeddings of its verb and arguments. For arguments that consist of more than one word, we follow [GranrothWilding and Clark2016]
and only use the head word identified by the parser. Outofvocabulary words and absent arguments are represented by zero vectors.
Given an event and the word embeddings of its verb and arguments ( is the dimension of embeddings), there are several ways to get the representation of the whole event by a mapping function . Here, we introduce three widely used semantic composition methods:

[leftmargin=*]

Average: Use the mean value of the verb and all arguments vectors as the representation of the whole event.

Concatenation [GranrothWilding and Clark2016]: Concatenate the verb and all argument vectors as the representation of the whole event.
2.2.2 Updating Event Representations Based on GGNN
As introduced above, GGNN is used to model the interactions among all context and candidate events. The main challenge is how to train it on a largescale graph. To train the GGNN model on NEEG with more than one hundred thousand event nodes, each time we feed into it a small subgraph, instead of the whole graph, to make it feasible to largescale graphs.
Inputs to GGNN are two matrices and , where = ( is 8 and is 5, the same as [GranrothWilding and Clark2016]), contains the initial context and subsequent candidate event vectors, and is the corresponding subgraph adjacency matrix, here:
(3) 
The adjacency matrix determines how nodes in the subgraph interact with each other. The basic recurrence of GGNN is:
(4)  
(5)  
(6)  
(7)  
(8) 
GGNN behaves like the widely used gated recurrent unit (GRU) [Cho et al.2014]. Eq. (4) is the step that passes information between different nodes of the graph via directed adjacency matrix . contains activations from edges. The remainings are GRUlike updates that incorporate information from the other nodes and from the previous time step to update each node’s hidden state. and are the update and reset gate,
is the logistic sigmoid function, and
is elementwise multiplication. We unroll the above recurrent propagation for a fixed number of steps . The output of GGNN can be used as the updated representations of context and candidate events.2.2.3 Choosing the Correct Subsequent Event
After obtaining the hidden states for each event, we model event pair relations using these hidden state vectors. A straightforward approach to model the relation between two events is using a Siamese network [GranrothWilding and Clark2016]. The output of GGNN for context events are and for the candidate events are . Given a pair of events () and (), the relatedness score is calculated by , where is the score function.
There are multiple choices for the score function in our model. Here, we introduce four common used similarity computing metrics that can serve as in the followings.

[leftmargin=*]

Manhattan Similarity is the Manhattan distance of two vectors: .

Cosine Similarity is the cosine distance of two vectors: .

Dot Similarity is the inner product of two vectors: .

Euclidean Similarity is the euclidean distance of two vectors: .
Given the relatedness score between each context event and each subsequent candidate event , the likelihood of given can be calculated as , then we choose the correct subsequent event by .
We also use the attention mechanism [Bahdanau et al.2015] to the context events, as we believe that different context events may have different weight for choosing the correct subsequent event. We use an attentional neural network to calculate the relative importance of each context event according to the subsequent event candidates:
(9)  
(10) 
Then the relatedness score is calculated by:
(11) 
2.2.4 Training Details
All the hyperparameters are tuned on the development set, and we use margin loss as the objective function:
where is the relatedness score between the th event context and the corresponding th subsequent candidate event, is the index of the correct subsequent event. The
is the margin loss function parameter, which is set to 0.015.
is the set of model parameters. is the parameter for L2 regularization, which is set to 0.00001. The learning rate is 0.0001, batch size is 1000, and recurrent times is 2.We use DeepWalk algorithm [Perozzi et al.2014] to train embeddings for predicateGR on the constructed NEEG (we find that embeddings trained from DeepWalk on the graph are better than that from Word2vec trained on event chains), and use the Skipgram algorithm [Mikolov et al.2013] to train embeddings for arguments on event chains. The embedding dimension
is 128. The model parameters are optimized by the RMSprop algorithm. Early stopping is used to judge when to stop the training loop.
3 Evaluation
We evaluate the effectiveness of SGNN comparing with several stateoftheart baseline methods. Accuracy (%) of choosing the correct subsequent event is used as the evaluation metric.
3.1 Baselines
We compare our model with the following baseline methods.

[leftmargin=*]

PMI [Chambers and Jurafsky2008] is the cooccurrencebased model that calculates predicateGR event pairs relations based on Pairwise Mutual Information.

Bigram [Jans et al.2012]
is the countingbased skipgrams model that calculates event pair relations based on bigram probabilities.

Word2vec [Mikolov et al.2013] is the widely used model that learns word embeddings from largescale text corpora. The learned embeddings for verbs and arguments are used to compute pairwise event relatedness scores.

DeepWalk [Perozzi et al.2014] is the unsupervised model that extends the word2vec algorithm to learn embeddings for networks.

EventComp [GranrothWilding and Clark2016] is the neural network model that simultaneously learns embeddings for the event verb and arguments, a function to compose the embeddings into a representation of the event, and a coherence function to predict the strength of association between two events.

PairLSTM [Wang et al.2017] is the model that integrates event order information and pairwise event relations together by calculating pairwise event relatedness scores using the LSTM hidden states as event representations.
3.2 Dataset
Training  Development  Test  

#Documents  830,643  103,583  103,805 
#Chains for NEEG  5,997,385     
#Chains for SGNN  140,331  10,000  10,000 
Following MarkGWAAAI16 MarkGWAAAI16, we extract event chains from the New York Times portion of the Gigaword corpus. The C&C tools [Curran et al.2007] are used for POS tagging and dependency parsing, and OpenNLP is used for phrase structure parsing and coreference resolution. There are 5 candidate subsequent events for each event context and only one of them is correct. The detailed dataset statistics are shown in Table 1.
4 Results and Analysis
4.1 Overall Results
Methods  Accuracy 

Random  20.00 
PMI [Chambers and Jurafsky2008]  30.52 
Bigram [Jans et al.2012]  29.67 
Word2vec [Mikolov et al.2013]  42.23 
DeepWalk [Perozzi et al.2014]  43.01 
EventComp [GranrothWilding and Clark2016]  49.57 
PairLSTM [Wang et al.2017]  50.83 
SGNNattention (without attention)  51.56 
SGNN (ours)  52.45 
SGNN+PairLSTM  52.71 
SGNN+EventComp  54.15 
SGNN+EventComp+PairLSTM  54.93 
) using ttest, except SGNN and PairLSTM (
).Experimental results are shown in Table 2, from which we can make the following observations.
(1) Word2vec, DeepWalk and other neural networkbased models (EventComp, PairLSTM, SGNN) achieve significantly better results than the countingbased PMI and Bigram models. The main reason is that learning low dimensional dense embeddings for events is more effective than sparse feature representations for script event prediction.
(2) Comparison between “Word2vec” and “DeepWalk”, and between “EventComp, PairLSTM” and “SGNN” show that graphbased models achieve better performance than pairbased and chainbased models. This confirms our assumption that the event graph structure is more effective than event pairs and chains, and can provide more abundant event interactions information for script event prediction.
(3) Comparison between “SGNNattention” and “SGNN” shows the attention mechanism can effectively improve the performance of SGNN. This indicates that different context events have different significance for choosing the correct subsequent event.
(4) SGNN achieves the best script event prediction performance of 52.45%, which is 3.2% improvement over the best baseline model (PairLSTM).
We also experimented with combinations of different models, to observe whether different models have complementary effects to each other. We find that SGNN+EventComp boosts the SGNN performance from 52.45% to 54.15%. This shows that they can benefit from each other. Nevertheless, SGNN+PairLSTM can only boost the SGNN performance from 52.45% to 52.71%. This is because the difference between SGNN and PairLSTM is not significant, which shows that they may learn similar event representations but SGNN learns in a better way. The combination of SGNN, EventComp and PairLSTM achieves the best performance of 54.93%. This is mainly because pair structure, chain structure and graph structure each has its own advantages and they can complement each other.
The learning curve (accuracy with time) of SGNN and PairLSTM is shown in Figure 4. We find that SGNN quickly reaches a stable high accuracy, and outperforms PairLSTM from start to the end. This demonstrates the advantages of SGNN over PairLSTM model.
4.2 Comparative Experiments
We conduct several comparative experiments on the development set to study the influence of various settings on SGNN.
4.2.1 Experiment with Different Event Semantic Composition Methods
Composition  Accuracy (%) 

Average  43.42 
Nonlinear  51.54 
Concatenation  52.38 
Score Metric  Accuracy (%) 

Manhattan  50.11 
Cosine  50.81 
Dot  51.62 
Euclidean  52.38 
Given the word embeddings of the verb and arguments ,,, of the event , we compare three event semantic composition methods, introduced in Section 2.2.1. Experimental results are shown in Table 3.
We find that concatenating the embeddings of the verb and arguments vectors can achieve the best performance. And average input vectors is the worst way to get the representation . The main reason is that many events do not have an indirect object , which may harm the performance of the averaging operation.
4.2.2 Experiment with Different Score Functions
We compare several common used similarity score metrics , as introduced in Section 2.2.3, to investigate their influence on the performance. As shown in Table 4, we find that different score metrics indeed have different effects on the performance, though the gaps among them are not so big. Euclidean score metric achieves the best result, which is consistent with the results of the previous study using word embeddings to compute document distances [Kusner et al.2015].
5 Related Work
5.1 Statistical Script Learning
The use of scripts in AI dates back to the 1970s [Schank and Abelson1977]. In this conception, scripts are composed of complex events without probabilistic semantics. In recent years, a growing body of research has investigated learning probabilistic cooccurrencebased models with simpler events. chambers2008 chambers2008 proposed unsupervised induction of narrative event chains from raw newswire text, with narrative cloze as the evaluation metric, and pioneered the recent line of work on statistical script learning. jans2012skip jans2012skip used bigram model to explicitly model the temporal order of event pairs. However, they all utilized a very impoverished representation of events in the form of (verb, dependency). To overcome the drawback of this event representation, pichotta2014statistical pichotta2014statistical presented an approach that employed events with multiple arguments.
There have been a number of recent neural models for script learning. pichotta2015learning pichotta2015learning showed that LSTMbased event sequence model outperformed previous cooccurrencebased methods. pichotta2016using pichotta2016using used a Seq2Seq model directly operating on raw tokens to predict sentences. MarkGWAAAI16 MarkGWAAAI16 described a feedforward neural network which composed verbs and arguments into lowdimensional vectors, evaluating on a multiplechoice version of the narrative cloze task. wang2017integrating wang2017integrating integrated event order information and pairwise event relations together by calculating pairwise event relatedness scores using the LSTM hidden states.
Previous studies built prediction models either based on event pairs [Chambers and Jurafsky2008, GranrothWilding and Clark2016] or based on event chains [Wang et al.2017]. In this paper, we propose to solve the problem of script event prediction based on event graph structure.
5.2 Neural Network on Graphs
Graphstructured data appears frequently in domains such as social networks and knowledge bases. [Perozzi et al.2014] proposed the unsupervised DeepWalk algorithm that extended the word2vec [Mikolov et al.2013] algorithm to learn embeddings for graph nodes based on random walks. Later, unsupervised network embedding algorithms including LINE [Tang et al.2015] and node2vec [Grover and Leskovec2016] have been proposed following DeepWalk. duvenaud2015convolutional duvenaud2015convolutional introduced a convolutional neural network that could operate directly on graphs. kipf2016semi kipf2016semi presented a scalable approach for semisupervised learning on graphs that was based on an efficient variant of convolutional neural networks. However, their models require the adjacency matrix to be symmetric and can only operate on undirected graphs.
Most related to our SGNN model is the graph neural network introduced by gori2005new gori2005new, which was capable of directly processing graphs. GNN extended recursive neural networks and could be applied on most of the practically useful kinds of graphs, including directed, undirected, labeled and cyclic graphs. However, the learning algorithm they used required running the propagation to convergence, which could have trouble propagating information across a long range in a graph. To remedy this, li2015gated li2015gated introduced modern optimization technique of backpropagation through time and gated recurrent units to GNN, resulting in the gated graph neural network model. Nevertheless, their models usually operate on small graphs.
In this paper, we further extend the GGNN model by proposing a scaled graph neural network, which is feasible to largescale graphs. Instead of feeding the whole graph into the model, we borrow the idea of divide and conquer in the training process, and only a subgraph that contains the concerned nodes is fed into GGNN for each training instance.
5.3 Graphbased Organization of Events
Some previous studies also explored graphbased organization of events. Orkin2007Learning Orkin2007Learning described events in a multiagent script, which were derived from data collected from video game players. Li2013Story Li2013Story described an approach to construct graphbased narrative scripts using crowdsourcing for story generation. Glava2015Construction Glava2015Construction proposed
event graphs as a novel way of structuring eventbased information from text. Other studies tried to organize temporal relations [Chambers and Jurafsky2008] or causal relations [Zhao et al.2017] between events into graph structure. In this paper, we propose event evolutionary graph, which denotes the knowledge base storing abstract event evolutionary patterns.6 Conclusion
In this paper, we propose constructing event graph to solve the script event prediction problem based on network embedding. In particular, to better utilize the dense connections information among events, we construct a narrative event evolutionary graph (NEEG) based on the extracted narrative event chains. To solve the inference problem on NEEG, we present a scaled graph neural network (SGNN) to model the events interactions and learn better event representations for choosing the correct subsequent event. Experimental results show that event graph structure is more effective than event pairs and event chains, which can help significantly boost the prediction performance and make the model more robust.
Acknowledgments
This work is supported by the National Key Basic Research Program of China via grant 2014CB340503 and the National Natural Science Foundation of China (NSFC) via grants 61472107 and 61702137. The authors would like to thank the anonymous reviewers for the insightful comments. They also thank Haochen Chen and Yijia Liu for the helpful discussion.
References
 [Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.
 [Chambers and Jurafsky2008] Nathanael Chambers and Daniel Jurafsky. Unsupervised learning of narrative event chains. In ACL, volume 94305, pages 789–797, 2008.
 [Cho et al.2014] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, pages 1724–1734, 2014.
 [Curran et al.2007] James R Curran, Stephen Clark, and Johan Bos. Linguistically motivated largescale nlp with c&c and boxer. In ACL, pages 33–36, 2007.
 [Duvenaud et al.2015] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, pages 2224–2232, 2015.
 [Glavaš and Šnajder2015] Goran Glavaš and Jan Šnajder. Construction and evaluation of event graphs. Natural Language Engineering, 21(4):607–652, 2015.
 [Gori et al.2005] Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains. In IJCNN, volume 2, pages 729–734. IEEE, 2005.
 [GranrothWilding and Clark2016] Mark GranrothWilding and Stephen Clark. What happens next? event prediction using a compositional neural network model. In AAAI, 2016.
 [Grover and Leskovec2016] A Grover and J Leskovec. node2vec: Scalable feature learning for networks. KDD, 2016:855–864, 2016.

[Jans et al.2012]
Bram Jans, Steven Bethard, Ivan Vulić, and Marie Francine Moens.
Skip ngrams and ranking functions for predicting script events.
In EACL, 2012.  [Kipf and Welling2017] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. ICLR, 2017.
 [Kusner et al.2015] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In ICML, pages 957–966, 2015.
 [Li et al.2013] Boyang Li, Stephen LeeUrban, George Johnston, and Mark O. Riedl. Story generation with crowdsourced plot graphs. In AAAI, pages 598–604, 2013.
 [Li et al.2016] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. ICLR, 2016.
 [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
 [Orkin2007] Jeffrey David Orkin. Learning plan networks in conversational video games. Thesis, 2007.
 [Perozzi et al.2014] Bryan Perozzi, Rami Alrfou, and Steven Skiena. Deepwalk: online learning of social representations. KDD, 2014:701–710, 2014.
 [Pichotta and Mooney2014] Karl Pichotta and Raymond J Mooney. Statistical script learning with multiargument events. In EACL, volume 14, pages 220–229, 2014.

[Pichotta and
Mooney2015]
Karl Pichotta and Raymond J Mooney.
Learning statistical scripts with lstm recurrent neural networks.
In AAAI, 2015.  [Pichotta and Mooney2016] Karl Pichotta and Raymond J Mooney. Using sentencelevel lstm language models for script inference. ACL, 2016.
 [Schank and Abelson1977] Roger C Schank and Robert P Abelson. Scripts, plans, goals, and understanding: An inquiry into human knowledge structures (artificial intelligence series). 1977.
 [Tang et al.2015] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In WWW, pages 1067–1077, 2015.
 [Wang et al.2017] Zhongqing Wang, Yue Zhang, and ChingYun Chang. Integrating order information and event relation for script event prediction. In EMNLP, pages 57–67, 2017.
 [Zhao et al.2017] Sendong Zhao, Quan Wang, Sean Massung, Bing Qin, Ting Liu, Bin Wang, and ChengXiang Zhai. Constructing and embedding abstract event causality networks from text snippets. In WSDM, pages 335–344. ACM, 2017.
Comments
There are no comments yet.