1 Introduction
Recently, eventcentered commonsense knowledge has attracted much attention chambers2008unsupervised; SEGERS16.722; wang2017integrating; li2018constructing, because of understanding events is an important component of NLP. Given a dailylife event, human can easily understand it and reason about its causes, effects, and so on. However, it still remains a challenging task for NLP systems. This is partly due to most of them are trained for taskspecific datasets or objectives, which results in models that are adapt at finding taskspecific underlying correlation patterns but have limited capability in simple and explainable commonsense reasoning sap2018atomic.
To facilitate this, rashkin2018event2mind (rashkin2018event2mind) build the Event2Mind dataset and sap2018atomic (sap2018atomic) present the Atomic dataset, mainly focus on nine IfThen reasoning types to describe causes, effects, intents and participant characteristic about events. Together with these datasets, a simple RNNbased encoderdecoder framework is proposed to conduct the IfThen reasoning.
However, there still remains two challenging problems. First, as illustrated in Figure 1, given an event “PersonX finds a new job”, the plausible feeling of PersonX about that event could be multiple (such as “needy/stressed out” and “relieved/joyful”). Previous work showed that for the onetomany problem, conventional RNNbased encoderdecoder models tend to generate generic responses, rather than meaningful and specific answers li2016diversity; serban2016building.
Second, as a commonsense reasoning problem, rich background knowledge is necessary for generating reasonable inferences. For example, as shown in Figure 1, the feeling of PersonX upon the event “PersonX finds a new job” could be multiple. However, after given a context “PersonX was fired”, the plausible inferences would be narrowed down to “needy” or “stressed out”.
To better solve these problems, we propose a contextaware variational autoencoder (CWVAE) together with a twostage training procedure. Variational Autoencoder (VAE) based models have shown great potential in modeling the onetomany problem and generate diversified inferences bowman2015generating; zhao2017learning.
In addition to the traditional VAE structure, we introduces an extra contextaware latent variable in CWVAE to learn the event background knowledge. In the pretrain stage, CWVAE is trained on an auxiliary dataset (consists of three narrative story corpora and contains rich event background knowledge), to learn the event background information by using the contextaware latent variable. Subsequently, in the finetune stage, CWVAE is trained on the taskspecific dataset to adapt the event background information to each specific aspect of IfThen inferential target (e.g., intents, reactions, etc.).
Experiments on the Event2Mind and Atomic dataset show that our proposed approach outperforms baseline methods in both the accuracy and diversity of inferences. The code is released at https://github.com/sjcfr/CWVAE.
2 Background
Before specifically describing two dataset — Event2Mind and Atomic used in this paper as well as the IfThen reasoning task, for clarity, we define the following terminologies:
Base event: the prerequisite event in IfThen reasoning, organized as a verb phrase with a predicate and its arguments, such as the event “PersonX finds a new job” shown in Figure 1.
Inference dimension: a particular IfThen reasoning type, e.g., intents, effects of the base event. Details are shown in Table 1 and Table 2.
Target: the inferential results. For example, as shown in Figure 1, given a base event “PersonX finds a new job” and one inference dimension “xReact”, the targets could be “relieved” or “needy”. Notice that each inference dimension can have multiple targets.
Event2Mind Dataset contains 25K base events and 300K targets, annotated through crowdsourcing. Event2Mind is organized in a hierarchical form: each base event has three types of inference dimensions, and given a base event, under one of inference dimensions, several targets may simultaneously exist. Table 1 shows the (base eventinference dimensiontarget) hierarchical structure through an example from Event2Mind.
Atomic Dataset Inspired by Event2Mind, the Atomic dataset shares the same hierarchical structure as Event2Mind, while scales up the size of dataset and expands the scope to nine types of inference dimensions. Table 2 shows the (base eventinference dimensiontarget) hierarchical structure through an example from Atomic. Though Atomic covers the inference dimensions of Event2Mind, the base event collection of Event2Mind is nonidentical to that of Atomic.
Base event  Inference Dim.  Target  

PersonX writes PersonY a letter  xIntent 


xReact 


oReact 

Base event  Inference Dim.  Target  

PersonX adopts a child  xIntent 


xNeed 


xAttr 


xEffect 


xWant 


xReact 


oReact 


oWant 


oEffect 

Problem Definition The IfThen reasoning task could be formally defined as a conditional onetomany generation problem: given a base event and one inference dimension , the model is required to generate targets as close to the ground truths as possible. Both and consist of sequence of words: , and , where and denotes the length of and , respectively.
Conditional Variational Autoencoder The variational autoencoder (VAE) defines a generative framework suited for onetomany generation problem kingma2014auto. While conditional variational autoencoder (CVAE) sohn2015learning is an extension of VAE on the conditional generation problem. As shown in Figure 2
(a), CVAE characterizes the conditional onetomany generation problem using three random variables: event
, target and a latent variable , which is used for modeling the latent distribution of semantic over targets given an event. Hence, under a certain inference dimension, with regard to the latent semantic variable , the conditional generation problem could be expressed as . CVAE models andusing deep neural networks (parameterized by
) and . Then as illustrated in Figure 2 (b), could be generated from and .CVAE is trained to maximize the conditional likelihood , which involves an intractable marginalization over the latent variable . Instead, following kingma2014auto (kingma2014auto), a practical way is to introduce another deep network (parameterized by ) to approximate the true posterior distribution and maximize the evidence lower bound (ELBO) of the loglikelihood function:
(1) 
Therefore, CVAE is composed of three neural networks in general. We refer to as a prior network, as a recognition network, and as a neural decoder.
3 Contextaware Variational Autoencoder
Traditional CVAE can model the eventtarget
relation. In other words, given an observed event, CVAE can generate its corresponding targets. While in this paper we model the IfThen reasoning as a [(background), event]target
process. It means that in addition to the observed event, we also want to involve the event background knowledge (which can be learned from event contexts) to generate the reasonable targets.
To this end, we propose a contextaware variational autoencoder (CWVAE), with two additional latent variables: a contextacquiring latent variable to directly acquire context information, and a contextaware latent variable to learn background knowledge from , as shown in Figure 3 (a). However, the event context information is absent in the Event2Mind and Atomic dataset. To learn from the external event context information, we design the following twostage training procedure for CWVAE.
Pretrain: Learning Event Background Knowledge from Auxiliary Dataset In the pretrain stage, CWVAE is trained on three narrative story corpora with rich event context information. As shown in Figure 3 (a), contextacquiring latent variable is directly conditioned on the context . Hence, could be employed for acquiring background knowledge from event contexts. Then, we minimize the distance between and the contextaware latent variable , by which the event background knowledge is transferred from to .
Finetune: Adapt Event Background Knowledge to Each Inference Dimension In the finetune stage, as shown in Figure 3 (b), CWVAE is trained on the Event2Mind and Atomic dataset without the event context information. Pretrained CWVAE is finetuned to learn the specific inferential knowledge of each inference dimension. After the training procedure, as shown in Figure 3 (c), samples of is generated based on and samples of , where contains rich event background knowledge helpful for IfThen reasoning.
3.1 Architecture of CWVAE
As shown in Figure 4
, CWVAE is mainly composed of four parts: a neural encoder that provides distributed representations of base events/targets, a recognition network for inferring
, and , a prior network for modeling and , and a neural decoder that integrates the information from and to generate targets.Neural Encoder We employ a bidirectional GRU as neural encoder, which encodes context , event and target into distributed representations , and , where , and is the length of , and , respectively.
Recognition Network The recognition network models , , based on , and .
Following traditional VAE, the abovementioned three distributions are assumed to be multivariate Gaussian distribution with a diagonal covariance structure:
(2) 
where denotes the mean of the distribution,
denotes the standard deviation of the distribution, and
denotes the identity matrix.
Given , and
, we propose a novel attentionbased inferer (ABI) module to estimate the mean and standard deviation of
, and :(3) 
Briefly, through the attention mechanism, ABI can capture the semantic interaction between input sequences, and estimate the parameters of distributions based on it. We will introduce the specific structure of ABI in below.
Prior Network Prior Network models and based on . The distribution of and are still assumed to be multivariate Gaussian, whereas the parameters are different:
(4) 
where denotes the mean of the distribution, denotes the standard deviation of the distribution and denotes the identity matrix.
Then the attentionbased inferer module is still employed to estimate parameters of distributions:
(5) 
Neural Decoder Given the base event , the semantic latent variable , and the contextaware latent variable
, the neural decoder defines the generation probability of
as following:(6) 
where , is an attentionbased feed forward model,
is the context vector and
is the hidden state of the decoder. We obtain and the same way as bahdanau2014neural (bahdanau2014neural). Whereas our decoder differs from bahdanau2014neural (bahdanau2014neural) in that our model integrates the contextaware latent variable and semantic latent variable in the computation of , where is the word embeddings of target words.Note that through concatenating and with and , could be affected by contextaware latent variable and semantic latent variable . This allows model to directly access to the event background knowledge from . In addition, the randomness of and would increase the diversity of model generation.
Attentionbased Inferer Attention mechanism has shown strong ability in capturing semantic interactions gong2017natural. Inspired by the coattention mechanism parikh2016decomposable, we propose an attentionbased inferer (ABI) to estimate the mean and standard deviation of a distribution belongs to or by capturing semantic interactions of input sequences.
Specifically, given two input sequences (e.g., representations of contexts and events) and with length and , we first obtain the attention scores from each side through:
(7) 
where and are parameter weights.
With these attention scores, the context vectors of both sequences are given by:
(8) 
Then we perform a mean pooling operation on context vectors of both sequences:
(9) 
To obtain the mean and standard deviation, the pooled context vectors and which carry semantic interaction between two sequences, are concatenated and projected into a latent semantic space through a nonlinear transformation:
(10) 
Finally the mean and standard deviation are generated through a nonlinear transformation over :
(11) 
3.2 Optimizing
With the incorporation of , the original loglikelihood could be decomposed as:
(12) 
Then following traditional CVAE, the ELBO of CWVAE is defined as follows:
(13) 
which is the objective function at the finetune stage.
While in the pretrain stage, as we aim to learn background knowledge through minimizing the distance between and , in addition to , a contextaware regulation term is introduced:
(14) 
where the context aware regularization term is the KL distance between and . Through minimizing the context aware regularization term, we aim to pass event context knowledge from to the context aware latent variable .
3.3 Training Details
To test the performance of CWVAE, we split the Event2Mind and Atomic dataset into training, development and test sets (80%, 10%, 10%) in the same way as rashkin2018event2mind (rashkin2018event2mind) and sap2018atomic (sap2018atomic), respectively.
We initialize the embedding layer from 300d GloVe word embeddings. The neural encoder is chosen to be biGRU with 300 hidden units. For the ABI module, size of and is set to be and respectively. The dimension of , and is all set as 40. The neural decoder is set to be GRU with 300d hidden state. Regulation coefficient of contextaware regulation term is set to be 0.1. Models are trained using an Adam optimizer kinga2015method with a learning rate of 0.001.
4 Experiments
4.1 Auxiliary Dataset
Context  Event  Inference Target  


4⃝ he got the job .  5⃝ jason was much happier at his new job . 
The auxiliary dataset is built upon three humanwritten story corpora: ROCStories mostafazadeh2016corpus, VIST huang2016visual and WritingPrompts fan2018hierarchical. ROCStories and VIST are composed of short stories with five sentences. We filter out stories of more than 1,000 words in WritingPrompts, and cut the remaining stories into fivesentenceparagraphs.
For each fivesentenceparagraph, we define the first three sentences as contexts of the base event, the fourth sentence as the base event, and the fifth sentence as the inference target. For example, as shown in Table 3, the first three sentences describe a context that Jason was unsatisfied about his job and applied for a new job. Hence, after happening the event “he got the job”, a plausible react about the event could be “jason was much happier at his new job”. In total, the auxiliary dataset contains 192,316 triples.
4.2 Baselines
We compared our proposed model with the following four baseline methods:

RNNbased Seq2Seq proposed by sap2018atomic (sap2018atomic) for the IfThen reasoning on Atomic.

Variational Seq2Seq combines a latent variable with the encoderdecoder structure through converting the last hidden state of RNN encoder into a Gaussian distributed latent variable bowman2015generating.

VRNMT Propose by su2018variational (su2018variational), VRNMT combines CVAE with attentionbased encoderdecoder framework through introduces a latent variable to model the semantic distribution of targets.

CWVAEUnpretrained refers to the CWVAE model without the pretrain stage.
Note that, for each baseline method, we train distinct models for each distinct inference dimension, respectively.
Metric  Methods  xIntent  xReact  oReact 

PPL  RNNbased Seq2Seq  44.12  29.18  14.08 
Variational Seq2Seq  42.06  28.22  12.62  
VRNMT  33.45  25.54  11.93  
CWVAEUnpretrained  31.32  24.07  11.37  
CWVAE  29.23  23.17  11.04  
BLEU  RNNbased Seq2Seq  2.75  2.11  5.18 
Variational Seq2Seq  2.84  2.43  2.08  
VRNMT  3.94  4.81  6.61  
CWVAEUnpretrained  5.52  7.36  5.33  
CWVAE  5.65  12.98  6.97 
Metric  Methods  xIntent  xReact  oReact 

dist1  RNNbased Seq2Seq  0.0002  0.0002  0.0001 
Variational Seq2Seq  0.0006  0.0003  0.0001  
VRNMT  0.0002  0.0002  0.0003  
CWVAEUnpretrained  0.0023  0.0017  0.0004  
CWVAE  0.0052  0.0033  0.0025  
dist2  RNNbased Seq2Seq  0.0005  0.0002  0.0002 
Variational Seq2Seq  0.0014  0.0002  0.0001  
VRNMT  0.0005  0.0003  0.0001  
CWVAEUnpretrained  0.0061  0.0040  0.0013  
CWVAE  0.0146  0.0099  0.0063 
Metric  Methods  xIntent  xNeed  xAttr  xEffect  xReact  xWant  oWant  oReact  oEffect 

PPL  RNNbased Seq2Seq  22.54  24.69  33.54  65.13  29.52  26.63  16.76  14.99  35.17 
Variational Seq2Seq  26.48  28.31  33.00  68.62  29.93  29.50  16.98  14.25  34.20  
VRNMT  21.04  24.28  24.87  61.05  26.62  28.57  14.45  14.86  30.12  
CWVAEUnpretrained  20.73  23.72  25.80  60.62  25.75  26.71  15.93  12.82  32.00  
CWVAE  15.93  20.32  23.85  50.74  21.39  24.02  14.02  11.70  29.13  
BLEU  RNNbased Seq2Seq  8.17  12.35  2.96  5.26  3.43  13.44  7.08  4.09  6.42 
Variational Seq2Seq  8.31  12.05  2.13  6.07  2.52  11.71  7.40  4.08  6.38  
VRNMT  9.52  13.35  4.87  4.42  7.64  9.80  10.79  5.28  13.71  
CWVAEUnpretrained  11.37  14.64  4.07  14.11  7.86  12.70  12.09  8.16  14.93  
CWVAE  12.12  15.67  5.63  14.64  8.13  15.01  13.83  8.58  11.63 
Metric  Methods  xIntent  xNeed  xAttr  xEffect  xReact  xWant  oWant  oReact  oEffect 

dist1  RNNbased Seq2Seq  0.0012  0.0029  0.0004  0.0019  0.0001  0.0022  0.0006  0.0001  0.0006 
Variational Seq2Seq  0.0006  0.0018  0.0002  0.0002  0.0001  0.0013  0.0007  0.0001  0.0002  
VRNMT  0.0002  0.0001  0.0053  0.0005  0.0018  0.0022  0.0005  0.0001  0.0004  
CWVAEUnpretrained  0.0019  0.0036  0.0119  0.0046  0.0021  0.0013  0.0018  0.0005  0.0006  
CWVAE  0.0055  0.0045  0.0142  0.0028  0.0043  0.0040  0.0021  0.0030  0.0033  
dist2  RNNbased Seq2Seq  0.0036  0.0081  0.0002  0.0018  0.0002  0.0006  0.0013  0.0001  0.0011 
Variational Seq2Seq  0.0013  0.0042  0.0001  0.0003  0.0002  0.0026  0.0002  0.0003  0.0006  
VRNMT  0.0002  0.0011  0.0002  0.0005  0.0001  0.0034  0.0005  0.0001  0.0004  
CWVAEUnpretrained  0.0060  0.0088  0.0136  0.0113  0.0043  0.0029  0.0041  0.0011  0.0009  
CWVAE  0.0162  0.0112  0.0146  0.0072  0.0013  0.0107  0.0044  0.0068  0.0093 
4.3 Evaluation Metrics
Automatic Evaluation
We first compare the perplexity of CWVAE with baseline methods. Perplexity measures the probability of model to regenerate the exact targets, which is particular suitable for evaluating the model performance on onetomany problem serban2017hierarchical. Further, we employ BLEU score to evaluate the accuracy of generations papineni2002bleu
, and the number of distinct ngram to evaluate the diversity of generations
li2016diversity. The distinct is normalized to by dividing the total number of generated tokens.Human Evaluation
Since automatic evaluation of generations is still a challenging task liu2016not, we also conduct human evaluations on the model performance. Five human experts are employed to evaluate the coherence, diversity and fluency of generated targets. Experts are asked to vote for if a generation is fluent or coherent for each generated target, and give a 15 score for the diversity of generations. For both Event2Mind and Atomic datasets, 100 events are randomly selected from the test set. For each method, top 10 generated targets of each base event are used for evaluation. Finally we report three overall averaged scores of coherence, diversity and fluency on both datasets, respectively.
4.4 Overall Results
We list the perplexity and BLEU score of CWVAE and baseline methods on Event2Mind and Atomic in Table 4 and Table 6, respectively, and show the distinct1 and distinct2 score on Event2Mind and Atomic in Table 5 and Table 7, respectively. We find that:
(1) As shown in Table 5 and Table 7, comparison between RNNbased Seq2Seq and variationalbased methods, including Variational Seq2Seq, VRNMT, CWVAEunpretrained and CWVAE shows that, variationalbased methods could increase the diversity of generations. This confirms one of our motivations that variationalbased methods could capture the latent semantic distribution within targets and increase the diversity of IfThen reasoning.
(2) Comparing CWVAEunpretrained with other baseline methods shows that, in general CWVAE improves the accuracy and diversity on both dataset. These results indicate the efficiency of CWVAE in capturing the latent semantic distribution of targets, and generate more reasonable inferential results.
(3) Comparison between CWVAE and CWVAEunpretrained shows that the pretrain stage could enhance the performance of CWVAE in both the accuracy and diversity. This is mainly because event knowledge could offer the guidance for IfThen reasoning. In the pretrain stage, CWVAE could capture the event background knowledge through contextaware latent variable, and such knowledge could be be adapted to our task through the fintune stage.
Methods  Coherence  Diversity  Fluency 

RNNbased Seq2Seq  0.28  2.03  0.73 
Variational Seq2Seq  0.33  1.67  0.92 
VRNMT  0.32  2.60  0.83 
CWVAEUnpretrained  0.36  2.10  0.92 
CWVAE  0.43  2.85  0.96 
Methods  Coherence  Diversity  Fluency 

RNNbased Seq2Seq  0.21  2.66  0.78 
Variational Seq2Seq  0.22  2.70  0.90 
VRNMT  0.24  2.61  0.78 
CWVAEUnpretrained  0.25  2.72  0.83 
CWVAE  0.32  3.03  0.90 
Base event  Inference dim.  Generations  Ground truth  

CWVAE  RNNbased Seq2Seq  





To further evaluate the effectiveness of our proposed approach, we also conduct human evaluations, the results of which are shown in Table 8 and Table 9. On both datasets, CWVAEbased methods achieve consistent better coherence, diversity and fluency performances. While comparing with CWVAEUnpretrained, the pretrain procedure could improves the performance on coherence and fluency. The main reasons are twofold: first, the CWVAE has advantage in capturing the semantic distribution of targets; second, event background learned from the pretrain stage is helpful for the IfThen reasoning.
4.5 Case Study
Table 10 provides an example of model generations given the base event “PersonX works tirelessly” and the inference dimension “xIntent”. The generations under CWVAE mainly contain four kinds of semantics: (1) be productive, (2) finish his work soon, (3) accomplish goal, (4) earn more money. While the semantics of generations using baseline RNNbased Seq2Seq model is relatively limited. Furthermore, the first three kinds of semantic overlap the three ground truth targets, and the fourth kind of semantic is in accordance with dailylife commonsense. Compared to RNNbased Seq2Seq model, our approach can increase the diversity and rationality of generations, meanwhile keep the accuracy.
5 Related Work
5.1 EventCentered Commonsense Reasoning
Understanding events and constructing eventcentered commonsense knowledge are crucial to many NLP applications, such as intention recognition goldwasser2016understanding and dialog generation wen2017latent. Recently a growing number of studies focus on eventcentered commonsense reasoning, which mainly concentrates on two areas, script event prediction and story ending generation/choosing.
Script event prediction concerns with the temporal relationships between script events granroth2016happens, which requires models to choose a correct subsequent tripleorganized event among the candidates wang2017integrating. Prior work mainly focused on modeling event pairs granroth2016happens, event chains wang2017integrating and event graph li2018constructing to predict the subsequent event. Story ending generation focuses on generating plausible story endings mostafazadeh2016corpus, which requires models to understand the story context, and keep generated endings logically consistent with it peng2017joint; guan2019story. The above tasks mainly investigate the logical orders of events, whereas the IfThen reasoning task focuses on inferring the mental state of event participants.
5.2 Variational AutoEncoderDecoder Based Natural Language Generation
VAE kingma2014auto
has been widely applied in various of text generation tasks, such as dialogue and machine translation.
In dialogue generation, zhao2017learning (zhao2017learning) adapts VAE with encoderdecoder framework to model the latent semantic distribution of answers, which can increase the diversity of generations. For the task of machine translation, su2018variational (su2018variational) and zhang2016variational (zhang2016variational) employ a latent variable to capture the semantic interaction between the source and target sentence, and regard the latent variable as a supplementation of attention mechanism. While Wang2019Topic (Wang2019Topic) use the latent variable to model topic distributions in text generation. In this paper, we introduce an additional contextaware latent variable to effectively learn background knowledge and conduct IfThen reasoning on the guidance of it.
6 Conclusion
In this paper, we propose a novel contextaware VAE (CWVAE) framework with two training stages for IfThen commonsense reasoning. By introducing an additional contextaware latent variable, CWVAE is able to learn external background knowledge, and conduct IfThen reasoning under its guidance. In the pretrain stage, CWVAE learns event background knowledge, then in the finetune stage CWVAE adapts such knowledge to each inference dimension. Experimental results demonstrate that CWVAE outperforms baseline methods in both the accuracy and diversity of generations.
7 Acknowledgments
We thank the anonymous reviewers for their constructive comments, and gratefully acknowledge the support of the National Key Research and Development Program of China (SQ2018AAA010010), the National Key Research and Development Program of China (2018YFB1005103), the National Natural Science Foundation of China (NSFC) via Grant 61702137.
Comments
There are no comments yet.