Story ending generation is the task of generating an ending sentence of a story given a story context. A story context is a sequence of sentences connecting characters and events. This task is challenging as it requires modelling the characters, events and objects in the context, and then generating a coherent and sensible ending based on them. Generalizing the semantics of the events and entities and their relationships across stories is a non-trivial task. Even harder challenge is to generate stories which are non-trivial and interesting. In this work, we focus on the story ending generation task, where given a story context - a sequence of sentences from a story, the model has to generate the last sentence of the story.
|My friends and I did not know what to do|
|for our friends birthday. We sat around the|
|living room trying to figure out what to do.|
|We finally decided to go to the movies. We all|
|drove to the theatre and bought tickets.|
|Specific response (ground truth)|
|The movie turned out to be terrible but our|
|friend had a good time.|
|Generic response (seq2seq output)|
|We were so happy to see that we had a good|
Seq2seq models have been widely used for the purpose of text generation. Despite the popularity of the Seq2seq models, in story generation tasks, they suffer from a well known issue of generating frequent phrases and generic outputs. These models when trained with Maximum Likelihood Estimate (MLE), learn to generate sequences close to the ground truth sequences. However, in story generation tasks, there can be multiple possible reasonable outputs for a given input story context. MLE objective in these models results in outputs which are safe (that is more likely to be present in any output), but also bland and generic. Some examples of such generic outputs in story ending generation task are - ”He was sad”, ”They had a great time”, etc. Table1 shows an example story from the ROC stories corpus Mostafazadeh et al. (2017) and the corresponding specific and generic responses.
There have been many attempts to solve the issue of generation of generic responses. They can be broadly categorized into two categories:
Use conditionals such as emotions, sentiments, keywords, etc. that work as factors to condition the output on Li et al. (2016b); Hu et al. (2017a). When the models focus on these conditionals given as additional input features, they tend to generate outputs which are more relevant and specific to the conditionals, which leads to less generic outputs. In our models, we use the keyphrases present in the story context as conditionals.
to encourage the model to generate more diverse outputs. Our proposed model uses the ITF loss function suggested byNakamura et al. (2018) to encourage the decoder to produce more interesting outputs.
We show that our proposed models can generate diverse and interesting outputs by conditioning on the keyphrases present in the story context and incorporating a modified training objective. Apart from human judgement based evaluation, we measure performance of the models in terms of 1) Diversity using DISTINCT-1,2,3 metrics and 2) Relevance by introducing an automatic metric based on Story Cloze loss. Experiments show that our models score higher than current state of the art models in terms of both diversity and relevance.
For reproducibility purposes we are making our codebase open source 111https://github.com/witerforcing/WriterForcing.
2 Related Work
There has been a surge in recent years to tackle the problem of story generation. One common theme is to employ the advances in deep learning for the task.Jain et al. (2017) use Seq2Seq models Sutskever et al. (2014) to generate stories from descriptions of images. Huang et al. (2018) leverage hierarchical decoding where a high-level decoder constructs a plan by generating a topic and a low-level decoder generates sentences based on the topic. There have been a few works which try to incorporate real world knowledge during the process of story generation. Guan et al. (2018) use an incremental encoding (IE) scheme and perform one hop reasoning over the ConceptNet graph ConceptNet in order to augment the representation of words in the sentences. Chen et al. (2018) also tackle the problem in a similar way by including ”commonsense knowledge” from ConceptNet as well. Several prior work focus on generating more coherent stories. Clark et al. (2018) model entity representations explicitly by combining it with representations of the previous sentence and Martin et al. (2018) model events representations and then generate natural language sentences from those events (event2sentence). Li et al. (2018) use adversarial training to help the model generate more reasonable endings.
A common problem with such neural approaches in general is that the generated text is very ”safe and boring”. There has been a lot of recent efforts towards generating diverse outputs in problems such as dialogue systems, image captioning, story generation, etc., in order to alleviate the safe or boring text generation problem. Methods include using self-attention Shao et al. (2017)2017), GANs etc. Xu et al. (2018) proposed a method called Diversity-Promoting Generative Adversarial Network, which assigns low reward for repeatedly generated text and high reward for novel and fluent text using a language model based discriminator. Li et al. (2016a) propose a Maximum Mutual Information (MMI) objective function and show that this objective function leads to a decrease in the proportion of generic response sequences. Nakamura et al. (2018) propose another loss function for the same objective. In our models we experiment with their loss function and observe similar effects.
Recent works have also made advances in controllable generation of text based on constraints to make the outputs more specific. Peng et al. (2018) have a conditional embedding matrix for valence to control the ending of the story. Hu et al. (2017b)
have a toggle vector to introduce constraint on the output of text generation models using Variational Auto EncodersDoersch (2016). Generating diverse responses based on conditioning has been done extensively in the field of dialogue systems. Xing et al. (2016); Zhou et al. (2018); Zhang et al. (2018) propose conditioning techniques by using emotion and persona while generating responses. Conditioned generation has also been studied in the field of story generation to plan and write Yao et al. (2018); Huang et al. (2018) stories.
In this work, we focus on generating more diverse and interesting endings for stories by introducing conditioning on keyphrases present in the story context and encouraging infrequent words in the outputs by modifying the training objective, thus leading to more interesting story endings.
3.1 Sequence-to-Sequence with Attention
We use a Seq2Seq model with attention as our baseline model. Words (
) belonging to the context of the story are fed one by one to the encoder (uni-directional but multi-layer) which produces the corresponding hidden representations. Finally, the hidden representation at the final timestep (T) is passed on to the decoder. During training, for each step t, the decoder (a unidirectional GRU) receives the word embedding of the previous word() and the hidden state (). At training time, the word present in the target sentence at timestep is used and at test time, the actual word emitted by the decoder at the time step is used as input in the next time step.
However, to augment the hidden representation that is passed from encoder to the decoder, one can use the mechanism of attention Bahdanau et al. (2014). The attention weights at time step during decoding, denoted as , can be calculated as:
where , and are learnable parameters and denotes the
component of the attention weights. The attention weights can be viewed as a probability distribution over the source words, that tells the decoder where to look to produce the next word. Next, the attention weights are used to produce a weighted sum of the encoder hidden states, known as the context vector():
This context vector is then concatenated with the embedding of the input word. It is used by the decoder to produce a probability distribution. over the whole vocabulary:
During training, the loss for timestep t is the negative log likelihood of the target word for that timestep.
Thus, the overall averaged loss of the whole sequence becomes:
4 Model Overview
We now describe our model and its variations. We hypothesize that conditioning on the keyphrases present in the story context leads to more specific and interesting outputs. We experiment with several variants for incorporating keyphrases in our base model. We further adapt the loss suggested by Nakamura et al. (2018) to encourage the model to generate infrequent tokens.
4.1 Keyphrase Conditioning
In this section we briefly describe four different variants used for incorporating keyphrases from the story context. We first extract top k keyphrases from the story context using the RAKE algorithm Rose et al. (2010). RAKE determines the keyphrases and assigns scores in the text based on frequency and co-occurrences of words. We then sort the list of keyphrases by their corresponding scores and take the top-k of those. Note, each of the keyphrases can contain multiple words. Each of the word in a multi-word key phrase is assigned the same score as the keyphrase.
We use the top-k keyphrases and ignore the rest. For that, we explicitly set the score of 0 to all the keyphrases which couldn’t get to the top-k list. Next, the scores of these top-k keyphrases are normalized so that the total score sums to 1. We call this set keyphrases and its corresponding score vector . is a vector with length equal to the length of the story context, and the value of every element is equal to score of the keyphrase to which the word belongs.
In all the four model variants described next, we incorporate the score vector to encourage the model to condition on the keyphrases.
4.1.1 Keyphrase Addition
In a Seq2Seq model with attention, for every timestep t during decoding, the model generates a distribution which is the weight given for a given source context word . In this variant, the model is provided an additional keyphrase attention score vector along with the self-learnt attention weight vector . To combine the two vectors, we simply add the values of the two vectors for each encoder position i and normalize the final vector .
Now at each time step of the decoder, we compute the new context vector as follows:
4.1.2 Context Concatenation
This variation calculates two separate context vectors - one based on attention weights learnt by the model, and another based on the keyphrase attention score vector. Then both these context vector are concatenated. The intuition for this method comes from multi-head attention Vaswani et al. (2017), where different attention heads are used to compute attention on different parts of the encoder states. Similarly we also expect the model to capture salient features from both types of context vectors.
We use this new context vector to calculate our probabilities over the words as described in equation 4.
4.1.3 Coverage Loss
This is a variant which implicitly encourages the model to pay attention to all words present in the context. We adapt the attention coverage based loss proposed by See et al. (2017). It also helps in avoiding repeated attention across different timesteps while decoding. Due to this constraint, the model should focus on different words in the story and generate outputs conditioned on these words. The loss function which is presented in the paper is :
Here is the sum of attention weight vectors till i time steps and is the attention weight vector for the current time step i.
4.1.4 Keyphrase Attention Loss
In this variant, instead of explicitly forcing the model to attend on the keyphrases, we provide additional guidance to the model in order for it learn to attend to keyphrases. We introduce an attention similarity based loss. We first create a coverage vector , which is the sum of the attention weights across all the decoders time steps. We then calculate Mean Squared Error loss between this vector and our keyphrase score vector
. This loss is calculated once per story after the whole decoding of the generated ending has finished. Unlike the first two variants, this method only nudges the model to pay more attention to the keyphrases instead of forcing attention on them. While backpropagating into the network, we use two losses. One is the original reconstruction loss which is used in Seq2Seq models and the other is this keyphrase based attention loss. This can be summarised by the following set of equations.
Where MSE is the mean squared error between the coverage vector and the probability distribution produced by the RAKE algorithm. This loss is weighted by a factor and added to the cross-entropy loss.
4.2 Inverse Token Frequency Loss
As mentioned earlier, Seq2Seq models tend to generate frequent words and phrases, which lead to very generic story endings. This happens due to the use of conditional likelihood as the objective, especially in problems where there can be one-to-many correspondences between the input and outputs. MLE loss unfairly penalizes the model for generating rare words which would be correct candidates, but are not present in the ground truth. This holds for our problem setting too, where for the same story context, there can be multiple possible story endings. Nakamura et al. (2018) proposed an alternative Inverse Token Frequency (ITF) loss which assigns smaller loss for frequent token classes and larger loss for rare token classes during training. This encourages the model to generate rare words more frequently compared to cross-entropy loss and thus leads to more interesting story ending outputs.
5 Experimental Setup
We used the ROCStories Mostafazadeh et al. (2017) corpus to generate our story endings. Each story in the dataset comprises of five sentences. The input is the first four sentences of the story and output is the last sentence of the story. The number of stories which were used to train and test the model are shown in Table 2.
|Train Set||Dev set||Test Set|
5.2 Baselines and Proposed Methods
For the evaluation of story ending generation, we compare the following baselines and proposed models:
Seq2Seq: Seq2Seq model with attention trained with vanilla Maximum likelihood Estimate(MLE) loss.
IE + GA: model based on Incremental Encoding (IE) and Graph Attention (GA) Guan et al. (2019).
Seq2Seq + ITF: Seq2Seq model with attention trained with ITF loss.
Keyphrase Add + ITF: Our model variant described in section 4.1.1.
Context Concat + ITF: Our model variant described in section 4.1.2.
Coverage Loss + ITF: Our model variant described in section 4.1.3 based on See et al. (2017).
Keyphrase Loss + ITF: Our model variant described in section 4.1.4.
Keyphrase Loss: Our model variant described in section 4.1.4 without the ITF loss.
5.3 Experiment Settings
All our models use the same hyper-parameters. We used a two layer encoder-decoder architecture with 512 GRU hidden units. We train our models using Adam optimizer with a learning rate of 0.001. For the Keyphrase Attention Loss model we assign the weight of 0.9 to Keyphrase loss and 0.1 to reconstruction loss. We use the best win percent from our Story-Cloze metric (described in the next section) for model selection. For ITF loss we use the hyperparameters mention in the original paper. We also apply basic heuristics to prevent continuous repetition of words.
5.4 Automatic Evaluation Metrics
In this section, we briefly describe the various metrics which were used to test our models. We did not use perplexity or BLEU as evaluation metric, as neither of them is likely to be an effective evaluation metric in our setting. This is since both these metrics measure performance based on a single reference story ending present in the test dataset, however there can be multiple valid possible story endings for a story. Therefore, we
DIST (Distinct): Distinct-1,2,3 calculates numbers of distinct unigrams, bigrams and trigrams in the generated responses divided by the total numbers of unigrams, bigrams and trigrams. We denote the metrics as DIST-1,2,3 in the result tables. Higher Distinct scores indicate higher diversity in generated outputs.
Story-Cloze: Since it is difficult to do human evaluation on all the stories, we use the Story-Cloze task Mostafazadeh et al. (2017) to create a metric in order to pick our best model and also to evaluate the efficacy of our model against Seq2Seq and its variants. This new proposed metric measures the semantic relevance of the generated ending with respect to the context. In the Story-Cloze task, given two endings to a story the task is to pick the correct ending. We can use this task to identify the better of two endings. In order to do so, we fine-tune BERT Devlin et al. (2018) to identify the true ending between two story candidates. The dataset for this task was obtained using the Story-Cloze task. Positive examples to BERT are obtained from the Story-Cloze dataset while the negative examples are obtained by randomly sampling from other story endings to get false ending for the story. We fine tune BERT in the two sentence setting by providing the context as the first sentence and the final sentence as the second. We pick the ending with a greater probability (from BERT’s output head) of being a true ending as the winner. With this approach we were able to get a Story-Cloze test accuracy of 72%.
We now use this pre-trained model to compare the IE + GA model with our models. We select the winner based on the probability given by the pre-trained Bert model.
We measure the performance of our models through automatic evaluation metrics as well as human evaluation. We use Distinct1, Distinct2 and Distinct3 to measure the diversity of our outputs. Additionally, we have built an automatic evaluation system using BERT and the Story-Cloze task following Fan et al. (2018) in order to compare our model against the state of the art models like the IE model. We also perform human evaluation on the stories generated by our model to get a overall sense of the model’s performance.
|IE + GA||0.026||0.130||0.263|
|Seq2Seq + ITF||0.063||0.281||0.517|
|Keyphrase Add + ITF||0.065||0.289||0.539|
|Context Concat + ITF||0.065||0.294||0.558|
|Coverage Loss + ITF||0.066||0.315||0.590|
|Keyphrase Loss + ITF||0.068||0.318||0.588|
|Seq2Seq + ITF||54.1|
|Keyphrase Add + ITF||52.9|
|Context Concat + ITF||55.8|
|Coverage Loss + ITF||54.7|
|Keyphrase Loss + ITF||55.9|
We measure the performance of the models using an automated Story-Cloze classifier which compares the outputs of model with the outputs of IE model.
5.5.1 Model Comparison and Ablation Study
From the Table 3, we observe that the Seq2Seq model and the incremental encoding + graph attention (IE + GA) model have the worst performance in diversity. Although it has been shown that the IE + GA model achieves a good BLEU score, we observe that the model does not do as well on our automated metrics like DIST-1, 2 and 3 because the model has learnt to generate endings which match the distribution as a whole instead of generating story specific endings.
As expected, Seq2Seq + ITF loss model greatly outperforms the vanilla Seq2Seq model. As does the Keyphrase loss, showing that these models are indeed able to focus on different context words resulting in more diverse generations.
The Story-Cloze based performance of the models is presented in Table 4. The Keyphrase + ITF loss model outperforms all models on both the diversity and Story-Cloze metrics. Hence, we select Keyphrase + ITF loss model as the best model in further discussions. As an ablation study, we run the Keyphrase loss model with the MLE loss instead of the ITF loss. We find that this model performs poorly than then its version with the ITF loss but still performs quite better than the Seq2Seq model. Also we note that the diversity obtained in Keyphrase + ITF loss model is greater than the Seq2Seq + ITF model and the Keyphrase loss model without ITF. It shows that a combination of both, Keyphrase attention loss and ITF loss, achieves better performance than these components by themselves.
5.5.2 Effect of varying number of keyphrases
In order to better understand the effect of keyphrases on the diversity and relevance of story endings, we ran the Coverage Loss model with varying number of keyphrases. Table 5 shows the results of the experiment. We see that both Story-Cloze loss and DIST-1,2,3 are low when we use 1 keyphrase and also when we use all the keyphrases. This is expected, since in the case of 1 keyphrase, the model has very little keyphrase related information. In the other extreme case, providing all keyphrases covers a large proportion of the original context itself, and thus does not provide any extra benefit. We see good performance within the range of 3-5 keyphrases, where using 5 keyphrases gives the best diversity and 3 keyphrases gives the best Story-Cloze score. Informed by this experiment, we use 5 keyphrases in all our other experiments.
5.6 Human Evaluation
Since automatic metrics are not able to capture all qualitative aspects of the models, we performed a human evaluation study to compare our models. We first randomly selected 50 story contexts from the test set, and show them to three annotators. The annotators see the story context, and the story endings generated by our best model and the baseline IE+GA model in a random order. They are asked to select a better ending among the two based on three criteria - 1) Relevance - Story ending should be appropriate and reasonable according to the story context. 2) Interestingness - More interesting story ending should be preferred 3) Fluency - Endings should be natural english and free of errors. We found that both models were preferred 50% of the time, that is, both model was picked for 25 stories each. From a manual analysis of human evaluation, we found that our model was selected over the baseline in many cases for generating interesting endings, but was also equivalently penalized for losing the relevance in some of the story endings. We discuss this aspect in more detail in section 5.7.
5.7 Qualitative Analysis
In Table 6, we show some example generations of our model and baselines. From example 1 and 2, it can be seen that the baseline models produce generic responses for story endings without focusing much on the context and keyphrases in the story. However, our model conditions on words like ”pageant” in the story context, and includes it in the output even though it is a rare word in the corpus. Another point to note is that our model tends to include more proper nouns and entities in its output, like alicia and megan instead of using generic words like ”he” and ”she”. However, our model is penalised a few times for being too adventurous, because it tends to generate more rare outputs based on the context. For example, in example 3, it got half of the output correct till ”katie was devastated”, but the other half ”dumped her boyfriend” although is more interesting than the baseline models, is not relevant to the story context. The model incorrectly refers to katie with the pronoun ”himself”. In example 4, our model’s generated output is quite relevant and interesting, apart from the token ”catnip”, for which it is penalized in human evaluation. Hence, although our model generates more interesting outputs, further work is needed to ensure that 1) The generated outputs entail the story context at both semantic and token level. 2) The generated output is logically sound and consistent.
In this paper we have presented several models to overcome the generic responses produced by the state of the art story generation systems. We have both quantitatively and qualitatively shown that our model achieved meaningful improvements over the baselines.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Baheti et al. (2018)
Ashutosh Baheti, Alan Ritter, Jiwei Li, and Bill Dolan. 2018.
Generating more interesting responses in neural conversation models
with distributional constraints.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3970–3980, Brussels, Belgium. Association for Computational Linguistics.
- Chen et al. (2018) J. S. Chen, Jiaao Chen, and Zhou Yu. 2018. Incorporating structured commonsense knowledge in story completion. CoRR, abs/1811.00625.
- Clark et al. (2018) Elizabeth Clark, Yangfeng Ji, and Noah A Smith. 2018. Neural text generation in stories using entity representations as context. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2250–2260.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Doersch (2016) Carl Doersch. 2016. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908.
- Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833.
- Guan et al. (2018) Jian Guan, Yansen Wang, and Minlie Huang. 2018. Story ending generation with incremental encoding and commonsense knowledge. arXiv preprint arXiv:1808.10113.
- Guan et al. (2019) Jian Guan, Yansen Wang, and Minlie Huang. 2019. Story ending generation with incremental encoding and commonsense knowledge. CoRR, abs/1808.10113.
Hu et al. (2017a)
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing.
Toward controlled generation of text.
Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1587–1596, International Convention Centre, Sydney, Australia. PMLR.
- Hu et al. (2017b) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017b. Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1587–1596. JMLR. org.
- Huang et al. (2018) Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng Wu, Jianfeng Wang, and Xiaodong He. 2018. Hierarchically structured reinforcement learning for topically coherent visual story generation. arXiv preprint arXiv:1805.08191.
- Jain et al. (2017) Parag Jain, Priyanka Agrawal, Abhijit Mishra, Mohak Sukhwani, Anirban Laha, and Karthik Sankaranarayanan. 2017. Story generation from sequence of independent short descriptions. arXiv preprint arXiv:1707.05501.
- Li et al. (2016a) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
- Li et al. (2016b) Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill Dolan. 2016b. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 994–1003, Berlin, Germany. Association for Computational Linguistics.
- Li et al. (2017) Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2157–2169, Copenhagen, Denmark. Association for Computational Linguistics.
- Li et al. (2018) Zhongyang Li, Xiao Ding, and Ting Liu. 2018. Generating reasonable and diversified story ending using sequence to sequence model with adversarial training. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1033–1043, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Martin et al. (2018)
Lara J Martin, Prithviraj Ammanabrolu, Xinyu Wang, William Hancock, Shruti
Singh, Brent Harrison, and Mark O Riedl. 2018.
Event representations for automated story generation with deep neural
Thirty-Second AAAI Conference on Artificial Intelligence.
- Mostafazadeh et al. (2017) Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen. 2017. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51.
- Nakamura et al. (2018) Ryo Nakamura, Katsuhito Sudoh, Koichiro Yoshino, and Satoshi Nakamura. 2018. Another diversity-promoting objective function for neural dialogue generation. arXiv preprint arXiv:1811.08100.
- Peng et al. (2018) Nanyun Peng, Marjan Ghazvininejad, Jonathan May, and Kevin Knight. 2018. Towards controllable story generation. In Proceedings of the First Workshop on Storytelling, pages 43–49.
- Rose et al. (2010) Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Automatic Keyword Extraction from Individual Documents, pages 1 – 20.
- See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- Shao et al. (2017) Yuanlong Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, and Ray Kurzweil. 2017. Generating high-quality and informative conversation responses with sequence-to-sequence models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2210–2219, Copenhagen, Denmark. Association for Computational Linguistics.
Sutskever et al. (2014)
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural networks.In Advances in neural information processing systems, pages 3104–3112.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Vijayakumar et al. (2018) Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 7371–7379.
- Xing et al. (2016) Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2016. Topic augmented neural response generation with a joint attention mechanism. arXiv preprint arXiv:1606.08340, 2(2).
- Xu et al. (2018) Jingjing Xu, Xuancheng Ren, Junyang Lin, and Xu Sun. 2018. Diversity-promoting gan: A cross-entropy based generative adversarial network for diversified text generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3940–3949.
- Yao et al. (2018) Lili Yao, Nanyun Peng, Weischedel Ralph, Kevin Knight, Dongyan Zhao, and Rui Yan. 2018. Plan-and-write: Towards better automatic storytelling. arXiv preprint arXiv:1811.05701.
- Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243.
- Zhou et al. (2018) Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. 2018. Emotional chatting machine: Emotional conversation generation with internal and external memory. In Thirty-Second AAAI Conference on Artificial Intelligence.