Incremental Transformer with Deliberation Decoder for Document Grounded Conversations

07/20/2019 ∙ by Zekang Li, et al. ∙ Northeastern University Tencent 0

Document Grounded Conversations is a task to generate dialogue responses when chatting about the content of a given document. Obviously, document knowledge plays a critical role in Document Grounded Conversations, while existing dialogue models do not exploit this kind of knowledge effectively enough. In this paper, we propose a novel Transformer-based architecture for multi-turn document grounded conversations. In particular, we devise an Incremental Transformer to encode multi-turn utterances along with knowledge in related documents. Motivated by the human cognitive process, we design a two-pass decoder (Deliberation Decoder) to improve context coherence and knowledge correctness. Our empirical study on a real-world Document Grounded Dataset proves that responses generated by our model significantly outperform competitive baselines on both context coherence and knowledge relevance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Past few years have witnessed the rapid development of dialogue systems. Based on the sequence-to-sequence framework Sutskever et al. (2014), most models are trained in an end-to-end manner with large corpora of human-to-human dialogues and have obtained impressive success Shang et al. (2015); Vinyals and Le (2015); Li et al. (2016); Serban et al. (2016). While there is still a long way for reaching the ultimate goal of dialogue systems, which is to be able to talk like humans. And one of the essential intelligence to achieve this goal is the ability to make use of knowledge.

There are several works on dialogue systems exploiting knowledge. The Mem2Seq Madotto et al. (2018) incorporates structured knowledge into the end-to-end task-oriented dialogue.  Liu et al. Liu et al. (2018) introduces fact-matching and knowledge-diffusion to generate meaningful, diverse and natural responses using structured knowledge triplets.  Ghazvininejad et al. Ghazvininejad et al. (2018),  Parthasarathi and Pineau Parthasarathi and Pineau (2018),  Yavuz et al. Yavuz et al. (2018),  Dinan et al. Dinan et al. (2018) and  Lo and Chen Lo and Chen (2019) apply unstructured text facts in open-domain dialogue systems. These works mainly focus on integrating factoid knowledge into dialogue systems, while factoid knowledge requires a lot of work to build up, and is only limited to expressing precise facts. Documents as a knowledge source provide a wide spectrum of knowledge, including but not limited to factoid, event updates, subjective opinion, etc. Recently, intensive research has been applied on using documents as knowledge sources for Question-Answering Chen et al. (2017); Huang et al. (2018); Yu et al. (2018); Rajpurkar et al. (2018); Reddy et al. (2018).

The Document Grounded Conversation is a task to generate natural dialogue responses when chatting about the content of a specific document. This task requires to integrate document knowledge with the multi-turn dialogue history. Different from previous knowledge grounded dialogue systems, Document Grounded Conversations utilize documents as the knowledge source, and hence are able to employ a wide spectrum of knowledge. And the Document Grounded Conversations is also different from document QA since the contextual consistent conversation response should be generated. To address the Document Grounded Conversation task, it is important to: 1) Exploit document knowledge which are relevant to the conversation; 2) Develop a unified representation combining multi-turn utterances along with the relevant document knowledge.

In this paper, we propose a novel and effective Transformer-based Vaswani et al. (2017)

architecture for Document Grounded Conversations, named Incremental Transformer with Deliberation Decoder. The encoder employs a transformer architecture to incrementally encode multi-turn history utterances, and incorporate document knowledge into the the multi-turn context encoding process. The decoder is a two-pass decoder similar to the Deliberation Network in Neural Machine Translation

Xia et al. (2017), which is designed to improve the context coherence and knowledge correctness of the responses. The first-pass decoder focuses on contextual coherence, while the second-pass decoder refines the result of the first-pass decoder by consulting the relevant document knowledge, and hence increases the knowledge relevance and correctness. This is motivated by human cognition process. In real-world human conversations, people usually first make a draft on how to respond the previous utterance, and then consummate the answer or even raise questions by consulting background knowledge.

We test the effectiveness of our proposed model on Document Grounded Conversations Dataset Zhou et al. (2018). Experiment results show that our model is capable of generating responses of more context coherence and knowledge relevance. Sometimes document knowledge is even well used to guide the following conversations. Both automatic and manual evaluations show that our model substantially outperforms the competitive baselines.

Our contributions are as follows:

  • We build a novel Incremental Transformer to incrementally encode multi-turn utterances with document knowledge together.

  • We are the first to apply a two-pass decoder to generate responses for document grounded conversations. Two decoders focus on context coherence and knowledge correctness respectively.

2 Approach

2.1 Problem Statement

Our goal is to incorporate the relevant document knowledge into multi-turn conversations. Formally, let be a whole conversation composed of utterances. We use to denote the -th utterance containing words, where denotes the -th word in the -th utterance. For each utterance , likewise, there is a specified relevant document , which represents the document related to the -th utterance containing words. We define the document grounded conversations task as generating a response given its related document and previous utterances with related documents , where and . Note that may be the same.

Therefore, the probability to generate the response

is computed as:


where .

2.2 Model Description

Figure 1: The framework of Incremental Transformer with Deliberation Decoder for Document Grounded Conversations.
Figure 2: (1) Detailed architecture of model components. (a) The Self-Attentive Encoder(SA). (b) Incremental Transformer (ITE). (c) Deliberation Decoder (DD). (2) Simplified version of our proposed model used to verify the validity of our proposed Incremental Transformer Encoder and Deliberation Decoder. (d) Knowledge-Attention Transformer(KAT). (e) Context-Knowledge-Attention Decoder (CKAD).

Figure 1 shows the framework of the proposed Incremental Transformer with Deliberation Decoder. Please refer to Figure 2 (1) for more details. It consists of three components:

1) Self-Attentive Encoder (SA) (in orange) is a transformer encoder as described in Vaswani et al. (2017), which encodes the document knowledge and the current utterance independently.

2) Incremental Transformer Encoder (ITE) (on the top) is a unified transformer encoder which encodes multi-turn utterances with knowledge representation using an incremental encoding scheme. This module takes previous utterances and the document ’s SA representation as input, and use attention mechanism to incrementally build up the representation of relevant context and document knowledge.

3) Deliberation Decoder (DD) (on the bottom) is a two-pass unified transformer decoder for better generating the next response. The first-pass decoder takes current utterance ’s SA representation and ITE output as input, and mainly relies on conversation context for response generation. The second-pass decoder takes the SA representation of the first pass result and the relevant document ’s SA representation as input, and uses document knowledge to further refine the response.

Self-Attentive Encoder

As document knowledge often includes several sentences, it’s important to capture long-range dependencies and identify relevant information. We use multi-head self-attention Vaswani et al. (2017) to compute the representation of document knowledge.

As shown in Figure 2 (a), we use a self-attentive encoder to compute the representation of the related document knowledge . The input () of the encoder is a sequence of document words embedding with positional encoding added.Vaswani et al. (2017):


where is the word embedding of and denotes positional encoding function.

The Self-Attentive encoder contains a stack of identical layers. Each layer has two sub-layers. The first sub-layer is a multi-head self-attention () Vaswani et al. (2017). is a multi-head attention function that takes a query matrix , a key matrix , and a value matrix as input. In current case, = = . That’s why it’s called self-attention. And the second sub-layer is a simple, position-wise fully connected feed-forward network (). This

consists of two linear transformations with a ReLU activation in between.

Vaswani et al. (2017).


where is the hidden state computed by multi-head attention at the first layer, denotes the representation of

after the first layer. Note that residual connection and layer normalization are used in each sub-layer, which are omitted in the presentation for simplicity. Please refer to

Vaswani et al. (2017) for more details.

For each layer, repeat this process:


where and .

We use to denote this whole process:


where is the final representation for the document knowledge .

Similarly, for each utterance , we use to represent the sequence of the position-aware word embedding. Then the same Self-Attentive Encoder is used to compute the representation of current utterance , and we use to denote this encoding result. The Self-Attentive Encoder is also used to encode the document and the first pass decoding results in the second pass of the decoder. Note that and have the same architecture but different parameters. More details about this will be mentioned in the following sections.

Incremental Transformer Encoder

To encode multi-turn document grounded utterances effectively, we design an Incremental Transformer Encoder. Incremental Transformer uses multi-head attention to incorporate document knowledge and context into the current utterance’s encoding process. This process can be stated recursively as follows:


where denotes the encoding function, denotes the context state after encoding utterance , is the context state after encoding last utterance , is the representation of document and is the embedding of current utterance .

As shown in Figure 2 (b), we use a stack of identical layers to encode . Each layer consists of four sub-layers. The first sub-layer is a multi-head self-attention:


where , is the output of the last layer and . The second sub-layer is a multi-head knowledge attention:


The third sub-layer is a multi-head context attention:


where is the representation of the previous utterances. That’s why we called the encoder ”Incremental Transformer”. The fourth sub-layer is a position-wise fully connected feed-forward network:


We use to denote the final representation at -th layer:


Deliberation Decoder

Motivated by the real-world human cognitive process, we design a Deliberation Decoder containing two decoding passes to improve the knowledge relevance and context coherence. The first-pass decoder takes the representation of current utterance and context as input and focuses on how to generate responses contextual coherently. The second-pass decoder takes the representation of the first-pass decoding results and related document as input and focuses on increasing knowledge usage and guiding the following conversations within the scope of the given document.

When generating the -th response word , we have the generated words as input Vaswani et al. (2017). We use to denote the matrix representation of as following:



is the vector representation of sentence-start token.

As shown in Figure 2 (c), the Deliberation Decoder consists of a first-pass decoder and a second-pass decoder. These two decoders have the same architecture but different input for sub-layers. Both decoders are composed of a stack of identical layers. Each layer has four sub-layers. For the first-pass decoder, the first sub-layer is a multi-head self-attention:


where , is the output of the previous layer, and . The second sub-layer is a multi-head context attention:


where is the representation of context . The third sub-layer is a multi-head utterance attention:


where is a Self-Attentive Encoder which encodes latest utterance . Eq. (18) mainly encodes the context and document knowledge relevant to the latest utterance, while Eq. (19) encodes the latest utterance directly. We hope optimal performance can be achieved by combining both.

The fourth sub-layer is a position-wise fully connected feed-forward network:


After layers, we use softmax to get the words probabilities decoded by first-pass decoder:


where is the response decoded by the first-pass decoder. For second-pass decoder:


where is the counterpart to in pass two decoder, referring to the output of the previous layer. is the representation of document using Self-Attentive Encoder, is the output words after the second-pass decoder.


In contrast to the original Deliberation Network Xia et al. (2017), where they propose a complex joint learning framework using Monte Carlo Method, we minimize the following loss as  Xiong et al. Xiong et al. (2018) do:


3 Experiments

3.1 Dataset

We evaluate our model using the Document Grounded Conversations Dataset Zhou et al. (2018). There are 72922 utterances for training, 3626 utterances for validation and 11577 utterances for testing. The utterances can be either casual chats or document grounded. Note that we consider consequent utterances of the same person as one utterance. For example, we consider A: Hello! B: Hi! B: How’s it going? as A: Hello! B: Hi! How’s it going?. And there is a related document given for every several consequent utterances, which may contain movie name, casts, introduction, ratings, and some scenes. The average length of documents is about 200. Please refer to Zhou et al. (2018) for more details.

Knowledge Context
Model PPL BLEU(%) Fluency Relevance Coherence
Seq2Seq without knowledge 80.93 0.38 1.62 0.18 0.54
HRED without knowledge 80.84 0.43 1.25 0.18 0.30
Transformer without knowledge 87.32 0.36 1.60 0.29 0.67
Seq2Seq (+knowledge) 78.47 0.39 1.50 0.22 0.61
HRED (+knowledge) 79.12 0.77 1.56 0.35 0.47
Wizard Transformer 70.30 0.66 1.62 0.47 0.56
ITE+DD (ours) 15.11 0.95 1.67 0.56 0.90
ITE+CKAD (ours) 64.97 0.86 1.68 0.50 0.82
KAT (ours) 65.36 0.58 1.58 0.33 0.78
Table 1: Automatic evaluation and manual evaluation results for baselines and our proposed models.

3.2 Baselines

We compare our proposed model with the following state-of-the-art baselines:
Models not using document knowledge:

Seq2Seq: A simple encoder-decoder model Shang et al. (2015); Vinyals and Le (2015) with global attention Luong et al. (2015). We concatenate utterances context to a long sentence as input.

HRED: A hierarchical encoder-decoder model Serban et al. (2016), which is composed of a word-level LSTM for each sentence and a sentence-level LSTM connecting utterances.

Transformer: The state-of-the-art NMT model based on multi-head attention Vaswani et al. (2017). We concatenate utterances context to a long sentence as its input.
Models using document knowledge:

Seq2Seq (+knowledge) and HRED (+knowledge) are based on Seq2Seq and HRED respectively. They both concatenate document knowledge representation and last decoding output embedding as input when decoding. Please refer to Zhou et al. (2018) for more details.

Wizard Transformer: A Transformer-based model for multi-turn open-domain dialogue with unstructured text facts Dinan et al. (2018). It concatenates context utterances and text facts to a long sequence as input. We replace the text facts with document knowledge.

Here, we also conduct an ablation study to illustrate the validity of our proposed Incremental Transformer Encoder and Deliberation Decoder.

ITE+CKAD: It uses Incremental Transformer Encoder (ITE) as encoder and Context-Knowledge-Attention Decoder (CKAD) as shown in Figure 2 (e). This setup is to test the validity of the deliberation decoder.

Knowledge-Attention Transformer (KAT): As shown in Figure 2 (d), the encoder of this model is a simplified version of Incremental Transformer Encoder (ITE), which doesn’t have context-attention sub-layer. We concatenate utterances context to a long sentence as its input. The decoder of the model is a simplified Context-Knowledge-Attention Decoder (CKAD). It doesn’t have context-attention sub-layer either. This setup is to test how effective the context has been exploited in the full model.

Knowledge Context
Model Relevance(%) Coherence(%)
Wizard 64/25/11 58/28/14
ITE+CKAD 67/16/17 40/37/23
ITE+DD 64/16/20 38/34/28
Table 2: The percent(%) of score (0/1/2) of Knowledge Relevance and Context Coherence for Wizard Transformer, ITE+CKAD and ITE+DD.

3.3 Experiment Setup

We use OpenNMT-py111 Klein et al. (2017) as the code framework222The code and models are available at For all models, the hidden size is set to 512. For rnn-based models (Seq2Seq, HRED), 3-layer bidirectional LSTM Hochreiter and Schmidhuber (1997) and 1-layer LSTM is applied for encoder and decoder respectively. For transformer-based models, the layers of both encoder and decoder are set to 3. The number of attention heads in multi-head attention is 8 and the filter size is 2048. The word embedding is shared by utterances, knowledge and generated responses. The dimension of word embedding is set to 512 empirically. We use Adam Kingma and Ba (2014) for optimization. When decoding, beam size is set to 5. We use the previous three utterances and its related documents as input.

moviename despicable mecast: steve carell as grurotten tomatoes: 81% … it is the debut film of illumination entertainment. … gru, a supervillain, is disheartened when an unknown supervillain steals the great pyramid of giza. gru, with the assistance of his colleague dr. Nefario and his minions, resolves to one-up this mystery rival by shrinking and stealing the moon. as it would be costly to steal the moon, gru seeks a loan from the bank of evil. … upon learning that an up-and-coming villain named vector was responsible for the pyramid theft …
Utterance 1 what do you think about despicable me?
ITE+DD i think it is an awesome movie, what about you?
Wizard i think it’s a great movie!
Utterance 2 i rather liked it. it wasn’t what i was expecting but it was still good.
ITE+DD yeah, it makes me laugh, even when they make a movie about it.
Wizard it’s a great movie for kids and adults.
Utterance 3 i like it too. funny and heart warming at the same time.
ITE+DD i think it’s interesting that gru decided to take a loan from the bank of evil.
Wizard i think they should have made a sequel.
Utterance 4 gru was a really interesting character, too. i liked steve carell and the voice he used for gru.
ITE+DD yeah, now that you mention it, what did you think of the minions?
Wizard i think the entire cast was great.
Utterance 5 yes, he did an excellent job portraying the character. 81% in ratings i think is great.
ITE+DD yes, 81% is a great score!
Wizard i also liked the villain of the movie.
Table 3: Responses generated by Incremental Transformer with Deliberation Decoder(ITE+DD) and the Wizard Transformer(Wizard). These utterances are from continuous dialogues. ITE+DD and Wizard generate responses using context utterances and a given document. Note that Utterance is the gold response of Utterance .

3.4 Evaluation Metrics

Automatic Evaluation: We adopt perplexity (PPL) and BLEU Papineni et al. (2002) to automatically evaluate the response generation performance. Models are evaluated using perplexity of the gold response as described in Dinan et al. (2018)

. Lower perplexity indicates better performance. BLEU measures n-gram overlap between a generated response and a gold response. However, since there is only one reference for each response and there may exist multiple feasible responses, BLEU scores are extremely low. We compute BLEU score by the

Manual Evaluation: Manual evaluations are essential for dialogue generation. We randomly sampled 30 conversations containing 606 utterances from the test set and obtained 5454 utterances from the nine models. We have annotators score these utterances given its previous utterances and related documents. We defined three metrics - fluency, knowledge relevance Liu et al. (2018) and context coherence for manual evaluation. All these metrics are scored 0/1/2.

fluency: Whether the response is natural and fluent. Score 0 represents not fluent and incomprehensible; 1 represents partially fluent but still comprehensible; 2 represents totally fluent.

knowledge relevance: Whether the response uses relevant and correct knowledge. Score 0 represents no relevant knowledge; 1 represents containing relevant knowledge but not correct; 2 represents containing relevant knowledge and correct.

context coherence: Whether the response is coherent with the context and guides the following utterances. Score 0 represents not coherent or leading the dialogue to an end; 1 represents coherent with the utterance history but not guiding the following utterances; 2 represents coherent with utterance history and guiding the next utterance.

3.5 Experimental Results

Table 1 shows the automatic and manual evaluation results for both the baseline and our models.

In manual evaluation, among baselines, Wizard Transformer and RNN without knowledge have the highest fluency of 1.62 and Wizard obtains the highest knowledge relevance of 0.47 while Transformer without knowledge gets the highest context coherence of 0.67. For all models, ITE+CKAD obtains the highest fluency of 1.68 and ITE+DD has the highest Knowledge Relevance of 0.56 and highest Context Coherence of 0.90.

In automatic evaluation, our proposed model has lower perplexity and higher BLEU scores than baselines. For BLEU, HRED with knowledge obtains the highest BLEU score of 0.77 among the baselines. And ITE+DD gets 0.95 BLEU score, which is the highest among all the models. For perplexity, Wizard Transformer obtains the lowest perplexity of 70.30 among baseline models and ITE+DD has remarkably lower perplexity of 15.11 than all the other models. A detailed analysis is in Section 3.6.

3.6 Analysis and Discussion

To our surprise, ITE+DD reaches an extremely low ground truth perplexity. We find that the ground truth perplexity after the first-pass decoding is only similar to the ITE+CKAD. It shows that the second-pass decoder utilizes the document knowledge well, and dramatically reduced the ground truth perplexity.

As shown in Table 2, ITE+DD has a higher percent of score 2 both on Knowledge Relevance and Context Coherence than ITE+CKAD. This result also demonstrates that Deliberation Decoder can improve the knowledge correctness and guide the following conversations better.

Although the perplexity of ITE+CKAD is only slightly better than KAT, the BLEU score, Fluency, Knowledge Relevance and Context Coherence of ITE+CKAD all significantly outperform those of KAT model, which indicates that Incremental Transformer can deal with multi-turn document grounded conversations better.

Wizard Transformer has a great performance on Knowledge Relevance only second to our proposed Incremental Transformer. However, its score on Context Coherence is lower than some other baselines. As shown in Table 2, Wizard Transformer has Knowledge Relevance score 1 results twice more than score 2 results, which indicates that the model tends to generate responses with related knowledge but not correct. And the poor performance on Context Coherence also shows Wizard Transformer does not respond to the previous utterance well. This shows the limitation of representing context and document knowledge by simple concatenation.

ID Utterance Two-pass Responses
1 I think rachel mcadams had an even better role as regina george however! would you agree? i’m not a fan of kristen bell, but i think she did a great job.
i’m not a huge fan of rachel mcadams, but he did a great job.
2 yeah, I guess that’s always worth it, and a truce was made as well. yeah, not only does she reconcile with the plastics.
yeah, she reconciles with janis , damien and aaron.
3 i liked the scene where buzz thinks he’s a big shot hero but then the camera reveals him to be a tiny toy. i think that’s one of the best scenes in the movie.
oh, i think that is what makes the movie unique as well. have you seen any of the other pixar movies?
Table 4: Examples of the two pass decoding. Underlined texts are the differences between two results. For each case, the first-pass response is on the top.

3.7 Case Study

In this section, we list some examples to show the effectiveness of our proposed model.

Table 3 lists some responses generated by our proposed Incremental Transformer with Deliberation Decoder (ITE+DD) and Wizard Transformer (which achieves overall best performance among baseline models). Our proposed model can generate better responses than Wizard Transformer on knowledge relevance and context coherence.

To demonstrate the effectiveness of the two-pass decoder, we compare the results from the first-pass decoding and the second-pass decoding. Table 4 shows the improvement after the second-pass decoding. For Case 1, the second-pass decoding result revises the knowledge error in the first-pass decoding result. For Case 2, the second-pass decoder uses more detailed knowledge than the first-pass one. For Case 3, the second-pass decoder cannot only respond to the previous utterance but also guide the following conversations by asking some knowledge related questions.

4 Related Work

The closest work to ours lies in the area of open-domain dialogue system incorporating unstructured knowledge.  Ghazvininejad et al. Ghazvininejad et al. (2018) uses an extended Encoder-Decoder where the decoder is provided with an encoding of both the context and the external knowledge.  Parthasarathi and Pineau Parthasarathi and Pineau (2018) uses an architecture containing a Bag-of-Words Memory Network fact encoder and an RNN decoder.  Dinan et al. Dinan et al. (2018) combines Memory Network architectures to retrieve, read and condition on knowledge, and Transformer architectures to provide text representation and generate outputs. Different from these works, we greatly enhance the Transformer architectures to handle the document knowledge in multi-turn dialogue from two aspects: 1) using attention mechanism to combine document knowledge and context utterances; and 2) exploiting incremental encoding scheme to encode multi-turn knowledge aware conversations.

Our work is also inspired by several works in other areas.  Zhang et al. Zhang et al. (2018) introduces document context into Transformer on document-level Neural Machine Translation (NMT) task.  Guan et al. Guan et al. (2018) devises the incremental encoding scheme based on rnn for story ending generation task. In our work, we design an Incremental Transformer to achieve a knowledge-aware context representation using an incremental encoding scheme.  Xia et al. Xia et al. (2017) first proposes Deliberation Network based on rnn on NMT task. Our Deliberation Decoder is different in two aspects: 1) We clearly devise the two decoders targeting context and knowledge respectively; 2) Our second pass decoder directly fine tunes the first pass result, while theirs uses both the hidden states and results from the first pass.

5 Conclusion and Future Work

In this paper, we propose an Incremental Transformer with Deliberation Decoder for the task of Document Grounded Conversations. Through an incremental encoding scheme, the model achieves a knowledge-aware and context-aware conversation representation. By imitating the real-world human cognitive process, we propose a Deliberation Decoder to optimize knowledge relevance and context coherence. Empirical results show that the proposed model can generate responses with much more relevance, correctness, and coherence compared with the state-of-the-art baselines. In the future, we plan to apply reinforcement learning to further improve the performance.

6 Acknowledgments

This work is supported by 2018 Tencent Rhino-Bird Elite Training Program, National Natural Science Foundation of China (NO. 61662077, NO.61876174) and National Key R&D Program of China (NO.YS2017YFGH001428). We sincerely thank the anonymous reviewers for their thorough reviewing and valuable suggestions.


  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1870–1879. Cited by: §1.
  • E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston (2018)

    Wizard of wikipedia: knowledge-powered conversational agents

    arXiv preprint arXiv:1811.01241. Cited by: §1, §3.2, §3.4, §4.
  • M. Ghazvininejad, C. Brockett, M. Chang, B. Dolan, J. Gao, W. Yih, and M. Galley (2018) A knowledge-grounded neural conversation model. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §1, §4.
  • J. Guan, Y. Wang, and M. Huang (2018) Story ending generation with incremental encoding and commonsense knowledge. arXiv preprint arXiv:1808.10113. Cited by: §4.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.3.
  • H. Huang, E. Choi, and W. Yih (2018) Flowqa: grasping flow in history for conversational machine comprehension. arXiv preprint arXiv:1810.06683. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
  • G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush (2017) OpenNMT: open-source toolkit for neural machine translation. In Proc. ACL, External Links: Link, Document Cited by: §3.3.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. Cited by: §1.
  • S. Liu, H. Chen, Z. Ren, Y. Feng, Q. Liu, and D. Yin (2018) Knowledge diffusion for neural dialogue generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1489–1498. Cited by: §1, §3.4.
  • H. Y. K. Lo and S. S. Y. Chen (2019) Knowledge-grounded response generation with deep attentional latent-variable model. Thirty-Third AAAI Conference on Artificial Intelligence. Cited by: §1.
  • T. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    pp. 1412–1421. Cited by: §3.2.
  • A. Madotto, C. Wu, and P. Fung (2018) Mem2Seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1468–1478. Cited by: §1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In , pp. 311–318. Cited by: §3.4.
  • P. Parthasarathi and J. Pineau (2018) Extending neural generative conversational model using external knowledge sources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 690–695. Cited by: §1, §4.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 784–789. Cited by: §1.
  • S. Reddy, D. Chen, and C. D. Manning (2018) Coqa: a conversational question answering challenge. arXiv preprint arXiv:1808.07042. Cited by: §1.
  • I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau (2016)

    Building end-to-end dialogue systems using generative hierarchical neural network models

    In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1, §3.2.
  • L. Shang, Z. Lu, and H. Li (2015) Neural responding machine for short-text conversation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 1577–1586. Cited by: §1, §3.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §2.2, §2.2, §2.2, §2.2, §2.2, §3.2.
  • O. Vinyals and Q. Le (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §1, §3.2.
  • Y. Xia, F. Tian, L. Wu, J. Lin, T. Qin, N. Yu, and T. Liu (2017) Deliberation networks: sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems, pp. 1784–1794. Cited by: §1, §2.2, §4.
  • H. Xiong, Z. He, H. Wu, and H. Wang (2018) Modeling coherence for discourse neural machine translation. arXiv preprint arXiv:1811.05683. Cited by: §2.2.
  • S. Yavuz, A. Rastogi, G. Chao, D. Hakkani-Tür, and A. A. AI (2018) DEEPCOPY: grounded response generation with hierarchical pointer networks. Advances in Neural Information Processing Systems. Cited by: §1.
  • A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V. Le (2018) Qanet: combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541. Cited by: §1.
  • J. Zhang, H. Luan, M. Sun, F. Zhai, J. Xu, M. Zhang, and Y. Liu (2018) Improving the transformer translation model with document-level context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 533–542. Cited by: §4.
  • K. Zhou, S. Prabhumoye, and A. W. Black (2018) A dataset for document grounded conversations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 708–713. Cited by: §1, §3.1, §3.2.