One of the key goals of AI is to build a machine that can talk with humans when given an initial topic. To achieve this goal, the machine should be able to understand language with background knowledge, recall knowledge from memory or external resource, reason about these concepts together, and finally output appropriate and informative responses. Lots of research efforts have been devoted to chitchat oriented conversation generation Ritter et al. (2011); Shang et al. (2015). However, these models tend to produce generic responses or incoherent responses for a given topic, since it is quite challenging to learn semantic interactions merely from dialogue data Ghazvininejad et al. (2018); Zhou et al. (2018) without help of background knowledge.
Recently, some previous studies have been conducted to introduce external knowledge in open-domain chitchat generation Ghazvininejad et al. (2018); Liu et al. (2018); Vougiouklis et al. (2016); Young et al. (2018); Zhou et al. (2018). These models usually recall background knowledge from a source, either unstructured non-factoid knowledge base Ghazvininejad et al. (2018); Vougiouklis et al. (2016) or structured factoid knowledge base Liu et al. (2018); Young et al. (2018); Zhou et al. (2018), and then generate more informative responses conditioned on selected knowledge.
For factoid knowledge, e.g., facts about movies, triple attributes or graph edges provide high-quality candidates for knowledge selection decision, and these prior information can enhance generalization capability of knowledge selection models. But it suffers from information insufficiency for response generation since there is simply a single word or entity to facilitate generation. For non-factoid knowledge, e.g., comments about movies, the text sentences can provide rich information for generation, but its unstructured (e.g., document based) representation scheme demands strong capability for models to perform knowledge selection or attention from the list of knowledge texts. Fusion of knowledge triples (or graph based representation scheme) and text sentences might yield mutually reinforcing advantages for knowledge selection in conversation systems, but there is less study on that.
To bridge the gap between the two lines of studies on knowledge aware conversation generation, we present an augmented knowledge graph based open-domain chatting machine (denoted as AKGCM), which consists of knowledge selector and response generator.
To construct augmented knowledge graph, we take a factoid knowledge graph (KG) as its backbone, and align unstructured sentences of non-factoid knowledge with the factoid KG by linking entities from these sentences to vertices (containing entites) of the KG. Thus we augment the factoid KG with non-factoid knowledge, and retain its structured representation. Then we use this augmented KG to facilitate knowledge selection and response generation, as shown in Figure 1.
Being operated on the graph, the knowledge selector first retrieves a vertex that matches an input message as a starting point, then learns to traverse the graph with multi-hop paths that may reflect conversation logic, and finally stops at an answer vertex with correct knowledge. Our graph differs from previous KGs in that: some vertices in ours contain long texts, not a single entity or word. To fully leverage this long text information, we improve a state of the art reasoning algorithm Das et al. (2018), MINERVA, with machine reading comprehension (MRC) technology Seo et al. (2017) to conduct fine-grained semantic matching between an input message and candidate vertices.
Finally, for response generation, we use an encoder-decoder model to produce responses conditioned on selected knowledge.
In summary, this paper makes the following contributions:
This work is the first attempt that unifies factoid and non-factoid knowledge as a graph, and conducts more flexible multi-hop graph reasoning for knowledge selection. Supported by such knowledge and knowledge selection method, our system can respond more appropriately and informatively.
2 Related Work
Conversation with Knowledge Graph There are growing interests in leveraging factoid knowledge Han et al. (2015); Liu et al. (2018); Zhu et al. (2017) or commonsense knowledge Young et al. (2018); Zhou et al. (2018) with graph based representation for generation of appropriate and informative responses. Compared with them, we augment previous KG with non-factoid knowledge and introduce multi-hop graph reasoning in conversational models. Wu et al. (2018) ever used document reasoning network for modeling of conversational contexts, but not for knowledge selection. Moreover, previous models cannot effectively utilize long texts from graph vertices since they simply use an embedding for representation of the whole vertex, without further text analysis.
Conversation with Unstructured Texts With availability of a large amount of knowledge texts from Wikipedia or user generated content websites, e.g., Reddit, some work focus on either modeling of conversation generation with unstructured texts Ghazvininejad et al. (2018); Vougiouklis et al. (2016); Xu et al. (2017), or building benchmark dialogue data grounded on knowledge Dinan et al. (2019); Moghe et al. (2018). In comparison with them, we adopt a graph based representation scheme for unstructured texts, which helps improve generalization of knowledge selection, as shown in our experiments.
Knowledge Graph Reasoning Previous studies on KG reasoning can be categorized into path-based models Das et al. (2017a); Lao et al. (2011), embedding-based models Bordes et al. (2013); Wang et al. (2014), and models in unifying embedding and path-based technology Das et al. (2018); Lin et al. (2018); Xiong et al. (2017), which can predict missing links or entities for completion of KG. Our problem setting is different from theirs in that some vertices of our graph contain long texts, not limited to a single entity or word. It motivates us to improve previous reasoning models with machine reading technology to effectively leverage long text information.
POMDP for Dialogue Modeling POMDP based models have been extensively studied for policy learning, responsible for system action selection, in task oriented multi-turn conversational systems with pipelined architecture Young et al. (2013). In this work, a POMDP based reasoning model is used for knowledge selection in a single turn, not for modeling of multi-turn dialogue. Moreover, our system is for open-domain chitchat, not task oriented dialogue.
Fusion of KG triples and sentences In the context of QA, combination of a KG and a text corpus has been studied with a strategy of late fusion Gardner and Krishnamurthy (2017); Ryu et al. (2014) or ealry fusion Das et al. (2017b); Sun et al. (2018), which can help address the issue of low coverage to answers in KG based models. In this work, we conduct this fusion for conversation generation, not QA, and our model can select text span as answers, not restricted to entities as done in those QA models.
3 The Proposed Model
3.1 Problem Definition and Model Overview
Our problem is formulated as follows: Let denote an augmented KG, where is a set of vertices, is a set of edges, and is a set of edge labels (e.g., triple attibutes, or vertex categories). Given an input message and , the goal is to generate a proper response
. Essentially, the system consists of two stages: (1) knowledge selection: we select the vertex that maximizes following probability as an answer, which is from vertices being connected to:
; (2) response generation: it estimates the probability:
The overview of our augmented knowledge graph based chatting machine (AKGCM) is shown in Figure 2. The knowledge selector first takes as input a message and retrieves a starting vertex from that is closely related to , and then performs multi-hop graph reasoning on and finally arrives at a vertex that has the knowledge being appropriate for response generation. The knowledge aware response generator produces a response with knowledge from . At each decoding position, it attentively reads the selected knowledge text, and then generates a word in the vocabulary or copies a span in the knowledge text.
For model training, each pair of [message, response] in training data is associated with ground-truth knowledge and its vertex ID (ground-truth vertex) in for knowledge grounding. These vertex IDs will be used as ground-truth for training of knowledge selector, while the pairs of [message, knowledge text, response] will be used for training of response generator.
3.2 Augmented Knowledge Graph
Given a factoid KG and related documents containing non-factoid knowledge, we take the KG as a backbone, where each vertex contains a single entity or word, and each edge represents an attribute or a relation. Then we segment the documents into sentences and align each sentence with entries of the factoid KG by mapping entities from these sentences to entity vertices of the KG. Thus we augment the factoid KG with non-factoid knowledge, and retain its structured representation.
3.3 Knowledge Selection on Graph
Task Definition We formulate knowledge selection on as a finite horizon sequential decision making problem. It supports more flexible multi-hop walking on graphs, not limited to one-hop walking as done in previous conversation models Han et al. (2015); Zhou et al. (2018); Zhu et al. (2017).
As shown in Figure 3
, we begin by representing the environment as a deterministic partially observed Markov decision process (POMDP) onbuilt in Section 3.2
. Our reinforcement learning (RL) based agent is given an input query of the form. Starting from vertex corresponding to in , the agent follows a path in the graph, and stops at a vertex that it predicts as the answer . Using a training set of known answer vertices for message-response pairs, we train the agent using policy gradients Williams (1992) with control variates.
The difference between the setting of our problem and previous KG reasoning lies in that: (1) the content of our input queries is not limited to entities and attributes; (2) some vertices in our graph contains long texts, while vertices in previous KGs just contain a single entity or short text. It motivates us to make a few improvements on previous models, as shown in Equation (5), (6), and (7).
Next we elaborate the 5-tuple of the environment, and policy network.
States A state at time step is represented by and the state space consists of all valid combinations in , where is current location of the RL agent, is the ground-truth vertex, and is the set of all possible .
Observations The complete state of the environment cannot be observed. Intuitively, the agent knows its current location and , but not the ground-truth one , which remains hidden. Formally, the observation function is defined as .
Actions The set of possible actions from a state consists of all outgoing edges of the vertex in . Formally . It means an agent at each state has option to select which outgoing edge it wishes to take with the label of the edge and destination vertex . We limit the length of the action sequence (horizon length) up to a fixed number (e.g., T) of time steps. Moreover, we augment each vertex with a special action called ‘NO_OP’ which goes from a vertex to itself. This decision allows the agent to remain at a vertex for any number of time steps. It is especially helpful when the agent has managed to reach a correct vertex at a time step and can continue to stay at the vertex for the rest of the time steps.
Transition The environment evolves deterministically by just updating the state to the new vertex according to the edge selected by the agent. Formally, the transition function is defined by , where and . is the label of an edge connecting and , and is destination vertex.
Rewards After time steps, if the current vertex is the ground-truth one, then the agent receives a reward of otherwise . Formally, , where is the final state.
Policy Network We design a randomized non-stationary policy , where is a policy at time step . In this work, for each , we employ a policy network with three components to make the decision of choosing an action from all available actions () conditioned on .
The first component is a history dependent feed-forward network (FFN) based model proposed in Das et al. (2018). We first employs a LSTM to encode the history
as a continuous vector, where is the sequence of observations and actions taken. It is defined by:
where is the embedding of the relation corresponding to the label of the edge the agent chose at time and is the embedding of the vertex corresponding to the agent’s state at time .
Recall that each possible action represents an outgoing edge with information of the edge relation label and destination vertex . So let denote an embedding for each action , and we obtain the matrix
by stacking embeddings for all the outgoing edges. Then we build a two-layer feed-forward network with ReLU nonlinearity which takes in the current history representationand the representation of (). We use another single-layer feed-forward network for computation of , which accepts the original sentence embedding (e.g., BERT based embedding) of () as input. The updated FFN model for action decision is defined by:
Recall that in our graph, some vertices contain long texts, differentiating our graph from others in previous work. The original reasoning model Das et al. (2018), MINERVA, cannot effectively exploit the long text information within vertices since it just learns embedding representation for the whole vertex, without detailed analysis of text in vertices. To fully leverage the long text information in vertices, we employ two models, a machine reading comprehension model (MRC) Seo et al. (2017) and a bilinear model, to score each possible from both global and local view.
For scoring from global view, (1) we build a document by collecting sentences from all possible , (2) we employ the MRC model to predict an answer span () from the document, (3) we score each by calculating a BLEU-1 score of ’s sentence with as the reference, shown as follows:
Here, represents operation of getting text contents, and represents operation of calculating BLEU-1 score. We see that the MRC model can help to determine which is the best based on global information from the whole document.
For scoring from local view, we use another bilinear model to calculate similarity between and , shown as follows:
Finally, we calculate a sum of outputs of the three above-mentioned models and outputs a probability distribution over the possible actions from which a discrete action is sampled, defined by:
Please see Section 3.1 for definition of . When the agent finally arrives at , we obtain as the answer for response generation.
Training For the policy network () described above, we want to find parameters that maximize the expected reward:
where we assume there is a true underlying distribution , and .
3.4 Knowledge Aware Generation
Following the work of Moghe et al. (2018)
, we modify a text summarization modelSee et al. (2017) to suit our knowledge aware response generation task.
In the summarization task, its input is a document and its output is a summary, but in our case the input is a [selected knowledge, message] pair and the output is a response. Therefore we introduce two RNNs: one is for computing the representation of the selected knowledge, and the other for the message. The decoder accepts the two representations and its own internal state representation as input, and then compute (1) a probability score which indicates whether the next word should be generated or copied, (2) a probability distribution over the vocabulary if the next word needs to be generated, and (3) a probability distribution over the input words if the next word needs to be copied. These three probability distributions are then combined, resulting in , to produce the next word in the response.
4 Experiments and Results
We adopt a publicly-available knowledge grounded multi-turn dialogue dataset for our experiments, the Reddit movie dataset released by Moghe et al. (2018), since it contains conversations being grounded on both factoid and non-factoid knowledge. Other knowledge grounded datasets Dinan et al. (2019); Liu et al. (2018) focus on a single type of knowledge, which are not applicable to our problem setting. The Reddit dataset contains movie chats wherein each response is explicitly generated by copying or modifying sentences from background knowledge such as triples about facts, plots, comments or reviews (from Reddit or IMDB) about movies. It consists of 9K conversation sessions containing a total of 90K utterances pertaining to about 921 movies. We follow their data split for training, validation and test111We use the single-reference mixed-short test set for evaluation. Please see their paper for more details.. Then it is possible to make direct comparison with their reported results. Their statistics can be seen in Table 1. Non-factoid knowledge is more frequently involved in movie dialogue than factoid knowledge, indicating the importance of non-factoid knowledge.
|Conversational Pairs||Augmented KG|
|Factoid knowledge||Non-factoid knowledge|
|#Total vertices||21028||#Total vertices||96345|
|#Vertices in utterances||2740||#Vertices in utterances||31746|
4.2 Implementation Details
We implement our knowledge selection model based on the code222https://github.com/shehzaadzd/MINERVA byDas et al. (2018) and that333https://github.com/nikitacs16/Holl-E by Moghe et al. (2018). We set the maximum reasoning length as . We use TransEBordes et al. (2013) to initialize vertex and relation representations in our augmented KG. The embedding size for TransE is set as 768 for compatibility with the setting of BERT444https://github.com/hanxiao/bert-as-service Devlin et al. (2018) embeddings. Here we use BERT to get the representations of input messages for knowledge selection model. We use BiDAF as our MRC module, shown in Equation (6), and we train the MRC module on the same training set for our knowledge selection model. and in Equation (8) is set as and
respectively. The embeddings of vertices and input messages is fixed during training process. We use Adam optimizer with a mini-batch size of 32. The learning rate is 0.001. The model is ran at most 20 epochs. During preparation offor each , if we cannot find any vertex for , then we take its movie name vertex as . This operation helps improve the recall of correct knowledge for . We implement the knowledge aware generation model based on the code of GTTP3 released by Moghe et al. (2018). The word embedding size is set to 300, the vocabulary size is limited to 30000, and for other parameter settings, we follow them. We will make the augmented KG and our code publicly available soon.
For baselines, including Seq2Seq, HRED, MemNet, and CCM, we initialize word embeddings with GloVe. We also follow the parameter setting in their original papers for model training.
4.3 Experiment Settings
We follow the existing work to conduct both automatic evaluation and human evaluation for our system. We also compare our system with a set of carefully selected baselines, shown as follows.
Seq2Seq We implement a sequence-to-sequence model (Seq2Seq) Sutskever et al. (2014), which is widely used in open-domain conversational systems.
HRED We implement a hierarchical recurrent encoder-decoder model Serban et al. (2016).
MemNet We implement an end-to-end knowledge-grounded generation model Ghazvininejad et al. (2018), where top-k knowledge text candidates are selected by another retrieval model and then are stored into the memory units for generation.
GTTP It is an end-to-end text summarization model See et al. (2017). We use the code released by Moghe et al. (2018), where they modify GTTP to suit the task of knowledge aware conversation generation, taking a message and a document containing all available knowledge as input.
BiDAF+G It is a Bi-directional Attention Flow based QA Model (BiDAF) Seo et al. (2017). We use the code released by Moghe et al. (2018), where they use BiDAF to find the answer span from a knowledge document, taking the input message as the query. Moreover, we use a response generator (as same as ours) for NLG with the predicted knowledge span.
CCM It is an end-to-end commonsense conversational model Zhou et al. (2018). We use the code released by the original authors and then modify our graph to suit their setting by selecting each content word from long text as an individual vertex to replace the original long-text vertices since their model cannot effectively process long texts in vertices.
AKGCM It is our two-stage system presented in Section 3. We use BiDAF as the MRC model.
|AKGCM vs. *||AKGCM vs. *|
4.4 Automatic Evaluations
Metrics Following the work of Moghe et al. (2018), we adopt BLEU-4 Papineni et al. (2002), ROUGE-2 Lin (2004) and ROUGE-L Lin and Och (2004) to evaluate how similar the output response is to the reference text.
Results As shown in Table 2, on the full task (both knowledge selection and response generation), AKGCM can obtain the highest scores on test set in terms of ROUGE-2 and ROUGE-L, and the second highest score in terms of BLEU-4, surpassing other models, except BiDAF, by a large margin. It indicates that AKGCM can generate more informative and grammatical responses.
4.5 Human Evaluations
Metrics We resort to a web crowdsourcing service for human evaluations. We randomly sample 200 messages from test set and run each model to generate responses, and then we conduct pair-wise comparison between the response by AKGCM and the one by a baseline for the same message. In total, we have 1,200 pairs since there are six baselines. For each pair, we ask five evaluators to give a preference between the two responses, in terms of the following two metrics: (1) appropriateness (Appr.), e.g., whether the response is appropriate in relevance, and logic, (2) informativeness (Infor.), whether the response provides new information and knowledge in addition to the input message, instead of generic responses such as “This movie is amazing”. Tie is allowed. Notice that system identifiers are masked during evaluation.
Annotation Statistics We calculate the agreements to measure inter-evaluator consistency. For appropriateness, the percentage of test instances that at least 3 evaluators give the same label (3/5 agreement) is 96%, and that for at least 4/5 agreement is 73%. For informativeness, the percentage for at least 3/5 agreement is 96% and that for at least 4/5 agreement is 75%.
Results In Table 3, each score for win/tie/lose is the percentage of messages for which AKGCM can generate better/almost same/worse responses, in comparison with a baseline. AKGCM outperforms all the baselines significantly (sign test, p-value ¡ 0.005) in terms of the two metrics on test set. Furthermore, AKGCM can win the strongest baseline, BiDAF. It demonstrates the effectiveness of our graph reasoning mechanism that can use global graph structure information.
4.6 Model Analysis and Case Study
|w/o non-factoid knowledge||0.61||0.84||1.52|
|w/o Bilinear + MRC||6.71||4.69||12.47|
|Message||I thought she was interesting to watch, but I found Wade more intriguing. I was mortified with what happened to Glen.|
|Ground-truth response||Yeah , that was crazy when Wade douses Glen in gasoline and sets him and the garage on fire.|
|Seq2seq||It was an absurdist screwball comedy.|
|HRED||I liked Clerks , I loved Clerks 2.|
|MemNet||I think it was the best of the movie that is a great movie.|
|GTTP||I think it was a very slow moving novel like story but the character were played well.|
|BiDAF+G||I also liked the scene where Wade fight , and Wade clubs Glen , killing him.|
|Selected knowledge||Wade fight, and Wade clubs Glen, killing him. Numbly, Wade douses Glen in gasoline and sets him and the garage on fire.|
|CCM||A empathy organized Auto named Martin _UNK ( his more ) made him a audience talk of _NONE.|
|AKGCM||It was crazy when Numbly, Wade douses Glen in gasoline and sets him and the garage on fire.|
|Selected knowledge||Numbly, Wade douses Glen in gasoline and sets him and the garage on fire.|
AKGCM without (w/o) Non-factoid Knowledge To verify contribution of non-factoid knowledge, we remove non-factoid knowledge from augmented KG in test procedure, and report the performance of our system with only factoid knowledge in Table 4. We see that without non-factoid knowledge, the performance of our system drops significantly in terms of BLEU and ROUGE. It indicates that non-factoid knowledge is essential for knowledge aware conversation generation.
AKGCM w/o the MRC Model or Bilinear One For ablation study, we implement a few variants of our system without the bilinear model or MRC for knowledge selection. Results of these variants are reported in Table 4. If we compare the performance of our full model with its variants, we find that both MRC and the bilinear model can bring performance improvement to our system. It indicates that the full interaction between input message and knowledge texts by neural models is effective to knowledge selection.
Model Generalization As shown in Figure 4, we gradually reduce the size of training data, and then AKGCM can still manage to achieve acceptable performance, even when given extremely small training data (around 3,400 u-r pairs at the x-axis point of 10%). But the performance of the two strongest baselines, BIDAF+G and GTTP, drops more dramatically in comparison with AKGCM. It indicates that our graph reasoning mechanism can effectively use the graph structure information for knowledge selection, resulting in better generalization capability of AKGCM.
Case Study As shown in Table 5, given an input message talking about movie characters and scene, AKGCM can select appropriate knowledge text that is related to movie scene, and then produce more appropriate and informative response with the use of the selected knowledge, compared with other models. Furthermore, in comparison with the high-quality responses by BiDAF+G and GTTP, AKGCM’s output is more similar to the ground-truth one. Moreover, Seq2seq and HRED generate comments of the movie, worse than other scene description related responses in terms of appropriateness and informativeness. MemNet and CCM fail to generate high-quality responses probably due to that the our training data is too small (around 34,000 conversation pairs), significantly less than that in their original work.
In this paper, we propose an augmented knowledge graph based open-domain chatting machine (AKGCM) to facilitate conversation generation with the first attempt to unify both factoid and non-factoid knowledge as a graph, and then combine multi-hop graph reasoning with machine reading technology for knowledge selection. Results indicate that although the machine reading based model (BiDAF+G) is a very strong baseline, AKGCM, supported by graph reasoning mechanism, can outperform it, especially when given extremely small training data.
This work may be viewed as a step towards knowledge-aware and interpretable neural conversational models. In the future, we may extend AKGCM to conduct multi-turn dialogue or expand our graph for more content coverage by incorporating data beyond background knowledge.
- Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multirelational data. In Proceedings of NIPS, pages 2787––2795.
- Das et al. (2018) Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay Krishnamurthy, Alex Smola, and Andrew McCallum. 2018. Go for a walk and arrive at the answer: reasoning over paths in knowledge bases using reinforcement learning. In Proceedings of ICLR, pages 1–18.
Das et al. (2017a)
Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. 2017a.
Chains of reasoning over entities, relations, and text using recurrent neural networks.In Proceedings of EACL, pages 132––141.
- Das et al. (2017b) Rajarshi Das, Manzil Zaheer, Siva Reddy, and Andrew McCallum. 2017b. Question answering on knowledge bases and text using universal schema and memory networks. In Proceedings of ACL, pages 358––365.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dinan et al. (2019)
Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason
Wizard of wikipedia: knowledge-powered conversational agents.In Proceedings of ICLR.
- Gardner and Krishnamurthy (2017) Matt Gardner and Jayant Krishnamurthy. 2017. Open-vocabulary semantic parsing with both distributional statistics and formal knowledge. In Proceedings of AAAI, pages 3195––3201.
- Ghazvininejad et al. (2018) Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen tau Yih, and Michel Galley. 2018. A knowledge-grounded neural conversation model. In Proceedings of AAAI 2018, pages 5110–5117.
- Han et al. (2015) Sangdo Han, Jeesoo Bang, Seonghan Ryu, and Gary Geunbae Lee. 2015. Exploiting knowledge base to generate responses for natural language dialog listening agents. In Proceedings of SIGDIAL, pages 129––133.
- Lao et al. (2011) Ni Lao, Tom M. Mitchell, and William W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In Proceedings of EMNLP, pages 529––539.
- Lin (2004) Chin-Yew Lin. 2004. Rouge: a package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004).
- Lin and Och (2004) Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of ACL, pages 605–612.
- Lin et al. (2018) Xi Victoria Lin, Richard Socher, and Caiming Xiong. 2018. Multi-hop knowledge graph reasoning with reward shaping. In Proceedings of EMNLP, pages 3243––3253.
- Liu et al. (2018) Shuman Liu, Hongshen Chen, Zhaochun Ren, Yang Feng, Qun Liu, and Dawei Yin. 2018. Knowledge diffusion for neural dialogue generation. In Proceedings of ACL, pages 1489––1498.
- Moghe et al. (2018) Nikita Moghe, Siddhartha Arora, Suman Banerjee, and Mitesh M. Khapra. 2018. Towards exploiting background knowledge for building conversation systems. In Proceedings of EMNLP, pages 2322––2332.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of ACL, pages 311–318.
- Ritter et al. (2011) Alan Ritter, Colin Cherry, and William B. Dolan. 2011. Data-driven response generation in social media. In Proceedings of EMNLP, pages 583––593.
- Ryu et al. (2014) Pum-Mo Ryu, Myung-Gil Jang, and Hyun-Ki Kim. 2014. Open domain question answering using wikipedia-based knowledge model. In Information Processing and Management, pages 50(5):683–692.
- See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: summarization with pointer-generator networks. In Proceedings of ACL, pages 1073––1083.
- Seo et al. (2017) Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In Proceedings of ICLR.
- Serban et al. (2016) Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of AAAI, pages 3776––3784.
- Shang et al. (2015) Lifeng Shang, Zheng dong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. In Proceedings of ACL, pages 1577––1586.
- Sun et al. (2018) Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William W. Cohen. 2018. Open domain question answering using early fusion of knowledge bases and text. In Proceedings of EMNLP, pages 4231––4242.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of NIPS, pages 3104––3112.
- Vougiouklis et al. (2016) Pavlos Vougiouklis, Jonathon Hare, and Elena Simperl. 2016. A neural network approach for knowledge-driven response generation. In Proceedings of COLING 2016, pages 3370––3380.
Wang et al. (2014)
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014.
Knowledge graph embedding by translating on hyperplanes.In Proceedings of AAAI, pages 1112––1119.
- Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Machine learning, pages 8(3–4):229––256.
- Wu et al. (2018) Xianchao Wu, Ander Martınez, and Momo Klyen. 2018. Dialog generation using multi-turn reasoning neural networks. In Proceedings of NAACL-HLT, pages 2049––2059.
- Xiong et al. (2017) Wenhan Xiong, Thien Hoang, and William Yang Wang. 2017. Deeppath: a reinforcement learning method for knowledge graph reasoning. In Proceedings of EMNLP, pages 564–573.
- Xu et al. (2017) Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, and Xiaolong Wang. 2017. Incorporating loose-structured knowledge into lstm with recall gate for conversation modeling. In Proceedings of IJCNN, pages 3506–3513.
- Young et al. (2013) Steve Young, Milica Gašić, Blaise Thomson, and Jason D. Williams. 2013. Pomdp-based statistical spoken dialog systems: a review. In Proceedings of the IEEE, pages Vol. 101, No. 5, May 2013.
- Young et al. (2018) Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou, Subham Biswas, and Minlie Huang. 2018. Augmenting end-to-end dialogue systems with commonsense knowledge. In Proceedings of AAAI 2018, pages 4970–4977.
- Zhou et al. (2018) Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu. 2018. Commonsense knowledge aware conversation generation with graph attention. In Proceedings of IJCAI-ECAI, pages 4623–4629.
- Zhu et al. (2017) Wenya Zhu, Kaixiang Mo, Yu Zhang, Zhangbin Zhu, Xuezheng Peng, and Qiang Yang. 2017. Flexible end-to-end dialogue system for knowledge grounded conversation. In CoRR, abs/1709.04264.