Challenges in Building Intelligent Open-domain Dialog Systems

by   Minlie Huang, et al.
Tsinghua University

There is a resurgent interest in developing intelligent open-domain dialog systems due to the availability of large amounts of conversational data and the recent progress on neural approaches to conversational AI. Unlike traditional task-oriented bots, an open-domain dialog system aims to establish long-term connections with users by satisfying the human need for communication, affection, and social belonging. This paper reviews the recent works on neural approaches that are devoted to addressing three challenges in developing such systems: semantics, consistency, and interactiveness. Semantics requires a dialog system to not only understand the content of the dialog but also identify user's social needs during the conversation. Consistency requires the system to demonstrate a consistent personality to win users trust and gain their long-term confidence. Interactiveness refers to the system's ability to generate interpersonal responses to achieve particular social goals such as entertainment, conforming, and task completion. The works we select to present here is based on our unique views and are by no means complete. Nevertheless, we hope that the discussion will inspire new research in developing more intelligent dialog systems.


Learning Conversational Systems that Interleave Task and Non-Task Content

Task-oriented dialog systems have been applied in various tasks, such as...

Probing Neural Dialog Models for Conversational Understanding

The predominant approach to open-domain dialog generation relies on end-...

From Eliza to XiaoIce: Challenges and Opportunities with Social Chatbots

Conversational systems have come a long way after decades of research an...

Hierarchical Reinforcement Learning for Open-Domain Dialog

Open-domain dialog generation is a challenging problem; maximum likeliho...

End-to-End Trainable Non-Collaborative Dialog System

End-to-end task-oriented dialog models have achieved promising performan...

Addressing Inquiries about History: An Efficient and Practical Framework for Evaluating Open-domain Chatbot Consistency

A good open-domain chatbot should avoid presenting contradictory respons...

A Bandit Approach to Posterior Dialog Orchestration Under a Budget

Building multi-domain AI agents is a challenging task and an open proble...

1. Introduction

Building intelligent open-domain dialog systems that can converse with humans coherently and engagingly has been a long-standing goal of artificial intelligence (AI). Early dialog systems such as Eliza (Weizenbaum, 1966), Parry (Colby et al., 1971), and Alice (Wallace, 2009), despite being instrumental to significantly advancing machine intelligence, worked well only in constrained environments. An open-domain social bot remains an elusive goal until recently. The Microsoft XiaoIce (‘Little Ice’ literally in Chinese) system, since its release in May, 2014, has attracted millions of users and can converse with users on a wide variety of topics for hours (Zhou et al., 2018a; Shum et al., 2018). In 2016, the Alexa Prize challenge was proposed to advance the research and development of social bots that are able to converse coherently and engagingly with humans on popular topics such as sports, politics, and entertainment, for at least 20 minutes (Ram et al., 2018) 111Even though the dialog systems in this challenge are very complicated, they are more informational systems where user emotion need is less considered.

. The evaluation metric, inspired by the Turing Test

(Turing, 1950), is designed to test the social bots’ capacity of delivering coherent, relevant, interesting, free-form conversations and keeping users engaged as long as possible. However, the general intelligence demonstrated by these systems is still far behind humans. Building open-domain dialog systems that can converse on various topics like humans remains extremely challenging (Gao et al., 2019a).

In this paper we focus our discussion on three challenges in developing neural-based open-domain dialog systems, namely semantics, consistency and interactiveness. The rest of the paper is structured as follows. In the rest of Section 1, we compare open-domain dialog bots with traditional task-oriented bots and elaborate the three challenges. In Section 2, we survey three typical approaches to building neural-based open-domain dialog systems, namely, retrieval-based, generation-based, and hybrid methods. In Sections 3 to 5, we review the approaches that have been proposed to address the three challenges, respectively. In Section 6, we discuss dialog evaluation. We conclude the paper by presenting several future research trends in Section 7.

1.1. Open-Domain Dialog vs. Task-Oriented Dialog

Generally speaking, there are two types of dialog systems: one is task-oriented and the other is for open-domain dialog. Task-oriented dialog systems are designed for very specific domains or tasks, such as flight booking, hotel reservation, customer service, and technical support, and have been successfully applied in some real-world applications. Open-domain dialog systems, however, are much more challenging to develop due to its open-ended goal.

As outlined by Gao et al. (2019a), although both task-oriented dialog and open-domain dialog can be formulated as an optimal decision making process with the goal of maximizing expected reward, the reward in the former is better-defined and much easier to optimize than the latter. Consider a ticket-booking bot. It is straightforward to optimize the bot to get all necessary information to have the ticket booked in minimal dialog turns. The goal of an open-domain dialog agent is to maximize the long-term user engagement. This is difficult to optimize mathematically because there are many different ways (known as dialog skills) to improve the engagement (e.g., making entertainment, giving recommendation, chatting on an interesting topic, giving emotional comforting) and it requires the systems to have a deep understanding of dialog context and user’s emotional needs to select the right skill at the right time, and generate interpersonal responses with a consistent personality.

Open-domain dialog systems also differ from task-oriented bots in system architecture. A task-oriented bot is typically developed based on a pre-defined task-specific scheme222A task scheme typically defines a set of user intents, and for each intent defines a set of dialog acts, slot-value pairs. and is designed as a modular system which consists of domain-specific components like language understanding, dialog management333Dialog management performs both dialog state tracking (Henderson et al., 2013; Mrksic et al., 2017) and response selection via policy (Zhao and Eskénazi, 2016; Peng et al., 2017; Su et al., 2016; Lipton et al., 2018). , and language generation444Recently, there are end-to-end methods (Rojas-Barahona et al., 2017; Bordes et al., 2017; Zhang et al., 2019a) that output a response given the previous dialog history, but in general, domain knowledge about the task should be explicitly considered, which differs significantly from open-domain dialog systems.

. These components can be either hand-crafted based on domain knowledge or trained on task-specific labeled data. In comparison, due to the open-ended nature, open-domain dialog systems need to deal with open-domain knowledge without any pre-defined task-specific schemas or labels. In recent years, there has been a trend towards developing fully data-driven unitary (non-modular) systems that map user input to system response using neural networks. Since the primary goal of open-domain dialog bots is to be AI companions to humans with an emotional connection rather than completing specific tasks, they are often developed to mimic human conversations by training neural response generation models on large amounts of human conversational data

(Sordoni et al., 2015; Vinyals and Le, 2015; Shang et al., 2015).

Unlike task-oriented bots, most neural response generation models developed for open-domain dialog systems are not grounded in real world, which prevent these systems from effectively conversing about anything that relates to the user’s environment. Only recently have researchers begun to explore how to ground open-domain dialog systems in real-world entities and knowledge (Ghazvininejad et al., 2018; Mostafazadeh et al., 2017). Knowledge grounding is also crucial for the system to provide interpersonal responses. For instance, the conversations between friends are quite different from those between strangers. So the system needs to ground in the personas of the speaker and addressee, respectively (Li et al., 2016b). The tone of system responses needs to be adjusted according to user’s emotional states and affects. So the system needs to ground in affect or emotion of the user (Huber et al., 2018; Winata et al., 2017; Xu et al., 2018a).

Figure 1. Key issues (the middle block), key factors (the upper block), and key technologies (the bottom block) in open-domain dialog systems.

1.2. Three Challenges

Thanks to the recent success of neural approaches to natural language process and conversational AI, there have been substantial advancements on open-domain dialog systems. Most neural approaches formulate conversation as a task of generating output response given user input and dialog context. Almost all neural response generation models follow the neural text generation frameworks, such as sequence-to-sequence (Seq2Seq)

(Sutskever et al., 2014; Vinyals and Le, 2015; Shang et al., 2015)

, conditional variational autoencoder (CVAE)

(Sohn et al., 2015), and generative adversarial network (GAN) (Li et al., 2017). These models are trained on large-scale conversational data, and can deliver plausible responses to any topic that users are interested in. However, these neural-based systems often fail to establish long-term emotional connections with users due to many issues as illustrated by the examples in Table 1 and summarized in Figure 1. In what follows, we elaborate three key issues which we will discuss in the rest of the paper.

Semantics Issue
User I come from Shanghai.
Machine So cool, I am from Beijing, the same province!
User Back to Beijing now. So cold so happy!
Machine When will you come to Beijing?
Consistency Issue
User Are a boy or a girl?
Machine I am a boy.
User Are you a girl?
Machine Yes, I am a girl.
Interactiveness Issue
User Terrible breakfast today.
Machine How about together tomorrow morning?
User Sure. Would be great with your recommendation.
Machine Really?
Table 1. Examples for demonstrating the semantics, consistency, and interactiveness issues. The first two examples show semantic conflicts between user post and machine response, the third session shows bad consistency within the dialog context due to the lack of a coherent personality, and the last session has bad interactiveness due to the lack of grounding. The results in the first two blocks are from a standard Seq2Seq model with an attention mechanism, and the last session is from a commercial system.


Semantics is the heart of any dialog system. Conversation is a semantic activity (Zhou et al., 2018e). It requires not only to understand the content, context, and scene of the conversation, but also to process multi-modal information including user’s personality and persona555Personality is someone’s character or nature while a persona is a superficial identity of the character or nature.

, emotion and sentiment, and profile and background. From the technical perspective, semantics mainly involves named entity recognition, entity linking, domain detection, topic and intent detection, sentiment and emotion detection, and knowledge and semantic reasoning.


In order to gain user’s long-term confidence and trust, it is crucial for a dialog system to present a consistent personality and respond consistently given user’s input and dialog history (Li et al., 2016b; Qian et al., 2018; Zheng et al., 2019; Zhou et al., 2018a), For instance, a social bot should not deliver a response that conflicts with her pre-set persona, or her previous responses in temporal dependency, causality, or logic. From the technical perspective, consistency mainly involves personalization, multi-turn context modeling, knowledge grounding, and dialog planning.


As mentioned above, meeting user’s social needs, such as emotional affection and social belonging, is the primary design goal of an open-domain dialog system. Interactiveness refers to the system’s ability to generate interpersonal responses to achieve a particular social goal such as entertainment, conforming, and task completion. To improve interactiveness, it is important to understand the user’s emotion state or affect (Zhou et al., 2018b, a), to respond not only reactively but also proactively (Yu et al., 2016; Wang et al., 2018b), to control the topic maintenance or transition (Wang et al., 2018a), and to optimize the interaction strategy (i.e., dialog policy) in multi-turn conversations to maximize long-term user engagement. From the technical perspective, interactiveness mainly involves sentiment and emotion detection, context modeling, topic detection and recommendation, dialog planning, and dialog policy learning.

2. Frameworks for Building Open-domain Dialog Systems

As discussed in Section 1.1, open-domain dialog systems are typically implemented using an unitary architecture, rather than a modular architecture used by task-oriented bots for which task-specific schemes and labels are available to develop dialog modules. At the heart of an open-domain dialog system is a response generation engine, which takes user input at -th dialog turn and dialog context , which will be explained in a minute, and generates response by


where denotes the set of all candidate responses, is a learned model of scoring candidate responses, parameterized by , and argmax the search algorithm to find among all candidates the best one with the highest score.

This formulation unifies three typical methods of building open-domain dialog systems: retrieval-based, generation-based, and hybrid. In retrieval-based methods, the search space is obtained by retrieving candidate responses from a pre-collected human conversational dataset consisting of input-context-response pairs. is implemented as a matching or ranking function which scores the relevance of each candidate given and . In generation-based methods, the search space is very large, namely where is the vocabulary size and is the response length, and is typically implemented as an auto-regressive model that generates a sentence word by word. In the hybrid methods, it is typical to first retrieve prototype responses from a dataset and then generates a response by utilizing prototype responses.

Note that the introduction of context offers a lot of flexibility to model various aspects of dialog. For instance, when , it models single-turn dialog; Setting models multi-turn dialogs. can also encode other contexts such as persona (Qian et al., 2018; Zhang et al., 2018b; Zheng et al., 2019) for personalized dialog generation, emotion labels (Zhou et al., 2018b; Asghar et al., 2018)

for emotional response generation, and knowledge graphs

(Zhou et al., 2018e; Ghazvininejad et al., 2018) for knowledge-aware response generation.

2.1. Retrieval-based Methods

Figure 2. Framework of retrieval-based methods. The online process finds the most relevant output from the retrieved candidate with a matching model while the offline process trains the matching model with the auto-constructed data.

Figure 2 illustrates the process of retrieval-based response generation methods. Using input 666Hereafter, we will use to denote the current input and the dialog context . as a query, such methods first retrieve a list of candidates from a large repository which consists of input-context-output pairs, and choose the top-scored candidate as output response using the matching function , which can be implemented using either traditional learning-to-rank algorithms (Liu, 2010), or modern neural matching models (Lu and Li, 2013; Huang et al., 2013; Fan et al., 2017). The model parameters is learned on pair-wise training data to minimize the margin-based pair-wise ranking loss as follows:


where is a margin (a hyper-parameter), is a ground-truth (positive) response, is a negative response which can be randomly sampled from the dataset or generated by corrupting , and is the matching function to be learned.

Alternatively, we can also use a likelihood loss defined as:


Although both loss functions are widely used, in our experiments we find the likelihood loss work better than the margin-based loss for response ranking. There are two possible interpretations. First, the hyper-parameter

is difficult to tune. Second, in the cases where there are highly competitive negative examples, the margin-based loss is close to zero, thereby leading to very little model update. The likelihood loss does not suffer from these issues.

Modern neural models of can be roughly grouped into two categories, shallow and deep interaction networks, as illustrated in Figure 3. In shallow interaction networks, candidate and input

are first encoded independently into the two vectors which then have some

shallow interactions such as subtraction or element-wise multiplication before being fed to the classification layer. In deep interaction networks, and interact via an interaction network to form a fused representation, which is then fed to the classification layer.

Figure 3. Frameworks of shallow and deep interaction networks. In shallow interaction network, the feature vectors of input and candidate are obtained independently, and there may be shallow interactions such as subtraction or element-wise multiplication between the two vectors before the classification layer. In deep interaction network, the input and candidate make interactions in the early stage to obtain a feature vector for the classification layer.

For shallow interaction networks, many efforts have been devoted to learning good representations for input and candidate independently. Huang et al. (2013) proposed to use deep structured similarity models (DSSMs) to extract semantic features from query and document independently before computing their relevance. DSSM is further augmented by introducing Convolutional layers (Shen et al., 2014; Hu et al., 2014; Severyn and Moschitti, 2015)

and recurrent layers with Long Short-Term Memory (LSTM) units

(Palangi et al., 2016). To effectively incorporate dialog history, Yan et al. (2016) reformulated input query , and combined matching scores computed based on the reformulated and original queries, and retrieved queries and responses, respectively. Zhou et al. (2016)

used a hierarchical Recurrent Neural Network (RNN) to encode a candidate and the utterance sequence in context, respectively, before computing their matching score. These shallow models are simple to implement and efficient to execute.

For deep interaction networks, query and response interact via a neural network to generate a single feature vector that preserves all query-response interaction information at different levels of abstraction. The matching score is then derived from the vector using another neural network. Hu et al. (2014) extracted matching features from all -gram combinations of input and response

to obtain low-level feature maps with a Convolutional Neural Network (CNN). Afterwards, the feature maps are transformed with multiple CNN layers to form the final representation for classification.

Wu et al. (2017) proposed a sequential matching model for multi-turn dialog where each contextual utterance in is encoded conditioned on , and these utterances are connected sequentially by GRUs. The matching score is computed on top of the weighted sum of the GRUs’ states. Other matching models that were proposed originally for non-dialog tasks such as paraphrase detection, language inference, and reading comprehension (Wang et al., 2017a; Pang et al., 2016), have also been adapted and applied to dialog response ranking.

2.2. Generation-based Methods

Neural generative models have been widely applied to open-domain dialog generation. Inspired by the early template-based generation method (Higashinaka et al., 2014) and statistical machine translation (SMT)  (Ritter et al., 2011), sequence-to-sequence (Seq2seq) models (Sutskever et al., 2014; Vinyals and Le, 2015; Shang et al., 2015; Sordoni et al., 2015) have become the most popular choice for dialog generation. Other frameworks, including conditional variational autoencoder (CVAE) (Serban et al., 2017; Zhao et al., 2017; Ke et al., 2018; Shen et al., 2018; Zhao et al., 2018; Du et al., 2018) and generative adversarial network (GAN) (Li et al., 2017; Xu et al., 2018b), are also applied to dialog generation.

Generation-based models usually formulate as below:


where . Typically, the output response is generated word by word, e.g., at each time step a word is sampled according to . Using RNNs, during the course of generation, the generated prefix is autoregressively encoded into the input to generate the next word.

Most neural generation models adopt an encoder-decoder framework. The encoder transforms the input into semantic vectors as


Then, at each -th step of generation, the decoder updates its state vector and samples a word from distribution as follows:


where is the weight matrix of the decoder. The decoder’s state is updated by


where is an attentive read of the encoded input conditioned on state , typically using attention mechanism (Bahdanau et al., 2015); and is the vector representation of the previously generated word .

The formulation of generation-based models mentioned above is auto-regressive in that these models generate a target sequence word by word, each word conditioned on the words that are previously generated. To make the decoding parallelizable, non-autoregressive models based on Transformer have been proposed to generate all the tokens simultaneously

(Kaiser et al., 2018; Lee et al., 2018). Non-autoregressive modeling factorizes the distribution over a target sequence given a query into a product of conditionally independent per-step distributions, as follows:


Though the performance of such non-autoregressive models is still not as good as their autoregressive counterparts, it opens new opportunities for fast training using very large scale datasets.

Figure 4. Typical encoder-decoder framework for generation-based models. The input is encoded into vectors . In the decoder, a word is sampled from and the decoder’s state is updated with and as input.

2.3. Hybrid Methods

Retrieval-based methods retrieve an output response from a repository of human-human conversations. Such human-produced conversations are fluent, grammatical, and of high quality. However, the scale of the repository is critical to the success of the methods. Moreover, retrieval-based methods cannot generate unseen conversations. On the other hand, generation-based methods can produce novel conversations. But they often generate undesirable responses that are either ungrammatical or irrelevant. Hybrid methods combine the strengths of both and usually adopt a two-stage procedure (Yang et al., 2019). In the first stage, some relevant conversations, known as prototype responses in (Wu et al., 2019), are retrieved from a dataset using input as a query. Then, prototype responses are used to help generate new responses in the second stage.

Based on the Seq2Seq architecture, Song et al. (2018) used additional encoders to represent the set of retrieved responses, and applied the attention (Bahdanau et al., 2015) and copy (Gu et al., 2016) mechanism in decoding to generate new responses. Pandey et al. (2018) first retrieved similar conversations from training data using a TF-IDF model. The retrieved responses were used to create exemplar vectors that were used by the decoder to generate a new response. Wu et al. (2019) first retrieved a prototype response from training data and then edited the prototype response according to the differences between the prototype context and current context. The motivation is that the retrieved prototype provides a good start-point for generation because it is grammatical and informative, and the post-editing process further improves the relevance and coherence of the prototype.

3. Semantics

A typical symptom of a dialog system that suffers from the semantics issue is that it often generates bland and generic responses, such as “I don’t know”, “thank you”, “OK” , or simply repeats whatever a user says  (Sordoni et al., 2015; Vinyals and Le, 2015; Serban et al., 2016; Gao et al., 2019a). We observe similar phenomena in human conversations. When we don’t understand what the other party is talking about but have to respond, we often pick those safe but bland responses like “OK” and “I don’t know”.

To make an engaging conversation, the dialog system needs to produce contentful, interesting, and interpersonal responses based on its understanding of the dialog content, user’s sentiment and emotion, and knowledge that is related to the dialog. In this section, we review some of the most prominent neural approaches that have been proposed recently to address the semantics issue. We first describe the ways of improving the encoder-decoder framework to generate diverse and informative responses. Then, we describe the methods of grounding dialog in real-world knowledge to make system responses more contentful.

3.1. Improving Diversity and Informativeness in Neural Response Generation

Most state of the art neural response generation models are based on the encoder-decoder framework which consists of four components: (1) an encoder that encodes user input and dialog context, (2) an intermediate representation, (3) an decoder that generates candidate responses, and (4) a ranker that picks the best candidate as the response. In what follows, we review the proposed methods in four categories, each focusing on improving one of the four components.


Encoding more information from query , such as longer dialog history (Sordoni et al., 2015), persona (Li et al., 2016b), hidden topics (Serban et al., 2017), has proved to be helpful for generating more informative results. Xing et al. (2017) extracted topic words, rather than hidden topics, using LDA, and encoded such words in a topic-aware model. The model generates a response by jointly attending to input and the topic words. Topic words are also used to model topic transition in multi-turn conversations (Wang et al., 2018a). The hybrid methods described in Section 2.3 (Pandey et al., 2018; Song et al., 2018; Wu et al., 2019) encode the retrieved prototype responses to help generate more informative responses.

Intermediate Representation

Instead of encoding using a fixed size vector as in (Sutskever et al., 2014), methods have been proposed to use richer intermediate representations (e.g., by using additional latent variables) to enhance the representation capability to address the one-to-many issue in dialog, and to improve the interpretability of the representation in order to better control the response generation. Zhao et al. (2017)

introduced CVAE for dialogue generation and adopted a Gaussian distribution, rather than a fixed vector, as the form of representation, thus obtaining diverse responses via sampling the latent variable.

Du et al. (2018) introduced a sequence of continuous latent variables to model response diversity, and demonstrated empirically that it is more effective than using a single latent variable. Zhao et al. (2018) proposed an unsupervised representation learning method to use discrete latent variables, instead of dense continuous ones, which improves the interpretability of representation. Zhou et al. (2017, 2018c) assumed that there exist some latent responding mechanisms, each of which can generate different responses for a single input post. These responding mechanisms are modeled as latent embeddings, and can be used to encode the input into mechanism-aware context to generate responses with the controlled generation styles and topics. Gao et al. (2019b) proposed a SpaceFusion model which induces a latent space that fuses the two latent spaces generated by Seq2Seq and auto-encoder, respectively, in such a way that after encoding into a vector in the space, the distance and direction from the predicted response vector given the context roughly match the relevance and diversity, respectively.


Assigning additional probability mass to

desirable words in decoder is a commonly used method to gain the control of what to generate. Mathematically, this can be implemented by adjusting the output word distribution as follows:


where is the generated prefix; assigns additional probabilities to the words to be controlled; and

is a normalization function to ensure a probablity distribution. Many existing works use this formulation. The most notable example is CopyNet

(Gu et al., 2016), which copies infrequent words from the input to the output, thus assigning higher probabilities to those rare words. In (Zhang et al., 2018c), is formulated as a Gaussian distribution, which assigns higher probabilities to rare words to control the specificity of a response, where the specificity score of a word is proportional to its IDF (inverse document frequency) score.

Candidate Ranker

To obtain more diverse responses, beam search is commonly used to generate multiple candidates, which are then ranked by another model, which uses information that is not available in decoding (e.g., mutual information between input and response) or is too expensive to use in decoding (e.g.,, a large pre-trained language model such as BERT (Devlin et al., 2019)) to pick the final response. Li et al. (2016a) proposed to use Maximum Mutual Information (MMI) as the objective to rank candidates to promote the diversity of generated responses. As the standard beam search often produces near-identical results, recent work improves it by encouraging the diversity among (partial) hypotheses in the beam. For example, Li et al. (2016c) penalized lower-ranked siblings extended from the same parents, so that the N-best hypotheses in the beam at each time step are more likely to expand from different parents, and thus more diverse. Vijayakumar et al. (2018) divided the hypotheses into several groups and applied beam search group by group. The model favours the hypotheses that are dissimilar to the ones in the previous groups.

3.2. Knowledge Grounded Dailog Models

Knowledge is crucial for language understanding and generation. To build effective human-machine interactions, it is indispensable to ground the concepts, entities, and relations in text to commonsense knowledge or real-world facts such as those stored in Freebase and Wikipedia. An open-domain dialog system, equipped with rich knowledge and knowledge grounding capability, should be able to identify the entities and topics mentioned in user input, link them into real-world facts, retrieve related background information, and thereby respond users in a proactive way e.g., by recommending new, related topics to discuss.

Knowledge has been shown useful in both retrieval-based and generation-based dialog systems. A well-known example of the former is Microsoft XiaoIce (Zhou et al., 2018a). XiaoIce relies on a large knowledge graph (KG) to identify the topics and knowledge related to user input for both response generation and topic management. In (Young et al., 2018), a Tri-LSTM model is proposed to use commonsense knowledge as external memories to facilitate the model to encode commonsense assertions for response selection. An early example of using knowledge for generating responses is (Han et al., 2015), where manually crafted templates are used to generate responses which are filled with relevant knowledge triples. In (Ghazvininejad et al., 2018), a knowledge-grounded model is proposed to generate a response by incorporating some retrieved posts that are relevant to the input. The knowledge in (Ghazvininejad et al., 2018) is in the form of unstructured posts retrieved by an information retrieval model, and the quality is mixed. Pre-compiled structured knowledge, which is in the form of fact triples, is believed to be of high quality and has been shown to help conversation generation (Zhu et al., 2017; Liu et al., 2018). Zhu et al. (2017) dealt with a scenario where two speakers are conversing based on each other’s private knowledge bases in the music domain. The generation model can generate a word in response from either the context or the knowledge base. In (Liu et al., 2018), a knowledge diffusion model is proposed to not only answer factoid questions based on a knowledge base, but also generate an appropriate response containing knowledge base entities that are relevant to the input. Zhou et al. (2018e) exploited the use of large-scale commonsense knowledge for conversation generation. First, a one-hop subgraph is retrieved from ConceptNet (Speer et al., 2017) for each word in an input post. Then, the word vectors, along with the graph vectors which extend the meaning of the word via its neighboring entities and relations, are used to encode the input post. During decoding, a graph attention mechanism is applied in which the model first attends to a knowledge graph and then to a triple within each graph, and the decoder chooses a word to generate from either the graph or the common vocabulary.

4. Consistency

A human-like dialog system needs to embody a consistent personality, so that it can gain the user’s confidence and trust (Shum et al., 2018; Zhou et al., 2018a). Personality settings include age, gender, language, speaking style, general (positive) attitude, level of knowledge, areas of expertise, and a proper voice accent. For example, the persona of XiaoIce (Zhou et al., 2018a) is designed as a 18-year-old girl who is always reliable, sympathetic, affectionate, and has a wonderful sense of humor. Despite being extremely knowledgeable (due to the access to large volumes of data), XiaoIce never comes across as egotistical and only demonstrates her wit and creativity when appropriate. However, modeling these factors in dialog systems remains very challenging because the embodiment of these personality features is often very implicit and subtle, especially when they have to be expressed using natural language.

The studies of personalization in dialog models can be roughly classified into two types:

implicit personalization and explicit personalization. In the former type, the user personalization is represented by a vector. For instance, Kim et al. (2014) proposed a ranking-based approach to integrate a personal knowledge base and user interests in dialogue system. Bang et al. (2015) extended the user input by exploiting examples retrieved from her personal knowledge base to help identify the candidate responses that fit her persona. Li et al. (2016b); Zhang et al. (2017) used an embedding vector to represent a user (speaker) persona and fed the user embedding into each decoding position of the decoder. Such models need to be trained using conversational data labeled by user identifiers, which is expensive to collect for large quantities. Thus, Wang et al. (2017c) proposed to train personalized models with only group attributes (e.g., male or female). The group attributes are embedded to vectors and then fed into the decoder for response generation. Although Zhang et al. (2018d); Ouchi and Tsuboi (2016) showed that user embedding is an effective technique to distinguish roles of speakers and addressees in multi-party conversation, personalization in these models are handled in an implicit way and thus not easy to interpret and control in generating desired responses. In (Qian et al., 2018), an explicit personalization model is proposed to generate personality-coherent responses given a pre-speficified profile. The chatbot’s personality is defined by a key-value table (i.e., profile) which consists of name, gender, age, hobbies, and so on. During generation, the model first chooses a key-value from the profile and then decodes a response from the chosen key-value pair forward and backward. This model can be trained on generic dialogue data without user identifier. XiaoIce also uses an explicit personalization model (Zhou et al., 2018a).

In the studies of (Zhang et al., 2017; Mo et al., 2018; Casanueva et al., 2015; Wang et al., 2017b)

, personalized conversation generation is cast as domain adaptation or transfer learning. The idea is to first train a general conversation model on a large corpus (source domain) and then to transfer the model to a new speaker or domain using small amounts of personalized data (target domain).

Casanueva et al. (2015) proposed to automatically gather dialogues from similar speakers to improve the performance of policy learning of personalized dialogue systems. Zhang et al. (2017) proposed a two-phase transfer learning approach, namely initialization then adaptation, to generate personalized responses. They also proposed a quasi-Turing test method to evaluate the performance of the generated personalized responses. Yang et al. (2017) presented a transfer learning framework similar to Zhang et al. (2017)

, but proposed to use a new adaptation mechanism based on reinforcement learning.

Luan et al. (2017) proposed a multi-task learning approach to take the response generation and utterance representation as two sub-tasks for speaker role adaptation.

Stylistic response generation (Wang et al., 2017b; Oraby et al., 2018) can be viewed as a form of personalization in conversation. The main challenges lie in two aspects: disentangling content and style in representation, and constructing parallel corpora containing same content with different styles. Wang et al. (2017b) utilized a small-scale stylistic data and proposed a topic embedding model to generate responses in specific styles and topics simultaneously. Oraby et al. (2018) demonstrated that it is possible to automate the construction of a parallel corpus where each meaning representation can be realized in different styles with controllable stylistic parameters.

There have been increasing efforts of building personalized dialogue corpora. In (Zhang et al., 2018b), a multi-turn dialogue corpus is constructed, where each dialogue session involves two speakers and the persona of each speaker is defined by several sentences describing the speaker’s hobbies or preferences. Mazaré et al. (2018) presented a simple method of constructing a large-scale personalized dataset from social media where user’s persona is defined by a set of sentences of particular patterns describing their preferences. Joshi et al. (2017) developed a personalized version of the bAbI dialoge dataset (Bordes et al., 2017) by associating each goal-oriented dialog with user traits such as gender, age, and favorite foods. In (Zheng et al., 2019), a large-scale personalized dialog corpus has been developed. The corpus consists of multi-turn conversations collected from Weibo with speaker IDs. Each speaker is associated with her personal information including gender, age, location, and interest.

5. Interactiveness

Interactiveness refers to the system’s ability to generate interpersonal responses to maximize long-term user engagement. To improve interactiveness, it is important to understand user’s emotion and affect, in addition to dialog content, and to optimize the system’s behavior and interaction strategy in multi-turn conversations.

5.1. Modeling User Emotion

Emotion perception and expression is vital for building a human-like dialog system. Earlier attempts of building emotional dialog systems are mostly inspired by psychology findings. Those systems are either rule-based or trained on small-scale data, and work well only in a controlled environment. Thanks to the availability of large-scale data and the recent progress on neural conversational AI, many neural response generation models have been proposed to perceive and express emotions in an open-domain dialog setting.  Zhou et al. (2018b) proposed Emotional Chatting Machine (ECM) to generate emotional responses given a pre-specified emotion. ECM consists of three components: (1) emotion category embedding which is fed into each decoding position, (2) internal emotion state which assumes that the emotion state decays gradually and finally to zero during decoding, and (3) external memory which allows the model to choose emotional (e.g., lovely) or generic (e.g., person) words explicitly at each decoding step. The authors also presented some typical emotion interaction patterns in human-human conversations such as empathy and comfort, which may inspire the design of emotion interaction between human and machine. Asghar et al. (2018) developed a method of affective response generation that consists of three components: (1) the affective vectors based on Valence/Arousal/Dominance dimensions (Warriner et al., 2013), which serve as a supplement to word vectors; (2) the affective loss functions which maximize or minimize the affective consistency between a post and a response; and (3) the affective beam search algorithm for seeking affective responses. In (Zhou and Wang, 2018), a conditional variational autoencoder is proposed to generate more emotional responses conditioned on an input post and some pre-specified emojis. Huber et al. (2018)

studied how emotion can be grounded in an image to generate more affective conversations. In addition to text, the decoder of the model also takes as input the scene, sentiment, and facial coding features extracted from a given image.

Controlling the emotion or sentiment has been a recent popular topic in language generation (Hu et al., 2017; Radford et al., 2017; Ghosh et al., 2017). In (Radford et al., 2017)

, an RNN-based language model is trained on large-scale review data where some neurons are reported to be highly correlated with sentiment expression.

Ghosh et al. (2017)

proposed an affective language model which generates an affective sequence from a leading context. At each decoding position, the model estimates an affective vector of the already generated prefix by keyword spotting using the Linguistic Inquiry and Word Count (LIWC) dictionary

(Pennebaker et al., 2001). The vector is then used to generate the next word. In (Wang and Wan, 2018), to generate reviews of a particular polarity, the authors proposed a multi-class generative adversarial network which consists of multiple generators for multi-class polarities and a multi-class discriminator.

5.2. Modeling Conversation Behavior and Strategy

As pointed out in (Zhou et al., 2018a), an open-domain dialog system needs to have enough social skills to have engaging conversations with users and eventually establish long-term emotional connections with users. These social skills include topic planning and dialog policy which can determine whether to drive the conversation to a new topic when e.g., the conversation has stalled, or whether or not to be actively listening when the user herself is engaged in the conversation. Nothdurft et al. (2015) elucidated the challenges of proactiveness in dialogue systems and how they influence the effectiveness of turn-taking behaviour in multimodal and unimodal dialogue systems. Yu et al. (2016) proposed several generic conversational strategies to handle possible system breakdowns in non-task-oriented dialog systems, and designed policies to select these strategies according to dialog context. Zhang et al. (2018a) proposed a task of predicting from the very beginning of a conversation whether it will get out of hand. The authors developed a framework for capturing pragmatic devices, such as politeness strategies and rhetorical prompts, used to start a conversation, and analyzed their relation to its future trajectory. Applying this framework in a controlled setting, it is possible to detect early warning signs of antisocial behavior in online discussions.

The above studies inspire researchers to devise new methods of incorporating social skills into an open-domain dialog system. In (Li et al., 2016d), a retrieval-based method is proposed to first detect the sign of stalemate using rules, and then retrieve responses that contain the entities that are relevant to the input, assuming that a proactive reply should contain the entities that can be triggered from the ones in the input. Yan and Zhao (2018) proposed a proactive suggestion method where a look-ahead post for a user is decoded in addition to the system response, conditioned on the context and the previously generated response. The user can use the generated post directly, or type a new one during conversation. Wang et al. (2018b) argued that asking good questions in conversation is shown to be an important proactive behavior. A typed decoder is proposed to generate meaningful questions by predicting a type distribution over topic words, interrogatives, and ordinary words at each decoding position. The final output distribution is modeled by the type distribution, leading to a strong control over the question to be generated. Ke et al. (2018) conducted a systematic study of generating responses with different sentence functions, such as interrogative, imperative, and declarative sentences. These sentence functions play different roles in conversations. For instance, imperative responses are used to make requests, give directions and instructions, or elicit further interactions; and declarative responses make statements or explanations.

6. Dialog Evaluation

Evaluating the quality of an open-domain dialog system is challenging because open-domain conversations are inherently open-ended (Ram et al., 2018). For example, if a user asks the question ”what do you think of Michael Jackson?”, there are hundreds of distinct but plausible responses. Evaluation of a dialog system can be performed manually or in an automatic way. In manual evaluation, human judges are hired to assess the generated results in terms of predefined metrics, with well-documented guidelines and exemplars. Evaluation is conducted by either scoring each individual result (point-wise) or comparing two competing results (pair-wise). In some dialog evaluation challenges, manual evaluation is commonly adopted in the final-stage competition  (Dinan et al., 2019; Ram et al., 2018). For instance, the second conversational intelligence challenge (Dinan et al., 2019) adopted manual evaluation by paid workers from Amazon Mechanical Turk and unpaid volunteers, and the organizers reported the rating difference between the two user groups: the volunteers’ evaluation had relatively fewer good (i.e. long and consistent) dialogues, while paid workers tended to rate the models higher than the volunteers.

Since manual evaluation is expensive, time-consuming, and not always reproducible, automatic evaluation is more frequently used, especially at the early stage of development. For retrieval-based methods, traditional information retrieval evaluation metrics such as precision@k, mean average precision (MAP), and normalized Discounted Cumulative Gain (nDCG) 

(Manning et al., 2008) are applicable. For generation-based models, metrics such as perplexity, BLEU (Papineni et al., 2002), and distinct- (Li et al., 2016a), are widely used. Perplexity measures how well a probabilistic model fits the data, and is a strong indicator whether the generated text is grammatical. BLEU, adopted from machine translation, measures the lexical overlap between the generated responses and the reference ones. Distinct- measures the diversity by computing the proportion of unique -grams in a generated set. However, (Liu et al., 2016) argued that automatic metrics such as BLEU, ROUGE (Lin, 2004), and METEOR (Banerjee and Lavie, 2005) all have low correlation with manual evaluation. But as pointed out in (Gao et al., 2019a), the correlation analysis in (Liu et al., 2016) is performed at the sentence level while BLEU is designed from the outset to be used as a corpus-level metric. (Galley et al., 2015) showed that the correlation of string-based metrics (BLEU and deltaBLEU) significantly increases with the units of measurement bigger than a sentence. Nevertheless, in open-domain dialog systems, the same input may have many plausible responses that differ in topics or contents significantly. Therefore, low BLEU (or other metrics) scores do not necessarily indicate low quality as the number of reference responses is always limited in test set. Therefore, there has been significant debate as to whether such automatic metrics are appropriate for evaluating open-domain dialog systems (Gao et al., 2019a).

Recently, trainable metrics for dialog evaluation have attracted some research efforts. Lowe et al.  (2017)

proposed a machine-learned metric, called ADEM, for dialog evaluation. They presented a variant of the VHRED model

(Serban et al., 2017) that takes context, user input, gold and system responses as input, and produces a qualitative score between 1 and 5. The authors claimed that the learned metric correlates better with human evaluation than BLEU and ROUGE. (Tao et al., 2018) proposed an evaluation model, called RUBER, which does not rely on human judged scores. RUBER consists of a referenced component to measure the overlap between a system response and a reference response, and an unreferenced component to measure the correlation between the system response and the input utterance. However, as pointed out in (Sai et al., 2019), ADEM can be easily fooled with a variation as simple as reversing the word order in the text. Their experiments on several such adversarial scenarios draw out counter-intuitive scores on the dialogue responses. In fact, any trainable metrics lead to potential problems such as overfitting and “gaming of the metric” 777In discussing the potential pitfalls of machine-learned evaluation metrics, Albrecht and Hwa (2007)

argued for example that it would be “prudent to defend against the potential of a system gaming a subset of the features.” In the case of deep learning, this gaming would be reminiscent of making non-random perturbations to an input to drastically change the network’s predictions, as it was done, e.g., with images in

(Szegedy et al., 2013) to show how easily deep learning models can be fooled. Readers refer to Chapter 5 in Gao et al. (2019a) for a detailed discussion.
(Albrecht and Hwa, 2007), which might explain why none of the previously proposed machine-learned evaluation metrics (Corston-Oliver et al., 2001; Kulesza and Shieber, 2004; Lita et al., 2005; Albrecht and Hwa, 2007; Giménez and Màrquez, 2008; Pado et al., 2009; Stanojević and Sima’an, 2014, etc.) is used in official machine translation benchmarks. Readers refer to (Gao et al., 2019a) for a detailed discussion.

All of this suggests that automatic evaluation of dialog systems is by no means a solved problem. We believe that developing a successful automatic evaluation metric has two prerequisites. First, there should be a fairly large, representative conversational dataset. This dataset should have a good coverage of daily life topics and domains. Second, for each input, there should be multiple appropriate responses to address the one-to-many essence in open-domain dialog.

7. Discussions and Future Trends

In this paper, we review the recent progress in developing open-domain dialog systems. We focus the discussion on neural approaches that have been proposed to deal with three key challenges: semantics, consistency, and interactiveness. We also review dialog evaluation metrics for both manual and automatic evaluation, and share our thoughts on how to develop automatic evaluation metrics.

Differing from early generations of dialog assistants which are designed for the tasks that require only short, domain-specific conversations, such as making reservation or asking for information, open-domain dialog systems are design to be AI companions that are able to have long, free-form social chats that occur naturally in social and professional human interactions (Ram et al., 2018; Zhou et al., 2018a). Despite the recent progress as reviewed in this paper, achieving sustained, coherent, and engaging open-domain conversations remains very challenging. Here, we discuss some future trends that may contribute to building more intelligent open-domain dialog systems:

Topic and Knowledge Grounding

To deliver contentful conversations, it is important to ground conversations in real-world topics or entities (e.g., in knowledge bases). This is part of the semantics challenge we discussed in Section 3. Since natural language understanding in open domains is extremely challenging, knowledge grounding provides to some degree the ability of understanding language in dialog context, as shown in several preliminary studies (Zhou et al., 2018e; Liu et al., 2018; Zhu et al., 2017). Even though an open-domain dialog system has no access to annotated dialog acts (which are available only for task-oriented dialog) to learn to explicitly detect user intents (labeled by dialog acts), the system can still play a proactive role of leading the conversation by for example suggesting new topics or providing new information, if the key concepts and entities are correctly recognized and linked to a knowledge base (Fang et al., 2018; Pichl et al., 2018; Wang et al., 2018b; Zhou et al., 2018a). Some recently proposed corpora, such as Document-grounded conversation (Zhou et al., 2018d) and Wizard of Wikipedia (Dinan et al., 2018), which aim to build a chatbot for delivering conversations grounded in the topics of a document or a Wikipedia page, respectively, provide new test beds for this research.

Empathetic Computing

Sentiment/emotion is a key factor for making effective social interactions, and is crucial for building an empathetic social bot to improve interactiveness. Existing works (Zhou et al., 2018b; Asghar et al., 2018; Zhou et al., 2018a; Zhou and Wang, 2018) in this direction are still in the infant stage, as they only deal with superficial expression of emotion. In the future, an empathetic machine should be able to perceive a user’s emotion state and change, deliver emotionally influential conversations, and evaluate the emotional impact of its action, much of which should be tightly aligned to psychological studies. These become more crucial in more complicated scenarios such as psychological treatment, mental health, and emotional comforting. Moreover, it is insufficient for an empathetic machine to use only text information. The signals from other modalities such as facial expression and speech prosody should also be leveraged (Liao et al., 2018; Zhang et al., 2019b). To foster the research, Saha et al. (2017) developed a conversational dataset consisting of multi-modal dialog sessions in a fashion domain where each turn contains a textual utterance, one or more images, or a mix of text and images.

Personality of a Social Bot

A coherent personality is important for a social bot to gain human trust, thereby improving the consistency and interactiveness of human-machine conversations. Personality (e.g., Big five traits) has been well defined in psychology (Norman, 1963; Gosling et al., 2003). However, existing works (Li et al., 2016b; Qian et al., 2018; Zhang et al., 2018b; Zhou et al., 2018a) are still very preliminary, and need to be significantly extended by incorporating the results of multidiscipline research covering psychology, cognitive science, computer science, etc. The central problem is how to ensure personality-coherent behaviors in conversations and evaluate such behaviors from the perspectives of multidisciplines, particularly via psychological studies.

Controllability of dialog generation

Most existing open domain dialog systems depend on neural dialog generation models. Due to the essence of probabilistic sampling used in language generation, controllability is a big issue as repetitive, bland, illogical or even unethical responses are frequently observed. Controllability is closely related to the interpretability and robustness of neural network models, and solving it requires new methods, such as the hybrid approaches that combine the strengths of both neural and symbolic methods.

8. Acknowledgement

This work was supported by the National Key R&D Program of China (Grant No. 2018YFC0830200), and partly by the National Science Foundation of China (Grant No.61876096/61332007).

We would like to thank Pei Ke, Qi Zhu, Yilin Niu, Zhihong Shao, Yaoqin Zhang, Hao Zhou, Chris Brockett, Bill Dolan, and Michel Galley for their discussions and contributions to this paper.


  • (1)
  • Albrecht and Hwa (2007) Joshua Albrecht and Rebecca Hwa. 2007. A Re-examination of Machine Learning Approaches for Sentence-Level MT Evaluation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague, Czech Republic, 880–887.
  • Asghar et al. (2018) Nabiha Asghar, Pascal Poupart, Jesse Hoey, Xin Jiang, and Lili Mou. 2018. Affective Neural Response Generation. In Advances in Information Retrieval - 40th European Conference on IR Research, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings, Vol. 10772. 154–166.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR 2015, San Diego, CA, USA, May 7-9, 2015.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.
  • Bang et al. (2015) Jeesoo Bang, Hyungjong Noh, Yonghee Kim, and Gary Geunbae Lee. 2015. Example-based chat-oriented dialogue system with personalized long-term memory. In 2015 International Conference on Big Data and Smart Computing, BIGCOMP 2015, Jeju, South Korea, February 9-11, 2015. 238–243.
  • Bordes et al. (2017) Antoine Bordes, Y.-Lan Boureau, and Jason Weston. 2017. Learning End-to-End Goal-Oriented Dialog. In ICLR 2017, Toulon, France, April 24-26, 2017.
  • Casanueva et al. (2015) Iñigo Casanueva, Thomas Hain, Heidi Christensen, Ricard Marxer, and Phil D. Green. 2015. Knowledge transfer between speakers for personalised dialogue management. In Proceedings of SIGDIAL 2015, 2-4 September 2015, Prague, Czech Republic. 12–21.
  • Colby et al. (1971) Kenneth Mark Colby, Sylvia Weber, and Franklin Dennis Hilf. 1971. Artificial paranoia. Artificial Intelligence 2, 1 (1971), 1–25.
  • Corston-Oliver et al. (2001) Simon Corston-Oliver, Michael Gamon, and Chris Brockett. 2001. A Machine Learning Approach to the Automatic Evaluation of Machine Translation. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics. Toulouse, France, 148–155.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL 2019, June 2–7, 2019, Minneapolis, USA.
  • Dinan et al. (2019) Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander H. Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W. Black, Alexander I. Rudnicky, Jason Williams, Joelle Pineau, Mikhail Burtsev, and Jason Weston. 2019. The Second Conversational Intelligence Challenge (ConvAI2). CoRR abs/1902.00098 (2019).
  • Dinan et al. (2018) Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of Wikipedia: Knowledge-Powered Conversational agents. CoRR abs/1811.01241 (2018).
  • Du et al. (2018) Jiachen Du, Wenjie Li, Yulan He, Ruifeng Xu, Lidong Bing, and Xuan Wang. 2018. Variational Autoregressive Decoder for Neural Response Generation. In Proceedings of EMNLP 2018, Brussels, Belgium, October 31 - November 4, 2018. 3154–3163.
  • Fan et al. (2017) Yixing Fan, Liang Pang, Jianpeng Hou, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2017. MatchZoo: A Toolkit for Deep Text Matching. CoRR abs/1707.07270 (2017).
  • Fang et al. (2018) Hao Fang, Hao Cheng, Maarten Sap, Elizabeth Clark, Ari Holtzman, Yejin Choi, Noah A. Smith, and Mari Ostendorf. 2018. Sounding Board: A User-Centric and Content-Driven Social Chatbot. In Proceedings of NAACL-HLT 2018, New Orleans, Louisiana, USA, June 2-4, 2018, Demonstrations. 96–100.
  • Galley et al. (2015) Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets. In ACL-IJCNLP. 445–450.
  • Gao et al. (2019a) Jianfeng Gao, Michel Galley, Lihong Li, et al. 2019a. Neural approaches to conversational AI. Foundations and Trends® in Information Retrieval 13, 2-3 (2019), 127–298.
  • Gao et al. (2019b) Xiang Gao, Sungjin Lee, Yizhe Zhang, Chris Brockett, Michel Galley, Jianfeng Gao, and Bill Dolan. 2019b. Jointly Optimizing Diversity and Relevance in Neural Response Generation. arXiv preprint arXiv:1902.11205 (2019).
  • Ghazvininejad et al. (2018) Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2018. A Knowledge-Grounded Neural Conversation Model. In Proceedings of AAAI 2018, New Orleans, Louisiana, USA, February 2-7, 2018. 5110–5117.
  • Ghosh et al. (2017) Sayan Ghosh, Mathieu Chollet, Eugene Laksana, Louis-Philippe Morency, and Stefan Scherer. 2017. Affect-LM: A Neural Language Model for Customizable Affective Text Generation. In Proceedings of ACL 2017, Vancouver, Canada, July 30 - August 4. 634–642.
  • Giménez and Màrquez (2008) Jesús Giménez and Lluís Màrquez. 2008. A Smorgasbord of Features for Automatic MT Evaluation. In Proceedings of the Third Workshop on Statistical Machine Translation. 195–198.
  • Gosling et al. (2003) Samuel D Gosling, Peter J Rentfrow, and William B Swann. 2003. A very brief measure of the Big-Five personality domains. Journal of Research in personality 37, 6 (2003), 504–528.
  • Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of ACL 2016, August 7-12, 2016, Berlin, Germany.
  • Han et al. (2015) Sangdo Han, Jeesoo Bang, Seonghan Ryu, and Gary Geunbae Lee. 2015. Exploiting knowledge base to generate responses for natural language dialog listening agents. In Proceedings of SIGDIAL 2015, 2-4 September 2015, Prague, Czech Republic. 129–133.
  • Henderson et al. (2013) Matthew Henderson, Blaise Thomson, and Steve J. Young. 2013. Deep Neural Network Approach for the Dialog State Tracking Challenge. In Proceedings of SIGDIAL 2013, 22-24 August 2013, SUPELEC, Metz, France. 467–471.
  • Higashinaka et al. (2014) Ryuichiro Higashinaka, Kenji Imamura, Toyomi Meguro, Chiaki Miyazaki, Nozomi Kobayashi, Hiroaki Sugiyama, Toru Hirano, Toshiro Makino, and Yoshihiro Matsuo. 2014. Towards an open-domain conversational system fully based on natural language processing. In COLING 2014, August 23-29, 2014, Dublin, Ireland. 928–939.
  • Hu et al. (2014) Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional Neural Network Architectures for Matching Natural Language Sentences. In NIPS 2014, December 8-13 2014, Montreal, Quebec, Canada. 2042–2050.
  • Hu et al. (2017) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. 2017. Toward Controlled Generation of Text. In Proceedings of ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Vol. 70. 1587–1596.
  • Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry P. Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 27 - November 1, 2013. 2333–2338.
  • Huber et al. (2018) Bernd Huber, Daniel McDuff, Chris Brockett, Michel Galley, and Bill Dolan. 2018. Emotional Dialogue Generation using Image-Grounded Language Models. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI 2018, Montreal, QC, Canada, April 21-26, 2018. 277.
  • Joshi et al. (2017) Chaitanya K. Joshi, Fei Mi, and Boi Faltings. 2017. Personalization in Goal-Oriented Dialog. CoRR abs/1706.07503 (2017).
  • Kaiser et al. (2018) Lukasz Kaiser, Samy Bengio, Aurko Roy, Ashish Vaswani, Niki Parmar, Jakob Uszkoreit, and Noam Shazeer. 2018. Fast Decoding in Sequence Models Using Discrete Latent Variables. In Proceedings of ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Vol. 80. 2395–2404.
  • Ke et al. (2018) Pei Ke, Jian Guan, Minlie Huang, and Xiaoyan Zhu. 2018. Generating Informative Responses with Controlled Sentence Function. In Proceedings of ACL 2018, Melbourne, Australia, July 15-20, 2018. 1499–1508.
  • Kim et al. (2014) Yonghee Kim, Jeesoo Bang, Junhwi Choi, Seonghan Ryu, Sangjun Koo, and Gary Geunbae Lee. 2014. Acquisition and Use of Long-Term Memory for Personalized Dialog Systems. In Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction - Second International Workshop, MA3HMI 2014, Held in Conjunction with INTERSPEECH 2014, Singapore, Singapore, September 14, 2014, Vol. 8757. 78–87.
  • Kulesza and Shieber (2004) Alex Kulesza and Stuart M. Shieber. 2004. A Learning Approach to Improving Sentence-Level MT Evaluation. In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation. Baltimore, MD.
  • Lee et al. (2018) Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement. In Proceedings of EMNLP 2018, Brussels, Belgium, October 31 - November 4, 2018. 1173–1182.
  • Li et al. (2016a) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A Diversity-Promoting Objective Function for Neural Conversation Models. In NAACL HLT 2016, San Diego California, USA, June 12-17, 2016. 110–119.
  • Li et al. (2016b) Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jianfeng Gao, and William B. Dolan. 2016b. A Persona-Based Neural Conversation Model. In Proceedings of ACL 2016, August 7-12, 2016, Berlin, Germany.
  • Li et al. (2016c) Jiwei Li, Will Monroe, and Dan Jurafsky. 2016c. A Simple, Fast Diverse Decoding Algorithm for Neural Generation. CoRR abs/1611.08562 (2016).
  • Li et al. (2017) Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial Learning for Neural Dialogue Generation. In Proceedings of EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017. 2157–2169.
  • Li et al. (2016d) Xiang Li, Lili Mou, Rui Yan, and Ming Zhang. 2016d. StalemateBreaker: A Proactive Content-Introducing Approach to Automatic Human-Computer Conversation. In Proceedings of IJCAI 2016, New York, NY, USA, 9-15 July 2016. 2845–2851.
  • Liao et al. (2018) Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018. Knowledge-aware Multimodal Dialogue Systems. In 2018 ACM Multimedia Conference on Multimedia Conference. 801–809.
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004).
  • Lipton et al. (2018) Zachary C. Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, and Li Deng. 2018. BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems. In Proceedings of AAAI 2018, New Orleans, Louisiana, USA, February 2-7, 2018. 5237–5244.
  • Lita et al. (2005) Lucian Vlad Lita, Monica Rogati, and Alon Lavie. 2005. BLANC: Learning Evaluation Metrics for MT. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT ’05). 740–747.
  • Liu et al. (2016) Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In Proceedings of EMNLP 2016, Austin, Texas, USA, November 1-4, 2016. 2122–2132.
  • Liu et al. (2018) Shuman Liu, Hongshen Chen, Zhaochun Ren, Yang Feng, Qun Liu, and Dawei Yin. 2018. Knowledge Diffusion for Neural Dialogue Generation. In Proceedings of ACL 2018, Melbourne, Australia, July 15-20, 2018. 1489–1498.
  • Liu (2010) Tie-Yan Liu. 2010. Learning to rank for information retrieval. In Proceeding of SIGIR 2010, Geneva, Switzerland, July 19-23, 2010. 904.
  • Lowe et al. (2017) Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses. In Proceedings of ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers. 1116–1126.
  • Lu and Li (2013) Zhengdong Lu and Hang Li. 2013. A Deep Architecture for Matching Short Texts. In NIPS 2013, December 5-8, 2013, Lake Tahoe, Nevada, United States. 1367–1375.
  • Luan et al. (2017) Yi Luan, Chris Brockett, Bill Dolan, Jianfeng Gao, and Michel Galley. 2017. Multi-Task Learning for Speaker-Role Adaptation in Neural Conversation Models. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP 2017, Taipei, Taiwan, November 27 - December 1, 2017 - Volume 1: Long Papers. 605–614.
  • Manning et al. (2008) Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press.
  • Mazaré et al. (2018) Pierre-Emmanuel Mazaré, Samuel Humeau, Martin Raison, and Antoine Bordes. 2018. Training Millions of Personalized Dialogue Agents. In Proceedings of EMNLP 2018, Brussels, Belgium, October 31 - November 4, 2018. 2775–2779.
  • Mo et al. (2018) Kaixiang Mo, Yu Zhang, Shuangyin Li, Jiajun Li, and Qiang Yang. 2018. Personalizing a Dialogue System With Transfer Reinforcement Learning. In Proceedings of AAAI 2018, New Orleans, Louisiana, USA, February 2-7, 2018. 5317–5324.
  • Mostafazadeh et al. (2017) Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao, Georgios Spithourakis, and Lucy Vanderwende. 2017. Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation. In IJCNLP. 462–472.
  • Mrksic et al. (2017) Nikola Mrksic, Diarmuid Ó Séaghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve J. Young. 2017. Neural Belief Tracker: Data-Driven Dialogue State Tracking. In Proceedings of ACL 2017, Vancouver, Canada, July 30 - August 4. 1777–1788.
  • Norman (1963) Warren T Norman. 1963. Toward an adequate taxonomy of personality attributes: Replicated factor structure in peer nomination personality ratings. The Journal of Abnormal and Social Psychology 66, 6 (1963), 574.
  • Nothdurft et al. (2015) Florian Nothdurft, Stefan Ultes, and Wolfgang Minker. 2015. Finding appropriate interaction strategies for proactive dialogue systems-an open quest. In Proceedings of the 2nd European and the 5th Nordic Symposium on Multimodal Communication, August 6-8, 2014, Tartu, Estonia. 73–80.
  • Oraby et al. (2018) Shereen Oraby, Lena Reed, Shubhangi Tandon, Sharath T. S., Stephanie M. Lukin, and Marilyn A. Walker. 2018. Controlling Personality-Based Stylistic Variation with Neural Natural Language Generators. In Proceedings of SIGDIAL 2018, Melbourne, Australia, July 12-14, 2018. 180–190.
  • Ouchi and Tsuboi (2016) Hiroki Ouchi and Yuta Tsuboi. 2016. Addressee and Response Selection for Multi-Party Conversation. In Proceedings of EMNLP 2016, Austin, Texas, USA, November 1-4, 2016. 2133–2143.
  • Pado et al. (2009) Sebastian Pado, Daniel Cer, Michel Galley, Dan Jurafsky, and Christopher D. Manning. 2009. Measuring Machine Translation Quality as Semantic Equivalence: A Metric Based on Entailment Features. Machine Translation (2009), 181–193.
  • Palangi et al. (2016) Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab K. Ward. 2016. Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval. IEEE/ACM Trans. Audio, Speech & Language Processing 24, 4 (2016), 694–707.
  • Pandey et al. (2018) Gaurav Pandey, Danish Contractor, Vineet Kumar, and Sachindra Joshi. 2018. Exemplar Encoder-Decoder for Neural Conversation Generation. In Proceedings of ACL 2018, Melbourne, Australia, July 15-20, 2018. 1329–1338.
  • Pang et al. (2016) Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text Matching as Image Recognition. In Proceedings of AAAI 2016, February 12-17, 2016, Phoenix, Arizona, USA. 2793–2799.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of ACL 2002, July 6-12, 2002, Philadelphia, PA, USA. 311–318.
  • Peng et al. (2017) Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Çelikyilmaz, Sungjin Lee, and Kam-Fai Wong. 2017. Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning. In Proceedings of EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017. 2231–2240.
  • Pennebaker et al. (2001) James W Pennebaker, Martha E Francis, and Roger J Booth. 2001. Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates 71, 2001 (2001), 2001.
  • Pichl et al. (2018) Jan Pichl, Petr Marek, Jakub Konrád, Martin Matulík, Hoang Long Nguyen, and Jan Sedivý. 2018. Alquist: The Alexa Prize Socialbot. CoRR abs/1804.06705 (2018).
  • Qian et al. (2018) Qiao Qian, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu. 2018. Assigning Personality/Profile to a Chatting Machine for Coherent Conversation Generation. In Proceedings of IJCAI 2018, July 13-19, 2018, Stockholm, Sweden. 4279–4285.
  • Radford et al. (2017) Alec Radford, Rafal Józefowicz, and Ilya Sutskever. 2017. Learning to Generate Reviews and Discovering Sentiment. CoRR abs/1704.01444 (2017).
  • Ram et al. (2018) Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn, Behnam Hedayatnia, Ming Cheng, Ashish Nagar, Eric King, Kate Bland, Amanda Wartick, Yi Pan, Han Song, Sk Jayadevan, Gene Hwang, and Art Pettigrue. 2018. Conversational AI: The Science Behind the Alexa Prize. CoRR abs/1801.03604 (2018).
  • Ritter et al. (2011) Alan Ritter, Colin Cherry, and William B. Dolan. 2011. Data-Driven Response Generation in Social Media. In Proceedings of EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK. 583–593.
  • Rojas-Barahona et al. (2017) Lina Maria Rojas-Barahona, Milica Gasic, Nikola Mrksic, Pei-Hao Su, Stefan Ultes, Tsung-Hsien Wen, Steve J. Young, and David Vandyke. 2017. A Network-based End-to-End Trainable Task-oriented Dialogue System. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017. 438–449.
  • Saha et al. (2017) Amrita Saha, Mitesh Khapra, and Karthik Sankaranarayanan. 2017. Multimodal Dialogs (MMD): A large-scale dataset for studying multimodal domain-aware conversations. arXiv preprint arXiv:1704.00200 (2017).
  • Sai et al. (2019) Ananya Sai, Mithun Das Gupta, Mitesh M. Khapra, and Mukundhan Srinivasan. 2019. Response Generation by Context-aware Prototype Editing. In Proceedings of AAAI 2019, Honolulu, Hawaii, USA, January 27-February 1, 2019.
  • Serban et al. (2016) Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In Proceedings of AAAI 2016, February 12-17, 2016, Phoenix, Arizona, USA. 3776–3784.
  • Serban et al. (2017) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. 2017. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. In AAAI. 3295–3301.
  • Severyn and Moschitti (2015) Aliaksei Severyn and Alessandro Moschitti. 2015. Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks. In Proceedings of SIGIR 2015, Santiago, Chile, August 9-13, 2015. 373–382.
  • Shang et al. (2015) Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural Responding Machine for Short-Text Conversation. In Proceedings of ACL 2015, July 26-31, 2015, Beijing, China. 1577–1586.
  • Shen et al. (2018) Xiaoyu Shen, Hui Su, Shuzi Niu, and Vera Demberg. 2018. Improving Variational Encoder-Decoders in Dialogue Generation. In Proceedings of AAAI 2018, New Orleans, Louisiana, USA, February 2-7, 2018. 5456–5463.
  • Shen et al. (2014) Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, November 3-7, 2014. 101–110.
  • Shum et al. (2018) Heung-Yeung Shum, Xiaodong He, and Di Li. 2018. From Eliza to XiaoIce: challenges and opportunities with social chatbots. Frontiers of IT & EE 19, 1 (2018), 10–26.
  • Sohn et al. (2015) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning Structured Output Representation using Deep Conditional Generative Models. In NIPS 2015, December 7-12, 2015, Montreal, Quebec, Canada. 3483–3491.
  • Song et al. (2018) Yiping Song, Cheng-Te Li, Jian-Yun Nie, Ming Zhang, Dongyan Zhao, and Rui Yan. 2018. An Ensemble of Retrieval-Based and Generation-Based Human-Computer Conversation Systems. In Proceedings of IJCAI 2018, July 13-19, 2018, Stockholm, Sweden. 4382–4388.
  • Sordoni et al. (2015) Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses. In NAACL HLT 2015, Denver, Colorado, USA, May 31 - June 5, 2015. 196–205.
  • Speer et al. (2017) Robert Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In Proceedings of AAAI 2017, February 4-9, 2017, San Francisco, California, USA. 4444–4451.
  • Stanojević and Sima’an (2014) Miloš Stanojević and Khalil Sima’an. 2014. Fitting Sentence Level Translation Evaluation with Many Dense Features. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar, 202–206.
  • Su et al. (2016) Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Maria Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve J. Young. 2016. On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems. In Proceedings of ACL 2016, August 7-12, 2016, Berlin, Germany.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In NIPS 2014, December 8-13 2014, Montreal, Quebec, Canada. 3104–3112.
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. CoRR abs/1312.6199 (2013).
  • Tao et al. (2018) Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018. RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems. In Proceedings of AAAI 2018, New Orleans, Louisiana, USA, February 2-7, 2018. 722–729.
  • Turing (1950) Alan M Turing. 1950. Computing machinery and intelligence. Mind 59, 236 (1950), 433–460.
  • Vijayakumar et al. (2018) Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra. 2018. Diverse Beam Search for Improved Description of Complex Scenes. In Proceedings of AAAI 2018, New Orleans, Louisiana, USA, February 2-7, 2018. 7371–7379.
  • Vinyals and Le (2015) Oriol Vinyals and Quoc V. Le. 2015. A Neural Conversational Model. CoRR abs/1506.05869 (2015).
  • Wallace (2009) Richard S Wallace. 2009. The anatomy of ALICE. In Parsing the Turing Test. Springer, 181–210.
  • Wang et al. (2017b) Di Wang, Nebojsa Jojic, Chris Brockett, and Eric Nyberg. 2017b. Steering Output Style and Topic in Neural Response Generation. In Proceedings of EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017. 2140–2150.
  • Wang et al. (2017c) Jianan Wang, Xin Wang, Fang Li, Zhen Xu, Zhuoran Wang, and Baoxun Wang. 2017c. Group Linguistic Bias Aware Neural Response Generation. In Proceedings of the 9th SIGHAN Workshop on Chinese Language Processing, SIGHAN@IJCNLP 2017, Taipei, Taiwan, December 1, 2017. 1–10.
  • Wang and Wan (2018) Ke Wang and Xiaojun Wan. 2018. SentiGAN: Generating Sentimental Texts via Mixture Adversarial Networks. In Proceedings of IJCAI 2018, July 13-19, 2018, Stockholm, Sweden. 4446–4452.
  • Wang et al. (2018a) Wenjie Wang, Minlie Huang, Xin-Shun Xu, Fumin Shen, and Liqiang Nie. 2018a. Chat More: Deepening and Widening the Chatting Topic via A Deep Model. In SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018. 255–264.
  • Wang et al. (2018b) Yansen Wang, Chenyi Liu, Minlie Huang, and Liqiang Nie. 2018b. Learning to Ask Questions in Open-domain Conversational Systems with Typed Decoders. In Proceedings of ACL 2018, Melbourne, Australia, July 15-20, 2018. 2193–2203.
  • Wang et al. (2017a) Zhiguo Wang, Wael Hamza, and Radu Florian. 2017a. Bilateral Multi-Perspective Matching for Natural Language Sentences. In Proceedings of IJCAI 2017, Melbourne, Australia, August 19-25, 2017. 4144–4150.
  • Warriner et al. (2013) Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert. 2013. Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior research methods 45, 4 (2013), 1191–1207.
  • Weizenbaum (1966) Joseph Weizenbaum. 1966. ELIZA - a computer program for the study of natural language communication between man and machine. Commun. ACM 9, 1 (1966), 36–45.
  • Winata et al. (2017) Genta Indra Winata, Onno Kampman, Yang Yang, Anik Dey, and Pascale Fung. 2017. Nora the empathetic psychologist. In Proc. Interspeech. 3437–3438.
  • Wu et al. (2019) Yu Wu, Furu Wei, Shaohan Huang, Zhoujun Li, and Ming Zhou. 2019. Response Generation by Context-aware Prototype Editing. In Proceedings of AAAI 2019, Honolulu, Hawaii, USA, January 27-February 1, 2019.
  • Wu et al. (2017) Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In Proceedings of ACL 2017, Vancouver, Canada, July 30 - August 4. 496–505.
  • Xing et al. (2017) Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2017. Topic Aware Neural Response Generation. In Proceedings of AAAI 2017, February 4-9, 2017, San Francisco, California, USA. 3351–3357.
  • Xu et al. (2018b) Jingjing Xu, Xuancheng Ren, Junyang Lin, and Xu Sun. 2018b. Diversity-Promoting GAN: A Cross-Entropy Based Generative Adversarial Network for Diversified Text Generation. In Proceedings of EMNLP 2018, Brussels, Belgium, October 31 - November 4, 2018. 3940–3949.
  • Xu et al. (2018a) Peng Xu, Andrea Madotto, Chien-Sheng Wu, Ji Ho Park, and Pascale Fung. 2018a. Emo2Vec: Learning Generalized Emotion Representation by Multi-task Training. arXiv preprint arXiv:1809.04505 (2018).
  • Yan et al. (2016) Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System. In Proceedings of SIGIR 2016, Pisa, Italy, July 17-21, 2016. 55–64.
  • Yan and Zhao (2018) Rui Yan and Dongyan Zhao. 2018. Smarter Response with Proactive Suggestion: A New Generative Neural Conversation Paradigm. In Proceedings of IJCAI 2018, July 13-19, 2018, Stockholm, Sweden. 4525–4531.
  • Yang et al. (2019) Liu Yang, Junjie Hu, Minghui Qiu, Chen Qu, Jianfeng Gao, W Bruce Croft, Xiaodong Liu, Yelong Shen, and Jingjing Liu. 2019. A Hybrid Retrieval-Generation Neural Conversation Model. arXiv preprint arXiv:1904.09068 (2019).
  • Yang et al. (2017) Min Yang, Zhou Zhao, Wei Zhao, Xiaojun Chen, Jia Zhu, Lianqiang Zhou, and Zigang Cao. 2017. Personalized Response Generation via Domain adaptation. In Proceedings of SIGIR 2017, Tokyo, Japan, August 7-11, 2017. 1021–1024.
  • Young et al. (2018) Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou, Subham Biswas, and Minlie Huang. 2018. Augmenting End-to-End Dialogue Systems With Commonsense Knowledge. In Proceedings of AAAI 2018, New Orleans, Louisiana, USA, February 2-7, 2018. 4970–4977.
  • Yu et al. (2016) Zhou Yu, Ziyu Xu, Alan W. Black, and Alexander I. Rudnicky. 2016. Strategy and Policy Learning for Non-Task-Oriented Conversational Systems. In Proceedings of SIGDIAL 2016, 13-15 September 2016, Los Angeles, CA, USA. 404–412.
  • Zhang et al. (2018a) Justine Zhang, Jonathan P. Chang, Cristian Danescu-Niculescu-Mizil, Lucas Dixon, Yiqing Hua, Dario Taraborelli, and Nithum Thain. 2018a. Conversations Gone Awry: Detecting Early Signs of Conversational Failure. In Proceedings of ACL 2018, Melbourne, Australia, July 15-20, 2018. 1350–1361.
  • Zhang et al. (2018c) Ruqing Zhang, Jiafeng Guo, Yixing Fan, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2018c. Learning to Control the Specificity in Neural Response Generation. In Proceedings of ACL 2018, Melbourne, Australia, July 15-20, 2018. 1108–1117.
  • Zhang et al. (2018d) Rui Zhang, Honglak Lee, Lazaros Polymenakos, and Dragomir R. Radev. 2018d. Addressee and Response Selection in Multi-Party Conversations With Speaker Interaction RNNs. In Proceedings of AAAI 2018, New Orleans, Louisiana, USA, February 2-7, 2018. 5690–5697.
  • Zhang et al. (2018b) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018b. Personalizing Dialogue Agents: I have a dog, do you have pets too?. In Proceedings of ACL 2018, Melbourne, Australia, July 15-20, 2018. 2204–2213.
  • Zhang et al. (2017) Wei-Nan Zhang, Qingfu Zhu, Yifa Wang, Yanyan Zhao, and Ting Liu. 2017. Neural personalized response generation as domain adaptation. World Wide Web (2017), 1–20.
  • Zhang et al. (2019a) Zheng Zhang, Minlie Huang, Zhongzhou Zhao, Feng Ji, Haiqing Chen, and Xiaoyan Zhu. 2019a. Memory-augmented Dialogue Management for Task-oriented Dialogue Systems. ACM Transactions on Information Systems 1 (2019).
  • Zhang et al. (2019b) Zheng Zhang, Lizi Liao, Minlie Huang, Xiaoyan Zhu, and Tat-Seng Chua. 2019b. Neural Multimodal Belief Tracker with Adaptive Attention forDialogue Systems. (2019).
  • Zhao and Eskénazi (2016) Tiancheng Zhao and Maxine Eskénazi. 2016. Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning. In Proceedings of SIGDIAL 2016, 13-15 September 2016, Los Angeles, CA, USA. 1–10.
  • Zhao et al. (2018) Tiancheng Zhao, Kyusong Lee, and Maxine Eskénazi. 2018. Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation. In Proceedings of ACL 2018, Melbourne, Australia, July 15-20, 2018. 1098–1107.
  • Zhao et al. (2017) Tiancheng Zhao, Ran Zhao, and Maxine Eskénazi. 2017. Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders. In Proceedings of ACL 2017, Vancouver, Canada, July 30 - August 4. 654–664.
  • Zheng et al. (2019) Yinhe Zheng, Guanyi Chen, Minlie Huang, Song Liu, and Xuan Zhu. 2019. Personalized Dialogue Generation with Diversified Traits. CoRR abs/1901.09672 (2019).
  • Zhou et al. (2017) Ganbin Zhou, Ping Luo, Rongyu Cao, Fen Lin, Bo Chen, and Qing He. 2017. Mechanism-Aware Neural Machine for Dialogue Response Generation. In Proceedings of AAAI 2017, February 4-9, 2017, San Francisco, California, USA. 3400–3407.
  • Zhou et al. (2018c) Ganbin Zhou, Ping Luo, Yijun Xiao, Fen Lin, Bo Chen, and Qing He. 2018c. Elastic Responding Machine for Dialog Generation with Dynamically Mechanism Selecting. In Proceedings of AAAI 2018, New Orleans, Louisiana, USA, February 2-7, 2018. 5730–5737.
  • Zhou et al. (2018b) Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. 2018b. Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory. In Proceedings of AAAI 2018, New Orleans, Louisiana, USA, February 2-7, 2018. 730–739.
  • Zhou et al. (2018e) Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu. 2018e. Commonsense Knowledge Aware Conversation Generation with Graph Attention. In Proceedings of IJCAI 2018, July 13-19, 2018, Stockholm, Sweden. 4623–4629.
  • Zhou et al. (2018d) Kangyan Zhou, Shrimai Prabhumoye, and Alan W. Black. 2018d. A Dataset for Document Grounded Conversations. In Proceedings of EMNLP 2018, Brussels, Belgium, October 31 - November 4, 2018. 708–713.
  • Zhou et al. (2018a) Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2018a. The Design and Implementation of XiaoIce, an Empathetic Social Chatbot. CoRR abs/1812.08989 (2018).
  • Zhou et al. (2016) Xiangyang Zhou, Daxiang Dong, Hua Wu, Shiqi Zhao, Dianhai Yu, Hao Tian, Xuan Liu, and Rui Yan. 2016. Multi-view Response Selection for Human-Computer Conversation. In Proceedings of EMNLP 2016, Austin, Texas, USA, November 1-4, 2016. 372–381.
  • Zhou and Wang (2018) Xianda Zhou and William Yang Wang. 2018. MojiTalk: Generating Emotional Responses at Scale. In Proceedings of ACL 2018, Melbourne, Australia, July 15-20, 2018. 1128–1137.
  • Zhu et al. (2017) Wenya Zhu, Kaixiang Mo, Yu Zhang, Zhangbin Zhu, Xuezheng Peng, and Qiang Yang. 2017. Flexible End-to-End Dialogue System for Knowledge Grounded Conversation. CoRR abs/1709.04264 (2017).