Log In Sign Up

Neural Approaches to Conversational AI

by   Jianfeng Gao, et al.

The present paper surveys neural approaches to conversational AI that have been developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) chatbots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between them and traditional approaches, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies.


page 7

page 9


SCAI-QReCC Shared Task on Conversational Question Answering

Search-Oriented Conversational AI (SCAI) is an established venue that re...

Conversational Agents: Theory and Applications

In this chapter, we provide a review of conversational agents (CAs), dis...

The Rapidly Changing Landscape of Conversational Agents

Conversational agents have become ubiquitous, ranging from goal-oriented...

Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling

Over the last several years, end-to-end neural conversational agents hav...

Conversational Agents in Therapeutic Interventions for Neurodevelopmental Disorders: A Survey

Neurodevelopmental Disorders (NDD) are a group of conditions with onset ...

ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI

We present the first English corpus study on abusive language towards th...

Migratable AI : Investigating users' affect on identity and information migration of a conversational AI agent

Conversational AI agents are becoming ubiquitous and provide assistance ...

1.1 Who Should Read this Paper?

This paper is based on tutorials given at the SIGIR and ACL conferences in 2018 (Gao et al., 2018a, b), with the IR and NLP communities as the primary target audience. However, audiences with other backgrounds (such as machine learning) will also find it an accessible introduction to conversational AI with numerous pointers, especially to recently developed neural approaches.

We hope that this paper will prove a valuable resource for students, researchers, and software developers. It provides a unified view, as well as a detailed presentation of the important ideas and insights needed to understand and create modern dialogue agents that will be instrumental to making world knowledge and services accessible to millions of users in ways that seem natural and intuitive.

This survey is structured as follows:

  • The rest of this chapter introduces dialogue tasks and presents a unified view in which open-domain dialogue is formulated as an optimal decision making process.

  • Chapter 2 introduces basic mathematical tools and machine learning concepts, and reviews recent progress in the deep learning and reinforcement learning techniques that are fundamental to developing neural dialogue agents.

  • Chapter 3 describes question answering (QA) agents, focusing on neural models for knowledge-base QA and machine reading comprehension (MRC).

  • Chapter 4 describes task-oriented dialogue agents, focusing on applying deep reinforcement learning to dialogue management.

  • Chapter 5 describes social chatbots, focusing on fully data-driven neural approaches to end-to-end generation of conversational responses.

  • Chapter 6 gives a brief review of several conversational systems in industry.

  • Chapter 7 concludes the paper with a discussion of research trends.

1.2 Dialogue: What Kinds of Problems?

usr: Good morning!
agt: What can I do for you? You sound depressed.
usr: Thanks. I’d like to know where sales are lagging behind our forecast?
agt: The worst region is [country], where sales are 15% below projections.
usr: Do you know why?
agt: The forecast for [product] growth was overly optimistic.
usr: How can we turn this around?
Here are the 10 customers in [country] with the most growth potential,
per our CRM model.
usr: Can you set up a meeting with the CTO of [company]?
Yes, I’ve set up a meeting with [person name] for next month when you are
in [location].
usr: Thanks.
Table 1.1: A human-agent dialogue during the process of making a business decision. (usr: user, agt: agent)

Table 1.1 shows a human-agent dialogue during the process of making a business decision. The example illustrates the kinds of problems a dialogue system is expected to solve:

  • question answering: the agent needs to provide concise, direct answers to user queries based on rich knowledge drawn from various data sources including text collections such as Web documents and pre-compiled knowledge bases such as sales and marketing datasets.

  • task completion: the agent needs to accomplish user tasks ranging from restaurant reservation to meeting scheduling, and to business trip planning.

  • social chat: the agent needs to converse seamlessly and appropriately with users – like a human as measured by the Turing test – and provide useful recommendations.

One may envision that the above dialogue can be collectively accomplished by a set of agents, also known as bots, each of which is designed for solving a particular type of task, e.g., QA bots, task-completion bots, social chatbots. These bots can be grouped into two categories, task-oriented and chitchat, depending on whether the dialogue is conducted to assist users to achieve specific goals, e.g., obtain an answer to a query or have a meeting scheduled.

Most of the popular personal assistants in today’s market, such as Amazon Alexa, Apple Siri, Google Home, and Microsoft Cortana, are task-oriented bots. These can only handle relatively simple tasks, such as reporting weather and requesting songs. An example of a chitchat dialogue bot is Microsoft XiaoIce. Building a dialogue agent to fulfill complex tasks as in Table 1.1 remains one of the most fundamental challenges for the IR and NLP communities, and AI in general.

Figure 1.1: Two architectures of dialogue systems for (Top) traditional task-oriented dialogue and (Bottom) fully data-driven dialogue.

A typical task-oriented dialogue agent is composed of four modules, as illustrated in Fig. 1.1

(Top): (1) a Natural Language Understanding (NLU) module for identifying user intents and extracting associated information; (2) a state tracker for tracking the dialogue state that captures all essential information in the conversation so far; (3) a dialogue policy that selects the next action based on the current state; and (4) a Natural Language Generation (NLG) module for converting agent actions to natural language responses. In recent years there has been a trend towards developing fully data-driven systems by unifying these modules using a deep neural network that maps the user input to the agent output directly, as shown in Fig. 

1.1 (Bottom).

Most task-oriented bots are implemented using a modular system, where the bot often has access to an external database on which to inquire about information to accomplish the task (Young et al., 2013; Tur and De Mori, 2011). Social chatbots, on the other hand, are often implemented using a (non-modular) unitary system. Since the primary goal of social chatbots is to be AI companions to humans with an emotional connection rather than completing specific tasks, they are often developed to mimic human conversations by training DNN-based response generation models on large amounts of human-human conversational data (Ritter et al., 2011; Sordoni et al., 2015b; Vinyals and Le, 2015; Shang et al., 2015). Only recently have researchers begun to explore how to ground the chitchat in world knowledge (Ghazvininejad et al., 2018) and images (Mostafazadeh et al., 2017) so as to make the conversation more contentful and interesting.

1.3 A Unified View: Dialogue as Optimal Decision Making

The example dialogue in Table 1.1 can be formulated as a sequential decision making process. It has a natural hierarchy: a top-level process selects what agent to activate for a particular subtask (e.g., answering a question, scheduling a meeting, providing a recommendation or just having a casual chat), and a low-level process, controlled by the selected agent, chooses primitive actions to complete the subtask.

Such hierarchical decision making processes can be formulated in the mathematical framework of options

over Markov Decision Processes (MDPs)

(Sutton et al., 1999b), where options generalize primitive actions to higher-level actions. This is an extension to the traditional MDP setting where an agent can only choose a primitive action at each time step, with options the agent can choose a “multi-step” action which for example could be a sequence of primitive actions for completing a subtask.

If we view each option as an action, both top- and low-level processes can be naturally captured by the reinforcement learning framework. The dialogue agent navigates in a MDP, interacting with its environment over a sequence of discrete steps. At each step, the agent observes the current state, and chooses an action according to a policy. The agent then receives a reward and observes a new state, continuing the cycle until the episode terminates. The goal of dialogue learning is to find optimal policies to maximize expected rewards. Table 1.2 summarizes all dialogue agents using this unified view of RL.

The unified view of hierarchical MDPs has already been applied to guide the development of some large-scale open-domain dialogue systems. Recent examples include Sounding Board 333, a social chatbot that won the 2017 Amazon Alexa Prize, and Microsoft XiaoIce 444, arguably the most popular commercial social chatbot that has attracted more than 660 million users worldwide since its release in 2014. Both systems use a hierarchical dialogue manager: a master (top-level) that manages the overall conversation process, and a collection of skills (low-level) that handle different types of conversation segments (subtasks). These social chatbots are designed to maximize user engagement in the long run, measured by the expected reward function of Conversation-turns Per Session (CPS).

Although RL provides a unified ML framework for building dialogue agents, applying RL requires training a dialogue agent by interacting with real users, which can be very expensive in many domains. Hence, in practice, we often use a hybrid approach that combines the strengths of different ML methods. For example, we might use imitation and/or supervised learning methods (if there is a large amount of human-human conversational corpus) to obtain a reasonably good agent before applying RL to continue improving it. In the remainder of the paper, we will survey these ML approaches that their use for training dialogue systems.

dialogue state action reward
understanding of
user query intent
or answers
relevance of answer
# of dialogue turns
understanding of
user goal
dialogue-act and
task success rate
# of dialogue turns
conversation history
and user intent
responses user engagement
top-level bot
understanding of
user top-level intent
options daily/monthly usage
Table 1.2: Reinforcement Learning for Dialogue.

1.4 The Transition of NLP to Neural Approaches

Figure 1.2: Traditional NLP Component Stack. Figure credit: Bird et al. (2009).

Neural approaches are now transforming the field of NLP and IR, where symbolic approaches have been dominating for decades.

NLP applications differ from other data processing systems in their use of language knowledge of various levels, including phonology, morphology, syntax, semantics and discourse (Jurafsky and Hartin, 2009). Historically, much of the NLP field has organized itself around the architecture of Fig. 1.2, with researchers aligning their work with one or another component task, such as morphological analysis or parsing. These component tasks can be viewed as resolving (or realizing) natural language ambiguity (or diversity) at different levels by mapping (or generating) a natural language sentence to (or from) a series of human-defined, unambiguous, symbolic representations, such as Part-Of-Speech (POS) tags, context free grammar, first-order predicate calculus. With the rise of data-driven and statistical approaches, these components have remained and have been adapted as a rich source of engineered features to be fed into a variety of machine learning models (Manning et al., 2014).

Neural approaches do not reply on any human-defined symbolic representations but learn a task-specific neural space where task-specific knowledge is implicitly

represented as semantic concepts using low-dimensional continuous vectors. As illustrated in Fig. 

1.3, neural methods often perform NLP tasks (e.g., machine reading comprehension and dialogue) in three steps: (1) encoding symbolic user input and knowledge into their neural semantic representations, where semantically related or similar concepts are represented as vectors that are close to each other; (2) reasoning in the neural space to generate a system response based on input and system state; and (3) decoding

the system response into a natural language output in a symbolic space. Encoding, reasoning and decoding are implemented using neural networks (of different architectures), which can be stacked into a deep neural network trained in an end-to-end fashion via back propagation and stochastic gradient descent.

Figure 1.3: Symbolic and Neural Computation.

End-to-end training results in tighter coupling between the end application and the neural network architecture, lessening the need for traditional NLP component boundaries like morphological analysis and parsing. This drastically flattens the technology stack of Fig. 1.2, and substantially reduces the need for feature engineering. Instead, the focus has shifted to carefully tailoring the increasingly complex architecture of neural networks to the end application.

Although neural approaches have already been widely adopted in many AI tasks, including image processing, speech recognition and machine translation (see the review by Goodfellow et al. (2016)), their impact on conversational AI has come somewhat more slowly. Only recently have we begun to observe neural approaches establish state-of-the-art results on an array of conversation benchmarks for both component tasks and end applications and, in the process, sweep aside the traditional component-based boundaries that have defined research areas for decades. This symbolic-to-neural shift is also reshaping the conversational AI landscape by opening up new tasks and user experiences that were not possible with older techniques. One reason for this is that neural approaches provide a consistent representation for many modalities, capturing linguistic and non-linguistic (e.g., image and video (Mostafazadeh et al., 2017)) features in the same modeling framework.

2.1 Machine Learning Basics

Mitchell (1997) defines machine learning broadly to include any computer program that improves its performance at some task , measured by , through experiences .

Dialogue, as summarized in Table 1.2, is a well-defined learning problem with , , and specified as follows:

  • : perform conversations with a user to fulfill the user’s goal.

  • : cumulative reward defined inTable 1.2.

  • : a set of dialogues, each of which is a sequence of user-agent interactions.

As a simple example, a single-turn QA dialogue agent might improve its performance as measured by accuracy or relevance of its generated answers at the QA task, through experience of human-labeled question-answer pairs.

A common recipe of building an ML agent using supervised learning

(SL) consists of a dataset, a model, a cost function (a.k.a. loss function) and an optimization procedure.

  • The dataset consists of pairs, where for each input , there is a ground-truth output . In QA, consists of an input question and the documents from which an answer is generated, and is the desired answer provided by a knowledgeable external supervisor.

  • The model is typically of the form , where is a function (e.g., a neural network) parameterized by that maps input to output .

  • The cost function is of the form . is often designed as a smooth function of error, and is differentiable w.r.t. . A commonly used cost function that meets these criteria is the mean squared error, or MSE, defined as

  • The optimization can be viewed as a search algorithm to identify the best that minimize . Given that is differentiable, the most widely used optimization procedure for deep learning is mini-batch Stochastic Gradient Descent (SGD) which updates after each batch as


    where is the batch size and the learning rate.

While SL learns from a fixed dataset, in interactive problems such as dialogue 111As shown in Table 1.2, dialogue learning is formulated as RL where the agent learns a policy that in each dialogue turn chooses an appropriate action from the set , based on dialogue state , so as to achieve the greatest cumulative reward., it is often impractical to obtain examples of desired behaviors that are both correct and representative of all the states in which the agent has to act. In unexplored territory, the agent has to learn how to act by interacting with an environment on its own, known as reinforcement learning (RL), where there is a feedback loop between the agent and its experiences. In other words, while SL learns from previous experiences provided by a knowledgeable external supervisor, RL learns by experiencing on its own. RL differs from SL in several important respects (Sutton and Barto, 2018; Mitchell, 1997)

  • Exploration-exploitation tradeoff. In RL, the agent needs to collect reward signals from the environment. This raises the question of which experimentation strategy produces most effective learning. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore unknown states and actions in order to make better action selections in the future.

  • Delayed reward and temporal credit assignment. In RL, training information is not available in the form of as in SL. Instead, the environment provides only delayed rewards as the agent executes a sequence of actions. For example, we do not know whether a dialogue succeeds in completing a task until the end of the session. The agent, therefore, has to determine which of the actions in its sequence are to be credited with producing the eventual reward, a problem known as temporal credit assignment.

  • Partially observed states. In many RL problems, the observation perceived from the environment at each step, e.g., user input in each dialogue turn, provides only partial information about the entire state of the environment based on which the agent selects the next action. Neural approaches learn a deep neural network to represent the state by encoding all information observed at the current and past steps, e.g., all the previous dialogue turns and the retrieval results from external databases.

A central challenge in both SL and RL is generalization — the agent’s ability to perform well on unseen inputs. Many learning theories and algorithms have been proposed to address the challenge with some success by, e.g., seeking a good tradeoff between the amount of available training experiences and the model capacity to avoid underfitting and overfitting. Compared to previous techniques, neural approaches provide a potentially more effective solution by leveraging the representation learning power of deep neural networks, as we will review briefly in the next section.

2.2 Deep Learning

Deep learning (DL) involves training neural networks, which in their original form consisted of a single layer (i.e., the perceptron)

(Rosenblatt, 1957). The perceptron is incapable of learning even simple functions such as the logical XOR, so subsequent work explored the use of “deep” architectures, which added hidden layers between input and output (Rosenblatt, 1962; Minsky and Papert, 1969), a form of neural network that is commonly called the multi-layer perceptron (MLP), or deep neural networks (DNNs). This section introduces some commonly used DNNs for NLP and IR. Interested readers are referred to Goodfellow et al. (2016) for a comprehensive discussion.

2.2.1 Foundations

Consider a text classification problem: labeling a text string (e.g., a document or a query) by a domain name such as “sport” and “politics”. As illustrated in Fig. 2.1 (left), a classical ML algorithm first maps a text string to a vector representation using a set of hand-engineered features (e.g., word and character

-grams, entities, and phrases etc.), then learns a linear classifier with a softmax layer to compute the distribution of the domain labels

, where is a matrix learned from training data using SGD to minimize the misclassification error. The design effort is focused mainly on feature engineering.

Figure 2.1:

Flowcharts of classic machine learning (Left) and deep learning (Right). A convolutional neural network is used as an example for deep learning.

Instead of using hand-designed features for , DL approaches jointly optimize the feature representation and classification using a DNN, as exemplified in Fig. 2.1 (right). We see that the DNN consists of two halves. The top half can be viewed as a linear classifier, similar to that in the classical ML model in Fig. 2.1 (left), except that its input vector is not based on hand-engineered features but is learned using the bottom half of the DNN, which can be viewed as a feature generator optimized together with the classifier in an end-to-end fashion. Unlike classical ML, the effort of designing a DL classifier is mainly on optimizing DNN architectures for effective representation learning.

For NLP tasks, depending on the type of linguistic structures that we hope to capture in the text, we may apply different types of neural network (NN) layer structures, such as convolutional layers for local word dependencies and recurrent layers for global word sequences. These layers can be combined and stacked to form a deep architecture to capture different semantic and context information at different abstract levels. Several widely used NN layers are described below:

Word Embedding Layers.

In a symbolic space each word is represented as a one-hot vector whose dimensionality is the size of a pre-defined vocabulary. The vocabulary is often large; e.g., . We apply a (pre-trained) word embedding model, which is parameterized by a linear projection matrix , to map each one-hot vector to a -dimensional real-valued vector () in a neural space where the embedding vectors of the words that are more semantically similar are closer to each other.

Fully Connected Layers.

They perform linear projections as .222We often omit the bias terms for simplifying notations in this paper. We can stack multiple fully connected layers to form a deep feed-forward NN (FFNN) by introducing a nonlinear activation function after each linear projection. If we view a text string as a Bag-Of-Words (BOW) and let be the sum of the embedding vectors of all words in the text, a deep FFNN can extract highly nonlinear features to represent hidden semantic topics of the text at different layers, e.g., at the first layer, and at the second layer, and so on, where ’s are trainable matrices.

Convolutional-Max-Pooling Layers.

An example is shown in Fig. 2.1 (right). A convolutional layer forms a local feature vector, denoted , of word in two steps. It first generates a contextual vector by concatenating the word embedding vectors of and its surrounding words defined by a fixed-length window. It then performs a projection to obtain , where is a trainable matrix and

is an activation function. The max-pooling operation applies the max operation over each “time”

of the sequence of the vectors computed by the convolutional layer to obtain a global feature vector , where each element is computed as .

Recurrent Layers.

An example of recurrent neural networks (RNNs) is shown in Fig. 

2.2. RNNs are commonly used for sentence embedding where we view a text string as a sequence of words rather than a BOW. They map the text string to a dense and low-dimensional semantic vector by sequentially and recurrently processing each word, and mapping the subsequence up to the current word into a low-dimensional vector as , where is the word embedding of the -th word in the text, and are trainable matrices, and is the semantic representation of the word sequence up to the -th word.

Figure 2.2: An example of recurrent neural networks.

2.2.2 A case study of DSSM

Figure 2.3: The architecture of DSSM.

DSSM stands for Deep Structured Semantic Models, or more generally, Deep Semantic Similarity Model. DSSM is a deep learning model for measuring the semantic similarity of a pair of inputs . They can be applied to a wide range of tasks depending on the definition of . For example, is a query-document pair for Web search ranking (Huang et al., 2013; Shen et al., 2014), a document pair in recommendation (Gao et al., 2014b), a question-answer pair in QA (Yih et al., 2015a), a sentence pair of different languages in machine translation (Gao et al., 2014a), and an image-text pair in image captioning (Fang et al., 2015) and so on.

As illustrated in Fig. 2.3, a DSSM consists of a pair of DNNs, and , which map inputs and into corresponding vectors in a common low-dimensional semantic space. Then the similarity of and is measured by the cosine distance of the two vectors. and can be of different architectures depending on and . For example, to compute the similarity of an image-text pair, can be a deep convolutional NN and an RNN.

Let be the parameters of and . is learned to identify the most effective feature representations of and , optimized directly for end tasks. In other words, we learn a hidden semantic space, parameterized by , where the semantics of distance between vectors in the space is defined by the task or, more specifically, the training data of the task. For example, in Web document ranking, the distance measures the query-document relevance, and is optimized using a pair-wise rank loss. Consider a query and two candidate documents and , where is more relevant than to . Let be the similarity of and in the semantic space parameterized by as

We want to maximize . We do so by optimizing a smooth loss function


where is a scaling factor, using SGD of Eqn. 2.1.

2.3 Reinforcement Learning

This section gives a brief review of reinforcement learning that is most relevant to the discussions in later chapters. For a comprehensive survey, interested readers are referred to excellent textbooks and reviews, such as Sutton and Barto (2018); Kaelbling et al. (1996); Bertsekas and Tsitsiklis (1996); Szepesvári (2010); Wiering and van Otterlo (2012); Li (2019).

2.3.1 Foundations

Figure 2.4: Interaction between a reinforcement-learning agent and an external environment.

Reinforcement learning is a learning paradigm where an intelligent agent learns to make optimal decisions by interacting with an initially unknown environment (Sutton and Barto, 2018). Compared to supervised learning, a distinctive challenge in RL is to learn without a teacher (that is, without supervisory labels). As we will see, this will lead to algorithmic considerations that are often unique to RL.

As illustrated in Fig. 2.4, the agent-environment interaction is often modeled as a discrete-time Markov decision process, or MDP (Puterman, 1994), described by a five-tuple :

  • is a possibly infinite set of states the environment can be in;

  • is a possibly infinite set of actions the agent can take in a state;

  • gives the transition probability of the environment landing in a new state

    after action is taken in state ;

  • is the average reward immediately received by the agent after taking action in state ; and

  • is a discount factor.

The intersection can be recorded as a trajectory , generated as follows: at step ,

  • the agent observes the environment’s current state , and takes an action ;

  • the environment transitions to a next-state , distributed according to the transition probabilities ;

  • associated with the transition is an immediate reward , whose average is .

Omitting the subscript, each step results in a tuple that is called a transition. The goal of an RL agent is to maximize the long-term reward by taking optimal actions (to be defined soon). Its action-selection policy, denoted by , can be deterministic or stochastic. In either case, we use to denote selection of action by following in state . Given a policy , the value of a state is the average discounted long-term reward from that state:

We are interested in optimizing the policy so that is maximized for all states. Denote by an optimal policy, and its corresponding value function (also known as the optimal value function). In many cases, it is more convenient to use another form of value function called the Q-function:

which measures the average discounted long-term reward by first selecting in state and then following policy thereafter. The optimal Q-function, corresponding to an optimal policy, is denoted by .

2.3.2 Basic Algorithms

We now give a brief description of two popular classes of algorithms, exemplified by two algorithms, Q-learning and policy gradient.


The first family is based on the observation that an optimal policy can be immediately retrieved if the optimal Q-function is available. Specifically, the optimal policy can be determined by

Therefore, a large family of RL algorithms focuses on learning , and are collectively called value-function-based methods.

In practice, it is expensive to represent by a table, one entry for each , when the problem at hand is large. For instance, the number of states in the game of Go is larger than  (Tromp and Farnebäck, 2006). Hence, we often use compact forms to represent . In particular, we assume the -function has a predefined parametric form, parameterized by some vector . An example is linear approximation:

where is a -dimensional hand-coded feature vector for state-action pair , and is the corresponding coefficient vector to be learned from data. In general, may take different parametric forms. For example, in the case of Deep Q-Network (DQN), takes the form of deep neural networks, such as multi-layer perceptron (Tesauro, 1995; Mnih et al., 2015), recurrent network (Hausknecht and Stone, 2015; Li et al., 2015)

, etc. Furthermore, it is possible to represent the Q-function in a non-parametric way, using decision trees 

(Ernst et al., 2005) or Gaussian processes (Engel et al., 2005), which is outside of the scope of this introductory section.

To learn the Q-function, we modify the parameter using the following update rule, after observing a state transition :


The above update is known as Q-learning (Watkins, 1989), which applies a small change to , controlled by the step-size parameter and computed from the temporal difference (Sutton, 1988).

While popular, in practice, Q-learning can be unstable and requires many samples before reaching a good approximation of . Two modifications are often helpful in practice. The first is experience replay (Lin, 1992), popularized by Mnih et al. (2015). Instead of using an observed transition to update just once using Eqn. 2.3, one may store it in a replay buffer, and periodically sample transitions from it to perform Q-learning updates. This way, every transition can be used multiple times, thus increasing sample efficiency. Furthermore, it helps stabilize learning by preventing the data distribution from changing too quickly over time when updating parameter .

The second is a two-network implementation (Mnih et al., 2015). Here, the learner maintains an extra copy of the Q-function, called the target network, parameterized by . During learning, is fixed and is used to compute temporal difference to update . Specifically, Eqn. 2.3 now becomes:


Periodically, is updated to be , and the process continues. This is in fact an instance of the more general fitted value iteration algorithm (Munos and Szepesvári, 2008).

There have been a number of recent improvements to the basic Q-learning described above, such as dueling Q-network (Wang et al., 2016), double Q-learning (van Hasselt et al., 2016), and more recently the SBEED algorithm that is data-efficient and provably convergent (Dai et al., 2018b).

Policy Gradient.

The other family of algorithms tries to optimize the policy directly, without having to learn the Q-function. Here, the policy itself is directly parameterized by , and is often a distribution over actions. Given any , the policy is naturally evaluated by the average long-term reward it gets in a trajectory of length , :333We describe policy gradient in the simpler bounded-length trajectory case, although it can be extended to problems when the trajectory length is unbounded (Baxter and Bartlett, 2001; Baxter et al., 2001).

If it is possible to estimate the gradient

from sampled trajectories, one can do stochastic gradient ascent444Stochastic gradient ascent is simply stochastic gradient descent on the negated objective function. to maximize :


where is again a stepsize parameter.

One such algorithm, known as REINFORCE (Williams, 1992), estimates the gradient as follows. Let be a length- trajectory generated by ; that is, for every . Then, a stochastic gradient based on this single trajectory is given by


REINFORCE may suffer high variance in practice, as its gradient estimate depends directly on the sum of rewards along the entire trajectory. Its variance may be reduced by the use of an estimated value function of the current policy, often referred to as the critic in actor-critic algorithms 

(Sutton et al., 1999a; Konda and Tsitsiklis, 1999). The policy gradient is now computed by


where is an estimated value function for the current policy that is used to approximate in Eqn. 2.6. The estimated value function may be learned by standard temporal difference methods (similar to Q-learning already described), but there are many variants how to learn from data. Moreover, there has been much work on how to compute the gradient that is more effective than the steepest descent in Eqn. 2.7. Interested readers can refer to a few related works and the references therein for further details (Kakade, 2001; Peters et al., 2005; Schulman et al., 2015a, b; Mnih et al., 2016; Gu et al., 2017; Dai et al., 2018a; Liu et al., 2018a).

2.3.3 Exploration

So far we have described basic algorithms for updating either the value function or the policy, when transitions are given as input. Typically, a reinforcement-learning agent also has to determine how to select actions to collect desired transitions for learning. Always selecting the action (“exploitation”) that seems best is problematic, as not selecting a novel action (that is underrepresented, or even absent, in data collected so far), known as “exploration”, may result in the risk of not seeing outcomes that are potentially better. Balancing exploration and exploitation efficiently is one of the unique challenges in reinforcement learning.

A basic exploration strategy is known as -greedy. The idea is to choose the action that looks best with high probability (for exploitation), and a random action with small probability (for exploration). In the case of DQN, suppose is the current parameter of the Q-function, then the action-selection rule for state is given as follows:

In many problems this simple approach is effective (although not necessarily optimal). A more in-depth discussion on exploration is found in Sec. 4.5.2.

3.1 Knowledge Base

Organizing the world’s facts and storing them in a structured database, large scale Knowledge Bases (KB) like DBPedia (Auer et al., 2007), Freebase (Bollacker et al., 2008) and Yago (Suchanek et al., 2007) have become important resources for supporting open-domain QA.

A typical KB consists of a collection of subject-predicate-object triples where are entities and

is a predicate or relation. A KB in this form is often called a Knowledge Graph (KG) due to its graphical representation, i.e., the entities are nodes and the relations the directed edges that link the nodes.

Fig. 3.1 (left) shows a small subgraph of Freebase related to the TV show Family Guy. Nodes include some names, dates and special Compound Value Type (CVT) entities.111CVT is not a real-world entity, but is used to collect multiple fields of an event or a special relationship. A directed edge describes the relation between two entities, labeled by the predicate.

Figure 3.1: An example of semantic parsing for KB-QA. (Left) A subgraph of Freebase related to the TV show Family Guy. (Right) A question, its logical form in -calculus and query graph, and the answer. Figures adapted from Yih et al. (2015a).

3.2 Semantic Parsing for KB-QA

Most state-of-the-art symbolic approaches to KB-QA are based on semantic parsing, where a question is mapped to its formal meaning representation (e.g., logical form) and then translated to a KB query. The answers to the question can then be obtained by finding a set of paths in the KB that match the query and retrieving the end nodes of these paths (Richardson et al., 1998; Berant et al., 2013; Yao and Van Durme, 2014; Bao et al., 2014; Yih et al., 2015b).

We take the example used in Yih et al. (2015a) to illustrate the QA process. Fig. 3.1 (right) shows the logical form in -calculus and its equivalent graph representation, known as query graph, of the question “Who first voiced Meg on Family Guy?”. Note that the query graph is grounded in Freebase. The two entities, MegGriffin and FamilyGuy, are represented by two rounded rectangle nodes. The circle node means that there should exist an entity describing some casting relations like the character, actor and the time she started the role. is grounded in a CVT entity in this case. The shaded circle node is also called the answer node, and is used to map entities retrieved by the query. The diamond node constrains that the answer needs to be the earliest actor for this role. Running the query graph without the aggregation function against the graph as in Fig. 3.1 (Left) will match both LaceyChabert and MilaKunis. But only LaceyChabert is the correct answer as she started this role earlier (by checking the from property of the grounded CVT node).

Applying a symbolic KB-QA system to a very large KB is challenging for two reasons:

  • Paraphrasing in natural language: This leads to a wide variety of semantically equivalent ways of stating the same question, and in the KB-QA setting, this may cause mismatches between the natural language questions and the label names (e.g., predicates) of the nodes and edges used in the KB. As in the example of Fig. 3.1, we need to measure how likely the predicate used in the question matches that in Freebase, such as “Who first voiced Meg on Family Guy?” vs. cast-actor. Yih et al. (2015a) proposed to use a learned DSSM, which is conceptually an embedding-based method we will review in Sec. 3.3.

  • Search complexity: Searching all possible multi-step (compositional) relation paths that match complex queries is prohibitively expensive because the number of candidate paths grows exponentially with the path length. We will review symbolic and neural approaches to multi-step reasoning in Sec. 3.4.

3.3 Embedding-based Methods

To address the paraphrasing problem, embedding-based methods map entities and relations in a KB to continuous vectors in a neural space; see, e.g., Bordes et al. (2013); Socher et al. (2013); Yang et al. (2015); Yih et al. (2015b). This space can be viewed as a hidden semantic space where various expressions with the same semantic meaning map to the same continuous vector.

Most KB embedding models are developed for the Knowledge Base Completion (KBC) task: predicting the existence of a triple that is not seen in the KB. This is a simpler task than KB-QA since it only needs to predict whether a fact is true or not, and thus does not suffer from the search complexity problem.

The bilinear model is one of the basic KB embedding models (Yang et al., 2015). It learns a vector for each entity and a matrix for each relation . The model scores how likely a triple holds using


The model parameters (i.e., the embedding vectors and matrices) are trained on pair-wise training samples in a similar way to that of the DSSM described in Sec. 2.2.2. For each positive triple in the KB, denoted by , we construct a set of negative triples by corrupting , , or . The training objective is to minimize the pair-wise rank loss of Eqn. 2.2, or more commonly the margin-based loss defined as

where ,

is the margin hyperparameters, and

the training set of triples.

These basic KB models have been extended to answer multi-step relation queries, as known as path queries, e.g., “Where did Tad Lincoln’s parents live?” (Toutanova et al., 2016; Guu et al., 2015; Neelakantan et al., 2015). A path query consists of an initial anchor entity (e.g., TadLincoln), followed by a sequence of relations to be traversed (e.g., (parents, location)). We can use vector space compositions to combine the embeddings of individual relations into an embedding of the path . The natural composition of the bilinear model of Eqn. 3.1 is matrix multiplication. Thus, to answer how likely a path query holds, where , we would compute


These KB embedding methods are shown to have good generalization performance in terms of validating unseen facts (e.g., triples and path queries) given an existing KB. Interested users are referred to Nguyen (2017) for a detail survey of embedding models for KBC.

3.4 Multi-Step Reasoning on KB

Figure 3.2: An example of knowledge base reasoning (KBR). We want to identify the answer node USA for a KB query (Obama, citizenship, ?). Figure adapted from Shen et al. (2018).

Knowledge Base Reasoning (KBR) is a subtask of KB-QA. As described in Sec. 3.2, KB-QA is performed in two steps: (1) semantic parsing translates a question into a KB query, then (2) KBR traverses the query-matched paths in a KB to find the answers.

To reason over a KB, for each relation , we are interested in learning a set of first-order logical rules in the form of relational paths, . For the KBR example in Fig. 3.2, given the question “What is the citizenship of Obama?”, its translated KB query in the form of subject-predicate-object triple is (Obama, citizenship, ?). Unless the triple (Obama, citizenship, USA) is explicitly stored in the KB,222As pointed out by Nguyen (2017), even very large KBs, such as Freebase and DBpedia, which contain billions of fact triples about the world, are still far from complete. a multi-step reasoning procedure is needed to induce the answer from the paths that contain relevant triples, such as (Obama, born-in, Hawaii) and (Hawaii, part-of, USA), using the learned relational paths such as (born-in, part-of).

Below, we describe three categories of multi-step KBR methods. They differ in whether reasoning is performed in a discrete symbolic space or a continuous neural space.

3.4.1 Symbolic Methods

ID PRA Path # Comment
(athlete-plays-in-league, league-players,
# teams with many players in the athlete’s league
(athlete-plays-in-league, league-teams, team-against-team)
# teams that play against many teams in the athlete’s league
# city of the stadium with the same team
# city of the stadium with the same location
# stadium located in the same city with the query team
# home stadium of teams which share players with the query
# the league that the query team’s members belong to
# the league that query team’s competing team belongs to
Table 3.1: A sample of relational paths learned by PRA. For each relation, its top-2 PRA paths are presented, adapted from Lao et al. (2011).

Path Ranking Algorithm (PRA) (Lao and Cohen, 2010; Lao et al., 2011) is one of the primary symbolic approaches to learning relational paths in large KBs. PRA uses random walks with restarts to perform multiple bounded depth-first search to find relational paths. Table 3.1 shows a sample of relational paths learned by PRA. A relational path is a sequence . An instance of the relational path is a sequence of nodes such that is a valid triple.

During KBR, given a query , PRA selects the set of relational paths for , denoted by , then traverses the KB according to the query and , and score each candidate answer using a linear model


where is the learned weight, and is the probability of reaching from by a random walk that instantiates the relational path , also known as a path constrained random walk.

Because PRA operates in a fully discrete space, it does not take into account semantic similarities among relations. As a result, PRA can easily produce millions of categorically distinct paths even for a small path length, which not only hurts generalization but makes reasoning prohibitively expensive. Lao et al. (2011)

used heuristics and

regularization to reduce the number of relational paths that need to be considered in KBR. To address these limitations, Gardner et al. (2014) proposed a modification to PRA that leverages the KB embedding methods, as described in Sec. 3.3, to collapse and cluster PRA paths according to their relation embeddings.

3.4.2 Neural Methods

Figure 3.3: An overview of the neural methods for KBR (Shen et al., 2017a; Yang et al., 2017a). The KB is embedded in neural space as matrix that is learned to store compactly the connections between related triples (e.g., the relations that are semantically similar are stored as a cluster). The controller is designed to adaptively produce lookup sequences in and decide when to stop, and the encoder and decoder are responsible for the mapping between the symbolic and neural spaces.

Implicit ReasoNet (IRN) (Shen et al., 2016, 2017a)

and Neural Logic Programming (Neural LP)

(Yang et al., 2017a) are two of the recently proposed methods that perform multi-step KBR in a neural space and achieve state-of-the-art results on popular benchmarks. The overall architecture of these methods is shown in Fig. 3.3, which can be viewed as an instance of the neural approaches illustrated in Fig. 1.3 (right). In what follows, we use IRN as an example to illustrate how these neural methods work. IRN consists of four modules: encoder, decoder, shared memory, and controller, as in Fig. 3.3.

Encoder and Decoder

These two modules are task-dependent. Given an input query , the encoder maps and , respectively, into their embedding vectors 333The use of vectors rather than matrices for relation representations is inspired by the bilinear-diag model (Yang et al., 2015), which restricts the relation representations to the class of diagonal matrices. and then concatenates the two vectors to form the initial hidden state vector of the controller.

The decoder outputs a prediction vector , a nonlinear projection from state , where and

are the weight matrix and bias vector, respectively. In KBR, we can map the answer vector

to its answer node (entity) in the symbolic space based on distance as , where is the embedding vector of entity .

Shared Memory

The shared memory consists of a list of vectors that are randomly initialized and updated through back-propagation in training. stores a compact version of KB optimized for the KBR task. For example, the system may fail to answer the question (Obama, citizenship, ?) even if it finds the relevant facts in , such as (Obama, born-in, Hawaii) and (Hawaii, part-of, USA), because it does not know that bore-in and citizenship are semantically related relations. In order to correct the error, needs to be updated using the gradient to encode the piece of new information by moving the two relation vectors closer to each other in the neural space.


The controller is implemented as an RNN. Given initial state , it uses attention to iteratively lookup and fetch information from to update the state at time according to Eqn. 3.4, until it decides to terminate the reasoning process and calls the decoder to generate the output.


where ’s are learned projection matrices, a scaling factor and a nonlinear activation function.

The reasoning process of IRN can be viewed as a Markov Decision Process (MDP), as illustrated in Fig. 2.4. The step size in the information lookup and fetching sequence of Eqn. 3.4 is not given by training data, but is decided by the controller on the fly. More complex queries need more steps. Thus, IRN learns a stochastic policy to get a distribution over termination and prediction actions by the REINFORCE algorithm (Williams, 1992). Since all the modules of IRN are differentiable, IRN is an end-to-end differentiable neural model whose parameters, including the embedded KB matrix , can be jointly optimized using SGD on the training samples derived from a KB, as shown in Fig. 3.3.

As outlined in Fig. 1.3, neural methods operate in a continuous neural space, and do not suffer from the problems associated with symbolic methods. They are robust to paraphrase alternations because knowledge is implicitly represented by semantic classes via continuous vectors and matrices. They are also efficient even for a very large KB because they reason over a compact representation of a KB (e.g., the matrix in the shared memory in IRN) rather than the KB itself.

One of the major limitations of these methods is the lack of interpretability. Unlike PRA which traverses the paths in the graph explicitly as Eqn. 3.3, IRN does not follow explicitly any path in the KB during reasoning but performs lookup operations over the shared memory iteratively using the RNN controller with attention, each time using the revised internal state as a query for lookup. It remains challenging to recover the symbolic representations of queries and paths (or first-order logical rules) from the neural controller. See (Shen et al., 2017a; Yang et al., 2017a) for some interesting preliminary results of interpretation of neural methods.

3.4.3 Reinforcement Learning based Methods

DeepPath (Xiong et al., 2017), MINERVA (Das et al., 2017b) and M-Walk (Shen et al., 2018) are among the recent examples that use RL for learning multi-step reasoning over a KB. They use a policy-based agent with continuous states based on KB embeddings to traverse the knowledge graph to identify the answer node (entity) for an input query. The RL-based methods are as robust as the neural methods due to the use of continuous vectors for state representation, and are as interpretable as symbolic methods because the agents explicitly traverse the paths in the graph.

We formulate KBR as an MDP defined by the tuple , where is the continuous state space, the set of available actions, the state transition probability matrix, and the reward function. Below, we follow M-Walk and the example in Fig. 3.2 to describe these components in detail. We denote a KB as graph which consists a collection of entity nodes and the relation edges that link the nodes. We denote a KB query as , where and are the given source node and relation, respectively, and the answer node to be identified.


Let denote the state at time , which encodes information of all traversed nodes up to , all the previous selected actions and the initial query . can be defined recursively as follows:


where is the action selected by the agent at time , is the currently visited node, is the set of all the edges connected to , and is the set of all the nodes connected to . Note that in RL-based methods, is represented as a continuous vector using e.g., a RNN in M-Walk and MINERVA or a MLP in DeepPath.


Based on , the agent selects one of the following actions: (1) choosing an edge in and move to the next node , or (2) terminating the reasoning process and output the current node as a prediction of the answer node .


The transitions are deterministic. As shown in Fig. 3.2, once action is selected, the next node and its associated and are known.


We only have the terminal reward of if is the correct answer, and otherwise.

Policy Network

The policy denotes selection of action given state , and is implemented as a neural network parameterized by . The policy network is optimized to maximize , which is the long-term reward of starting from and following the policy afterwards. In KBR, the policy network can be trained from the training sample in the form of triples extracted from a KB using RL, such as the REINFORCE method. To address the reward sparsity issue (i.e., the reward is only available at the end of a path), Shen et al. (2018) proposed to use Monte Carlo Tree Search to generate a set of simulated paths with more positive terminal rewards by exploiting the fact that all the transitions are deterministic for a given knowledge graph.

3.5 Conversational KB-QA Agents

All of the KB-QA methods we have described so far are based on single-turn agents which assume that users can compose in one shot a complicated, compositional natural language query that can uniquely identify the answer in the KB.

Conversational KB-QA agents, on the other hand, allow users to query a KB interactively without composing complicated queries, motivated by the observations:

  • Users are more used to issuing simple queries of length less than 5 words (Spink et al., 2001).

  • In many cases, it is unreasonable to assume that users can construct compositional queries without prior knowledge of the structure of the KB to be queried.

A conversational KB-QA agent is useful for many interactive KB-QA tasks such as movie-on-demand, where a user attempts to find a movie based on certain attributes of that movie, as illustrated by the example in Fig. 3.4, where the movie DB can be viewed as an entity-centric KB consisting of entity-attribute-value triples.

Figure 3.4: An interaction between a user and a multi-turn KB-QA agent for the movie-on-demand task. Figure credit: Dhingra et al. (2017).

In addition to the core KB-QA engine which typically consists of a semantic parser and a KBR engine, a conversational KB-QA agent is also equipped with a Dialogue Manager (DM) which tracks the dialogue state and decides what question to ask to effectively help users navigate the KB in search of an entity (movie). The high-level architecture of the conversational agent for movie-on-demand is illustrated in Fig. 3.5. At each turn, the agent receives a natural language utterance as input, and selects an action as output. The action space consists of a set of questions, each for requesting the value of an attribute, and an action of informing the user with an ordered list of retrieved entities. The agent is a typical task-oriented dialogue system of Fig. 1.1 (Top), consisting of (1) a brief tracker module for resolving coreferences and ellipsis in user utterances using conversation context and identifying user intents, extracting associated attributes, and tracking the dialogue state; (2) an interface with the database to query for relevant results (i.e., the Soft-KB Lookup component, which can be implemented using the KB-QA models described in the previous sections, except that we need to form the query based on dialogue history captured by belief tracker, not just the current user utterance, as described in Suhr et al. (2018)); (3) a beliefs summary module to summarize the state into a vector; and (4) a dialogue policy which selects the next action based on the current state. The policy can be either programmed (Wu et al., 2015) or trained on dialogues (Wen et al., 2017; Dhingra et al., 2017).

Figure 3.5: An overview of a conversational KB-QA agent. Figure credit: Dhingra et al. (2017).

Wu et al. (2015) presented an Entropy Minimization Dialogue Management (EMDM) strategy. The agent always asks for the value of the attribute with maximum entropy over the remaining entries in the database. EMDM is proved optimal in the absence of language understanding errors. However, it does not take into account the fact that some questions are easy for users to answer, whereas others are not. For example, in the movie-on-demand task, the agent could ask users to provide the movie release ID which is unique to each movie but is often unknown to regular users.

Dhingra et al. (2017) proposed KB-InfoBot – a fully neural end-to-end multi-turn dialogue agent for the movie-on-demand task. The agent is trained entirely from user feedback. It does not suffer from the problem of EMDM, and always asks users easy-to-answer questions to help search in the KB. Like all KB-QA agents, KB-InfoBot needs to interact with an external KB to retrieve real-world knowledge. This is traditionally achieved by issuing a symbolic query to the KB to retrieve entries based on their attributes. However, such symbolic operations break the differentiability of the system and prevent end-to-end training of the dialogue agent. KB-InfoBot addresses this limitation by replacing symbolic queries with an induced posterior distribution over the KB that indicates which entries the user is interested in. The induction can be achieved using the neural KB-QA methods described in the previous sections. Experiments show that integrating the induction process with RL leads to higher task success rate and reward in both simulations and against real users.

Recently, several datasets have been developed for building conversational KB-QA agents. Iyyer et al. (2017) collected a Sequential Question Answering (SAQ) dataset via crowd sourcing by leveraging WikiTableQuestions (WTQ (Pasupat and Liang, 2015)), which contains highly compositional questions associated with HTML tables from Wikipedia. As the example in Fig. 3.6 (Left), each crowd sourcing task contains a long, complex question originally from WTQ as the question intent. The workers are asked to compose a sequence of simpler but inter-related questions that lead to the final intent. The answers of the simple questions are subsets of the cells in the table.

Figure 3.6: The examples from two conversational KB-QA datasets. (Left) An example question sequence created from a compositional question intent in the SQA dataset. Figure credit: Iyyer et al. (2017). (Right) An example dialogue from the CSQA dataset. Figure credit: Saha et al. (2018).

Saha et al. (2018) presented a dataset consisting of 200K QA dialogues for the task of Complex Sequence Question Answering (CSQA). CSQA combines two sub-tasks: (1) answering factoid questions through complex reasoning over a large-scale KB, and (2) learning to converse through a sequence of coherent QA pairs. As the example in Fig. 3.6 (Right), CSQA calls for a conversational KB-QA agent that combines many technologies described in this chapter, including (1) parsing complex natural language queries as described in Sec. 3.2, (2) using conversation context to resolve coreferences and ellipsis in user utterances as the belief tracker in Fig. 3.5 (3) asking for clarification questions for ambiguous queries, as the dialogue manager in Fig. 3.5 and (4) retrieving relevant paths in the KB to answer questions as described in Sec. 3.4.

3.6 Machine Reading for Text-QA

Machine Reading Comprehension (MRC) is a challenging task: the goal is to have machines read a (set of) text passage(s) and then answer any question about the passage(s). The MRC model is the core component of text-QA agents.

The recent big progress on MRC is largely due to the availability of a multitude of large-scale datasets that the research community has created over various text sources such as Wikipedia (WikiReading (Hewlett et al., 2016), SQuAD (Rajpurkar et al., 2016), WikiHop (Welbl et al., 2017), DRCD (Shao et al., 2018)), news and other articles (CNN/Daily Mail (Hermann et al., 2015), NewsQA (Trischler et al., 2016), RACE (Lai et al., 2017)), fictional stories (MCTest (Richardson et al., 2013), CBT (Hill et al., 2015), NarrativeQA (Kočisky et al., 2017)), and general Web documents (MS MARCO (Nguyen et al., 2016), TriviaQA (Joshi et al., 2017), SearchQA (Dunn et al., 2017), DuReader (He et al., 2017b)).

As the example in Fig. 3.7 (Left), the MRC task defined on SQuAD involves a question and a passage, and aims to find an answer span in the passage. For example, in order to answer the question “what causes precipitation to fall?”, one might first locate the relevant part of the passage “precipitation … falls under gravity”, then reason that “under” refers to a cause (not location), and thus determine the correct answer: “gravity”. Although the questions with span-based answers are more constrained than the real-world questions users submitting to Web search engines such as Google and Bing, SQuAD provides a rich diversity of questions and answer types and became one of the most widely used MRC datasets in the research community.

MS MARCO is a large scale real-world MRC dataset, released by Microsoft, aiming to address the limitations of other academic datasets. For example, MS MARCO differs from SQuAD in that (1) SQuAD consists of the questions posed by crowdworkers while MS MARCO is sampled from the real user queries; (2) SQuAD uses a small set of high quality Wikipedia articles while MS MARCO is sampled from a large amount of Web documents, (3) MS MARCO includes some unanswerable queries444SQuAD v2 (Rajpurkar et al., 2018) also includes unanswerable queries. and (4) SQuAD requires identifying an answer span in a passage while MS MARCO requires generating an answer (if there is one) from multiple passages that may or may not be relevant to the given question. As a result, MS MARCO is far more challenging, and requires more sophisticated reading comprehension skills. As the example in Fig. 3.7 (Right), given the question “will I qualify for OSAP if I’m new in Canada”, one might first locate the relevant passage that include: “you must be a 1 Canadian citizen; 2 permanent resident; or 3 protected person…” and reason that being new to the country is usually the opposite of citizen, permanent resident etc., thus determine the correct answer: “no, you won’t qualify”.

Figure 3.7: The examples from two MRC datasets. (Left) Question-answer pairs for a sample passage in the SQuAD dataset, adapted from Rajpurkar et al. (2016). Each of the answers is a text span in the passage. (Right) A question-answer pair for a set of passages in the MS MARCO dataset, adapted from Nguyen et al. (2016). The answer, if there is one, is human generated.

3.7 Neural MRC Models

Figure 3.8: Two examples of state of the art neural MRC models. (Left) The Stochastic Answer Net (SAN) model. Figure credit: Liu et al. (2018c). (Right) The BiDirectional Attention Flow (BiDAF) model. Figure credit: Seo et al. (2016).

The description in this section is based on the state of the art models developed on SQuAD, where given a question and a passage , we need to locate an answer span in .

In spite of the variety of model structures and attention types (Chen et al., 2016a; Xiong et al., 2016; Seo et al., 2016; Shen et al., 2017c; Wang et al., 2017b), a typical neural MRC model performs reading comprehension in three steps, as outlined in Fig. 1.3: (1) encoding the symbolic representation of the questions and passages into a set of vectors in a neural space; (2) reasoning in the neural space to identify the answer vector (e.g., in SQuAD, this is equivalent to ranking and re-ranking the embedded vectors of all possible text spans in ). and (3) decoding the answer vector into a natural language output in the symbolic space (e.g., this is equivalent to mapping the answer vector to its text span in ). Since the decoding module is straightforward for SQuAD models, we focus our discussion below on encoding and reasoning.

Fig. 3.8 shows two examples of neural MRC models. BiDAF (Seo et al., 2016) is among the most widely used state of the art MRC baseline models in the research community and SAN (Liu et al., 2018c) is the best documented MRC model on the SQuAD1.1 leaderboard555 as of Dec. 19, 2017.

3.7.1 Encoding

Most MRC models encode questions and passages through three layers: lexicon embedding layer, contextual embedding layer and attention layer.

Lexicon Embedding Layer.

It extracts information from and at the word level and normalizes for lexical variants. It typically maps each word to a vector space using a pre-trained word embedding model, such as word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014), such that semantically similar words are mapped to the vectors that are close to each other in the neural space (also see Sec. 2.2.1). Word embedding can be enhanced by concatenating each word embedding vector with other linguistic embeddings such as those derived from characters, Part-Of-Speech (POS) tags, and named entities etc. Given and , the word embeddings for the tokens in is a matrix and tokens in is , where is the dimension of word embeddings.

Contextual Embedding Layer.

It utilizes contextual cues from surrounding words to refine the embedding of the words. As a result, the same word might map to different vectors in a neural space depending on its context, such as “bank of a river” vs. “ bank

of America”. This is typically achieved by using a Bi-directional Long Short-Term Memory (BiLSTM) network,

666Long Short-Term Memory (LSTM) networks are an extension for recurrent neural networks (RNNs). The units of an LSTM are used as building units for the layers of a RNN. LSTMs enable RNNs to remember their inputs over a long period of time because LSTMs contain their information in a gated cell, where gated means that the cell decides whether to store or delete information based on the importance it assigns to the information. The use of BiLSTM for contextual embedding is suggested by Melamud et al. (2016); McCann et al. (2017). an extension of RNN of Fig. 2.2. As shown in Fig. 3.8, we place an LSTM in both directions, and concatenate the outputs of the two LSTMs. Hence, we obtain a matrix as contextual-aware representation of and a matrix as contextual-aware representation of .

ELMo (Peters et al., 2018) is the new state of the art contextual embedding model. It is based on deep BiLSTM. Instead of using only the output layer representations of BiLSTM, ELMo combines the intermediate layer representations in the BiLSTM, where the combination weights are optimized on task-specific training data.

Since RNN/LSTM is hard to train efficiently using parallel computing, Yu et al. (2018)

presents a new contextual embedding model which does not require RNN: Its encoder consists exclusively of convolution and self-attention, where convolution models local interactions and self-attention models global interactions. Such a model can be trained an order of magnitude faster than an RNN-based model on GPU clusters.

Attention Layer.

It couples the question and passage vectors and produces a set of query-aware feature vectors for each word in the passage, and generates the working memory over which reasoning is performed. This is achieved by summarizing information from both and via the attention process777Interested readers may refer to Table 1 in Huang et al. (2017) for a summarized view on the attention process used in several state of the art MRC models. that consists of the following steps:

  1. Compute an attention score, which signifies which query words are most relevant to each passage word: for each in , where is the similarity function e.g., a bilinear model, parameterized by .

  2. Compute the normalized attention weights through softmax: .

  3. Summarize information for each passage word via . Thus, we obtain a matrix as question-aware representation of .

Then, we form the working memory in the neural space as , where is a function of fusing its input matrices, parameterized by . can be an arbitrary trainable neural network. For example, the fusion function in SAN includes a concatenation layer, a self-attention layer and a BiLSTM layer. BiDAF computes attentions in two directions: from passage to question as well as from question to passage . The fusion function in BiDAF includes a layer that concatenates three matrices , and , and a two-layer BiLSTM to encode for each word its contextual information with respect to the entire passage and the query.

3.7.2 Reasoning

MRC models can be grouped into different categories based on how they perform reasoning to generate the answer: single-step and multi-step models.

Single-Step Reasoning.

A single-step reasoning model matches the question and document only once and produce the final answers. We use the single-step version of SAN888This is a special version of SAN where the maximum number of reasoning steps . SAN in Fig. 3.8 (Left) uses . in Fig. 3.8 (Left) as an example to describe the single-step reasoning process. We need to find the answer span (i.e., the start and end points) over the working memory . First, a summarized question vector is formed as


where , and

is a trainable vector. Then, a bilinear function is used to obtain the probability distribution of the start index over the entire passage by


where is a weight matrix. Another bilinear function is used to obtain the probability distribution of the end index, incorporating the information of the span start obtained by Eqn. 3.7, as


where the semicolon mark indicates the vector or matrix concatenation operator, is the probability of the -th word in the passage being the start of the answer span, is a weight matrix, and is the -th vector of .

Single-step reasoning is simple yet efficient and the model parameters can be trained using the classical back-propagation algorithm, thus it is adopted by most of systems (Chen et al., 2016b; Seo et al., 2016; Wang et al., 2017b; Liu et al., 2017; Chen et al., 2017a; Weissenborn et al., 2017; Hu et al., 2017). However, since humans often solve question answering tasks by re-reading and re-digesting the document multiple times before reaching the final answer (this may be based on the complexity of the questions and documents, as illustrated by the examples in Fig. 3.9), it is natural to devise an iterative way to find answers as multi-step reasoning.

Figure 3.9: (Top) A human reader can easily answer the question by reading the passage only once. (Bottom) A human reader may have to read the passage multiple times to answer the question.
Multi-Step Reasoning.

Multi-step reasoning models are pioneered by Hill et al. (2015); Dhingra et al. (2016); Sordoni et al. (2016); Kumar et al. (2016), who used a pre-determined fixed number of reasoning steps. Shen et al. (2017b, c) showed that multi-step reasoning outperforms single-step ones and dynamic multi-step reasoning further outperforms the fixed multi-step ones on two distinct MRC datasets (SQuAD and MS MARCO). But the dynamic multi-step reasoning models have to be trained using RL methods, e.g., policy gradient, which are tricky to implement due to the instability issue. SAN combines the strengths of both types of multi-step reasoning models. As shown in Fig. 3.8 (Left), SAN (Liu et al., 2018c) uses a fixed number of reasoning steps, and generates a prediction at each step. During decoding, the answer is based on the average of predictions in all steps. During training, however, SAN drops predictions via stochastic dropout, and generates the final result based on the average of the remaining predictions. Albeit simple, this technique significantly improves the robustness and overall accuracy of the model. Furthermore, SAN can be trained using back-propagation which is simple and efficient.

Taking SAN as an example, the multi-step reasoning module computes over memory steps and outputs the answer span. It is based on RNN, similar to IRN in Fig. 3.5. It maintains a state vector, which is updated each step. At the beginning, the initial state is the summarized question vector computed by Eqn. 3.6. At time step in the range of , the state is defined by , where contains retrieved information from memory using the previous state vector as a query via the attention process: : and , where is a trainable weight matrix. Finally, a bilinear function is used to find the start and end point of answer spans at each reasoning step , similar to Eqn. 3.7 and 3.8, as


where is the -th value of the vector , indicating the probability of the -th passage word being the start of the answer span at reasoning step .

3.7.3 Training

A neural MRC model can be viewed as a deep neural network that includes all component modules (e.g., the embedding layers and reasoning engines) which by themselves are also neural networks. Thus, it can be optimized on training data in an end-to-end fashion via back-propagation and SGD, as outlined in Fig. 1.3. For SQuAD models, we optimize model parameters by minimizing the loss function defined as the sum of the negative log probabilities of the ground truth answer span start and end points by the predicted distributions, averaged over all training samples:


where is the training set, and are the true start and end of the answer span of the -th training sample, respectively, and the -th value of the vector .

3.8 Conversational Text-QA Agents

While all the neural MRC models described in Sec. 3.7 assume a single-turn QA setting, in reality, humans often ask questions in a conversational context (Ren et al., 2018). For example, a user might ask the question “when was California founded?”, and then depending on the received answer, follow up by “who is its governor?” and “what is the population?”, where both refer to “California” mentioned in the first question. This incremental aspect, although making human conversations succinct, presents new challenges that most state-of-the-art single-turn MRC models do not address directly, such as referring back to conversational history using coreference and pragmatic reasoning999Pragmatic reasoning is defined as “the process of finding the intended meaning(s) of the given, and it is suggested that this amounts to the process of inferring the appropriate context(s) in which to interpret the given” (Bell, 1999). The analysis by Jia and Liang (2017); Chen et al. (2016a) revealed that state of the art neural MRC models, e.g., developed on SQuAD, mostly excel at matching questions to local context via lexical matching and paragraphing, but struggle with questions that require reasoning. (Reddy et al., 2018).

Figure 3.10: The examples from two conversational QA datasets. (Left) A QA dialogue example in the QuAC dataset. The student, who does not see the passage (section text), asks questions. The teacher provides answers in the form of text spans and dialogue acts. These acts include (1) whether the student should , could , or should not ask a follow-up; (2) affirmation (Yes / No), and, when appropriate, (3) No answer. Figure credit: Choi et al. (2018). (Right) A QA dialogue example in the CoQA dataset. Each dialogue turn contains a question (), an answer () and a rationale () that supports the answer. Figure credit: Reddy et al. (2018).

A conversational text-QA agent uses a similar architecture of Fig. 3.5, except that the Soft-KB Lookup module is replaced by a text-QA module which consists of a search engine (e.g., Google or Bing) that retrieves relevant passages for a given question, and an MRC model that generates the answer from the retrieved passages. The MRC model needs to be extended to address the aforementioned challenges in the conversation setting, henceforth referred to as a conversational MRC model.

Recently, several datasets have been developed for building conversational MRC models. Among them are CoQA (Conversational Question Answering (Reddy et al., 2018)) and QuAC (Question Answering in Context (Choi et al., 2018)), as shown in Fig. 3.10. The task of conversational MRC is defined as follows. Given a passage , the conversation history in the form of question-answer pairs and a question , the MRC model needs to predict the answer .

A conversational MRC model extends the models described in Sec. 3.7 in two aspects. First, the encoding module is extended to encode not only and but also the conversation history. Second, the reasoning module is extended to be able to generate an answer (via pragmatic reasoning) that might not overlap . For example, Reddy et al. (2018) proposed a reasoning module that combines the text-span MRC model of DrQA (Chen et al., 2017a) and the generative model of PGNet (See et al., 2017). To generate a free-form answer, DrQA first points to the answer evidence in text (e.g., R5 in Fig. 3.10 (Right)), and PGNet generates the an answer (e.g., A5) based on the evident.

3.9 TREC Open Benchmarks

In addition to the public text-QA and KB-QA datasets we described in the previous sections, TREC101010 also provides a series of QA open benchmarks.

The automated QA track.

This is one of the most popular tracks in TREC for many years, up to year 2007 (Dang et al., 2007; Agichtein et al., 2015). It has focused on the task of providing automatic answers for human questions. The track primarily dealt with factual questions, and the answers provided by participants were extracted from a corpus of News articles. While the task evolved to model increasingly realistic information needs, addressing question series, list questions, and even interactive feedback, a major limitation remained: the questions did not directly come from real users, in real time.

The LiveQA track.

The LiveQA track at TREC started in 2015 (Agichtein et al., 2015), focusing on answering user questions in real time. Real user questions, i.e., fresh questions submitted on the Yahoo Answers (YA) site that have not yet been answered, were sent to the participant systems, which provided an answer in real time. Returned answers were judged by TREC editors on a 4-level Likert scale. LiveQA revived the popular QA track which has been frozen for several years, attracting significant attention from the QA research community.

4.1 Overview and Architecture

We start with a brief overview of task-oriented dialogue systems, focusing on aspects that facilitate later discussions. For more information and historical developments, readers are referred to Jurafsky and Martin (2018, Chapters 25&26).

Slot-filling Dialogues

Throughout the chapter, we focus on a relatively simple yet important class of dialogues that involve filling in values for a predefined set of slots before a task can be successfully completed. These dialogues, known as slot-filling or form-filling, find a number of uses in practice. Table 4.1 gives an example conversation between a user and a dialogue system. To successfully complete a transaction to book tickets, the system must collect necessary information by conversing with the user.

System Hello! How may I assist you?
User Can I get tickets for Zoolander 2 tomorrow?
System Which city are you in?
User I want to watch in Seattle.
System How many tickets do you need?
User 2 tickets please. And I’d like to watch at Regal Meridian 16.
System 9:25 pm is available. Does it work for you?
User Yes.
System Okay, I was able to book 3 tickets for Zoolander 2 tomorrow at Regal Meridian 16 theater in Seattle at 9:25 pm.
User Thank you.
System Good bye.
Table 4.1: An example movie-booking dialogue, adapted from Li et al. (2016d).

Slot-filling dialogue may be applied to many other domains, such as movie (as shown in the example above), restaurant, air ticket booking, etc. For each domain, a set of slots are defined by domain experts and are application specific. For example, in the movie domain, slots like movie name, theater name, time, date, ticket price, number of tickets, etc. are necessary.

Dialogue Acts

The interaction between a dialogue agent and a user, as shown in the previous example, mirrors the interaction between an RL agent and the environment (Fig. 2.4), where a user utterance is like an observation, and the system utterance is an action selected by the dialogue agent. The dialogue acts theory gives a formal foundation for this intuition (Core and Allen, 1997; Traum, 1999).

In this framework, the utterances of a user or agent are considered actions that can change the (mental) state of both the user and the system, thus the state of the conversation. These actions can have a type of suggesting, informing, requesting, among others. A simple dialogue act is greeting, such as “Hello! How may I assist you?”, which allows the system to greet the user and start a conversation. Some dialogue acts may have slots as their parameter. For example, the following question in the movie-booking example above:

“How many tickets do you need?”

is to collect information about a certain slot:

Furthermore, some dialogue acts may even contain slot-value pair as parameters, such as inform(city=‘‘seattle’’) in the example:

“I want to watch it at Seattle.”

Dialogue as Optimal Decision Making

Equipped with dialogue acts, we are ready to model multi-turn conversations between a dialogue agent and a user as an RL problem. Here, the dialogue system is the RL agent, and the user is the environment. At every turn of the dialogue,

  • the agent keeps track of the dialogue state, based on information revealed so far in the conversation, and then takes an action; the action may be a response to the user in the form of dialogue acts, or an internal operation such as database lookup;

  • the user responds with the next utterance, which will be used by the agent to update its internal dialogue state in the next turn;

  • associated with this dialogue turn is an immediate reward.

This process is precisely the agent-environment interaction discussed in Sec. 2.3. We now discuss how a reward function is determined.

An appropriate reward function should capture desired features of a dialogue system. In task-oriented dialogues, obviously, we would like the system to succeed in helping the user in as few turns as possible. Therefore, it is natural to give a high reward (say ) at the end of the conversation if the task is successfully solved, or a low reward (say ) otherwise. Furthermore, we may give a small penalty (say, reward) to every intermediate turn of the conversation, so that the agent is encouraged to make the dialogue as short as possible. The above is of course just a simplistic illustration of how to set a reward function for task-oriented dialogues, but in practice more sophisticated reward functions may be used, such as those that measure diversity and coherence of the conversation. Further discussion of the reward function can be found in Sec. 4.5.6 and Sec. 5.4.

Figure 4.1: An architecture for multi-turn task-oriented dialogues

To build a system in practice, the architecture depicted in Fig. 4.1 is often used. It is pipelined and consists of the following modules.

  • Language Understanding (LU): This module takes the user’s raw utterance as input and converts it to the semantic form of dialogue acts.

  • Dialogue Management (DM): This module is the central controller of the dialogue system. It often has a State Tracking (ST) sub-module that is responsible for keeping track of the current dialogue state. The other sub-module, the policy, relies on the internal state provided by ST to select an action. Note that here, an action can be a response to the user, or some operation on backend databases (e.g., looking up certain information).

  • Language Generation (LG): If the policy chooses to respond to the user, LG will convert this action, often a dialogue act, into a natural language form.

4.2 Evaluation and User Simulation

Evaluation has been an important topic for dialogue systems. Different approaches have been used, including corpus-based approaches, user simulation, lab user study, actual user study, etc. We will discuss pros and cons of these various methods. In particular, our discussion is organized into several dimensions of desiderata of an ideal evaluation method. It can be easier to see the trade-offs along different dimensions of existing evaluation methods.

4.2.1 Evaluation Metrics

While individual components in a dialogue system can often be optimized against more well-defined metrics such as accuracy, precision/recall, F1 and BLEU scores, evaluating a whole dialogue system requires a more holistic view and is more challenging (Walker et al., 1997, 1998; Hartikainen et al., 2004). In the reinforcement learning framework, it implies that the reward function has to take multiple aspects of dialogue quality into consideration. In practice, the reward function is often a linear combination of a subset of the following metrics.

The first class of metrics measure task success. The most common choice is perhaps task success rate—the fraction of dialogues that successfully solves the user’s problem (buying the right movie tickets, finding proper restaurants, etc.). Effectively, the reward corresponding to this metric is for every turn, except for the last turn where it is for a successful dialogue and otherwise. Many examples are found in the literature (Walker et al., 1997; Williams, 2006; Peng et al., 2017). Other variants have also been used, such as those used to measure partial success (Singh et al., 2002; Young et al., 2016).

The second class measure cost incurred in a dialogue, such as time elapsed. A simple yet useful example is the number of turns, which is from the desideratum that, with everything else being equal, a more succinct dialogue is preferred. The corresponding reward is simply per turn of conversation, although more complicated choices exist (Walker et al., 1997).

In addition, other aspects of dialogue quality may also be encoded into the reward function, although this is a relatively under-investigated direction. In the context of chatbots (Chapter 5), coherence, diversity and personal styles have been used to result in more human-like dialogues (Li et al., 2016a, b).

4.2.2 Simulation-Based Evaluation

Typically an RL algorithm needs to interact with a user to learn (Sec. 2.3). But running RL on either recruited users or actual users can be expensive and even risky. A natural way to get around this challenge is to build a simulated user, with which an RL algorithm can interact at virtually no cost. Essentially, a simulated user tries to mimic what a real user does in a conversation: it keeps track of the dialog state, and converses with an RL dialogue system.

Substantial research has gone into building realistic user simulators. There are many different dimensions to categorize a user simulator, such as deterministic vs. stochastic, content-based vs. collaboration-based, static vs. non-static user goals during the conversations, among others. Here, we highlight two dimensions, and refer interested users to a survey for further details on creating and evaluating user simulators (Schatzmann et al., 2006):

  • Along the granularity dimension, the user simulator can operate either at the dialogue-act level (also known as intention level), or at the utterance level (Jung et al., 2009).

  • Along the methodology dimension, the user simulator can be implemented using a rule-based approach, or a model-based approach with the model learned from real conversational corpus.

Agenda-Based Simulation.

As an example, we describe a popular hidden agenda-based user simulator developed by Schatzmann and Young (2009), as instantiated in Li et al. (2016d) and Ultes et al. (2017c). Each dialogue simulation starts with a randomly generated user goal that is unknown to the dialogue manager. In general the user goal consists of two parts: the inform-slots contain a number of slot-value pairs that serve as constraints the user wants to impose on the dialogue; the request-slots are slots whose values are initially unknown to the user and will be filled out during the conversation. For instance, Fig. 4.2 shows a user goal in a movie domain.

Figure 4.2: (Left) An example user goal in the movie-ticket-booking domain, and (Right) a dialogue between a simulated user based on the user goal and an agent (Li et al., 2016d).

Furthermore, to make the user goal more realistic, domain-specific constraints are added, so that certain slots are required to appear in the user goal. For instance, it makes sense to require a user to know the number of tickets she wants in the movie domain.

During the course of a dialogue, the simulated user maintains a stack data structure known as user agenda. Each entry in the agenda corresponds to a pending intention the user aims to achieve, and their priorities are implicitly determined by the first-in-last-out operations of the agenda stack. Therefore, the agenda provides a convenient way of encoding the history of conversation and the “state-of-mind” of the user. Simulation of a user boils down to how to maintain the agenda after each turn of the dialogue, when more information is revealed. Machine learning or expert-defined rules can be used to set parameters in the stack-update process.

Model-Based Simulation.

Another approach to building user simulators is entirely based on data (Eckert et al., 1997; Levin et al., 2000; Chandramohan et al., 2011). Here, we describe a recent example due to Asri et al. (2016). Similar to the agenda-based approach, the simulator also starts an episode with a randomly generated user goal and constraints. These are fixed during a conversation.

In each turn, the user model takes an input a sequence of contexts collected so far in the conversation, and outputs the next action. Specifically, the context at a turn of conversation consists of:

  • the most recent machine action,

  • inconsistency between machine information and user goal,

  • constraint status, and

  • request status.

With these contexts, an LSTM is used to output the next user utterance. In practice, it often works well by combining both rule-based and model-based techniques to create user simulators.

Further Remarks on User Simulation.

While there has been much work on user simulation, building a human-like simulator is still a challenging task. In fact, even user simulator evaluation itself is not obvious (Pietquin and Hastie, 2013), and remains an ongoing research direction. In practice, it is often observed that dialogue policies that are overfitted to a particular user simulator may not work well when serving real humans (Dhingra et al., 2017). The gap between a user simulator and humans is the biggest limitation of user-simulation-based dialogue policy optimization.

Some user simulators are publicly available for research purposes. Other than the agenda-based simulator already mentioned by Li et al. (2016d), a much larger corpus with an evaluation environment, called AirDialogue, was recently made available (Wei et al., 2018). At the IEEE workshop on Spoken Language Technology in 2018, Microsoft is organizing a dialogue challenge 111 of building end-to-end task-oriented dialogue systems by providing an experiment platform with built-in user simulators in several domains (Li et al., 2018).

4.2.3 Human-based evaluation

Due to the discrepancy between simulator users and human users, it is often necessary to test a dialogue system on human users to reliably evaluate its metrics. There are roughly two types of human users.

The first is human subjects recruited in a lab study, possibly through crowd-sourcing platforms. Typically, the participants are asked to test-use a dialogue system to achieve a given task (depending on the domain of the dialogues), so that a collection of dialogues are obtained. Metrics of interest such as task-completion rate and average turns per dialogue can be measured, as done with a simulator user. In other cases, a fraction of these subjects are asked to test-use a baseline dialogue system, so that the two can be compared against various metrics.

Many published studies involving human subjects are of the first type (Singh et al., 2002; Gašić et al., 2013; Young et al., 2016; Lipton et al., 2018; Peng et al., 2017). While this approach has benefits over simulation-based evaluation, it is rather expensive and time-consuming to get a large number of subjects that can participate for a long time. Consequently, it has the following limitations:

  • The small number of subjects presents detection of statistically significant but small differences in metrics, often leading to inconclusive results.

  • Only a very small number of dialogue systems may be compared.

  • It is often impractical to run an RL agent that learns by interacting with these users, except in relatively simple dialogue applications.

The other type of humans for dialogue system evaluation is actual users (e.g., Black et al. (2011)). They are similar to the first type of users, except that they come with their actual tasks to be solved by conversing with the system. Consequently, metrics evaluated on them are even more reliable than those computed on recruited human subjects. Furthermore, the number of actual users can be much larger, thus resulting in more flexibility in evaluation. In this process, many online and offline evaluation techniques such as A/B-testing can be used (Hofmann et al., 2016). The major downside of experimenting with actual users is a potential risky of negative user experience.

4.2.4 Other Evaluation Techniques

Recently, researchers have started to investigate a different approach to evaluation that is inspired by the self-play technique in RL (Tesauro, 1995; Mnih et al., 2015). This technique is typically used in a two-player game (such as the game of Go), where both players are controlled by the same RL agent, possibly initialized differently. By playing the agent against itself, a large amount of trajectories can be generated at relatively low cost, from which the RL agent can learn a good policy.

Self-play can be adapted to dialogue management, as the two parties involved in a dialogue is often asymmetric (unlike in games like Go). Shah et al. (2018) described the dialogue self-play procedure, which can generates conversations between a simulated user and the system agent. Promising results have been observed in negotion dialogues (Lewis et al., 2017) and task-oriented dialogues (Shah et al., 2018; Wei et al., 2018). It provides an interesting solution to avoid the evaluation cost of involving human users as well as overfitting to untruthful simulated users.

In practice, it is reasonable to have a hybrid approach to evaluation. One possibility is to start with simulated users, then validate or fine-tune the dialogue policy on human users (cf., Shah et al. (2018)). Furthermore, there are more systematic approaches to using both sources of users for policy learning (see Sec. 4.5.5).

4.3 Traditional Approaches

There is a huge literature on managing (spoken) dialogue systems. A comprehensive survey is out of the scope of the this chapter. Interested readers are referred to earlier examples (Cole, 1999; Larsson and Traum, 2000; Rich et al., 2001; Bos et al., 2003; Bohus and Rudnicky, 2009), as well as excellent surveys like McTear (2002) and Young et al. (2013) for more information. Here, we review a small subset of traditional approaches that are relevant to the decision-theoretic view we take in this paper.

Levin et al. (2000) framed dialogue design as a decision optimization problem. Walker (2000) and Singh et al. (2002) are two early applications of reinforcement learning to dialogue systems. While promising, these approaches assumed that the dialogue state can only take finitely many possible values, and is fully observable. Both assumptions are often violated in real-world applications.

To handle uncertainty inherent in dialogue systems, Roy et al. (2000) and Williams and Young (2007) proposed to use Partially Observable Markov Decision Process (POMDP) as a principled mathematical framework for modeling and optimizing dialogue systems. The idea is to use observed user utterances to maintain a posterior distribution of the unobserved dialogue state. Since exact optimization in POMDPs is computationally intractable, approximation techniques are used (Roy et al., 2000; Williams and Young, 2007; Young et al., 2010; Li et al., 2009; Gašić and Young, 2014). Still, compared to the neural approaches covered in later sections, these methods require substantial domain knowledge to engineer features and design states.

Another important limitation of traditional approaches is that each module is optimized separately. Consequently, when the system does not perform well, it can be challenging to solve the “credit assignment” problem, namely, to identify which component in the system causes undesired system response and to improve that component. Indeed, as argued by McTear (2002), “[t]he key to a successful dialogue system is the integration of these components into a working system.” The recent marriage of differentiable neural models and reinforcement learning allows a dialogue system to be optimized in an end-to-end fashion, potentially leading to higher conversation quality; see Sec. 4.6 for further details.

4.4 Natural Language Understanding and Dialogue State Tracking

NLU and state tracking are two closely related and essential components of a dialogue system, and can have a significant impact on the overall system’s performance (with evidence from the literature such as Li et al. (2017e)). This section reviews some of the standard and state-of-the-art AI approaches to NLU and state tracking.

4.4.1 Natural Language Understanding

The NLU module takes user utterance as input, and performs three tasks: domain detection, intent determination, and slot tagging. Typically, a pipelined approach is taken, so that the three tasks are solved one after another. Accuracy, F1 score, and Area-Under-Curve (AUC) are among the most common metrics used to evaluate a model’s prediction quality. NLU is a preprocessing step for later modules in the dialogue system, whose quality has a significant impact on the system’s overall quality (Li et al., 2017d).

Among them, the first two tasks are often framed as a classification problem, which infers the domain or intent (from a predefined set of candidates) based on the current user utterance (Schapire and Singer, 2000; Yaman et al., 2008; Sarikaya et al., 2014). Neural approaches to multi-class classification have been used in the recent literature and outperformed traditional statistical methods. Ravuri and Stolcke (2015) studied the use of standard recurrent neural networks, and found them to be more effective; see Ravuri and Stolcke (2016) for further results. For short sentences where information has to be inferred from the context, Lee and Dernoncourt (2016) proposed to use recurrent and convolutional neural networks to also consider texts prior to the current utterance, and achieved better results on several benchmarks.

The more challenging task of slot tagging is often treated as sequence classification, where the classifier predicts semantic class labels for subsequences of the input utterance (Wang et al., 2005; Mesnil et al., 2013). Table 4.2 shows an ATIS (Airline Travel Information System) utterance example in the Inside-Outside-Beginning (IOB) format (Ramshaw and Marcus, 1995), where for each word the model is to predict a semantic tag.

Table 4.2: ATIS utterance example of IOB representation. Figure credit: Mesnil et al. (2015).

Yao et al. (2013) and Mesnil et al. (2015)

applied recurrent neural networks to slot tagging, where inputs are one-hot encoding of the words in the utterance, and obtained higher accuracy than statistical baselines such as conditional random fields and support vector machines. Moreover, it is also shown that a-prior word information can be effectively incorporated into basic recurrent models to yield further accuracy gains.

In many situations, the present utterance alone can be ambiguous or lack all necessary information. Contexts that includes information from previous utterances are expected to help improve model accuracy. Hori et al. (2015) treated conversation history as a long sequence of words, with alternating roles (words from user, vs. words from system), and proposed a variant to LSTM with role-dependent layers. Chen et al. (2016b) built on memory networks that learn which part of contextual information should be attended to, when making slot-tagging predictions. Both models achieved higher accuracy than context-free models.

Although the three NLU tasks are often studied separately, there are benefits to jointly solve them (similar to multi-task learning), and over multiple domains, so that it may require fewer labeled data when creating NLU models for a new domain (Hakkani-Tür et al., 2016; Liu and Lane, 2016). Another line of interesting work that can lead to substantial reduction of labeling cost in new domains is zero-shot learning, where slots from different domains are represented in a shared latent semantic space through embedding of the slots’ (text) descriptions (Bapna et al., 2017; Lee and Jha, 2018). Interested readers are referred to recent tutorials, such as Chen and Gao (2017) and Chen et al. (2017d), for more details.

4.4.2 Dialogue State Tracking

Dialogue State Tracking (DST) is a critical component in a successful dialogue system. In slot-filling problems, a dialogue state contains all information about what the user is looking for at the current turn of the conversation. This state is what the dialogue policy takes as input for deciding what action to take next (Fig. 4.1).

For example, in the restaurant domain, where a user tries to make a reservation, the dialogue state may consists of the following components (Henderson, 2015):

  • The goal constraint for every informable slot, in the form of a value assignment to that slot. The value can be “don’t care” (if the user has no preference) or ”none” (if the user has not yet specified the value).

  • The subset of requested slots that the user has asked the system to inform.

  • The current dialog search method, taking values in by constraint, by alternative and finished. It encodes how the user is trying to interact with the dialog system.

In the past, DST can either be created by experts, or obtained from data by statistical learning algorithms like conditional random fields (Henderson, 2015). More recently, neural approaches have started to gain popularity, with applications of deep neural networks (Henderson et al., 2013) and recurrent networks (Mrkšić et al., 2015) as some of the early examples.

Figure 4.3: Neural Belief Tracker. Figure credit: Mrkšić et al. (2017).

A more recent DST model is the Neural Belief Tracker proposed by Mrkšić et al. (2017), shown in Fig. 4.3. The model takes three items as input. The first two are the last system and user utterances, each of which is first mapped to an internal, vector representation. The authors studied two models for representation learning, based on multi-layer perceptrons and convolutional neural networks, both of which taken advantage of pre-trained collections of word vectors and output an embedding for the input utterance. The third input is any slot-value pair that is being tracked by DST. Then, the three embeddings may interact among themselves for context modeling, to provide further contextual information from the flow of conversation, and semantic decoding, to decide if the user explicitly expressed an intent matching the input slot-value pair. Finally, the context modeling and semantic decoding vectors go through a softmax layer to produce a final prediction. The same process is repeated for all possible candidate slot-value pairs.

A different representation of dialogue states, called belief spans, is explored by Lei et al. (2018) in the Sequicity framework. A belief span consists of two fields: one for informable slots and the other for requestable slots. Each field collects values that have been found for respective slots in the conversation so far. One of the main benefits of belief spans and Sequicity is that it facilitates the use of neural sequence-to-sequence models to learn dialogue systems, which take the belief spans as input and output system responses. This greatly simplifies system design and optimization, compared to more traditional, pipelined approaches (c.f., Sec. 4.3 and Sec. 4.6).

Dialogue State Tracking Challenge (DSTC)

is a series of challenges that provide common testbeds and evaluation measures for dialogue state tracking. Starting with Williams et al. (2013), it has successfully attracted many research teams to focus on a wide range of technical problems in DST (Henderson et al., 2014b, a; Kim et al., 2016a, b; Hori et al., 2017). Corpora used by DSTC over the years have covered human-computer and human-human conversations, different domains such as restaurant and tourist, cross-language learning. More information may be found in the DSTC website.222

4.5 Dialogue Policy Learning

4.5.1 Deep RL for Policy Optimization

The dialogue policy may be optimized by many standard reinforcement learning algorithms. There are two ways to use RL: online and batch. The online approach requires the learner to interact with users to improve its policy; the batch approach assumes a fixed set of transitions, and optimizes the policy based on the data only, without interacting with users (Li et al., 2009; Pietquin et al., 2011). In this chapter, we focus more on the online setting which often has batch learning as an internal step. Many covered topics can be useful to the batch setting.

Here, we use the DQN as an example, following Lipton et al. (2018), to illustrate the basic work flow. The use of alternative algorithms such as policy gradient is found in the literature, including many covered in this section. Even in the DQN family of solutions, many variants exist. A recent example uses graph neural network to model the Q-function, with nodes in the graph corresponding to slots of the domain (Chen et al., 2018). The nodes may share some of the parameters, therefore increasing learning speed. Another example is described in further detail in Sec. 4.5.2.

Model Architecture.

The DQN’s input is an encoding of the current dialogue state. One option is to encode it as a feature vector, consisting of the following: (1) one-hot representations of the dialogue act and slot corresponding to the last user action; (2) the same one-hot representations of the dialogue act and slot corresponding to the last system action; (3) a bag of slots corresponding to all previously filled slots in the conversation so far; (4) the current turn count; and (5) the number of results from the knowledge base that match the already filled-in constraints for informed slots.

DQN outputs a vector, whose entries correspond to all possible (dialogue-act, slot) pair that can be chosen by the dialogue system. Available prior knowledge can be used to reduce to number of outputs, if some (dialogue-act, slot) pairs do not make sense for a system, such as request(price).

Warm-Start Policy.

Learning the policy from scratch is often slow, but can be significantly sped up by initializing it to be a reasonably good policy before online interaction with (simulated) users starts. A popular approach is to use imitation learning to mimic an expert-provided policy 

(Li et al., 2014; Dhingra et al., 2017). Lipton et al. (2018) proposed a simpler yet effective alternative of Replay Buffer Spiking (RBS) that is particularly suited to DQN. The idea is to pre-fill the experience replay buffer of DQN with a small number of dialogues generated by running a naïve yet occasionally successful, rule-based agent. This technique is shown to be essential for DQN in simulated studies.

Online Policy Learning.

Standard back-propagation on mini-batches can be used to update parameters as in the two-network approach (Sec. 2.3). The learner may also use simple heuristics such as -greedy or Boltzmann exploration to select actions; see Sec. 4.5.2 for further discussions on the topic of exploration.

4.5.2 Efficient Exploration and Domain Extension

Without the help of a teacher, an RL agent learns from data collected by interacting with an initially unknown environment. In general, the agent has to try new actions in novel states, in order to discover potentially better policies. Hence, it has to strike a good trade-off between exploitation (choosing good actions to maximize reward) and exploration (choosing novel actions to discover potentially better alternatives), leading to the need for efficient exploration (Sutton and Barto, 2018). In the context of dialogue policy learning, the implication is that the policy learner actively tries new ways to converse with a user, in the hope to discovery a better policy in the long run.

While exploration in finite-state RL is relatively well-understood (Strehl et al., 2009; Jaksch et al., 2010; Osband and Roy, 2017; Dann et al., 2017), exploration when deep models are used is an active research topic (Bellemare et al., 2016; Osband et al., 2016; Houthooft et al., 2016; Jiang et al., 2017). Here, we describe a general-purpose exploration strategy that is particularly suited for dialogue systems that may change over time.

After a task-oriented dialogue system is deployed to serve users, there may be a need over time to add intents and/or slots to make the system more versatile. This problem, referred to as “domain extension” (Gašic et al., 2014), makes efficient exploration even more challenging: the agent needs to explicitly quantify the uncertainty in parameters for intents/slots, so as to explore new ones more aggressively while avoiding exploring already learned ones. Lipton et al. (2018) approached the problem using a Bayesian-By-Backprop (BBQ) variant of DQN.

Their model, called BBQ, is identical to DQN, except that it maintains an approximate posterior distribution over the network weights . For computational convenience,

is a multivariate Gaussian distribution with diagonal covariance, parameterized by

, where weight has a Gaussian posterior distribution, and

. The posterior information leads to a natural exploration strategy, inspired by Thompson Sampling 

(Thompson, 1933; Chapelle and Li, 2012; Russo et al., 2018). When selecting actions, the agent simply draws a random weight , and then selects the action with the highest value output by the network. Experiments show that BBQ explores more efficiently than state-of-the-art baselines for dialogue domain extension.

The BBQ model is updated as follows. Given observed transitions , we used the target network (see Sec. 2.3) to compute the target values for each in , resulting in the set , where and may be computed as in DQN. Then, we learn by minimizing the variational free energy (Hinton and Van Camp, 1993), the KL-divergence between the variational approximation and the posterior :

4.5.3 Composite-Task Dialogues

In many real-world problems, a task may consist of a set of subtasks that need to be solved collectively. Similarly, dialogues can often be decomposed into a sequence of related subdialogues, each of which focuses on a subtopic (Litman and Allen, 1987). Consider for example a travel planning dialogue system, which needs to book flights, hotels and car rental in a collective way so as to satisfy certain cross-subtask constraints known as slot constraints (Peng et al., 2017). Slot constraints are application specific. In a travel planning problem, one natural constraint is that the outbound flight’s arrival time should be earlier than the hotel check-in time.

Complex tasks with slot constraints are referred to as composite tasks by Peng et al. (2017). Optimizing the dialogue policy for a composite task is challenging for two reasons. First, the policy has to handle many slots, as each subtask often corresponds to a domain with its own set of slots, and the slots of a composite-task consists of slots from all sub-tasks. Furthermore, thanks to slot constraints, these subtasks cannot be solved independently. Therefore, the state space considered by a composite-task is much larger. Second, a composite-task dialogue often requires many more turns to complete. Typical reward functions gives a success-or-not reward only at the end of the whole dialogue. As a result, this reward signal is very sparse and considerably delayed, making policy optimization even harder.

Cuayáhuitl et al. (2010) proposed to use hierarchical reinforcement learning to optimize a composite task’s dialogue policy, with tabular versions of the MAXQ (Dietterich, 2000) and Hierarchical Abstract Machine (Parr and Russell, 1998) approaches. While promising, their solutions assume finite states, so do not scale well to large conversational problems.

More recently, Peng et al. (2017) tackled the composite-task dialogue policy learning problem under the more general options framework (Sutton et al., 1999b), where the task hierarchy has two levels. As illustrated in Fig. 4.4, a top-level policy selects which subtask to solve, and a low-level policy is to solve the subtask specified by . Assuming predefined subtasks, they extend the DQN model that results in substantially faster learning speed and superior policies. A similar approach is taken by Budzianowski et al. (2017), who used Gaussian process RL instead of deep RL for policy learning.

A major assumption in options/subgoal-based hierarchical reinforcement learning is the need for reasonable options and subgoals. Tang et al. (2018) considered the problem of discovering subgoals from dialogue demonstrations. Inspired by a sequence segmentation approach that is successfully applied to machine translation (Wang et al., 2017a), the authors developed the Subgoal Discovery Network (SDN), which learns to identify “bottleneck” states in successful dialogues. It is shown that the hierarchical DQN optimized with subgoals discovered by SDN is competitive to expert-designed subgoals.

Finally, another interesting attempt is made by Casanueva et al. (2018) based on Feudal Reinforcement Learning (FRL) (Dayan and Hinton, 1993). In contrast to the above methods that decompose a task into temporally separated subtasks, FRL decomposes a complex decision spatially. In each turn of a dialogue, the feudal policy first decides between information-gathering actions and information-providing actions, then a primitive action is chosen conditioned on this high-level decision.

Figure 4.4: A two-level hierarchical dialogue policy. Figure credit: Peng et al. (2017).

4.5.4 Multi-Domain Dialogues

A multi-domain dialogue can converse with a user to have a conversation that may involve more than one domains (Komatani et al., 2006; Hakkani-Tür et al., 2012; Wang et al., 2014). Table 4.3 shows an example,where the dialogue covers both the hotel and restaurant domains, in addition to a special meta domain.

Domain Agent Utterance
meta system “Hi! How can I help you?”
user “I’m looking for a hotel in Seattle on January 2nd
 for 2 nights.”
hotel system “A hotel for 2 nights in Seattle on January 2nd?”
user “Yes.”
system “I found Hilton Seattle.”
meta system “Anything else I can help with?”
user “I’m looking for cheap Japanese food in the downtown.”
restaurant system “Did you say Chinese food?”
user “Yes.”
system “I found the following results.”
Table 4.3: An example of multi-domain dialogue, adapted from Cuayáhuitl et al. (2016).

Different from composite tasks, subdialogues corresponding to different domains in a conversation are separate tasks, without cross-task slot constraints. Similar to composite-task systems, a multi-domain dialogue system needs to keep track of a much larger dialogue state space that has slots from all domains, so directly applying RL can be inefficient. It therefore raises the need to learn re-usable policies whose parameters can be shared across multiple domains as long as they are related.

Gašić et al. (2015) proposed to use a Bayesian Committee Machine (BCM) for efficient multi-domain policy learning. During training time, a number of policies are trained on different, potentially small, datasets. The authors used Gaussian processes RL algorithms to optimize those policies, although they can be replaced by deep learning alternatives. During test time, in each turn of a dialogue, these policies recommend an action, and all recommendations are aggregated into a final action to be taken by the BCM policy.

Cuayáhuitl et al. (2016) developed another related technique known as NDQN—Network of DQNs, where each DQN is trained for a specialized skill to converse in a particular subdialogue. A meta-policy controls how to switch between these DQNs, and can also be optimized using (deep) reinforcement learning.

4.5.5 Integration of Planning and Learning

Figure 4.5: Three strategies for optimizing dialogue policies based on reinforcement learning. Figure credit: Peng et al. (2018).

As mentioned in Sec. 4.2, optimizing the policy of a task-oriented dialogue against humans is costly, since it requires many interactions between the dialogue system and humans. Simulated users provide an inexpensive alternative, but may not be a sufficiently truthful approximation of human users. These two approaches correspond to the left two panels in Fig. 4.5.

The trade-off between learning from real users and learning from simulated users is in fact a common phenomenon in reinforcement learning.

Here, we are concerned with the use of a user model to generate more data to improve sample complexity in optimizing a dialogue system. Inspired by the Dyna-Q framework (Sutton, 1990), Peng et al. (2018) proposed Deep Dyna-Q (DDQ) to handle large-scale problems with deep learning models, as shown by the right panel of Fig. 4.5. Intuitively, DDQ allows interactions with both human users and simulated users. Training of DDQ consists of three parts:

  • direct reinforcement learning: the dialogue system interacts with a real user, collects real dialogues and improves the policy by either imitation learning or reinforcement learning;

  • world model learning: the world model (i.e., user simulator) is refined using real dialogues collected by direct reinforcement learning;

  • planning: the dialogue policy is improved against simulated users by reinforcement learning.

Human-in-the-loop experiments show that DDQ is able to efficiently improve the dialogue policy by interacting with real users, which is important for deploying dialogue systems in practice.

4.5.6 Reward Function Learning

The dialogue policy is often optimized to maximize long-term reward when interacting with users. The reward function is therefore critical to creating high-quality dialogue systems. One possibility is to have users provide feedback during or at the end of a conversation to rate the quality, but feedback like this is intrusive and costly. Often, easier-to-measure quantities such as time-elapsed are used to compute a reward function. Unfortunately, in practice, designing an appropriate reward function is not always obvious, and substantial domain knowledge is needed (Sec. 4.1). This inspires the use of machine learning to find a good reward function from data (Asri et al., 2012) which can better correlate with user satisfaction (Rieser and Lemon, 2011), or is more consistent with expert demonstrations (Li et al., 2014).

Su et al. (2015) proposed to rate dialogue success with two neural network models, a recurrent and a convolutional network. Their approach is found to result in competitive dialogue policies, when compared to a baseline that uses prior knowledge of user goals. However, these models assume the availability of labeled data in the form of (dialogue, success-or-not) pairs, in which the success-or-not feedback provided by users can be expensive to obtain. To reduce the labeling cost, Su et al. (2016)

investigated an active learning approach based on Gaussian processes, which aims to learn the reward function and policy at the same time while interacting with human users. More discussions and results are provided by the authors’ follow-up work 

(Su et al., 2018).

Ultes et al. (2017a) argued that dialogue success only measures one aspect of the dialogue policy’s quality. Focusing on information-seeking tasks, the authors proposed a new reward estimator based on interaction quality that balances multiple aspects of the dialogue policy. Later on, Ultes et al. (2017b) used multi-objective RL to automatically learn how to linearly combine multiple metrics of interest in the definition of reward function.

4.6 End-to-End Learning

One of the benefits of neural models is that they are often differentiable and can be optimized by gradient-based methods like back-propagation (Goodfellow et al., 2016). In addition to language understanding, state tracking and policy learning that have been covered in previous sections, speech recognition & synthesis (for spoken dialogue systems) and language generation may be learned by neural models and back-propagation to achieve state-of-the-art performance (Hinton et al., 2012; van den Oord et al., 2016; Wen et al., 2015). In the extreme, if all components in a task-oriented dialogue system (Fig. 4.1) are differentiable, the whole system becomes a larger differentiable system that can be optimized by back-propagation. This is a potential advantage compared to traditional approaches that optimize individual components separately (Sec. 4.3).

There are two frameworks to build an end-to-end dialogue system. The first is based on supervised learning, where desired system responses are first collected and then used to train multiple components of a dialogue system in order to maximize prediction accuracy (Bordes et al., 2017; Wen et al., 2017; Yang et al., 2017b; Eric et al., 2017). Wen et al. (2017) introduced a modular neural dialogue system, where most modules are represented by a neural network. However, their approach relies on non-differentiable knowledge-base lookup operators, so training of the components is done separately in a supervised manner. This challenge is addressed by Dhingra et al. (2017) who proposed “soft” knowledge-base lookups; see Sec. 3.5 for more details. Bordes et al. (2017) treated dialogue system learning as the problem of learning a mapping from dialogue histories to system responses. They show memory networks and supervised embedding models outperform standard baselines on a number of simulated dialogue tasks. Finally, Eric et al. (2017) proposed an end-to-end trainable Key-Value Retrieval Network, which is equipped with an attention-based key-value retrieval mechanism over entries of a KB, and can learn to extract relevant information from the KB.

While supervised learning methods can produce promising results, they require training data that may be expensive to obtain. Furthermore, this approach does not allow a dialogue system to explore different policies that can potentially be better than expert policies that produce responses for supervised training. This inspire another line of work that uses reinforcement learning to optimize end-to-end dialogue systems (Zhao and Eskénazi, 2016; Williams and Zweig, 2016; Dhingra et al., 2017; Li et al., 2017d).

Zhao and Eskénazi (2016) proposed a model that takes user utterance as input and outputs a semantic system action. Their model is a recurrent variant of DQN based on LSTM, which learns to compress user utterance sequence to infer an internal state of the dialogue. Compared to classic approaches, this method is able to jointly optimize the policy as well as language understanding and state tracking beyond standard supervised learning.

Another approach, taken by Williams et al. (2017), is to use LSTM to avoid the tedious step of state tracking engineering, and jointing optimize state tracker and the policy. Their model, called Hybrid Code Networks (HCN), also makes it easy for engineers to incorporate business rules and other prior knowledge via software and action templates. They show that HCN can be trained end-to-end and demonstrate much faster learning than several end-to-end techniques.

4.7 Concluding Remarks

In this chapter, we have surveyed some traditional as well as the more recent, neural approaches to optimizing task-oriented dialogue systems. This is a new area with exciting research opportunities. Here, we briefly describe a few of them.

Evaluation remains a major research challenge. Although user simulation can be useful (Schatzmann and Young, 2009; Li et al., 2016d; Wei et al., 2018), a more appealing solution is to use real human-human conversation corpora for evaluation. Unfortunately, this problem, known as off-policy evaluation in the RL literature, is challenging with numerous current research efforts (Precup et al., 2000; Jiang and Li, 2016; Thomas and Brunskill, 2016; Liu et al., 2018b). It is expected that off-policy techniques can find important use in evaluating and optimizing dialogue systems.

Another related line of research is deep reinforcement learning applied to text games (Narasimhan et al., 2015), which is in many ways similar to a conversation, except that the scenarios are predefined by the game designers. Recent advances for solving text games, such as handling natural-language actions (Narasimhan et al., 2015; He et al., 2016; Côté et al., 2018) and interpretable policies (Chen et al., 2017c) may find similar use in the case of dialogues.

5.1 End-to-End Conversation Models

Most of the earliest end-to-end (E2E) conversation models are inspired by statistical machine translation (SMT) (Koehn et al., 2003; Och and Ney, 2004)

, including neural machine translation 

(Kalchbrenner and Blunsom, 2013; Cho et al., 2014a; Bahdanau et al., 2015). The casting of the conversational response generation task (i.e., predict a response based on the previous dialogue turn ) as an SMT problem is a relatively natural one, as one can treat turn as the “foreign sentence” and turn as its “translation”. This means one can apply any off-the-shelf SMT algorithm to a conversational dataset to build a response generation system. This was the idea originally proposed in one of the first works on fully data-driven conversational AI (Ritter et al., 2011), which applied a phrase-based translation approach (Koehn et al., 2003) to dialogue datasets extracted from Twitter (Serban et al., 2015). A different E2E approach was proposed in (Jafarpour et al., 2010), but it relied on IR-based methods rather than machine translation.

While these two papers constituted a paradigm shift compared to earlier work in dialogue, they had several limitations. Their most significant limitation is their representation of the data as (query, response) pairs, which hinders their ability to generate responses that are contextually appropriate. This is a serious limitation as dialogue turns in chitchat are often short (e.g., a few word utterance such as “really?”), in which case conversational models critically need longer contexts to produce plausible responses. This limitation motivated the work of Sordoni et al. (2015b), which proposed an RNN-based approach to conversational response generation (similar to Fig. 2.2) that exploited longer context. Together with the contemporaneous works (Shang et al., 2015; Vinyals and Le, 2015), these papers represented the first neural approaches to fully E2E conversation modeling. While these three papers have some distinct properties, they are all based on recurrent (RNN) architectures, which nowadays are often modeled with a Long Short-Term Memory (LSTM) model  (Hochreiter and Schmidhuber, 1997; Sutskever et al., 2014).

5.1.1 The LSTM Model

We give an overview here of LSTM-based response generation, as LSTM is arguably the most popular seq2seq model, though alternative models such as GRU (Cho et al., 2014b) are often as effective. LSTM is an extension of the RNN model represented in Fig. 2.2, and is often more effective at exploiting long-term context. An LSTM-based response generation system is usually modeled as follows (Vinyals and Le, 2015; Li et al., 2016a): Given a dialogue history represented as a sequence of words ( here stands for source), the LSTM associates each time step with input, memory, and output gates, denoted respectively as , and . is the number of words in the source .111The notation distinguishes and where is the embedding vector for an individual word at time step , and is the vector computed by the LSTM model at time by combining and . is the cell state vector at time , and

represents the sigmoid function.

Then, the hidden state of the LSTM for each time step is computed as follows:


where matrices , , , belong to , denotes the element-wise product. As it is a response generation task, each conversational context is paired with a sequence of output words to predict: .222 is the length of the response and represents a word token that is associated with a -dimensional word embedding (distinct from the source). The LSTM model defines the probability of the next token to predict using the softmax function:333Note is the activation function between and , where is the output hidden vector at time .

5.1.2 The HRED Model

While the LSTM model has been shown to be effective in encoding textual contexts up to 500 words (Khandelwal et al., 2018), dialogue histories can often be long and there is sometimes a need to exploit longer-term context. Hierarchical models were designed to address this limitation by capturing longer context (Yao et al., 2015; Serban et al., 2016, 2017; Xing et al., 2018). One popular approach is the Hierarchical Recurrent Encoder-Decoder (HRED) model, originally proposed in (Sordoni et al., 2015a) and applied to response generation in (Serban et al., 2016).

Figure 5.1: (a) Recurrent architecture used by models such as RNN, GRU, LSTM, etc. (2) Two-level hierarchy representative of HRED. Note: To simplify the notation, the figure represents utterances of length 3.

The HRED architecture is depicted in Fig. 5.1

, where it is compared to the standard RNN architecture. HRED models dialogue using a two-level hierarchy that combines two RNNs: one at a word level and one at the dialogue turn level. This architecture models the fact that dialogue history consists of a sequence of turns, each consisting of a sequence of tokens. This model introduces a temporal structure that makes the hidden state of the current dialogue turn directly dependent on the hidden state of the previous dialogue turn, effectively allowing information to flow over longer time spans, and helping reduce the vanishing gradient problem

(Hochreiter, 1991), a problem that limits RNN’s (including LSTM’s) ability to model very long word sequences. Note that, in this particular work, RNN hidden states are implemented using GRU (Cho et al., 2014b) instead of LSTM. The HRED model was later extended with the VHRED model (Serban et al., 2017) (further discussed in Sec. 5.2), which adds a latent variable to the target to address a different issue.

5.1.3 Attention models

The seq2seq framework has been tremendously successful in text generation tasks such as machine translation, but its encoding of the entire source sequence into a fixed-size vector has certain limitations, especially when dealing with long source sequences. Attention-based models (Bahdanau et al., 2015) alleviate this limitation by allowing the model to search and condition on parts of a source sentence that are relevant to predicting the next target word, thus moving away from a framework that represents the entire source sequence merely as a single fixed-size vector. While attention models and variants (Bahdanau et al., 2015; Luong et al., 2015, etc.) have contributed to significant progress in the state-of-the-art in translation (Wu et al., 2016) and are very commonly used in neural machine translation nowadays, attention models have been somewhat less effective in E2E dialogue modeling. This can probably be explained by the fact that attention models effectively attempt to “jointly translate and align” (Bahdanau et al., 2015), which is a desirable goal in machine translation as each information piece in the source sequence (foreign sentence) typically needs to be conveyed in the target (translation) exactly once, but this is less true in dialogue data. Indeed, in dialogue entire spans of the source may not map to anything in the target and vice-versa.444Ritter et al. (2011) also found that alignment produced by an off-the-shelf word aligner (Och and Ney, 2003) produced alignments of poor quality, and an extension of their work with attention models (Ritter 2018, pc) yield attention scores that did not correspond to meaningful alignments. Some specific attention models for dialogue have been shown to be useful (Yao et al., 2015; Mei et al., 2017; Shao et al., 2017), e.g., to avoid word repetitions (which are discussed further in Sec. 5.2).

5.1.4 Pointer-Network models

Multiple model extensions (Gu et al., 2016; He et al., 2017a) of the seq2seq framework improve the model’s ability to “copy and paste” words between the conversational context and the response. Compared to other tasks such as translation, this ability is particularly important in dialogue, as the response often repeats spans of the input (e.g., “good morning” in response to “good morning”) or uses rare words such as proper nouns, which the model would have difficulty generating with a standard RNN. Originally inspired by the Pointer Network model (Vinyals et al., 2015)—which produces an output sequence consisting of elements from the input sequence—these models hypothesize target words that are either drawn from a fixed-size vocabulary (akin to a seq2seq model) or selected from the source sequence (akin to a pointer network) using an attention mechanism. An instance of this model is CopyNet (Gu et al., 2016), which was shown to significantly improve over RNNs thanks to its ability to repeat proper nouns and other words of the input.

5.2 Challenges and Remedies

The response generation task faces challenges that are rather specific to conversation modeling. Much of the recent research is aimed at addressing the following issues.

5.2.1 Response blandness

Utterances generated by neural response generation systems are often bland and deflective. While this problem has been noted in other tasks such as image captioning (Mao et al., 2015), the problem is particularly acute in E2E response generation, as commonly used models such as seq2seq tend to generate uninformative responses such as “I don’t know” or “I’m OK”. Li et al. (2016a) suggested that this is due to their training objective, which optimize the likelihood of the training data according to