Sample-Efficient Model-based Actor-Critic for an Interactive Dialogue Task

04/28/2020 ∙ by Katya Kudashkina, et al. ∙ 0

Human-computer interactive systems that rely on machine learning are becoming paramount to the lives of millions of people who use digital assistants on a daily basis. Yet, further advances are limited by the availability of data and the cost of acquiring new samples. One way to address this problem is by improving the sample efficiency of current approaches. As a solution path, we present a model-based reinforcement learning algorithm for an interactive dialogue task. We build on commonly used actor-critic methods, adding an environment model and planner that augments a learning agent to learn the model of the environment dynamics. Our results show that, on a simulation that mimics the interactive task, our algorithm requires 70 times fewer samples, compared to the baseline of commonly used model-free algorithm, and demonstrates 2 times better performance asymptotically. Moreover, we introduce a novel contribution of computing a soft planner policy and further updating a model-free policy yielding a less computationally expensive model-free agent as good as the model-based one. This model-based architecture serves as a foundation that can be extended to other human-computer interactive tasks allowing further advances in this direction.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Intelligent assistants that amplify and augment human cognitive and physical abilities have a paramount influence on society. The ubiquitous nature of intelligent assistants raises their economic significance. Voice personal assistants, in particular, are on the rise. These are systems such as Amazon Alexa, Google Personal Assistant, Microsoft Cortana, Apple Siri, and many others, that assist people with a myriad of tasks: gathering information, inquiring about the weather, making purchases, booking appointments, or dictating and editing documents. Such technology that helps in performing tasks through a dialogue between a user and a system is often referred to as voice-assistive.

We center our attention on a particular class of voice-assistive systems: voice document-editing. Work on document editing via voice goes back more than three decades [Ades and Swinehart1986]. Despite recent progress, voice document-editing and voice-dictation systems are still in a primitive form. For example, in some cloud-based document editors, we can type, format, and edit using only specific commands: “Select line” in order to select a line, or “Go to the next character” to advance the cursor. The lack of sophistication in today’s voice-editing systems is due to the difficulty of training them. They require either a large, diverse training dataset of general document-editing dialogues or hours of online human-machine interactions. Such datasets do not currently exist and online interactions are impractical— a substantial amount of training is needed before the agent can overcome users’ frustration with a system that has not been sufficiently trained.

General methods, such as supervised learning, that scale with increased computation, continue to build knowledge into our agents. However, this built-in knowledge is not enough when it comes to conversational AI agents assisting users, especially in complicated domains. Supervised learning is an important kind of learning, but alone it is not adequate for learning from interaction. In contrast, reinforcement learning provides an opportunity for an intuitive human-computer interaction.

The key contribution of this paper is a sample-efficient model-based reinforcement learning algorithm that can make best use of limited datasets and online learning. Its novelty is in computing a soft planner policy and updating a model-free policy. By integrating such an update with the rest of the algorithm, we can deploy a model-free agent that performs better than an agent trained without model-based components. We show that our algorithm is 70 times more sample-efficient than the one trained with a commonly used model-free actor-critic method that is already known for its good performance across many domains. In addition, asymptotically our algorithm demonstrates twice the performance in comparison to the model-free actor-critic baseline.

The intent of our algorithm is to address sample efficiency in real-world human-computer interactive systems, where the reward function truly depends on user feedback. Thus, we validate the algorithm on a selected challenging task that carries this property and encompasses user interactions with an imperfect transcription system–a voice document-editing task. The work demonstrates the benefits of a model and planner with a novel soft-planner policy and shows how to fine tune a model-free policy and value function. To establish our algorithm we compare it to the current state-of-the-art implementations for such systems: model-free actor-critic methods.

Related Work & Background

Reinforcement Learning (RL) is a framework that allows an agent to learn by interacting with an environment. The agent maps situations to actions with the goal of maximizing some numerical reward signal. It is not surprising that RL techniques have found wide use in a number of dialogue system implementations [Walker and Grosz1978, Singh et al.2000, Shah, Hakkani-Tur, and Heck2016, Li et al.2016, Dhingra et al.2017]. Below we describe the key RL concepts and terminology closely following the sutton2018reinforcement, sutton2018reinforcement conventions for notation and definitions.

Commonly, RL problems are cast as sequential decision-making problems modelled as Markov Decision Processes (MDPs). In an MDP, an agent and an environment interact in a sequence of discrete steps. The state at time-step is denoted , where is the set of all possible states. In cases of partial observability, the environment emits only observations —a signal that depends on its state, but carries only partial information about it. The observations may not contain all information necessary for predicting the future or optimal control (i.e., learning policies to attain a large reward.) In this case we operate within the framework known as Partially Observable MDPs (POMDPs).

An action that the agent takes at the current state is defined as , where is the set of all actions possible from state . Action selection is driven by the information that state provides.

The policy

is a probability distribution over all possible actions

at a given state : . Once an action is taken, the agent receives a numerical reward, and transitions to the next state , according to the transition probabilities of the environment’s dynamics: , producing a sequence:


where is the final timestep for an episodic task or infinity for a continuing task. The goal of the agent is to learn the optimal policy which maximizes the total expected discounted return:


where is a parameter, called the discount rate.

In value-based methods, we compute value functions

. In prediction-type problems, it is common to estimate the 

state-value function as when starting in state and following policy . Prediction problems are also referred to as policy evaluation. In control-type problems, we estimate the action-value function from which we derive the policy: . In policy-based methods, we directly optimize and compute the policy.

The first suggestions of formalizing MDPs for dialogue goes back forty years, followed by work using RL [Walker and Grosz1978, Biermann and Long1996, Levin, Pieraccini, and Eckert1997]. An example of an early end-to-end dialogue system that used RL is the RLDS software tool developed by singh2000reinforcement, singh2000reinforcement at AT&T Labs. NJFun was another pioneering system developed by litman2000njfun, litman2000njfun that used RL. NJFun was created as an MDP with a manually designed state space for the dialogue.

Dialogue systems have rapidly advanced their performance in the last few years, following the introduction of the sequence-to-sequence paradigm by sutskever2014sequence, sutskever2014sequence which transformed machine translation among other natural language processing tasks. Much of this progress is now attributed to the increasing availability of large amounts of data and computational power via deep learning techniques combined with RL. Most current dialogue systems that apply RL combine value-based and policy-based methods within the

actor-critic framework described next.

Actor-Critic Methods

Actor-critic algorithms [Barto, Sutton, and Anderson1983] are methods that combine value-based and policy-based methods. The ‘actor’ refers to the part of the agent responsible for producing the policy, while the ‘critic’ refers to the part of agent responsible for producing a value function. Actor-critic is a subset of policy gradient algorithms, which we describe next.

In policy gradient methods, we learn a parameterized policy:


where , and is the dimensionality of . The goal of the learning agent is to maximize some scalar performance measure with respect to the policy parameters. Policy gradient methods find a locally optimal solution to the problem of maximizing the objective and work by applying gradient ascent:


where is a stochastic estimate whose expectation approximates the gradient of the performance measure with respect to its argument . The classic variant of policy gradient methods is REINFORCE [Williams1992], where the gradient of the parameters of the objective function is


In REINFORCE, the agent simply collects transitions from an episode using the current policy

and then uses it to update the policy parameters. The gradient estimation technique can have high variance when using only an actor. Hence, a large number of samples may be needed for the policy to converge. To solve this problem, actor-critic methods aim to reduce this variance by using an actor and a critic. The critic’s approximate value of a state

is parameterized by , where and is the number of components of the parameter vector. For the critic update, in the simplest case of 1-step Temporal Difference learning, we perform the following:


In this work, we focus on the Advantage Actor-Critic (A2C) method [Mnih et al.2016], which make updates to policies using the advantage function: , where is an estimate of the value of the state-action pair and is parameterized by and . To update and (using action-value estimates instead of state-value estimates) we can use temporal difference updates as described in Equation 6. To update , instead of using , as in REINFORCE, we use a policy gradient update with :


In practice, instead of using a separate action-value network , empirically observed truncated -step returns are used to calculate an estimate of the advantage: , where episode trajectory rollouts are performed for steps during which the value function remains constant.

Model-based Reinforcement Learning

Learning the dynamics of the environment to augment the state or for planning—model-based RL—is a promising approach for improving sample efficiency [Kaiser et al.2019]. In this particular paradigm of model-based reinforcement learning, knowledge is represented in two parts: a state part and a dynamics part [Sutton2019]. The state part can be formulated as a state-update function which aims to capture all the information needed for predicting and controlling future state and rewards. In a classic MDP with full observability, a state, referred to as Markov state, contains all necessary information to predict future states and to take action.

The second part of model-based reinforcement learning, is learning a model for the dynamics of the environment, captured by an environment model. A common approach for this has been to learn an environment model online by predicting future states and rewards and comparing them to the ones the agent sees later. An environment model can be learned by the agent during human-computer interactions to capture the dynamics of interactions. These dynamics are everything that the agent needs to know about the environment. Environment models help the agent in making informed action choices by allowing it to compute possible future outcomes when considering different actions. These actions may include actions that the agent does not take and this process is called planning. Planning is particularly important because it allows the agent to learn not only from the actions the agent takes, but also from the actions that the agent does not select. Planning is exactly what allows a model to sweep through all actions, and propagate information back making learning easier and faster—in other words, sample-efficient.

However, current methods in dialogue systems that are based on policy-gradient methods do not include critical components of model-based RL that allow an agent to learn the dynamics of the environment it operates in, in our case, dynamics of a dialogue environment. A number of works combine policy-gradient methods and supervised learning [He et al.2015, Li et al.2016, Kaplan, Sauer, and Sosa2017, Weisz et al.2018]

, or take a different approach to improve sample complexity using imitation learning

[Lipton et al.2018]. The closest to using a model-based RL approach is the work of zhao2016towards, zhao2016towards. Their end-to-end architecture is close to the Dyna architecture [Sutton1991]. However, their additional model employs a known transition function from a database and not a model that is learned through interactions. One may compare their methods to MBPO method [Janner et al.2019] that also uses offline data while our method focuses on online learning.

Our algorithm allows the agent to learn a model online from interactions and removes the complexities of needing additional elements, such as “successful trajectories” [Lipton et al.2018] or data augmentation [Goyal, Niekum, and Mooney2019].

To demonstrate the advantages of our model-based RL algorithm within dialogue tasks, we selected a particular task of voice document-editing driven by two primary reasons. The first reason is that the voice document-editing application can be simulated while accurately reflecting real-life scenarios of voice document-editing task which permits the development and evaluation of models without user interaction. The second reason is that, in general, we think that model-based RL is a useful paradigm for human-computer interactive tasks with limited data. Voice document-editing, an example of such tasks, is a first step in this direction. Additionally, this task is particularly difficult for supervised approaches given the variability of users’ behavior. The next section describes the task in detail.

Voice Document-Editing Task

Voice document-editing systems could be incredibly powerful if they could fully process a dictation in free-form language. Anyone who creates documents could have a voice-assistant helping them with document editing in a timely and efficient manner. People could create and edit documents on-the-go: delete and insert words; create or edit itemized lists; change the order of words, paragraphs, or sentences; convert one tense to another; add signatures or salutations; format text; fix grammar; or fine-tune the style. Often, a user simply wants to delete or correct what they dictated a moment ago. The reason for this is that current speech recognition systems are imperfect. For example, a human might mean to say: “Good morning, George. I hope you have been well”, yet the dialogue system transcribes: “Good morning, George. I hope you pin well.” In this case, the user wishes to delete and re-dictate the final part of the sentence. In this paper, we focus on a specific problem in voice document-editing systems that manifests from the state of current speech recognition systems. Specifically, today’s systems do not allow the user to correct errors via voice.

We proceed with a setting where misrecognized words are at the end of the sentence, which is common in real-life scenarios. As soon as a dictation tool displays incorrectly processed words, a user sees it right away and stops dictating. In our setting, the user says “No” if she is not satisfied with the words at the end of the sentence that are displayed on the screen during dictation. When the user says “No”, then “No” is treated as a keyword for the agent to take an action: to identify which words were not recognized properly and to delete them. If the agent deletes too few words, the user can say “No” again and the process is repeated until the user is satisfied. If the agent deletes too many words, the user has to repeat the accidentally-deleted words before continuing dictation. The agent’s goal is to identify the correct number of words to delete in the least number of steps. We formalize our setting using RL concepts.
Environment: An environment is comprised of a text that is transcribed from a user’s dictation and the user speaking to the agent.
Action: An action is the number of words to delete.
Interaction: An interaction is an interactive process between the user and the agent. This interaction starts the moment a user says “No” for the first time, after dictating a new sequence of words, and terminates in one of the following two cases: 1) If the user is satisfied with the sentence and last edits (last action from the agent); or 2) If the agent deletes too many words. In the latter case, the user has to repeat some words that were previously dictated, because the agent deleted some well-recognized words by mistake. The satisfaction of the user is indicated by the fact that the user simply continues dictation and does not say “No” after the agent takes an action.
Intent: An intent is an integer representing the number of words that the user wishes to correct.
Word sequence : A word sequence is a sequence of words that is shown on the screen at time step .
Speech transcription : A speech transcription is a new sequence of words dictated by a human at time step . It is an output from a speech-to-text system.
Observation : An observation is a combination of a word sequence and speech that the agent observes at time step . An observation is emitted from the environment and serves as an input to a state-update function.
State : A state is a representation of the entire interaction’s history at time . The state is computed by the state-update function and it accumulates all the information since the beginning of an interaction: information about the observation, and the previous action and state.
Reward: The reward for each action is if the action corresponds to the intent. If the action is greater than the intent, the agent is penalized by the number of words that were not noisy but deleted by the agent incorrectly. The reward then is . If the agent undershoots by deleting fewer words than the intent, the reward is where is the number of steps within an interaction and is equal to the number of time the user already said “No”. This reward function reflects user (dis)satisfaction penalizing for longer time to task completion and overshoots that will require further interactions.
Consider the previous example where the user says “Good morning, George. I hope you have been well”, yet the dialogue system transcribes “Good morning, George. I hope you pin well.” The initial intent here is 2, to delete “pin well”. If an agent takes an action , the updated sentence is “Good morning, George. I hope you pin” and the reward is =-1. The updated intent is then 1. Next, the agent take an action . The updated sentence is “Good morning, George. I” and the reward is . In a real-life application we built111This application is not part of the paper., we observe the action that is done by the user after the agent’s action. If the agent takes the user has to re-dictate the words “hope you” indicating that they were deleted incorrectly. In our simulation (described further), when we inject noisy words into a sentence, we know the amount of noise and thus can simply calculate the reward for any given action.

Model-Based Actor-Critic Algorithm

The application of a common actor-critic model to our specific task was motivated by its strong performance across many domains. Actor-critic methods are one of the most stable on-policy model-free algorithms. Our Model-Based Actor-Critic (MBAC) algorithm (Algorithm 1) includes major components of model-based RL architectures following sutton2018reinforcement. Figure 1 depicts these components and interaction between them: the environment model , the state-update function , the policy or ‘actor’ , the state-value function or ‘critic’ , and a planner with policy . Taking the parameterized function approximation setting, we denote the parameters of the model, policy, and state-value function as respectively.

Environment models are fundamental in making long-term predictions and evaluating the consequences of actions. People quickly learn and adjust to a course of dialogues because they are good at those predictions. The goal of the model is to learn the transition dynamics of the environment—to predict the next states and rewards from the previous state and last action. Using the model, action choices are evaluated during planning by anticipating possible futures given different actions. An agent that uses planning and an environment model can be better in computing long-term consequences of actions, such as predictions about the estimated returns from states or state-action pairs. In other words, environment models can improve the agent’s prediction and control abilities. Thus, model-based approaches are foundational for applications of RL to real-life problems, such as interactive dialogues.

Instead of making the assumption that states are Markov, i.e., states are fully-observable, we generalize to POMDP. Together with the previously observed action, , and previous state, , the observation is used by the state-update function to produce a state . Incorporating a state-update function in the algorithm is necessary in order to encapsulate the course of a dialogue into a compact summary that is useful for choosing future actions.

Figure 1: MBAC architecture with the primary model-based components. Built on the work of  sutton2018reinforcement, sutton2018reinforcement (subscripts are omitted to avoid overlap of time-steps).

For example, if a user’s intent is to delete five words, but at the first step, the agent deletes only one word, the agent should have the information of this first attempt before trying to delete the remaining four words. The idea is that the state should be sufficient for predicting and controlling the future trajectories and understanding the consequences of the actions, without having to store a complete history. The state is used as an input for the model and planner. The model uses this state together with the action taken and subsequently observed reward and state when training. Since the model becomes better in its predictions and the planner learns better policy, both are used to update model-free components, namely the policy and the value function .

Provided the necessity of the state-update function, implementations like DQN [Mnih et al.2013] are not practical for this approach because it would require the storage of a potentially long sequence of state updates in a memory replay buffer. These implementations would be resource inefficient and contain problems of staleness as the state update function is being trained. Thus, online on-policy policy-based gradient methods, such as actor-critic, and in particular, advantage actor-critic [Mnih et al.2016] are more suitable.

For each time step of an interaction, we compute the state from the previous state , incorporating the information that followed in the transition: the action and the observation received by the agent after taking action . Then, we perform three major steps (see Algorithm 1): 1) Planning and acting; 2) Updating the policy and the state-value function ; and 3) Updating the model . In comparison, the model-free actor-critic algorithm does not include the planning portion of step (1) (lines 6-11) and step (3). In addition, in Step (2), we modify the updates for both the actor and critic.

Step 1. Planning and Acting (lines 6-13)

First, we compute the predicted next states and the predicted rewards for each possible action , using the model .

3:for each episode do
4:     for t = 1…T do
6:         # Plan and act
7:         for  do
12:         Sample action
13:         Take action and get and
14:         # Update actor and critic
24:         # Update model
Algorithm 1 Model-Based Actor-Critic (MBAC)

Using the predicted next states and rewards , we then compute a set of estimated discounted returns. While we can iterate the model for multiple steps at the cost of more computation, we found that one-step iteration was adequate. Thus, we use the value estimate of the one-step predicted state: where is a hyper-parameter, called the discount rate.

We compute the planner policy by applying the softmax function to the vector of estimated returns . The state-value corresponding to the policy of the planner is computed by:


Using the planner policy , the agent takes an action and receives reward , followed by the observation .

Step 2. Updating Actor and Critic (lines 14-23)

We compute the objective for the critic using the mean square error (MSE) between the state-value of the critic and the state-value of the planner , which was based on the model reward predictions and critic’s estimates at predicted states:


The critic is updated at each time step by the following rule:


The actor objective is computed by using the relative entropy between the actor policy and the planner policy :


where is an entropy term being subtracted to encourage exploration similarly to mnih2016asynchronous, mnih2016asynchronous. The entropy term in Equation 11 is defined as:

. (12)

The actor is updated similarly to the critic at each time step by the following rule:

Figure 2: Construction of RNN-input for the state-update function. The includes the information of the previous state, the observation, and the reward. The dimensions of each array are shown underneath each entity.

Step 3. Model Update (lines 24-30)

We update the model by minimizing the state-objective and the reward-objective together. Using the current state , the action taken , and the new observation as an input for the state-update function , we compute the next state . We also compute the predicted next state and reward by using the model , given the current state and action taken . The state-objective is how correct our model is in the state-prediction:


Similarly, using the MSE between the predicted and actual rewards, we get the reward-objective . Adding these two objectives together, we can then update the model parameters:




The state-update function

is represented by a recurrent neural network (RNN): a two-layered bi-directional GRU 

[Cho et al.2014] with input and hidden sizes of 400 and 100, respectively. In order to compute an updated state using the RNN, we need to provide an RNN-input , which includes the previous state, the observation, and the reward (Figure 2). The observation is computed by concatenation of the encoded text and user input .

Figure 3: A model architecture. The state and the action are inputs to the model and the model computes predictions of the next reward and the next state .

We use word2vec word embeddings [Mikolov et al.2013] to encode into an observation matrix . An action is encoded into an action-matrix

. A bi-linear transformation of the form

is applied to the action-matrix and the previous state to compute previous-state-action information . The previous-state-action information is concatenated with the observation to compute . We note that using one of the latest developments such as BERT [Devlin et al.2018] for word embeddings could not apply to our settings. BERT models jointly condition on both the left and right contexts of a sentence, which would create a problem in our simulation provided we inject random noise. Instead, we learn temporal structures with an RNN.

We use single-threaded agents that learn with a single stream of experience. The use of single-threaded agent mimics real human-computer interaction for the task. The actor and critic are a 1D convolutional neural network that extracts correlations in the temporal structure of the sentences. The network is shared up until the point where there is a split into two separate heads: one producing a scalar value estimate and the other a probability distribution over actions as a policy. The shared architecture consists of three convolutional layers with filter sizes of 50, 50, and 100. The entropy term

was used in both the advantage actor-critic and the MBAC architectures.

The model architecture (Figure 3) for state prediction is similar to [Oh et al.2015]. We use a stack of three convolutional layers with filter sizes 50, 50, and 100. The action is encoded into a vector and Hadamard product is applied to the output of the 1D convolutional layers with the action vector. The output of this operation is then passed through a stack of three deconvolutional layers with filters sizes of 100, 50, and 100. The output of the Hadamard product operation is also passed through a linear layer to a single node for the scalar reward prediction. The state and the action are inputs to the model and the model outputs predictions of the next reward and the next state .

Figure 4: Average reward per interaction for MBAC and A2C: it takes MBAC 70 times less samples to achieve the average reward at which A2C approximately converges.

Initialization & Hyperparameters

The architecture was implemented in PyTorch 1.0. The RNN-input requires an initial state and an initial action for a state construction. The intial state was initialized to ones, and the initial action was initialized to zeros. The actions were represented by the range

, respecting the real-world setting, where a user is most likely to observe the incorrectly recognized words and therefore halt dictation, before a sentence becomes longer than 15 words. We used the Adam optimizer [Kingma and Ba2014] with the default parameters, with the exception of the learning rate, which was set to . We used a discount rate of 0.9, and clipped gradients at 0.9. The word-embedding size was 300. Entropy parameter was set to 1.


The simulation of the chosen document-editing dialogue task required mimicking a sentence that was corrupted by the voice-recognition component. We chose the BookCorpus dataset [Zhu et al.2015]. Along with narratives, the books contain dialogue, emotion, and a wide range of interaction between characters [Kiros et al.2015]. The sentences from the dataset are fed to the model one by one, and noisy words222To simulate noise phonetically close to words would require significant engineering efforts (such as creating datasets of phonemic dictionaries) that is beyond the scope of the present work. are injected at a random point in a sentence. This intentionally creates a difficult problem for the agent: there are no labels, and the noise is random. Thus the only way for the agent to learn is through dialogue interaction and by observing the dynamics of the environment.

Figure 5: Asymptotic performance over the last 90,000 and 10,000 interactions for MBAC and A2C: MBAC is just over 2 times better in the limit. Orange line is the median.


The simulation assumes that the agent always receives “No” as a speech input from the user, provided we are interested in the interaction itself—when the agent has to make an action, and not in idle moments when the dictation is going well and all words are well recognized.

In the experiments, each algorithm was executed for 100,000 simulated interactions, with 30 runs for each algorithm, each time changing the random seed. The random seed affects the initialization of parameters of the neural networks, and the noise injected in the sentences. The average over that 30 runs was used in the performance measures of the algorithms.

We demonstrate a few measures of the performance: average reward per each interaction, absolute error per step, and the distribution over actions.

Sample Efficiency

We first ask: how much can model-based RL improve the efficiency of model-free deep RL algorithms? In other words, does the MBAC algorithm perform better than the ordinary actor-critic methods, without the model and the planner?

Figure 4 shows that MBAC takes about 70 steps on average to achieve an average reward of -4. Even at 2,000 A2C does not reach its asymptotic performance while MBAC shows twice higher average reward per interaction, using the same learning rate.

Figure 6: Absolute error in number of words deleted for each action, averaged over 300 actions. MBAC has the lowest error. Note, that one interaction can include a few actions.

It takes A2C over 71,900 steps to reach its asymptotic performance of -3.74, while it takes MBAC only 88 steps to reach this point. This makes MBAC over 70 times more sample-efficient. Moreover, MBAC quickly gains a higher average reward and continues to improve while A2C plateaus.

Provided the implementation of single-threaded agents for MBAC algorithm, we do not compare to Asynchronous Advantage Actor-Critic, or A3C. A3C is the asynchronous version of A2C which uses data samples from several asynchronous environments simultaneously and we believe that comparison of single-threaded agents is foundational as a starting point.

Asymptotic Performance

MBAC also achieves just over 2 times the performance of standard A2C asymptotically. Figure 5 shows performance over the last 90,000 and last 10,000 interactions. MBAC performance improves from -1.84 to -1.77 while A2C performance only improves from -3.76 to -3.74, which remains almost the same.

We believe MBAC’s performance is explained by improved exploration. Figure 7 demonstrates the distribution of action selection by the agent for both MBAC and A2C. A2C prefers lower numbered actions, and explores less, while MBAC’s distribution over actions is much more uniform.

Figure 6 shows the absolute error in the number of words—how far the agent was from the intent on every action. We found this is important to observe in addition to the average reward, as the latter only provides us with the reward information per interaction. Similar to the average reward, MBAC’s absolute error keeps decreasing, while A2C’s absolute error remains within the interval .

Finally, motivated by the future deployment of this solution into a real-life application, we demonstrate that using the model-free portion of the MBAC architecture would still work and allow for deployment and implementation that is less computationally and memory heavy. In other words, can we first apply MBAC training, then detach the model and planner, so as to deploy only the model-free architecture? We do exactly this and call the resulting model-free agent MBAC-lite. For evaluation, we use a subset of the dataset of over 74 million sentences that was split into halves for training and testing. We used the test set of unseen sentences to assess performance. Figure 8 compares the performance of two model-free architectures: MBAC-lite and A2C. Two results emerged: 1) MBAC-lite can be easily detached in order to be deployed more efficiently with performance superior to A2C; and 2) Using greedy actions333Selecting an action whose estimated value is the greatest, instead of sampling from a probability distribution, which is done in non-greedy settings., MBAC-lite also performs significantly better than A2C. This experiment demonstrates how we can distill model and planner knowledge into the model-free architecture.

Figure 7: The distribution over actions shows that MBAC is less deterministic than A2C.
Figure 8:

The performance of MBAC-lite vs. A2C. MBAC-lite outperforms A2C with and without applying greedy actions. The orange line is the median and the extent of the boxes represent upper and lower quartiles.


In this work, we presented a sample-efficient model-based actor-critic algorithm. At deployment time, the model and planner can be detached to obtain an inexpensive yet high performing model-free agent. Our experiments demonstrated that MBAC learned 70 times faster and obtained 2 times the performance of classic A2C in just under 100 interactions. We believe this result is foundational in applying reinforcement learning to human-computer interactions.

In this work, we presented a novel update to a model-free policy using a soft-planner policy. We see studying various methods of updating agents’ model-free components with models and planners as a fruitful research direction. As a long-term vision, we believe that model-based RL is a promising and more efficient alternative to model-free RL.


We would like to thank the following collaborators for their insightful conversations: J. Fernando Hernandez-Garcia, Hado van Hasselt, Kris De Asis, Matthew Schlegel, Niko Yasui, Patrick Pilarski, and Richard S. Sutton.


  • [Ades and Swinehart1986] Ades, S., and Swinehart, D. C. 1986. Voice annotation and editing in a workstation environment. XEROX Corporation, Palo Alto Research Center.
  • [Barto, Sutton, and Anderson1983] Barto, A. G.; Sutton, R. S.; and Anderson, C. W. 1983. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics (5):834–846.
  • [Biermann and Long1996] Biermann, A. W., and Long, P. M. 1996. The composition of messages in speech-graphics interactive systems. In Proceedings of the 1996 International Symposium on Spoken Dialogue, 97–100.
  • [Cho et al.2014] Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing, 1724–173.
  • [Devlin et al.2018] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
  • [Dhingra et al.2017] Dhingra, B.; Li, L.; Li, X.; Gao, J.; Chen, Y.-N.; Ahmed, F.; and Deng, L. 2017. Towards end-to-end reinforcement learning of dialogue agents for information access. In ACL.
  • [Goyal, Niekum, and Mooney2019] Goyal, P.; Niekum, S.; and Mooney, R. J. 2019. Using natural language for reward shaping in reinforcement learning. arXiv:1903.02020.
  • [He et al.2015] He, J.; Chen, J.; He, X.; Gao, J.; Li, L.; Deng, L.; and Ostendorf, M. 2015. Deep reinforcement learning with a natural language action space. arXiv:1511.04636.
  • [Janner et al.2019] Janner, M.; Fu, J.; Zhang, M.; and Levine, S. 2019. When to trust your model: Model-based policy optimization. In 33rd Conference on Neural Information Processing Systems.
  • [Kaiser et al.2019] Kaiser, L.; Babaeizadeh, M.; Milos, P.; Osinski, B.; Campbell, R. H.; Czechowski, K.; Erhan, D.; Finn, C.; Kozakowski, P.; Levine, S.; et al. 2019. Model-Based Reinforcement Learning for Atari. arXiv:1903.00374.
  • [Kaplan, Sauer, and Sosa2017] Kaplan, R.; Sauer, C.; and Sosa, A. 2017. Beating Atari with natural language guided reinforcement learning. arXiv:1704.05539.
  • [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 1–13.
  • [Kiros et al.2015] Kiros, R.; Zhu, Y.; Salakhutdinov, R. R.; Zemel, R.; Urtasun, R.; Torralba, A.; and Fidler, S. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems, 3294–3302.
  • [Levin, Pieraccini, and Eckert1997] Levin, E.; Pieraccini, R.; and Eckert, W. 1997. Learning dialogue strategies within the markov decision process framework. In

    IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings

    , 72–79.
  • [Li et al.2016] Li, J.; Monroe, W.; Ritter, A.; Galley, M.; Gao, J.; and Jurafsky, D. 2016. Deep reinforcement learning for dialogue generation. arXiv:1606.01541.
  • [Lipton et al.2018] Lipton, Z.; Li, X.; Gao, J.; Li, L.; Ahmed, F.; and Deng, L. 2018. BBQ-networks: Efficient exploration in deep reinforcement learning for task-oriented dialogue systems. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • [Litman et al.2000] Litman, D.; Singh, S.; Kearns, M.; and Walker, M. 2000. NJFun-a reinforcement learning spoken dialogue system. In ANLP-NAACL 2000 Workshop: Conversational Systems.
  • [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 3111–3119.
  • [Mnih et al.2013] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing Atari with Deep Reinforcement Learning. In Deep Learning Workshop, NIPS.
  • [Mnih et al.2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937.
  • [Oh et al.2015] Oh, J.; Guo, X.; Lee, H.; Lewis, R. L.; and Singh, S. 2015. Action-conditional video prediction using deep networks in Atari games. In Advances in Neural Information Processing Systems, 2863–2871.
  • [Shah, Hakkani-Tur, and Heck2016] Shah, P.; Hakkani-Tur, D.; and Heck, L. 2016. Interactive reinforcement learning for task-oriented dialogue management. In Workshop on Deep Learning for Action and Interaction, NIPS.
  • [Singh et al.2000] Singh, S. P.; Kearns, M. J.; Litman, D. J.; and Walker, M. A. 2000. Reinforcement learning for spoken dialogue systems. In Advances in Neural Information Processing Systems.
  • [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, 3104–3112.
  • [Sutton and Barto2018] Sutton, R. S., and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press.
  • [Sutton1991] Sutton, R. S. 1991. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin 2(4):160–163.
  • [Sutton2019] Sutton, R. S. 2019. Toward a New Approach to Model-based Reinforcement Learning. Technical report, University of Alberta.
  • [Walker and Grosz1978] Walker, D. E., and Grosz, B. J. 1978. Understanding spoken language. Elsevier Science Inc.
  • [Weisz et al.2018] Weisz, G.; Budzianowski, P.; Su, P.; and Gasic, M. 2018. Sample efficient deep reinforcement learning for dialogue systems with large action spaces. ArXiv:1802.03753.
  • [Williams1992] Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8(3-4):229–256.
  • [Zhao and Eskenazi2016] Zhao, T., and Eskenazi, M. 2016. Towards end-to-end learning for dialog state tracking and management using deep reinforcement learning. In SIGDIAL 2016 Conference, 1–10.
  • [Zhu et al.2015] Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; and Fidler, S. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In

    IEEE International Conference on Computer Vision

    , 19–27.