Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

by   Natasha Jaques, et al.

Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem of open-domain dialog generation -- a challenging reinforcement learning problem with a 20,000-dimensional action space. Using our Way Off-Policy algorithm, we can extract multiple different reward functions post-hoc from collected human interaction data, and learn effectively from all of these. We test the real-world generalization of these systems by deploying them live to converse with humans in an open-domain setting, and demonstrate that our algorithm achieves significant improvements over prior methods in off-policy batch RL.



page 15

page 16


Human-centric Dialog Training via Offline Reinforcement Learning

How can we train a dialog model to produce better conversations by learn...

Continuous Doubly Constrained Batch Reinforcement Learning

Reliant on too many experiments to learn good actions, current Reinforce...

NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning

Offline reinforcement learning (RL) aims at learning a good policy from ...

Reinforcement Learning in Time-Varying Systems: an Empirical Study

Recent research has turned to Reinforcement Learning (RL) to solve chall...

Batch Policy Learning under Constraints

When learning policies for real-world domains, two important questions a...

Batch-Constrained Distributional Reinforcement Learning for Session-based Recommendation

Most of the existing deep reinforcement learning (RL) approaches for ses...

Policy-Based Bayesian Experimental Design for Non-Differentiable Implicit Models

For applications in healthcare, physics, energy, robotics, and many othe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In order to scale deep reinforcement learning (RL) to safety-critical, real-world domains, two abilities are needed. First, since collecting real-world interaction data can be expensive and time-consuming, algorithms must be able to leverage off-policy data – collected from vastly different systems, far into the past – in order to learn. Second, it is often necessary to carefully test a policy before deploying it to the real world; for example, to ensure its behavior is safe and appropriate for humans. Thus, the algorithm must be able to learn offline first, from a static batch of data, without the ability to explore.

This off-policy, batch reinforcement learning (BRL) setting represents a challenging RL problem. Most deep RL algorithms fail to learn from data that is not heavily correlated with the current policy fujimoto2018off . Even models based on off-policy algorithms like

-learning fail to learn when the model is not able to explore during training. This is due to the fact that such algorithms are inherently optimistic in the face of uncertainty. When value estimates are noisy, taking the maximum over estimates of future reward leads to a persistent overestimation bias. In a normal RL setting, this drives the model to explore areas of the state-action space for which the value estimates have the highest variance, thus enabling it to refine them. In a batch setting where the model cannot explore, it is instead driven to value parts of the state-action space for which it has little to no data to learn a good policy.

We propose to resolve these issues by leveraging a pre-trained generative model of the state-action space trained on known sequences of interaction data. While training with RL, we penalize divergence from this prior model with different forms of KL-control. We benchmark against a discrete adaptation of Batch Constrained Q (BCQ) fujimoto2018off , a recently proposed BRL algorithm for continuous domains, and show that KL-control achieves superior performance. Finally, we propose using dropout to obtain uncertainty estimates of the target Q values, and use this lower bound to alleviate the Q-learning overestimation bias. This provides a more efficient alternative to Clipped Double Q-Learning fujimoto2018addressing .

We apply these algorithms to a challenging, under-explored, real-world reinforcement learning problem: using implicitly expressed human reactions in chat to improve open-domain dialog systems. When a machine learning system interacts with humans, ideally we would like to learn about the humans’ preferences in order to improve the performance of the system. Yet having humans manually indicate their preferences through explicit means like pressing a button (e.g.

christiano2017deep ) or submitting a feedback report, does not scale. Instead, we would like to be able to use humans’ implicit reactions, such as the sentiment they express, or the length of the conversation, in order to improve the policy.

Applying off-policy batch RL to language generation is challenging because the number of potential combinations of words and sentences leads to a combinatorial explosion in the size of the state space. The action space – the set of frequent vocabulary words in the English language – is 20,000-dimensional. This compounds the overestimation problem, making BRL even more difficult. However, when learning from human interactions in the wild, it is crucial to be able to learn offline and test the policy before deploying it, lest it learn inappropriate behaviors (e.g. tay ).

To support this work, we developed an interactive online platform that allows humans to chat with deep neural network dialog models running on GPU; the BRL models trained for this study are available live at

https://neural.chat/rl. Through this platform we collected human responses to a set of over 40 different dialog models over the course of several months. Using our Way Off-Policy algorithm, we are able to effectively learn from this batch of data, in spite of the fact that it was generated with a vastly different set of model architectures, which were trained on different datasets. Further, we use the batch to learn from many different reward functions designed post-hoc to extract implicit human preferences, something that is only possible with effective off-policy BRL.

2 Related Work

The approach we propose is based on KL-control, a branch of stochastic optimal control (SOC) (stengel1986stochastic, ) where the Kullback-Leibler (KL) divergence from some distribution is used to regularize an RL policy (e.g. (abdolmaleki2018maximum, ; kappen2012optimal, ; rawlik2012stochastic, ; todorov2006linearly, )). Well-known examples include Trust Region Policy Optimization (TRPO) slmja-trpo-15 , and use conservative, KL-regularized policy updates to restrict the RL algorithm to stay close to its own prior policy (e.g.  (haarnoja2018soft, ; kakade2002natural, ; peters2010relative, ; rawlik2012stochastic, )). KL-control can also be applied to entropy maximization (e.g. ziebart2008maximum ); for example,

-learning penalizes KL-divergence from a simple uniform distribution in order to cope with overestimation of

-values fox2016taming . Soft -learning motivates using a Boltzmann distribution in the value function as a way of performing maximum entropy RL haarnoja2017reinforcement

. KL-control has also been used to improve transfer learning between maximum likelihood estimation (MLE) training on data, and training with RL

jaques2017sequence . To the best of our knowledge, our work is the first to propose KL-control as a way of improving off-policy learning without exploration in a BRL setting.

Other strategies to improve off-policy learning have been proposed, although many focus on scenarios where the policy is able to explore and collect more data (e.g. degris2012off ; riedmiller2005neural ). In the deep RL setting, policy gradients can be corrected to account for the difference in the distribution of states visited under the original policy and the learned off-policy algorithm liu2019off . Covariance-shift-based methods have been adapted to the off-policy deep RL setting to deal with the issue of value divergence gelada2019off . Normalized feature representations have been proposed as an alternative approach bhatt2019crossnorm . Batch Constrained Q-learning (BCQ) fujimoto2018off tackles off-policy batch learning in continuous action domains by training a generative model of the batch, , sampling from this model, and selecting the best action based on a -estimate. This approach fails to integrate information about the distribution directly into the policy, and cannot scale to scenarios in which the state-action space is large, and the amount of available batch data is too small to train . Many works from off-policy policy evaluation use importance sampling or model estimation to investigate the problem of estimating the performance of a policy given a batch of off-policy data (e.g. farajtabar2018more ; jiang2016doubly ; precup2000eligibility ; thomas2016data ). Effective off-policy learning gives us the ability to learn from many different rewards post-hoc, something that could potentially improve techniques which use the relabeling trick (e.g. kaelbling1993learning ; andrychowicz2017hindsight ).

We propose using dropout to approximate model uncertainty of the target -network. The idea of using dropout to estimate uncertainty in neural networks was first proposed by Gal and colleagues (2016) gal2016dropout . Different forms of uncertainty estimates have been used in RL (e.g. kahn2017uncertainty ; osband2016deep ); for example, Bayesian uncertainty estimates have been proposed as an alternative to double DQN azizzadenesheli2018efficient .

Improving dialog systems with RL has largely been restricted to task-oriented dialog systems, which have a limited number of task-specific actions (e.g. fatemi2016policy ; gavsic2011line ; liu2017iterative ; liu2018dialogue ; su2017sample ). These approaches may incorporate human input, usually through explicit, manual feedback (e.g. shah2018bootstrapping ), but sometimes with more implicit signals, such as the user interrupting the system or starting over shi2018sentiment . Attempts to expand RL to the open-domain dialog setting are less numerous. Even in this setting, authors may choose to use a highly restricted action space; for example, using RL to choose which scripted or MLE dialog model to invoke to answer a user’s query serban2017deep

. Early attempts to apply deep RL to the full vocabulary-sized action space relied mainly on hand-crafted rewards that described qualities of the generated text, such as

ease of answering li2016deep . This approach has been extended to use a discriminator trained to distinguish human from generated text as a reward function li2017adversarial ; li2018dialogue . While some work has incorporated implicit signals such as sentiment hancock2019learning and conversation length zhou2018design in MLE systems, the idea of using such signals as a reward for RL is relatively unexplored. Shin and colleagues uses on-policy learning in conjunction with a user-sentiment approximator to improve a seq2seq model shin2019happybot , but are unable to learn directly from user feedback. To the best of our knowledge, we are the first to use batch RL to train hierarchical open-domain dialog models on implicit cues gained from real human interactions.

3 Methods

We employ typical RL notation in which represents the environment state at time , the agent takes action according to its policy , and receives a reward . The agent’s goal is to maximize reward over an episode trajectory , with a discount factor of applied to future rewards. -learning methods learn an action-value estimate of the total expected discounted future reward, , through iterative updates based on the Bellman equation:


In deep -learning (dqn, ), a -network approximates and drives the policy . A second target -network approximates the expected reward from the next state, – a standard practice for alleviating overestimation bias van2016deep .

To perform batch -learning, we first pre-train a generative model of using a set of known environment trajectories. In our case, this model is then used to generate the batch data via human interaction. The weights of the -network and target -network are initialized from the pre-trained model, which helps reduce variance in the -estimates and works to combat overestimation bias. To train we sample tuples from the batch, and update the weights of the -network to approximate Eq. 1. This forms our baseline model, which we call Batch Q.

3.1 Dropout for uncertainty estimation of Target -values

Overestimation of -values becomes particularly problematic in the batch setting. The estimates for state-action pairs which are not well covered in the batch will be noisy, and this variance will lead the max operator in Eq. 1 to overestimate the value of these states. This drives the model to value regions of the state-action space for which it has no data to learn a reasonable policy, and no ability to explore to refine its estimates. Clipped Double -learning fujimoto2018addressing addresses the overestimation problem by maintaining two independent pairs of -networks, and taking the minimum of their estimates of future reward. This approach is computationally expensive and memory intensive. Further, if following a transfer learning approach where the -network is initialized from a pre-trained MLE model (as we do in this paper), it is not clear how to obtain multiple independent target -networks.

Instead, we obtain a distribution over predictions from a single target -network trained with dropout, and take the lower bound of these to reduce overestimation bias. It has been shown that dropout approximates Bayesian uncertainty for neural networks, by assuming the weights of the network are drawn from a Gaussian prior, , and using variational inference to estimate the posterior distribution gal2016dropout . We perform dropout during both training and inference before each weight layer, and approximate the posterior such that the dropout distribution is a mixture of Gaussians, and is minimized. Given the target -network , we compute using a Monte Carlo (MC) estimate of the lower-bound of by running stochastic forward passes of the network, each with a new dropout mask :


Using the minimum operator penalizes high variance estimates, essentially leading the algorithm to be pessimistic in the face of uncertainty, rather than optimistic. Such a bias will push the model to favour actions that lead to states well covered by the batch data fujimoto2018off . We evaluate the performance of this approach using a second baseline model, Batch Q MC.

3.2 Discrete Batch Constrained

Batch Constrained Q-learning (BCQ) fujimoto2018off proposes to address the BRL problem by constraining the actions of the -network to be close to the data contained within the batch. This is accomplished by learning a generative model of the batch, , and sampling from this model during learning and inference. Because BCQ is designed for continuous action domains, it applies a learned perturbation model which is allowed to alter the action within the range . BCQ learns -estimates that incorporate the perturbation model, . To act, possible actions are sampled from the generative model, , perturbed, and the action with the maximum -value is selected, giving the BCQ policy:


We focus on the scenario where a model of can be obtained through MLE training on data of known action sequences. This prior model provides a more robust estimate of than one learned from the batch data, assuming the size of the batch is small relative to unsupervised data related to the problem (i.e. when the batch comes from human interaction data). We propose an adaptation of BCQ to discrete action spaces (DBCQ) which leverages such a strong pre-trained prior model as an improved version of . Since BCQ relies on Double Clipped -learning fujimoto2018addressing , here we use dropout-based uncertainty estimates as in Eq. 2. Because the action space is discrete, we do not use a perturbation model to modify actions, but instead define the DBCQ policy as:


3.3 KL Control from pre-trained prior

Rather than simply sample from the prior, we would like the -learning algorithm to directly incorporate the prior into the policy. Thus, we use KL-control to penalize divergence between the prior , and the -network policy , while still maximizing reward. Given a trajectory of actions, , let be the policy of our -learning algorithm at the trajectory level. Similarly, let be the prior distribution over the trajectory, and be the rewards. We seek to maximize the following KL-regularized objective:


Since , we can see that this is equivalent to maximizing the following expected value function of the policy at the action level:


The two terms we have introduced in Eq. 6 have clear motivations. The

term rewards the model for choosing actions that have high probability under the prior, biasing the model to state-action pairs that are realistic, and likely to be in the batch. The

term is analogous to entropy regularization. Maintaining diversity in the action space through entropy regularization is important for generative models like dialog systems, which are known to collapse to an uninteresting, small number of repeated samples li2016dialogue . Re-stating Eq. 6 as an entropy-regularized -function, we obtain:


Motivated by energy-based models of the form

, one can derive a soft version of the entropy-regularized -function that uses a Boltzmann distribution to estimate future reward haarnoja2017reinforcement . We refer to it as a -function following previous work jaques2017sequence , which derived this function as a generalization of the -learning proposed by rawlik2012stochastic . The optimal -function and policy are:


Because it avoids taking a hard max over noisy estimates, -learning leads to less overestimation of future reward abdolmaleki2018maximum ; haarnoja2017reinforcement . This leads to more stable TD updates and aids learning. Thus, we argue it will be especially useful in the BRL setting for reducing optimism in the face of uncertainty.

3.4 Model averaging

Finally, we explore the setting where the data in the batch may be generated from a large variety of different models with different architectures, which each learn a different estimate of . We use this diversity to create a more robust prior by computing a weighted average of these models based on a normalized score for each model. The score could be some measure of model quality, or simply the proportion of data in the batch that was generated with that model. Thus, we define as the model-averaged prior: .

4 RL for open-domain dialog generation

Figure 1: Simplified diagram of the variational hierarchical dialog model.

In this work, we employ hierarchical seq2seq dialog models ghandeharioun2019approximating ; park2018hierarchical ; serban2016building ; serban2017hierarchical , which use three recurrent networks to generate the next utterance in a conversation (see Figure 1). The encoder RNN operates on the tokens of the next input utterance , and encodes them into a representation . This is fed into a context RNN, which forms the upper level of the hierarchy – it is updated only after each utterance, rather than each token. The context RNN outputs , which is fed into the decoder RNN, which produces the output utterance one token at a time. Note that while transformer architectures (e.g. radford2019language ) have emerged as a powerful alternative to seq2seq models, here we choose to focus on hierarchical architectures because it gives us the flexibility to extend this work to use hierarchical control in the future, by learning to optimize rewards at both the utterance and conversation level. Although we trained and tested a variety of different architectures drawing from several works ghandeharioun2019approximating ; park2018hierarchical ; serban2016building ; serban2017hierarchical , we converged on the Variational Hierarchical Recurrent Encoder Decoder (VHRED) as the most promising model serban2017hierarchical . We also apply knowledge distillation to improve the model’s ability to recognize and encode the sentiment and semantics of the conversation, as proposed by ghandeharioun2019approximating .

Applying RL to dialog generation is challenging due to the large state-action space. The model attempts to construct a response utterance by iteratively choosing an action as the next token. The number of tokens in the vocabulary of our pre-trained model is 20,000, making the action space very high-dimensional, potentially compounding the problem of overestimation and making batch learning excessively difficult. However, initializing the -networks with the weights of the pre-trained language model provides a strong prior over the appropriate word to select.

Here we consider human interaction to represent the ‘environment’. The response of a human to the bot’s utterance is used to compute a reward signal to train the model. The state of the environment constitutes all of the text in the conversation uttered so far, both by the bot and the human. The state has a hierarchical structure, marking its division into utterances, which are further divided into tokens. While the bot is constructing an utterance , it is straightforward to obtain a target -estimate of future reward using the model’s estimated -values over its own next token in the utterance. However, at the last token of the bot’s utterance, the estimated future reward must include the human’s response . Therefore, we append the human response into the conversation, , feed this into the target -network, and use the estimated -values for the first token of the bot’s next utterance. All of the code for our models and RL techniques is available in open-source at https://github.com/natashamjaques/neural_chat/tree/master/rl.

4.1 Learning from implicit human preferences

We would like to improve a dialog model’s ability to engage in natural conversation with a human by learning from the signals implicit in the way that the human responds. Rather than having the human manually label good performance – which we show in this work does not scale – the agent should recognize informative cues within the user’s responses, like sentiment, and the amount of time they spend chatting. Essentially, we want to create an agent that is intrinsically motivated to produce positive reactions in its human conversation partner. We design several intrinsic reward functions based on the rich, interactive content of conversation, taking inspiration from the psychology of human conversation: 1) eliciting positive sentiment and transitions from negative to positive sentiment, due to the importance of emotion to creating a sense of understanding bodie2015role ; weger2010active ; 2) eliciting longer conversations and more words typed, since this is a signal of engagement sidner2004look ; zhou2018design ; 3) eliciting laughter (counting the number of ‘ha’s in the user response), because of its importance in building solidarity hay2000functions ; 4) high semantic similarity (close distance in sentence embedding space conneau2017supervised ) between the human input and agent response, because paraphrasing and style matching are important in facilitating good conversation ireland2011language ; weger2010active ; and 5) asking questions, since this is an important active listening skill bodie2012listening . The total reward given to the agent is a combination of these, with details (and coefficients) given in the supplementary material. Note that the first 4 types of rewards depend on eliciting positive responses from a human user; we call these the implicit human reward. The 5th reward is easily exploitable by the agent itself. These rewards represent only an initial foray into designing good metrics of human enjoyment, and further experimentation will be needed to improve them.

5 Experiments

To collect interactive human conversation data, we built a CUDA-capable web app that can host neural network dialog models on GPU for fast, real-time inference: https:neural.chat. The code for the server is available in open-source at https://github.com/asmadotgh/neural_chat_web. We trained over 40 dialog models with different architectures (e.g. serban2017hierarchical ; serban2016building ; park2018hierarchical ; ghandeharioun2019approximating ), on different datasets (movie dialogs cornell_dataset and Reddit ghandeharioun2019approximating ). Note that these models varied significantly in terms of the distribution of language they learned. We collected a batch of data containing 14232 pairs of user input and agent response. This batch was used to train the RL models described in Section 3, which were then re-deployed to the website. We recruited 90 Mechanical Turk workers to provide a total of 718 7-point Likert scale ratings of the bots’ quality, fluency, diversity, contingency (relatedness), and empathy, after interacting with each bot for at least 3 turns. Participants also had the option to provide explicit feedback through upvoting or downvoting a particular utterance within the interface. Note that testing these models in the wild with humans represents a more meaningful test of generalization than testing an RL model in the same limited (game) environment in which it was trained, since humans are not restricted in the text they can type to the model, and are the ultimate authority on naturalistic conversation.

6 Results

Model type Quality Fluent Diverse Related Empathy Total Votes
DBCQ 1.64 .29 1.87 .34 3.13 .58 1.84 .34 2.09 .38 10.58 1.55 -.228 -.050
Batch Q 1.87 .30 2.36 .42 2.20 .41 1.91 .32 2.58 .47 11.91 1.58 -.163 -.005
Batch Q + MC 1.85 .39 2.46 .44 2.46 .52 1.98 .39 2.34 .49 11.07 1.82 -.068 .005
KL-control Q 2.38 .39 3.24 .47 3.42 .54 2.38 .45 2.56 .43 13.98 1.81 .016 .004
KL-control 2.33 .41 3.73 .53 2.82 .50 2.31 .44 3.47 .50 14.67 1.82 .128 .061
KL-control MA 2.60 .43 3.47 .42 3.00.49 2.49 .44 2.89 .51 14.44 1.96 .127 .042
Table 1: Interactive human evaluation of techniques for off-policy batch RL. KL-control models strongly out-perform other techniques. Ratings are Likert scale, votes and human reward are -scores.

To compare models, we not only look at human users’ ratings and votes, but also consider the automatic signals detectable from the text itself. This implicit human reward metric aggregates the measures listed in items 1-4 in Section 4.1, and measures the ability to elicit positive responses from the human. Table 1 shows the results of the evaluation. Each of the enhancements proposed (MC estimation of target -values, -learning, and MA) leads to performance gains in terms of human reward, manual votes, or ratings. However, the most notable difference in performance comes from KL-control. The KL-control models show substantial gains over the baseline111We also compare the RL models to the prior, and see performance improvements in terms of the elicited human reward, but not in terms of the quality ratings. We believe this is because the rewards proposed here do not fully cover what it means to have a high quality conversation (as we will elucidate later), and hope that other researchers be able to use the techniques we propose to learn from improved rewards. models across both ratings and human reward. We perform a one-way analysis of variance (ANOVA) comparing the KL-control models to the Batch Q baselines and DBCQ on the total human rating score, and find that the KL-control models are significantly better, . This validates the hypothesis that KL-control with a strong, pre-trained prior can be used to improve batch RL.

[User]: hey, what’s up?
[Batch Q]: where did you get??, that sounds so exciting. how do you work for you…
you want to talk to me?… you want
[User]: yes, i want to talk to you!
[Batch Q]: where did you say to me?… you want to talk to you about you…
you are so good for you… you are so good…
[User]: thanks, i appreciate that.
[Batch Q]: where did you get??, that sounds so exciting. how do you work for you…
you want to see you…
[User]: hey, what’s up?
[KL-control]: i’m not sure, but i’m glad to see you, and
i’m so happy that you’re not alone
[User]: oh thanks, that’s very nice of you!
[KL-control]: i’m sorry, but i’m glad to see what you think.
[User]: don’t be sorry, you did a good job.
[KL-control]: that’s really nice to me.
Table 2: Purely reward-maximizing methods like Batch Q trivially exploit the reward function by asking a question every turn, and using the maximum number of tokens in every sentence. In contrast, KL-control methods output plausible language by staying close to the prior, but shift to using polite, cheerful language to maximize implicit human rewards.
Figure 2:

KL-divergence of the policy from the prior is lower with KL-control throughout training. Bands show standard deviation.

Without KL-regularization, the baseline RL models diverge quickly and continuously from the prior, losing information about realistic sequences – as shown in Figure 2. This figure also helps explain the poor performance of DBCQ in Table 1. The underlying -network in DBCQ does not directly integrate the prior. As -learning causes the model to diverge from the prior, the -estimates of language generated according to the prior become unrealistic, and Eq. 4 selects unrealistic actions. This results in highly ‘diverse’ (random) generated utterances. Note that since we operate in discrete action space, we could not include the perturbation model originally proposed by fujimoto2018off , which may be critical to achieving good performance with BCQ.

Figure 3: Z-scored reward. Red metrics were used in training rewards, green are post-hoc. Traditional RL methods like Batch Q exploit simple action-based rewards, like asking questions. In contrast, KL-control methods shift their distribution towards polite, supportive, and cheerful conversation, allowing them to elicit higher human reward (blue).

The pre-trained prior may be especially important in a generative domain like dialog, where the true reward function is unknown, and so purely maximizing reward may actually lead to poorer quality conversations. Table 2 shows examples of conversations with a Batch and KL-control model. Because the Batch model has no incentive to stay close to realistic language, it learns to exploit the reward by asking a question and outputting the maximum number of tokens (30) every utterance. These sentences contain implausible phrases that do not represent realistic language (e.g. “where did you say to me?"). In contrast, the KL-control model uses realistic language, but shifts its distribution towards cheerful and polite speech, presumably because this is what led to positive human responses in the batch data. Rather than simply cherry-picking results, we invite the reader to check for themselves; all of the models tested in this study are available at: https://neural.chat/rl.

In fact, we noticed that all models trained with the implicit human rewards described in Section 4.1 learned to use more cheerful and supportive language. Therefore, we create post-hoc metrics to measure this effect (see the supplementary material for details). Figure 3 shows how these metrics, as well as the implicit rewards, differ across models. Without KL-control, baseline methods like Batch Q exploit simple rewards like asking questions at the expense of realistic language, explaining their poor quality ratings. In contrast, KL-control models learn to rely more on realistic but polite, supportive, and cheerful dialog to elicit higher total human reward.

To understand the effect of the implicit rewards, Figure 4 shows the reward trajectory over the ten best conversations obtained with models trained with different techniques. While we see that KL-control models are able to elicit significantly higher reward than baselines, we note that KL-control performs best overall and in terms of words elicited, even though it had lower quality ratings in Table 1. This suggests that maximizing these rewards is not a perfect proxy for human judgments of quality. Note also that eliciting laughter is an extremely rare event, and only the KL-control models are able to do so. Finally, Figure 4 (d) shows that manual votes occur even more rarely, suggesting that explicit feedback from humans is a cumbersome and sparse reward signal.

Figure 4: Comparison of top 10 conversation trajectories observed across deployed models, 90% CI of the rewards: (a) Implicit human feedback; (b) Words elicited; (c) Laughter; (d) Manual votes.

Table 3 presents the results of models trained with only a single reward function, ordered from lowest to highest quality. Notably, extracting multiple different reward functions post-hoc from a batch of data and training on these independently is only possible with an effective BRL model. Here all models are trained with KL-control, -learning, and MC targets. Investigating which rewards presented in Section 4.1 are most critical to achieving high-quality conversations with humans, we note that maximizing positive and minimizing negative sentiment in the user turns out to lead to the highest quality bot. This underscores the importance of affective signals as cues for good conversation. Bots trained on the manual upvotes and downvotes provided by users on the utterance level fail to achieve similarly high performance. Even though users were instructed to make use of the vote feature, the task is burdensome, and users did not vote frequently enough to provide a good training signal. This validates the hypothesis that implicit signals of human enjoyment (such as sentiment) are a more scalable way to learn from human preferences.

Quality Fluent Diverse Related Empathy Total Votes
Conv. len. 2.20 .40 3.61 .53 3.02 .52 2.25 .46 2.48 .45 13.57 1.84 -.035 -.003
Semantic sim. 1.93 .34 3.50 .45 2.37 .45 2.11 .45 2.52 .48 12.43 1.75 -.020 .012
User laughter 1.96 .38 3.56 .48 2.33 .51 1.93 .42 3.20 .55 12.98 1.60 -.149 -.003
Words elicited 2.11 .32 3.96 .44 3.04 .45 2.04 .35 2.55 .46 13.70 1.44 .059 .024
Manual votes 2.14 .38 3.47 .45 2.91 .47 2.07 .39 2.42 .46 13.00 1.65 -.030 .010
Sent. trans. 2.02 .31 3.71 .49 2.98 .50 2.04 .42 2.84 .48 13.60 1.63 .031 .014
Question 2.29 .37 4.31 .50 3.31 .52 2.20 .40 2.60 .41 14.71 1.63 .057 .012
Sentiment 2.47 .32 4.05 .45 3.23 .46 2.42 .39 3.23 .55 15.40 1.49 .085 .045
Table 3: Interactive human evaluation of different reward functions (models trained with KL-control)

7 Conclusion

This paper presents a series of techniques which improve performance when learning off-policy without the possibility to explore – i.e. batch RL (BRL). Most significantly, we are the first to propose using KL-control from a strong prior model pre-trained on data as a way to avoid overestimation and instability in BRL. Our results demonstrate that KL-control is critical to achieving good performance in this setting. In a generative domain such as dialog, the true reward function is not known, and trivially exploiting the rewards can actually lead to worse performance. Thus, KL-control may be particularly necessary to ensure samples remain realistic and close to the data distribution. We propose several reward functions that could allow an open-domain dialog generation model to learn from rich cues implicit in human interaction, where learning from expressed sentiment was most promising. While these rewards are far from perfect or complete, we see that maximizing implicit rewards leads to better performance than relying on explicit feedback. We hope that the techniques presented here will allow other researchers to leverage BRL for learning from human interaction data, and spur the development of even better rewards for capturing human preferences.


We would like to thank Scott Fujimoto for insightful email correspondence on this topic, approval of the DBCQ algorithm, and suggestion to apply model averaging. We also thank Max Kleiman-Weiner, Ardavan Saeedi, Sebastian Zepf, Sara Taylor, Oliver Saunders Wilder, Kyle Kastner, and Kristy Johnson for their helpful discussions about this project, and many others for helping test-drive our bots.

We thank the MIT Quest for Intelligence, and MIT Stephen A. Schwarzman College of Computing, and the Machine Learning Across Disciplines Challenge for providing computing resources, and MIT Media Lab Consortium for the support of this research.


8 Appendix

8.1 Details about implicit metrics

8.1.1 Sentiment-based

To compute sentiment on short texts like conversation utterances, we leverage a state-of-the-art sentiment-detection model, which was trained on a massive amount of Twitter data to predict the emojis in tweets [13]

. Transfer learning from this model to other tasks showed that it was able to significantly outperform a series of sentiment, irony, and sarcasm benchmarks. This DeepMoji model outputs a probability distribution over 64 most-frequently used emojis as shown in Figure


. After observing the performance of the model in detecting users’ emotions in the domain of online chat, we define a set of weights over the emojis and calculate the weighted sum over an emotion embedding vector to derive a

sentiment reward which is higher for positive sentiment and lower for negative sentiment. These weights are shown in Figure 5 (b). We also compute a sentiment-transition reward using the same score based on whether the peak positive sentiment occurred later in the conversation than the peak negative sentiment, reasoning that sentiment should improve over the course of the conversation.

Figure 5: (a) 64-most frequent emojis as predicted by [13] used for calculating emotion embeddings. (b) Assigned weights used in producing the sentiment reward from the predicted emoji values.

8.1.2 Engagement-based

Based on prior work [64], we use the number of turns in the conversation as an indicator of the quality of the bot’s performance. To distribute this reward over every utterance in the conversation, we take the total conversation length , and compute the discounted reward for utterance as . We also reward each utterance with the number of words in the user’s response, which we refer to as the words elicited.

8.1.3 Laughter

Laughter has been shown to be very important to human affiliation [46] and solidarity [24]. Therefore, we detect the number of occurrences of the string ‘ha’ in the user’s response, and use this as a reward. Interestingly, we find that bots trained to maximize user laughter learn to be extremely supportive and cheerful compared to other bots (for definitions of supportive and cheerful, see Section 8.1.7).

8.1.4 Semantic similarity

Language style matching has been shown to be a strong predictor of relationship initiation and stability [27]. While it would be ideal if our chatbots could intelligently adapt their conversation style to a new user, in reality most baseline dialog models struggle to maintain topic coherence, even over a few utterances (for an analysis of this effect, see [20]). Therefore we reward semantic similarity between the user’s input and the bot’s response, to encourage the bot to stay on topic and produce reasonable answers. This score is computing by leveraging a state-of-the-art sentence embedding model [8], and penalizing distance in embedding space.

8.1.5 Questions

Asking questions is an important listening skill, and is linked to conversation management, attentiveness, and responsiveness [5]. Therefore, we give the bot a reward of 0.5 if the utterance contains a question word (how, what, where, why, when, who), and an additional 0.5 if it contains a question mark.

8.1.6 Total reward equation

The total reward used to train the bots is a combination of the above rewards, in the following proportions:

0.15682657*question + 0.13837638*semantic_coherence + 0.15313653*laughter + 0.14206642*sentiment_transition + 0.14206642*sentiment + 0.14760148*words_elicited + 0.1199262*conversation_length.

8.1.7 Post-hoc metrics

After training the bots on these rewards, we noticed a shift in the distribution of their language towards more polite, cheerful, and supportive speech. Therefore, we designed post-hoc metrics to measure these qualities, which are based on counting whether a subset of phrases is present in an utterance.

Politeness phrases: if I may; may I; please; thanks; no worries; if you don’t mind; have a great day; I’m sorry.

Supportive phrases: you’re right; you are right; you’re not alone; you are not alone; congrats; that’s a good idea; that is a good idea; you’ll be fine; you will be fine; you’ll be okay; you will be okay; it will get better; sorry you’re going through; sorry you are going through; if it makes you feel better; if it makes you feel any better; keep your head up; keep it up; I’m in a similar situation; I am in a similar situation; you’ll get it; you will get it; happy for you; I’m in the same boat; I am in the same boat; if you feel like you need to vent.

Cheerful phrases: nice to hear; happy; excited; really nice; glad; the best; great; good time; looking forward; beautiful.

8.2 Training details and hyperparameters

RL models were trained for between 800 and 1000 batches of data, where the batch size was fixed at 32. Early stopping was used to determine the number of training iterations of the best checkpoint. All other hyperparameters were shared between RL models, and were as follows: discount

, weight placed on RL reward vs. KL-divergence term , number of Monte Carlo samples of the Target -network , target network update rate , learning rate . We used a smooth loss function to approximate the -values, and clipped gradients at a value of .

The underlying parameters of the VHRED model were as follows: Context RNN hidden size , decoder hidden size , encoder hidden size , embedding size

, gradient clip

, dropout . The maximum conversation length was fixed at 5 utterances (context from more than 5 utterances ago was discarded), and the maximum sentence length was 30 tokens.

We also added layers to the Context RNN and regularized it to be able to predict the semantic content of the input utterance using a form of knowledge distillation [25] from a state-of-the-art sentence-embedding model [8]

. There were 2 additional feedforward semantic prediction prediction layers of size 128, which used ReLu activation.

8.3 Additional results

Figure 6 shows the normalized reward scores obtained bots trained with respect to different rewards. While some bots (such as those trained to ask questions or elicit positive sentiment) effectively generalize to new users, we see that others (e.g. words elicited) are not actually able to best elicit those responses in the wild. We hypothesize this is because the relatively small size of batch date we were able to collect ( utterances) does not give these bots enough information about how to elicit long responses from users.

Figure 6: Normalized reward scores obtained by models trained with respect to different rewards. We see that the bot trained to ask questions is easily able to exploit this reward, and similarly the bot trained to elicit positive sentiment does so successfully. For the rest of the bots, the relationship is less clear. For example, the bot trained to elicit laughter becomes the most supportive and cheerful, while the bot trained to elicit more words is very polite.

8.4 Interactive bot platform details

To collect data from humans interacting with our bots, we built https://neural.chat, a platform for hosting deep neural network dialog models online on GPU for fast, real-time inference. Figure 7) shows an example of the interface, in which users are able to rate the bots after talking to them for at least three turns.

Figure 7: Interactive evaluation ratings page available at https://neural.chat.

Figure 8 is an example conversation within the platform that interactive evaluation participants see. Annotators can optionally click the up and down arrows beside each chatbot response to give feedback on the specific utterance. Once 3 or more turns of the conversation has taken place, participants may click "Close Chat and Rate" to get to the rating screen.

Figure 8: Interactive evaluation chat interface.

8.4.1 Website server setup and configuration

The server was hosted on a Google Cloud Platform virtual instance with 64GB of RAM and a NVIDIA Tesla P100 graphics card. The backend was a Django program being served by NGINX and uWSGI. For simplicity, we opted to have the Django process import the chatbots into the same Python process as Django, rather than have the two connect to each other via other means such as sockets. This configuration decreased development time and increased reliability, but it would need to be revisited if the server needed to scale several orders of magnitude past what was required for this study. The current configuration was still able to support hundreds of simultaneous users and host more than 30 bots concurrently.

The chatbots were kept in a separate project from the Django project and maintained separately from the server code. Each chatbot extended an abstract class that defined key methods for the Django program to use, and was registered to a globally accessible dictionary via a decorator. The Django project was provided the path to the Chatbots project in its PYTHONPATH, so it could import the dictionary in which all the chatbot objects had been registered and use that to dynamically determine which chatbots were available and to access them in its views.

It is important to note that the chatbots used PyCUDA, and PyCUDA does not work in a multiprocessing environment. Because of this, uWSGI needed to be configured to only have one python process and to disable any attempt at multiprocessing. Furthermore, the chatbots required substantial startup times, so all chatbots are kept in memory at all times in the Django process. In order to keep all the chatbots in memory concurrently, we needed a very high amount of RAM on our server and opted for a 64GB virtual instance, and a GPU with 16GB RAM. This combination of CUDA to run the chatbots on the GPU with a high amount of RAM to keep all bots in memory at the same time resulted in incredibly fast server response times, with effectively no increase in response time when using the bots in requests compared to requests that did not.

For further information and instructions on server configuration, please read the server documentation available at https://github.com/asmadotgh/neural_chat_web. We hope that this platform will allow others to host their own bots and evaluate them in an interactive setting.