Ensemble-Based Deep Reinforcement Learning for Chatbots

08/27/2019 ∙ by Heriberto Cuayáhuitl, et al. ∙ 0

Trainable chatbots that exhibit fluent and human-like conversations remain a big challenge in artificial intelligence. Deep Reinforcement Learning (DRL) is promising for addressing this challenge, but its successful application remains an open question. This article describes a novel ensemble-based approach applied to value-based DRL chatbots, which use finite action sets as a form of meaning representation. In our approach, while dialogue actions are derived from sentence clustering, the training datasets in our ensemble are derived from dialogue clustering. The latter aim to induce specialised agents that learn to interact in a particular style. In order to facilitate neural chatbot training using our proposed approach, we assume dialogue data in raw text only -- without any manually-labelled data. Experimental results using chitchat data reveal that (1) near human-like dialogue policies can be induced, (2) generalisation to unseen data is a difficult problem, and (3) training an ensemble of chatbot agents is essential for improved performance over using a single agent. In addition to evaluations using held-out data, our results are further supported by a human evaluation that rated dialogues in terms of fluency, engagingness and consistency -- which revealed that our proposed dialogue rewards strongly correlate with human judgements.



There are no comments yet.


page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans in general find it relatively easy to have chat-like conversations that are both coherent and engaging at the same time. While not all human chat is engaging, it is arguably coherent GroszS86 , and it can cover large vocabularies across a wide range of conversational topics. In addition, each contribution by a partner conversant may exhibit multiple sentences, such as greeting+question or acknowledgement+statement+question. The topics raised in a conversation may go back and forth without losing coherence. All of these phenomena represent big challenges for current data-driven chatbots.

We present a novel approach for chatbot training based on the reinforcement learning SuttonB2018

, unsupervised learning


and deep learning

LeCunBH15 paradigms. In contrast to other learning approaches for Deep Reinforcement Learning chatbots that rely on partially labelled dialogue data SerbanEtAl2018 ; LiMSJRJ16 , our approach assumes only unlabelled data. Our learning scenario is as follows: given a dataset of human-human dialogues in raw text (without any manually provided labels), an ensemble of Deep Reinforcement Learning (DRL) agents take the role of one of the two partner conversants in order to learn to select human-like sentences when exposed to both human-like and non-human-like sentences. In our learning scenario the agent-environment interactions consist of agent-data interactions – there is no user simulator as in task-oriented dialogue systems Cuayahuitl16 ; GaoGL19 . During each verbal contribution and during training, the DRL agents

  1. observe the state of the world via a recurrent neural network, which models a representation of all words raised in the conversation together with a set of candidate responses (i.e.

    clustered actions in our approach);

  2. they then select an action so that their word-based representation is sent to the environment; and

  3. they receive an updated dialogue history and a numerical reward for having chosen a certain action, until a termination condition is met.

This process—illustrated in Figure 1—is carried out iteratively until the end of a dialogue for as many dialogues as necessary, i.e. until there is no further improvement in the agents’ performance. During each verbal contribution at test time, the agent exhibiting the highest predictive dialogue reward is selected for human-agent interactions.

Figure 1: High-level architecture of the proposed ensemble-based learning approach for chatbot training—see text for details

This article makes the following contributions to neural-based chatbots:

  1. We propose a novel approach for chatbot training using value-based Deep Reinforcement Learning, where we induce action sets automatically via unsupervised clustering. Most previous related work has used policy search methods, and value-based methods have received little attention. We identified the latter as a research gap in our literature review. Although the performance of our DRL agents drops with dialogues that the agents are not familiar with, our DRL agents indeed learn to improve their performance over time with dialogues that they get familiarised with.

  2. We propose a novel reward function due to the lack of well-embraced metrics for measuring chatbot performance. In addition, we train neural regressors for predicting dialogue rewards using a dataset of human-human dialogues that was automatically extended with noisy dialogues. While non-noisy dialogues exemplify human-like and desirable outputs, the noisy ones exemplify less desirable behaviour. This reward function is easy to implement, it strongly correlates with test human-human dialogues subject to using long dialogue histories, and it strongly correlates with human judgements.

  3. We propose a novel ensemble-based methodology for chatbot training, where each chatbot in our ensemble is trained with a set of clustered dialogues. To test our agents, we train 100 DRL chatbots with the aim of generating more context-relevant responses. Our experimental results according to automatic and human evaluations show that the ensemble of DRL agents outperforms a single DRL agent. This result is relevant for training future neural-based chatbots.

In the next two sections, 2 and 3, we review related work on neural-based chatbots and provide related background on deep reinforcement learning. Then we describe our proposed approach and methodology in section 4. This is followed by a comprehensive set of automatic and human evaluations in section 5, which use (i) a dataset of chitchat conversations, and (ii) human ratings of human-chatbot dialogues. Section 6 draws conclusions and discusses avenues for future research.

2 Related Work

Deep Reinforcement Learning Chatbots

Reinforcement Learning (RL) methods are typically based on value functions or policy search SuttonB2018 , which also applies to deep RL methods. Both types of trained agents can use the same state representations and rewards, but they differ in the representation of actions and policies. While value-based methods are typically applied to problems with discrete and finite actions, policy search methods can be applied to problems with either finite or infinite actions. In addition, while policies in value-based methods calculate numerical values (also referred to as ‘expected long-term rewards’) to model the importance of each state-action pair, policy search methods induce policies directly Li17b —via the parameters of a model such as a neural network or a Gaussian process. While value-based methods have been particularly applied to task-oriented dialogue systems CasanuevaBSURTG18 ; CuayahuitlY17 ; CuayahuitlYWC17 ; WilliamsAZ17 ; PengLLGCLW17 ; HC2016iwsds , policy-based methods have been particularly applied to open-ended dialogue systems such as (chitchat) chatbots LiMSJRJ16 ; LiMSJRJ17 ; SerbanEtAl2018

. This is not surprising given the fact that task-oriented dialogue systems use finite action sets, while chatbot systems use infinite action sets. The latter consider each sentence as an action, and consequently, the task is to induce dialogue behaviour from an infinite action set. This is extremely challenging for value-based reinforcement learning methods, which are more suitable for solving problems with finite action spaces. So far there is a preference for policy search methods for chatbots, but it is not clear whether they should be preferred because they face problems such as local optima rather than global optima, inefficiency, and high variance. It is therefore that we explore the feasibility of value-based methods for chatbots with large action sets, which has not been explored before—especially not from the perspective of deriving the action sets automatically as attempted in this article.

Sequence2Sequence Chatbots

Other closely related methods to DRL include sequence to sequence models for dialogue generation VinyalsL15 ; SordoniGABJMNGD15 ; SerbanKTTZBC17 ; LiGBSGD16 ; Wang2018 ; ZhangEtAl2018 . These methods are based on Recurrent Neural Networks (RNNs) using an encoder-decoder architecture. In these methods, while one RNN referred to as ‘encoder’ computes an internal representation of the inputs, another RNN referred to as ‘decoder’ generates one word as a time – both trained end-to-end with all parameters (weights) trained jointly. Given a sequence of input words , an encoder-decoder computes a sequence of output words by iterating the following equation: , where , , function

is an activation function, and training weights

involves minimising a loss function such as categorical cross entropy

SutskeverVL14 . Sequence2Sequence (Seq2Seq) methods can be combined with deep reinforcement learners by treating the policy as an encoder-decoder LiMSJRJ16

. Seq2Seq methods have also been included in ensemble-based methods together with rule-based systems and a variety of other machine learning methods

SerbanEtAl2018 ; SongLNZZY18 ; PapaioannouEtAl2017 . While some of them use a single DRL agent SerbanEtAl2018 , they have not been investigated using an ensemble containing a horde of DRL-based chatbots as attempted in this article.

Reward Functions

Related work above highlights that evaluation is a difficult part, and that there is a need for better evaluation metrics.

This is further supported by LiuLSNCP16 , who found that typical metrics used to assess the quality of machine translators such as BLEU (Bilingual Evaluation Understudy) PapineniEtAl2002 and METEOR (Metric for Evaluation of Translation with Explicit ORdering) Lavie2007 amongst others do not correlate with human judgments. The dialogue rewards used by DRL agents are either specified manually depending on the application, or learnt from dialogue data. For example, LiMSJRJ16 conceives a reward function that positively rewards sentences that are easy to respond to and coherent while penalising repetitiveness. LiMSJRJ17 uses an adversarial approach, where the discriminator is trained to score human vs. non-human sentences so that the generator can use these scores during training. SerbanEtAl2018 trains a reward function from expensive and time-consuming human ratings. All these related studies are neural-based, and there is no clear best reward function to use in future (chitchat) chatbots. This motivated us to propose a new metric that is easy to implement, practical due to requiring only data in raw text, and potentially promising as described below.

This article contributes to the literature of neural-based chatbots as follows. First, our methodology for training value-based DRL agents uses only unlabelled dialogue data. Previous work requires manual extensions to the dialogue data LiMSJRJ16 or expensive and time consuming ratings for training a reward function SerbanEtAl2018 . Second, our proposed reward function strongly correlates with human judgements. Previous work has only shown moderate positive correlations between target dialogue rewards and predicted ones SerbanEtAl2018 , or rely on high-level annotations requiring external and language-dependent resources typically induced from labelled data DHaroBHL19 . Third, while previous work on DRL chatbots train a single agent SerbanEtAl2018 ; LiMSJRJ16 , our study—confirmed by automatic and human evaluations—shows that an ensemble-based approach performs better than a counterpart single agent. The remainder of this article elaborates on these contributions.

3 Background

A reinforcement learning agent induces its behaviour from interacting with an environment through trial and error, where situations (representations of sentences in a dialogue history) are mapped to actions (follow-up sentences) by maximising a long-term reward signal. Such an agent is typically characterised by: (i) a finite set of states that describe all possible situations in the environment; (ii) a finite set of actions to change in the environment from one situation to another; (iii) a state transition function that specifies the next state for having taken action in the current state ; (iv) a reward function that specifies a numerical value given to the agent for taking action in state and transitioning to state ; and (v) a policy that defines a mapping from states to actions SuttonB2018 ; Szepesvari:2010 .

The goal of a reinforcement learning agent is to find an optimal policy by maximising its cumulative discounted reward defined as

where function represents the maximum sum of rewards discounted by factor

at each time step. While a reinforcement learning agent takes actions with probability

during training, it selects the best action at test time according to

A deep reinforcement learning agent approximates using a multi-layer neural network MnihKSRVBGRFOPB15 . The function is parameterised as , where

are the parameters or weights of the neural network (recurrent neural network in our case). Estimating these weights requires a dataset of learning experiences

(also referred to as ‘experience replay memory’), where every experience is described as a tuple . Inducing a function consists in applying Q-learning updates over minibatches of experience drawn uniformly at random from the full dataset . This process is implemented in learning algorithms using Deep Q-Networks (DQN) such as those described in MnihKSRVBGRFOPB15 ; HasseltGS16 ; WangSHHLF16 , and the following section describes a DQN-based algorithm for human-chatbot interaction.

4 Proposed Approach

This section explains the main components of Figure 1 as follows. Motivated by CuayahuitlEtAl2019ijcnn , we first describe the ensemble of Deep Reinforcement Learning (DRL) agents, we then explain how to conceive a finite set of dialogue actions from raw text, and finally we describe how to assign dialogue rewards for training DRL-based chatbots.

4.1 Ensemble of DRL Chatbots

We assume that all deep reinforcement learning agents in our ensemble use the same neural network architecture and learning algorithm. They only differ in the portion of data used for training and consequently the weights in their trained models—see Wiering2008EAR ; ChenEtAl2018 for alternative approaches. Our agents aim to maximise their cumulative reward over time according to

where is the numerical reward given at time step for choosing action in state , is a discounting factor, and is the optimal action-value function using weights in the neural network of chatbot . During training, a DRL agent will choose actions in a probabilistic manner in order to explore new pairs for discovering better rewards or to exploit already learnt values—with a reduced level of exploration over time and an increased level of exploitation over time. During testing, our ensemble-based DRL chatbot will choose the best actions according to

where is a trajectory of state-action pairs of chatbot , and is a function that predicts the dialogue reward of chatbot as in CuayahuitlEtAl2018neurips . Given the set of trajectories for all agents—where each agent takes its own decisions and updates its environment states accordingly—the agent with the highest predictive reward is selected, i.e. the one with the least amount of errors in the interaction.

Our DRL agents implement the procedure above using a generalisation of DQN-based methods MnihKSRVBGRFOPB15 ; HasseltGS16 ; WangSHHLF16 —see Algorithm 1, explained as follows.

  • After initialising replay memory with learning experience , dialogue history with sentences , action-value function and target action-value function , we sample a training dialogue from our data of human-human conversations (lines 1-4).

  • Once a conversation starts, it is mapped to its corresponding sentence embedding representation, i.e. ‘sentence vectors’ as described in Section 

    4.2 (lines 5-6).

  • Then a set of candidate responses is generated including (1) the true human response and (2) a set of randomly chosen responses (distractors). The candidate responses are clustered as described in the next section and the resulting actions are taken into account by the agent for action selection (lines 8-10).

  • Once an action is chosen, it is conveyed to the environment, a reward is observed as described at the end of this section, and the agent’s partner response is observed in order to update the dialogue history (lines 11-14).

  • In response to the update above, the new sentence embedding representation is extracted from for updating the replay memory with experience (lines 15-16).

  • Then a minibatch of experiences is sampled from for updating weights according to the error derived from the difference between the target value and the predicted value (see lines 18 and 20), which is based on the following weight updates:

    where and

    is a learning rate hyperparameter.

  • The target action-value function and environment state are updated accordingly (lines 21-22), and this iterative procedure continues until convergence.

1:Initialise Deep Q-Networks with replay memory , dialogue history , action-value function with random weights , and target action-value functions with
2:Initialise clustering model from training dialogue data
4:     Sample a training dialogue (human-human sentences)
5:     Append first sentence to dialogue history
6:      sentence embedding representation of
7:     repeat
8:         Generate noisy candidate response sentences
11:         Execute chosen clustered action
12:         Observe human-likeness dialogue reward
13:         Observe environment response (agent’s partner)
14:         Append agent and environment responses to
15:          sentence embedding representation of
16:         Append learning experience to
17:         Sample random minibatch from
19:         Set
20:         Gradient descent step on with respect to
21:         Reset every steps
23:     until end of dialogue
24:     Reset dialogue history
25:until convergence
Algorithm 1 ChatDQN Learning

4.2 Sentence and Dialogue Clustering

Actions in reinforcement learning chatbots correspond to sentences, and their size is infinite assuming all possible combinations of word sequences in a given language. This is especially true in the case of open-ended conversations that make use of large vocabularies, as opposed to task-oriented conversations that make use of smaller (restricted) vocabularies. A clustered action is a group of sentences sharing a similar or related meaning via sentence vectors derived from word embeddings MikolovSCCD13 ; PenningtonSM14 . We represent sentences via their mean word vectors—similarly as in Deep Averaging Networks IyyerMBD15 —denoted as , where is the vector of coefficients of word , is the number of words in sentence , and is the embedding vector of sentence . Similarly, a clustered dialogue is a group of conversations sharing a similar or related topic(s) via their clustered actions. We represent dialogues via their clustered actions. Dialogue clustering in this way can be seen as a two-stage approach, where sentences are clustered in the first step and dialogues are clustered in the second step. In our proposed approach, each DRL agent is trained on a cluster of dialogues.

While there are multiple ways of selecting features for clustering and also multiple clustering algorithms, the following requirements arise for chatbots: (1) unlabelled data due to human-human dialogues in raw text (this makes it difficult to evaluate the goodness of clustering features and algorithms), and (2) scalability to clustering a large set of data points (especially in the case of sentences, which are substantially different between them due to their open-ended nature).

Given a set of data points and a similarity metric , the task is to find a set of groups with a clustering algorithm. In our case each data point

corresponds to a dialogue or a sentence. For scalability purposes, we use the K-Means++ algorithm

ArthurV07 and the Euclidean distance with dimensions, and consider as a hyperparameter – though other clustering algorithms and distance metrics can be used with our approach. In this way, a trained sentence clustering model assigns a cluster ID to features , where the number of actions (in a DRL agent) refers to the number of sentence clusters, i.e. .

4.3 Human-Likeness Rewards

Specifying reward functions in reinforcement learning dialogue agents is often a difficult aspect. We propose to derive rewards from human-human dialogues by assigning positive values to contextualised responses seen in the data, and negative values to randomly chosen responses due to lacking coherence (also referred to as ‘non-human-like responses’) – see example in Tables 1 and 2. An episode or dialogue reward can thus be computed as , where index refers to the dialogue in focus, index to the dialogue turn in focus, and is given according to

Table 1 shows an example of a well rewarded dialogue (without distortions) and Table 2 shows an example of a poorly rewarded dialogue (with distortions). Other dialogues can exhibit similar dialogue rewards or something in between (ranging between and ), depending on the amount of distortions—the higher the amount of distortions the lower the dialogue reward.

turn Verbalisation Reward
1 A: hello what are doing today ?
B: i’m good , i just got off work and tired , i have two jobs . +1
2 A: i just got done watching a horror movie
B: i rather read , i have read about 20 books this year . +1
3 A: wow ! i do love a good horror movie . loving this cooler weather
B: but a good movie is always good . +1
4 A: yes ! my son is in junior high and i just started letting him watch them too
B: i work in the movies as well . +1
5 A: neat ! ! i used to work in the human services field
B: yes it is neat , i stunt double , it is so much fun and hard work . +1
6 A: yes i bet you can get hurt . my wife works and i stay at home
B: nice , i only have one parent so now i help out my mom . +1
7 A: i bet she appreciates that very much .
B: she raised me right , i’m just like her . +1
8 A: my dad was always busy working at home depot
B: now that i am older home depot is my toy r us . +1
Table 1: Example human-human dialogue with a good numerical reward
Turn Verbalisation Reward
1 A: hello what are doing today ?
B: i’m good , i just got off work and tired , i have two jobs . +1
B: lol. she is more pop now. her old stuff was country. i like classic country . -1
2 A: i just got done watching a horror movie
B: i rather read , i have read about 20 books this year . +1
B: i am good how about you -1
3 A: wow ! i do love a good horror movie . loving this cooler weather
B: but a good movie is always good . +1
B: i do not enjoy the driving in my honda civic -1
4 A: yes ! my son is in junior high and i just started letting him watch them too
B: i work in the movies as well . +1
B: do you like dogs ? I have two . -1
5 A: neat ! ! i used to work in the human services field
B: yes it is neat , i stunt double , it is so much fun and hard work . +1
B: very cool . i had one of those . -1
6 A: yes i bet you can get hurt . my wife works and i stay at home
B: nice , i only have one parent so now i help out my mom . +1
B: that is just great . we need more educators like you in the system . -1
7 A: i bet she appreciates that very much .
B: she raised me right , i’m just like her . +1
B: is it bad that i hate pumpkins and trees ? -1
8 A: my dad was always busy working at home depot
B: now that i am older home depot is my toy r us . +1
B: why not ? maybe you do not like to travel ? -1
Table 2: Example distorted human-human dialogue with a poor numerical reward

We employ the algorithm described in CuayahuitlEtAl2018neurips for generating dialogues with varying amounts of distortions (i.e. different degrees of human-likeness), which we use for training and testing reward prediction models using supervised regression. Given our extended dataset with (noisy) dialogue histories represented with sequences of sentence vectors, the goal is to predict dialogue scores as accurately as possible.

Alternative and automatically derived values between -1 and +1 are also possible but considered as future work. Section 5.4 provides an evaluation of our reward function and its correlation with human judgement. We show that albeit simple, our reward function is highly correlated with our judges’ ratings.

4.4 Methodology

Our proposed approach can be summarised through the following methodology:

  1. Collect or adopt a dataset of human-human dialogues (as in 5.1)

  2. Design or adopt a suitable reward function (as in 4.3)

  3. Train a neural regressor for predicting dialogue rewards (as in CuayahuitlEtAl2018neurips )

  4. Perform sentence and dialogue clustering in order to define the action set and training datasets (as in 4.2)

  5. Train a Deep Reinforcement Learning agent per dialogue cluster (as described in 4.1)

  6. Test the ensemble of agents together with the predictor of dialogue rewards (as in 5.3 and 5.4), and iterate from Step 1 if needed

  7. Deploy your trained chatbot subject to satisfactory results in Step 6

5 Experiments and Results

5.1 Data

We used the Persona-Chat dataset333Dataset downloaded from http://parl.ai/ on 18 May 2018 MillerFBBFLPW17 , stats are shown in Table 3.

Attribute / Value Training Set Test Set
Number of dialogues 17877 999
Number of dialogue turns 131438 7801
Number of sentences 262862 15586
Number of unique sentences 124469 15186
Number of unique words 18672 6692
Avg. turns per dialogue 7.35 7.8
Avg. words per dialogue 165.89 185.86
Avg. words per sentence 11.28 11.91
Table 3: Statistics of the Persona-Chat data used in our experiments for chatbot training

5.2 Experimental Setting

Our agents’ states model dialogue histories as sequences of sentence vectors—using GloVe-based PenningtonSM14 mean word vectors IyyerMBD15

—with pre-trained embeddings. All our experiments use a 2-layer Gated Recurrent Unit (GRU) neural network

444Other hyperparameters include embedding batch size=128, dropout=0.2, sentence vector dimension=100, latent dimensionality=256, discount factor=0.99, size of candidate responses=20, max. number of mean sentence vectors in =25, burning steps=3K, memory size=10K, target model update (C)=10K, optimiser=Adam, learning steps={100K, 500K}, test steps=100K. The number of parameters of each neural net is 2.53 million. choEtAlEMNLP2014 . At each time step in the dialogue history, the first hidden layer generates a hidden state as follows:

where refers to a set of sentence vectors of the dialogue history, is a reset gate that decides how much of the previous state to forget, is an update gate that decides how much to update its activation, is an internal state, and are the Sigmoid and hyperbolic Tangent functions (respectively), and are learnt weights, and refers to the element-wise multiplication. If the equations above are summarised as we get the following output action taking into account both hidden layers in our neural net: , where and .

ID Clustered Sentence
‘i mostly eat a fresh and raw diet , so i save on groceries .’, ‘i
only eat kosher foods’, ‘i like kosher salt a lot’, ‘i prefer seafood .
25 my dad makes awesome fish tacos .’, ‘i do a pet fish’, ’that sounds
interesting i like organic foods .’, ‘cheeseburgers are great , i try
all kinds of foods everywhere i go , gotta love food .’
‘hi how are you doing ? i am okay how about you ?’, ‘i am great .
what do you like to do ?’, ‘oh right how i am between jobs’,
68 ‘i am thinking about my upcoming retirement . how about you ?’,
‘i am not sure ? how old are you ?’, ‘i am well , how are you ?’
‘i am doing very fine this afternoon . how about you ?’
‘i have dogs and i walk them . and a cat .’, ‘yeah dogs are
pretty cool’, ‘i have dogs and cats’, ‘hello , leon . my dogs and
88 i are doing well .’, ‘sadly , no . my dogs and i are in ohio .’,
‘i have 2 dogs . i should take them walking instead of eating .’,
‘yeah dogs are cool . i kayak too . do you have pets ?’
Table 4: Example clustered sentences chosen arbitrarily
(a) 100 clusters of training sentences
(b) 100 clusters of training dialogues
Figure 2:

Example clusters of our training data using Principal Component Analysis

PPCA for visualisations in 2D, where each black dot represents a training sentence or dialogue

While a small number of sentence clusters may result in actions being assigned to potentially the same cluster555In this case our system would select randomly from sentences with the same cluster ID, a larger number of sentence clusters would mitigate the problem, but the larger the number of clusters the larger the computational cost—i.e. more parameters in the neural net. Table 4 shows example outputs of our sentence clustering using 100 clusters on our training data. A manual inspection showed that while clustered sentences sometimes do not seem very similar, they made a lot of sense and they produced reasonable outputs. Our human evaluation (see Section 5.4) confirms this. All our experiments use due to a reasonable compromise between system performance and computational expense.

The purpose of our second clustering model is to split our original training data into a group of data subsets, one subset for each ChatDQN agent in our ensemble. We explored different numbers of clusters (20, 50, 100) and noted that the larger the number of clusters the (substantially) higher the computational expense 666

Our experiments ran on a cluster of 16 GPU Tesla K80, and their implementation used the following libraries: Keras (

https://github.com/keras-team/keras), OpenAI (https://github.com/openai) and Keras-RL (https://github.com/keras-rl/keras-rl).. We chose 100 clusters for our experiments due to higher average episode rewards of cluster-based agents than non-cluster-based ones. Figure 2 shows visualisations of our sentence and dialogue clustering using 100 clusters on our training data of 17.8K data points. A manual inspection was not as straightforward as analysing sentences due to the large variation of open-ended sets of sentences—see next section for further results.

5.3 Automatic Evaluation

We compared three DQN-based algorithms (DQN MnihKSRVBGRFOPB15 , Double DQN HasseltGS16 and Dueling DQN WangSHHLF16 ) in order to choose a baseline single agent and the learning algorithm for our ensemble of agents. The goal of each agent is to choose the human-generated sentences (actions) out of a set of candidate responses (20 available at each dialogue turn). Figure 3(left) shows learning curves for these three learning algorithms, where we can observe that all agents indeed improve their performance (in terms of average episode reward) over time. It can also be observed that DQN and Double DQN performed similarly, and that Dueling DQN was outperformed by its counterpart algorithms. Due to its simplicity, we thus opted for using DQN as our main algorithm for the remainder of our experiments.

(a) ChatDQN Agents – 1 Dialogue Cluster
(b) ChatDQN Agents – 100 Dialogue Clusters
Figure 3: Learning curves of ChatDQN agents

Figure 3(right) shows the performance of 100 ChatDQN agents (one per dialogue cluster), where we also observe that all agents improve their performance over time. It can be noted however that the achieved average episode reward of -1 is much greater than that of the single agent corresponding to -5.5. Additional experiments reported that the lower the number of clusters the lower the average episode reward during training. We thus opted for using 100 dialogue clusters in the remainder of our experiments.

(a) Avg. Episode Reward
(b) Avg. F1 Score
(c) Avg. Recall@1
(d) Avg. Recall@5
Figure 4: Test performance of 100 ChatDQN agents on training (blue boxes) and test data (red boxes) using 4 evaluation metrics

We analysed the performance of our agents further by using the test set of 999 totally unseen dialogues during training. We clustered the test set using our trained dialogue clustering model in order to assess the goodness of each agent in dialogues that were similar but not the same. The box plots in Figure 4 report the performance of our DRL agents according to the following metrics while tested on training data and test data: Avg. Episode Reward, Avg. F1 score, Avg. Recall@1, and Average Recall@5. One can quickly observe the striking performance gap between testing on training data vs. testing on test data. This can be interpreted as ChatDQN agents being able to learn well how to select actions on training data, but not being able to replicate the same behaviour on test data. This may not be surprising given that only 720 sentences (out of 263,862 training sentences and 15,586 test sentences) are shared between both sets, and it is presumably a realistic scenario seen that even humans rarely use the exact same sentences in multiple conversations. On the one hand our results also suggest that our training dataset is rather modest, and that a larger dataset is needed for improved performance. On the other hand, our results help us to raise the question ‘Can chitchat chatbots with reasonable performance be trained on modest datasets— i.e. with thousands of dialogues instead of millions?’ If so, the generalisation abilities of chatbots need to be improved in future work. If not, large (or very large) datasets should receive more attention in future work on neural-based chatbots.

Finally, we compared the performance of 5 dialogue agents on 999 dialogues with 20 candidate sentences at every dialogue turn:

  • Upper Bound, which corresponds to the true human sentences in the test dataset;

  • Lower Bound, which selects a sentence randomly from other dialogues than the one in focus;

  • Ensemble, which selects a sentence using 100 agents trained on clustered dialogues as described in section 4 – the agent in focus is chosen using a regressor as predictor of dialogue reward using a similar neural net as the ChatDQN agents except for the final layer having one node and using Batch Normalisation IoffeS15 between hidden layers as in CuayahuitlEtAl2018neurips ;

  • Single Agent, which selects a sentence using a single ChatDQN agent trained on the whole training set; and

  • Seq2Seq, which selects a sentence using a 2-layer LSTM recurrent neural net with attention777https://github.com/facebookresearch/ParlAI/tree/master/projects/convai2/baselines/seq2seq – from the Parlai framework (http://www.parl.ai) ZhangEtAl2018 , trained using the same data as the agents above.

Table 5 shows the results of our automatic evaluation, where the ensemble of ChatDQN agents performed substantially better than the single agent and Seq2Seq model.

Agent/Metric Dialogue Reward F1 Score Recall@1
Upper Bound 7.7800 1.0000 1.0000
Lower Bound -7.0600 0.0796 0.0461
Ensemble -2.8882 0.4606 0.3168
Single Agent -6.4800 0.1399 0.0832
Seq2Seq -5.7000 0.2081 0.1316
Table 5: Automatic evaluation of chatbots on test data

5.4 Human Evaluation

In addition to our results above, we carried out a human evaluation using 15 human judges. Each judge was given a form of consent for participating in the study, and was asked to rate 500 dialogues (100 core dialogues—from the test dataset—with 5 different agent responses888All agents responded to the same human conversants, and they used the same sets of candidate sentences—for a fair comparison., dialogues presented in random order) according to the following metrics: Fluency (Is the dialogue naturally articulated as written by a human?), Engagingness (Is the dialogue interesting and pleasant to read?), and Consistency (without contradictions across sentences). This resulted in ratings from all judges. Figure 5 shows an example dialogue with ratings ranging from 1=strongly disagree to 5=strongly agree.

Figure 6

shows average ratings (and corresponding error bars) per conversational agent and per metric

999Note that the candidate sentences used as distractors were chosen randomly from randomly selected dialogues—which is rather challenging for action selection. Future work could consider candidate sentences from similar dialogues for potential improvements in terms of engagingness and consistency.. As expected, the Upper Bound agent achieved the best scores and the Lower Bound agent the lowest scores. The ranking of our agents in Table 5 is in agreement with the human evaluation, where the Ensemble agent outperforms the Seq2Seq agent, and the latter outperforms Single Agent. The difference in performance between the Ensemble agent and the Seq2Seq agent is significant at for the Fluency metric and at for the other metrics (Engagingness and Consistency)—based on a two-tailed Wilcoxon Signed Rank Test.

Figure 5: Screenshot of our dialogue evaluation tool
Figure 6: Human evaluation results, the higher the better

Furthermore, we analysed the predictive power of dialogue rewards, derived from our reward function, against human ratings on test data. This analysis revealed positive high correlations between them as shown in Figure7. These scatter plots show data points of test dialogues (the X-axes include Gaussian noise drawn from for better visualisation), which obtained Pearson correlation scores between 0.90 and 0.91 for all metrics (Fluency, Engagingness and Consistency). This is in favour of our proposed reward function and supports its application to training open-ended dialogue agents.

(a) Fluency
(b) Engagingness
(c) Consistency
Figure 7: Scatter plots showing strong correlations (with Pearson coefficients of ) between predicted dialogue rewards and average human ratings as part of the human evaluation, i.e. our proposed reward function correlates with human judgements

6 Conclusions and Future Work

We present a novel approach for training Deep Reinforcement Learning (DRL) chatbots. It uses an ensemble of 100 DRL agents based on clustered dialogues, clustered actions, and rewards derived from human-human dialogues without any manual annotations. The task of the agents is to learn to choose human-like actions (sentences) out of candidate responses including human generated and randomly chosen sentences. Our ensemble trains specialised agents with particular dialogue strategies according to their dialogue clusters. At test time, the agent with the highest predicted reward is used during a dialogue. Experimental results using chitchat dialogue data report that DRL agents learn human-like dialogue policies when tested on training data, but their generalisation ability in a test set of unseen dialogues (with mostly unseen sentences, only 4.62% seen sentences to be precise) remains a key challenge for future research in this field. As part of our study, we found the following:

  1. an ensemble of DRL agents is more promising than a single DRL agent or a single Seq2Seq model—confirmed by a human evaluation;

  2. value-based DRL can be used for training chatbots—previous work mostly uses policy search methods due to infinite action sets; and

  3. our proposed reward function albeit simple was useful for training chatbots.

Future work can investigate further the proposed learning approach for improved generalisation in test dialogues. Some research avenues are as follows.

  • Investigating other methods of sentence embedding such as DevlinCLT19 ; CerEtAl2018 ; LeMikolov2014 – possibly with fine-tuning or domain adaptation subject to using large datasets. Other DRL algorithms such as policy search methods should also be compared or combined. In addition, other distance metrics and clustering algorithms should be used to investigate better sentence clustering and dialogue clusterings. Alternative dialogue rewards should be compared to train agents with human-like dialogue rewards across different datasets.

  • An interesting future direction is training an ensemble of ChatDQN agents using a very large dataset – much larger than attempted in this article. Our results seem to suggest that the larger the dataset the better generalisation on unseen data. But this requirement represents high costs in data collection and high computational expense for system training. While chatbot training using large or very large datasets is interesting, chatbot training using modest datasets is still relevant because it can save costly datasets and computational requirements.

  • The proposed learning approach focuses on value-based deep reinforcement learning, and it could be combined with other deep learning methods in order to investigate more effective ensembles of machine learners. For example, our ensemble of agents could include not only value-based DRL methods but also policy search methods and a variety of seq2seq methods. Although this research direction represents increased computational expense, it has the potential of showing improved performance over single agents/models.

  • Our proposed approach in this article did not include any linguistic resources. One reason for this is the practical application of DRL agents to other languages/datasets, where linguistic resources are scarce or do not exist. Another reason is due to the fact that linguistic resources usually come at the expense of labelled data, and we aimed for investigating an approach and methodology assuming unlabelled data only. However, future work could improve the performance of DRL agents by including knowledge bases and natural language resources such as part-of-speech tagging, named entity recognition, coreference resolution, and syntactic parsing

    ManningEtAl2014 .

  • Last but not least, the proposed approach can be applied to different applications, beyond chitchat dialogue. Example applications in no particular order are as follows: combining task-oriented dialogue with open-ended dialogue Papaioannou2017hri , strategic dialogue CuayahuitlEtAl2015nips , spatially-aware dialogue Dethlefs17 , automatic (medical) diagnosis WeiLPTCHWD18 , in-car infotainment systems WengASHPH16 , and conversational robots CUAYAHUITL2019 , among others.


  • (1) B. J. Grosz, C. L. Sidner, Attention, intentions, and the structure of discourse, Computational Linguistics 12 (3) (1986) 175–204.
  • (2) R. S. Sutton, A. G. Barto, Reinforcement learning - an introduction, 2nd Edition, Adaptive computation and machine learning, MIT Press, 2018.
  • (3) T. Hastie, R. Tibshirani, J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction, 2nd Edition, Springer series in statistics, Springer, 2009.
  • (4) Y. LeCun, Y. Bengio, G. E. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
  • (5) I. V. Serban, C. Sankar, M. Germain, S. Zhang, Z. Lin, S. Subramanian, T. Kim, M. Pieper, S. Chandar, N. R. Ke, S. Rajeswar, A. de Brébisson, J. M. R. Sotelo, D. Suhubdy, V. Michalski, A. Nguyen, J. Pineau, Y. Bengio, A deep reinforcement learning chatbot (short version), CoRR abs/1801.06700.
  • (6) J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, J. Gao, Deep reinforcement learning for dialogue generation, in: EMNLP, 2016.
  • (7) H. Cuayáhuitl, Simpleds: A simple deep reinforcement learning dialogue system, in: Dialogues with Social Robots - Enablements, Analyses, and Evaluation, Seventh International Workshop on Spoken Dialogue Systems, IWSDS 2016, Saariselkä, Finland, January 13-16, 2016, 2016, pp. 109–118.
  • (8) J. Gao, M. Galley, L. Li, Neural approaches to conversational AI, Foundations and Trends in Information Retrieval 13 (2-3) (2019) 127–298.
  • (9) Y. Li, Deep reinforcement learning: An overview, CoRR abs/1701.07274.
  • (10) I. Casanueva, P. Budzianowski, P. Su, S. Ultes, L. M. Rojas-Barahona, B. Tseng, M. Gasic, Feudal reinforcement learning for dialogue management in large domains, in: NAACL-HLT, 2018.
  • (11) H. Cuayáhuitl, S. Yu, Deep reinforcement learning of dialogue policies with less weight updates, in: INTERSPEECH, 2017.
  • (12) H. Cuayáhuitl, S. Yu, A. Williamson, J. Carse, Scaling up deep reinforcement learning for multi-domain dialogue systems, in: IJCNN, 2017.
  • (13) J. D. Williams, K. Asadi, G. Zweig, Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning, in: ACL, 2017.
  • (14) B. Peng, X. Li, L. Li, J. Gao, A. Çelikyilmaz, S. Lee, K. Wong, Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning, in: EMNLP, 2017.
  • (15) H. Cuayáhuitl, SimpleDS: A simple deep reinforcement learning dialogue system, in: International Workshop on Spoken Dialogue Systems (IWSDS), 2016.
  • (16) J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, D. Jurafsky, Adversarial learning for neural dialogue generation, in: EMNLP, 2017.
  • (17) O. Vinyals, Q. V. Le, A neural conversational model, CoRR abs/1506.05869.
  • (18) A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J. Nie, J. Gao, B. Dolan, A neural network approach to context-sensitive generation of conversational responses, in: HLT-NAACL, 2015.
  • (19) I. V. Serban, T. Klinger, G. Tesauro, K. Talamadupula, B. Zhou, Y. Bengio, A. C. Courville, Multiresolution recurrent neural networks: An application to dialogue response generation, in: AAAI, 2017.
  • (20) J. Li, M. Galley, C. Brockett, G. P. Spithourakis, J. Gao, W. B. Dolan, A persona-based neural conversation model, in: ACL, 2016.
  • (21) W. Wang, M. Huang, X.-S. Xu, F. Shen, L. Nie, Chat more: Deepening and widening the chatting topic via a deep model, in: SIGIR, ACM, 2018.
  • (22) S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, J. Weston, Personalizing dialogue agents: I have a dog, do you have pets too?, CoRR abs/1801.07243.
  • (23) I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, in: NIPS, 2014.
  • (24) Y. Song, C. Li, J. Nie, M. Zhang, D. Zhao, R. Yan, An ensemble of retrieval-based and generation-based human-computer conversation systems, in: IJCAI, 2018.
  • (25) I. Papaioannou, A. C. Curry, J. L. Part, I. Shalyminov, X. Xu, Y. Yu, O. Dusek, V. Rieser, O. Lemon, An ensemble model with ranking for social dialogue, CoRR abs/1712.07558.
  • (26) C. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, J. Pineau, How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation, in: EMNLP, 2016.
  • (27) K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: A method for automatic evaluation of machine translation, in: ACL, 2002.
  • (28) A. Lavie, A. Agarwal, Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments, in: Proceedings of the Second Workshop on Statistical Machine Translation (StatMT), 2007.
  • (29) L. F. D’Haro, R. E. Banchs, C. Hori, H. Li, Automatic evaluation of end-to-end dialog systems with adequacy-fluency metrics, Computer Speech & Language 55 (2019) 200–215.
  • (30) C. Szepesvári, Algorithms for Reinforcement Learning, Morgan and Claypool Publishers, 2010.
  • (31) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through deep reinforcement learning, Nature 518 (7540).
  • (32) H. van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double Q-learning, in: AAAI, 2016.
  • (33) Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, N. de Freitas, Dueling network architectures for deep reinforcement learning, in: ICML, 2016.
  • (34) H. Cuayáhuitl, S. Ryu, D. Lee, S. Choi, I. Hwang, J. Kim, Deep reinforcement learning for chatbots using clustered actions and human-likeness rewards, in: IJCNN, 2019.
  • (35) M. A. Wiering, H. van Hasselt, Ensemble algorithms in reinforcement learning, Trans. Sys. Man Cyber. Part B 38 (4).
  • (36) X. liang Chen, L. Cao, C.-X. Li, Z.-X. Xu, J. Lai, Ensemble network architecture for deep reinforcement learning, Mathematical Problems in Engineering Article ID 2129393.
  • (37) H. Cuayáhuitl, S. Ryu, D. Lee, J. Kim, A study of dialogue reward prediction for open-ended conversational agents, in: NeurIPS Workshop on Conversational AI: “Today’s Practice and Tomorrow’s Potential”, 2018.
  • (38)

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: NIPS, 2013.

  • (39) J. Pennington, R. Socher, C. D. Manning, GloVe: Global vectors for word representation, in: EMNLP, 2014.
  • (40) M. Iyyer, V. Manjunatha, J. L. Boyd-Graber, H. D. III, Deep unordered composition rivals syntactic methods for text classification, in: ACL (1), 2015.
  • (41) D. Arthur, S. Vassilvitskii, K-means++: The advantages of careful seeding, in: SODA, SIAM, 2007.
  • (42) A. H. Miller, W. Feng, D. Batra, A. Bordes, A. Fisch, J. Lu, D. Parikh, J. Weston, ParlAI: A dialog research software platform, in: EMNLP, 2017.
  • (43) K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder–decoder for statistical machine translation, in: EMNLP, 2014.
  • (44) M. E. Tipping, C. Bishop, Probabilistic principal component analysis, Journal of the Royal Statistical Society, Series B 21/3 (1999) 611–622.
  • (45)

    S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: ICML, 2015.

  • (46) J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: NAACL-HLT, 2019.
  • (47) D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, R. Kurzweil, Universal sentence encoder, CoRR abs/1803.11175.
  • (48) Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: ICML, 2014.
  • (49)

    C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, D. McClosky, The Stanford CoreNLP natural language processing toolkit, in: ACL, 2014.

  • (50) I. Papaioannou, O. Lemon, Combining chat and task-based multimodal dialogue for more engaging HRI: A scalable method using reinforcement learning, in: International Conference on Human-Robot Interaction (HRI), 2017.
  • (51) H. Cuayáhuitl, S. Keizer, O. Lemon, Strategic dialogue managent via deep reinforcement learning, in: NIPS Deep Reinforcement Learning Workhop, 2015.
  • (52)

    N. Dethlefs, Domain transfer for deep natural language generation from abstract meaning representations, IEEE Comp. Int. Mag. 12 (3) (2017) 18–28.

  • (53) Z. Wei, Q. Liu, B. Peng, H. Tou, T. Chen, X. Huang, K. Wong, X. Dai, Task-oriented dialogue system for automatic diagnosis, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, 2018.
  • (54) F. Weng, P. Angkititrakul, E. Shriberg, L. P. Heck, S. Peters, J. H. L. Hansen, Conversational in-vehicle dialog systems: The past, present, and future, IEEE Signal Process. Mag. 33 (6) (2016) 49–60.
  • (55) H. Cuayáhuitl, A data-efficient deep learning approach for deployable multimodal social robots, Neurocomputing, issn 0925-2312, 2019.