The RLLChatbot: a solution to the ConvAI Challenge

11/07/2018 ∙ by Nicolas A. Gontier, et al. ∙ Université de Montréal McGill University 0

Current conversational systems can follow simple commands and answer basic questions, but they have difficulty maintaining coherent and open-ended conversations about specific topics. Competitions like the Conversational Intelligence (ConvAI) challenge are being organized to push the research development towards that goal. This article presents in detail the RLLChatbot that participated in the 2017 ConvAI challenge. The goal of this research is to better understand how current deep learning and reinforcement learning tools can be used to build a robust yet flexible open domain conversational agent. We provide a thorough description of how a dialog system can be built and trained from mostly public-domain datasets using an ensemble model. The first contribution of this work is a detailed description and analysis of different text generation models in addition to novel message ranking and selection methods. Moreover, a new open-source conversational dataset is presented. Training on this data significantly improves the Recall@k score of the ranking and selection mechanisms compared to our baseline model responsible for selecting the message returned at each interaction.



There are no comments yet.


page 7

page 32

page 33

page 34

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Having a conversation with computers is not a novel idea. Already in the 1950’s Alan Turing hypothesized that this would happen and proposed a test to evaluate the intelligence of such machines (i.e. the so-called “Turing test”) (Turing, 1950). Not long after, Weizenbaum built the first computer program that could interact with humans using natural language (Weizenbaum, 1966)

. In the late 80’s, computation with layered networks of artificial neurons were developed.

(Levin and Fleisher, 1988; Hornik et al., 1989). However such networks required a lot of data to be trained and thus were not used for dialog systems until recently. Recent progress in computer hardware and the availability of big amounts of data changed our approach to building dialog systems. This work demonstrates the potential of deep data-driven architectures to maintain conversations with humans.

1.1 Overview of Dialog Systems

Dialog systems are defined as computer programs that are responsible for returning an output sentence to one or more input sentences. They are also denoted as Conversational Agents, or Chatbots (Weizenbaum, 1966; Colby, 1975; Epstein, 1993; Serban et al., 2017). They communicate via natural language by using speech and/or text signals. This work focuses on text-to-text interactions only. This simplifies the task slightly as going from a text-based chatbot to a spoken system can be challenging due to speech recognition errors. Dialog systems are defined in multi-agent settings where each agent is either a human or another system. Here, the setting is constrained to one human and one virtual agent in the environment.

In general, conversational agents are clustered into two distinct categories: goal-oriented systems and open-domain systems (Serban et al., 2015). In the goal-oriented setting, the systems are explicitly built to solve a particular task (McGlashan et al., 1992; Aust et al., 1995; Gorin et al., 1997). They typically operate in well-defined domains and use rule-based or modular architectures (Rudnicky et al., 1999; Raux et al., 2005). Therefore, while being accurate in their specific domain, goal-oriented systems often lack flexibility (Simpson and Eraser, 1993).

In the open-domain setting, systems may not be meant to solve specific tasks, rather their role is to be a social companion to users (Weizenbaum, 1966; Colby, 1975; Epstein, 1993; Hutchens and Alder, 1998). The goal of these agents is to mimic as much as possible the unstructured characteristic of human-to-human conversations, while still being coherent. Because of their unstructured setting, these systems are much more flexible and can be used to better understand how humans converse. Indeed, without any task in mind, it is harder to use logical rules to guide the generation of responses. A different criterion is needed: understanding human social interactions. It is worth mentioning that the lack of a clear task makes it hard to automatically evaluate conversational agents in this setting. Unlike in the goal-oriented case, here there is no objective: thus it is ambiguous what entails a successful conversation. This remains an open problem in the field and to this day the best evaluation for open domain chatbots is to ask humans to manually score conversations (Liu et al., 2016). While this is one major limitation of open domain dialog agents, the organization of competitions has become helpful in their evaluation.

1.2 Our Approach

This article describes all the components of the RLLChatbot presented at the 2017 Conversational Intelligence (ConvAI) competition111 organized as part of the 2017 Neural Information Processing Systems (NIPS) conference. Our approach is divided into three steps done at runtime for each interaction. Various candidate messages are first generated with an ensemble of models conditioned on the conversation state. Second, a scoring neural network ranks each of the candidate messages. Finally, a selection criteria decides which message is returned to the user. The main contribution of this work is the thorough analysis of the different components and ensemble strategies for a general purpose dialog system. Combined with the open-source dataset provided222available upon request to the authors – link to be added at publication time, we believe that this work can be a starting point for any future researcher on using state-of-the-art models in building a general purpose chatbot.

Section 2 presents a literature review covering previous conversational systems and competitions. The 2017 ConvAI challenge is presented in Section 3. Section 4 provides a detailed description of each generative module in the system. A variety of generative-, retrieval-, and rule-based models are considered. The scoring and selection strategies are then defined in Section 5. More specifically, one supervised and one reinforcement learning strategy are introduced for the scoring mechanism. The selection mechanism can either follow a rule-based or a statistical criterion. The crowd-sourced data collection described in section 6 provides a clear understanding of which type of dialog model is preferred according to human evaluations. Experiments described in Section 7, demonstrate that the choice of the scoring algorithm plays a crucial role in our system. The goal of training end-to-end all the presented components is left as future work, as of now, each model is trained independently.

2 Previous Work in the Chatbot Community

2.1 History of Dialog Systems

The first chatbot was built in 1966 by MIT scientist Joseph Weizenbaum, and was named ELIZA (Weizenbaum, 1966). Entirely rule-based, the ELIZA program analyzes its input sentences and mostly repeats what the user says or asks to extend the conversation. Shortly after this, Stanford psychiatrist Kenneth Colby developed PARRY (Colby, 1975). Also rule-based, this conversational agent was built to reproduce the behavior of a paranoid schizophrenic patient.

Following the “AI winter” in the 1980s during which progress and research slowed down in the field of Artificial Intelligence, the 1988


system was built. Very different from its predecessors, this agent stores everything that everyone has ever said to it and finds the most appropriate thing to reply based on contextual pattern matching techniques

(Fryer and Carpenter, 2006). This system can be seen as the first data-driven conversational agent. Not long after that, one of the most famous chatbot was made by computer scientist Dr. Richard Wallace: the Artificial Linguistic Internet Computer Entity, also known as ALICE. Inspired by ELIZA, this 1995 system is one of the strongest rule-based agent with more than 20,000 conversational rules. This is the first program to rely on the XML schema called Artificial Intelligence Markup Language (AIML) (Wallace, 2009). This makes ALICE a strong and flexible agent and allowed it to win the Loebner Prize three times in 2000, 2001, and 2004.

One of the first chatbot to become a widely used consumer product was SmarterChild, developed by the company ActiveBuddy. In 2001, the dialog agent was released on AOL Instant Messenger and Windows Live Messenger networks as a showcase for the quick data access and the possibilities for fun personalized conversation (Kay and Hoffer, 2006, 2007; Cunningham et al., 2007). The innovative aspect of this bot is that it can provide useful information via partnership with various service providers to offer weather, stocks, movie listings and more.

During the early 2000s, chatbots started to rely less on hand-crafted rules and more on data-driven approaches (Lester et al., 2004). This shift was primarily caused by the growing abundance of conversational data with the introduction of new communication technologies via the Internet. In general, conversational agents developed during this time follow a pipeline (or modular) architecture (Rudnicky et al., 1999; Young, 2000; Zue and Glass, 2000). User queries are first parsed and interpreted by a natural language interpreter (NLI), then a dialog state tracker (DST) and a dialog nanager (DM) have the role of providing response elements, before a natural language generator (NLG) module can return a proper sentence. This pipeline is illustrated in Figure 1.

Figure 1: Pipeline framework of modular dialog systems. Composed of: an automatic speech recognizer (ASR) that translates audio signals to text, a natural language interpreter (NLI) that explains what the system heard by labeling the text, a dialog state tracker (DST) that understands what the user wants, a dialog manager (DM) that performs the required action and returns some information, a natural language generator (NLG) that makes a syntactical sentence, and a text-to-speech synthesizer (TTS) that maps text to audio signals. In addition, an external knowledge base (KB) often communicates with the DM.

In 2006, a start-up called Siri was acquired by tech giant Apple and their product became one of the most popular chatbot to this day. Being a voice-activated system, it uses innovative speech recognition techniques to transform speech to text before analyzing user queries and providing appropriate responses in a pipeline manner. This new assistant triggered a surge of other chatbots such as Google Now in 2012, Amazon’s Alexa, and Microsoft’s Cortana in 2015. The main advantage of these systems is their strong connection to other software applications from the same parent company, allowing them to become true personal assistants.

In parallel to the success of these new personal assistants, end-to-end approaches to dialog systems started to be explored. An end-to-end system is defined as one module replacing the four components presented in Figure 1, namely the NLI, DST, DM, and NLG modules. In particular, Bengio et al. (2003)

developed a neural approach to the language modelling task. Given a sequence of tokens (words), the goal is to predict what is the next token following that sequence. Applied recursively this technique can produce meaningful sentences. This novel use of recurrent networks outperforms previous work based on n-gram features. What is missing to produce a dialog is a way of conditioning this generation process with the context of the conversation. A solution is proposed by

Sutskever et al. (2014)

that presents an encoder-decoder architecture, also known as a sequence-to-sequence model. The same recurrent network is used to first encode the conversation history into a fixed-length vector before generating the next possible sentence, token by token, as in the language modelling task.

Following this novel technique, researchers begun to explore data-driven generation of conversational responses in the form of encoder-decoder or sequence-to-sequence models (Sordoni et al., 2015; Vinyals and Le, 2015; Serban et al., 2016). However, the generated responses are often too generic to carry meaningful information. A mutual information based model was proposed by Li et al. (2015) to address this issue, and was later improved by using deep reinforcement learning (Li et al., 2016). Recently, Zemlyanskiy and Sha (2018)

defined a quantitative metric on discovering information about the interlocutor and showed that maximize it yields more engaging conversations according to human evaluation. Overall however, end-to-end systems require a large amount of domain-specific conversational data to be trained on. Since we did not have prior in-domain data for the competition, an ensemble of both generic end-to-end systems and specific rule-based systems is proposed with deep learning scoring techniques.

2.2 History of ChatBot Competitions

To give further context for the ConvAI challenge described in Section 3, it is worth viewing it in a historical perspective.

In 1990, Hugh Loebner and the Cambridge centre for behavioural studies established a competition based on implementing the Turing test. A gold medal and $100,000 have been offered by Hugh Loebner as a grand prize for the first computer that makes responses which cannot be distinguished from humans. A bronze medal and an annual prize of $2,000 are pledged in every annual contest for the system that seems to be more human in relation to the other competitors. It is the first known competition that represents a formal Turing test (Epstein, 1993). The competition has been running since 1991 annually. The goal of this challenge is to design a chatbot that has the ability to pursue a conversation on any topic. The evaluation of the system is made by an interrogator that tries to guess whether they are talking to a program or a real human. After a five-minute conversation between the judge and a chatbot, and another five-minute conversation between the judge and an independent confederate, the judge has to nominate which one was the human. According to this judgment, the more human chatbot is the winner.

No chatbot has ever won the gold medal and passed the test, that is, fooling all the judges. However, there is a winning bot every year able to fool at least a few of the judges333 Through the years of Loebner prize competitions, the winning chat technologies evolved from very simple pattern matching systems, towards complicated patterns in combination with knowledge bases (Bradeško and Mladenić, 2012). However, most systems are still strongly hand-crafted, and not a lot of automatic reasoning machine has been proposed.

In 2017, another challenge was proposed by Amazon: the Alexa Prize444 This competition was targeted towards university students to advance human-computer interactions by creating a social chatbot that could converse about a wide range of topics such as current events, entertainment, sports, politics, technology, and fashion during 20 minutes. The submitted systems were evaluated by real Amazon users who already had an Echo device at their home. At any time, users could say something like “Alexa, let’s chat about <topic555examples: baseball playoffs, celebrity gossip, scientific breakthroughs, etc.>”. In response, Alexa directed the user to an anonymous team’s chatbot to interact with. At the end of the conversation, users were asked to rate the conversational agent on a scale from 1 to 5 based on factors such as relevance, coherence and interestingness. The team with the highest average score being the winner. Ties were broken by average conversation lengths with longer conversations being better.

This challenge is quite hard since no conversation topic or user data is given to the chatbot before it starts interacting with real users. Furthermore, since Alexa is a voice-activated assistant, the chatbot relies on the accuracy of the speech recognizer provided. Many chatbots have been proposed for this challenge, overall they all rely on modern deep learning and reinforcement learning techniques and try to be as flexible as possible by avoiding following conversation rules666 Most notably the MILABOT (Serban et al., 2017) follows a similar structure as our RLLChatbot by first generating candidate responses before selecting one of them.

3 Conversational Challenge Description

This section describes the Conversational Intelligence (ConvAI) challenge777 as well as the dataset collected during the competition.

3.1 Challenge Description

The 2017 ConvAI challenge was organized as part of the competition workshop of the 2017 Neural Information Processing Systems (NIPS) conference. It is a more contextual, text-based version of the Alexa Prize previously described: the topic of the discussion is defined at the beginning of each dialog with a random news article’s paragraph and every conversation is on the text messaging platform Telegram888

Figure 2: Example of bot-to-human conversation during the ConvAI evaluation using the Telegram platform.

The challenge required the construction of a conversational system that can talk to human judges about a random wiki-news article’s paragraph. Like the Turing test, at the beginning of each conversation, human judges did not know if they were talking to a chatbot or another human. After each interaction, human users could ‘up-vote’ or ‘down-vote’ individual responses of the other participant. The two participants discussed for any number of interactions desired, keeping in mind the news paragraph given at the very beginning of the conversation. At the end of the conversation, human users gave a score between 1 and 5 for the conversation quality, breadth, and engagement (1 being ‘very bad’, 2 ‘bad’, 3 ‘medium’, 4 ‘good’, 5 ‘very good’). Submitted systems were then given the average score among all the conversations they had with random human users. After many rounds of evaluation, the organizers collected a dataset of human-to-human and human-to-bot conversations, each evaluated from 1 to 5. Figure 2 is an example of a conversation from the competition.

This scenario is an instance of the previously described open domain setting. Indeed, the task is to chat about any given news paragraph with a human. While the topic is constrained for each conversation by the random paragraph, no other information is given. Given news paragraphs can be about sports, politics, science, history, technology, fashion, economics, and many other topics. Submitted systems have to be general enough to understand and speak about all these topics. Moreover, external knowledge bases cannot be queried over the internet. Therefore, the difficulty of the task is to extract information from the article and be able to have a coherent conversation about it without any other external information. Further technical difficulties are discussed in the Appendix A.6.

3.2 Competition Dataset

The ConvAI challenge organized an early human evaluation of submitted systems before the final round. The dataset collected during this human evaluation round was released999 by the organizers, which is a first step towards understanding what makes a good conversation and what does not.

The data is made of conversations with human-to-human interactions against human-to-bot interactions. Thus, only a small fraction of each chatbot is captured in the data. After removing empty conversations, one-sided conversations, and non-voted interactions, the data consists of {article, context, message, vote} tuples from unique articles. Therefore a wide variety of topics is covered in such a small dataset. The vote represents the human score given to the message in that same tuple and can be either 1 (up-voted) or 0 (down-voted). Human-to-bot messages are automatically added to the context since they do not have a vote. All the other messages in a conversation (bot-to-human, and human-to-human) appear in a tuple as message once, before being added to the context

in the following tuples of the same conversation. Final dialog ratings are not considered in this dataset because it is only used to classify up-voted and down-voted messages.

This data is used to train the baseline message scoring mechanism described in Section 5.1.1 prior to the final round of the competition. Splitting it into training and validation sets made of 80% and 20% respectively, resulted in training instances. Another conversational dataset (see Section 6) is used to train other message scoring models after the competition.

4 Generation of Candidate Responses

In this section a flexible ensemble system is presented as a solution to the competition. Technical details about its implementation are presented in the Appendix A. The high-level view of this system is made of three components: response generation, response scoring and response selection. A description of this procedure can be seen in Figure 3.

Figure 3: High-level view of the ensemble system. A three-step procedure is followed: 1) generation of candidate responses, 2) scoring of all candidates, 3) selection of one of them.

The objective of the response generation component is to produce multiple candidate responses for a given conversation state (defined as the randomly assigned news paragraph and the conversation history). During the final step, the system will return one of these candidate responses. It is thus important to produce various types of responses. To that end, both generative sequence-to-sequence models, retrieval-based systems, and rule-based systems are used.

4.1 Generative sequence-to-sequence models

Sequence-to-sequence models are fully generative systems, meaning they are generating sentences word by word. The unique challenge of such system is that they need to learn syntactic and grammatical rules in addition of knowing what to say. To do so, they are trained on {input - output} sentence pairs. The motivation to use this type of models is to produce flexible and generic responses. In this work, two distinct models under this scope are implemented: the Hierarchical Recurrent Encoder Decoder (HRED) (Serban et al., 2016) and the Neural Question Generator (NQG) (Du et al., 2017).

4.1.1 Hierarchical Recurrent Encoder Decoder

The objective of the ConvAI challenge is to hold a contextual conversation with the user. To introduce this notion of context, a model capable of reading previous messages in the conversation is required. Therefore, the commonly used neural model Hierarchical Recurrent Encoder Decoder (HRED) is chosen because its hierarchy permits tracking longer context for the conversation.

The HRED model is following an encoder-decoder architecture (Cho et al., 2014b). The encoder

is made of two (hierarchical) recurrent neural networks encoding the input sentences into a high-dimensional

context vector. The decoder is made of a third recurrent network decoding the context vector to output a sequence of words. For all three recurrent networks, the LSTM unit (Hochreiter and Schmidhuber, 1997) is used. The first LSTM encodes each dialog messages into a vector () by having word vectors () as input at each time step. The second LSTM encodes the entire conversation history () into another vector () by having message vectors () as input at each time step. Eventually the third LSTM decodes, word by word, the next dialog message by having as input the previously predicted word () and the context vector . A pictorial description of this process can be seen in Figure 4.

Figure 4: The Hierarchical Recurrent Encoder Decoder model with three LSTM networks: one encoder at the word level, a second encoder at the utterance level, and a third decoder predicting words of the next utterance. Here the model is given two messages in the context and is predicting the third message word by word.

Two versions of this architecture are trained: one on social interactions in order to add a social component into the system; and another on contextual, more focused conversations to give some grounded knowledge to the chatbot. The first dataset is made of roughly 1 million conversations scraped from Twitter (Ritter et al., 2010; Danescu-Niculescu-Mizil et al., 2011); while the other is made of roughly 4 million conversations scraped from Reddit Politics, Reddit News, and Reddit Movies. Each conversation is made of 3 to 6 messages between two users.

Given a conversation, both models are trained to minimize the negative log likelihood of the next message. This is done by projecting the decoded vector at each time step into a vector of vocabulary size, and applying a softmax function on it to get a probability distribution of outputting each token. More formally the output vector of the decoder at time-step

is: where are the tokens sampled in the previous time-steps. The next token is then sampled from this probability distribution. After a week of training, the HRED models are able to have short and generic social interactions such as greetings, and pretending to do some activity. However, generated responses tend to be qualitatively short, generic and often off-topic as shown in Table 1.

True Conversation HRED (twitter) HRED (reddit)
: I met a Tibetan once, he was nice.
: Hello. I don’t think the article
is about that. What is is about?
: I think it’s a good thing. I’m not
sure. I think I’m a bit of a child.
: I’m a man.
1-1 : Society & Culture
1-1 : More or less. It talks about dog!
: *laughs* I’m not sure if I’m a bit of
a freak. But I’m sure he’ll be fine.
: I’m a little disappointed
that this is a joke.
Table 1: Examples of HRED candidate responses (:) for each turn in a specific conversation (left column) between a human user (:) and the chatbot (:). In all cases the generated responses are off-topic. In addition, responses coming from the model trained on Reddit tend to be shorter.

4.1.2 Neural Question Generator

The second generative sequence-to-sequence model used in the ensemble is the Neural Question Generator (NQG) (Du et al., 2017). The motivation to use this model is to increase the interactivity between the bot and the user. One way to proactively increase this is by asking questions to the user with respect to the random article. The objective of this model is to ask questions related to the article, engage the user to read it, and reason about it.

NQG is trained on the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) to solve the inverse of the reading comprehension task. That is, instead of answering questions about a piece of text, it automatically generates questions regarding that piece of text (Du et al., 2017). This model is exactly what the chatbot needs to ask questions about the news paragraph given at each beginning of conversation. The SQuAD dataset provides paragraphs of Wikipedia articles with many questions on each paragraph, and the original task is to retrieve the span of the paragraph that answers a given question. In order to form a dataset for the question generation task, the entire sentence that provides the answer (not just a span of it) is retrieved for each question in an article. Thus, {sentence - question} pairs are created to train the model.

Similarly to the previous generative model, NQG is a sequence-to-sequence recurrent neural network following an encoder-decoder architecture (Cho et al., 2014b). One LSTM network encodes the input sentence into a vector representation and the other LSTM network decodes that vector into a question by sampling one word at a time. Like the previous model, the NQG system is trained to minimize the negative log likelihood of the generated question, conditioned on the input sentence.

This model is run on every sentence from the incoming article at the beginning of each conversation. Generated questions are saved for later so that at any time in the conversation, one of these questions can be asked to the user, without any latency. Overall this system is generating meaningful questions as showed in Table 2. Surprisingly, it can also generate questions that do not have an answer in the news paragraph. This is a key feature that forces users to attentively read the news article and eventually look for that information online if they want to have the correct answer. This can be observed in the top example of Table 2. Having such a feature provides an interesting starting point for future systems that aim at increasing user’s attention.

Sentence Generated Question
The median longevity of mixed-breed dogs, taken as an
average of all sizes, is one or more years longer than
that of purebred dogs when all breeds are averaged.
What is the average
estimate of all dogs?
On 5 December 2011, Pusuke, the world’s oldest living
dog recognized by Guinness Book of World Records,
died aged 26 years and 9 months.
Who recognized the
world’s oldest living dog?
The dog widely reported to be the longest-lived is “Bluey”,
who died in 1939 and was claimed to be 29.5 years old at
the time of his death.
how many years. longest-
lived longest-lived and ?
Table 2: Examples of generated questions by the NQG model for specific article sentences. The top example is a case in which the answer to the question is not in the source sentence. The middle example is a classic, easy-to-answer question. The bottom example is a case in which the model failed to generate a syntactically correct question.

4.2 Retrieval-based systems

The second category of systems used in the ensemble of response generation models are retrieval-based. Unlike generative-based systems, these models do not have to learn the syntactic structure of a sentence. Their objective is instead to retrieve the most relevant response conditioned on the conversation state. In this category three distinct models are considered: the Document reader Question-Answering model (DrQA) (Chen et al., 2017), a topic classifier, and a fact retriever.

4.2.1 DrQA

The natural next model after introducing a question generator, is a question answering model. This is a crucial component of the ensemble model for the ConvAI challenge as users will partially test the chatbot on its understanding of the randomly assigned news paragraph. The objective is to provide correct answers in response to user questions. The motivation to use this model is to hold meaningful conversations with respect to the news paragraphs. The Document reader Question Answering (DrQA) model (Chen et al., 2017) is introduced to answer open-domain questions using some input text document. This model is important for the chatbot to answer as many user questions as possible. Since the system is given a random news article in each conversation, a question-answering dataset on similar paragraphs is used: the SQuAD dataset (Rajpurkar et al., 2016). As previously mentioned, this dataset provides paragraphs of Wikipedia articles with many questions on each paragraph, and the task is to retrieve the span that answers a given question. That is, given an article and a question, the model is trained to predict the starting and ending position in the paragraph that answers the given question. In particular, the probability of each word in the input paragraph to be the starting or the ending token is computed like so:

where is a vector representation of token in the paragraph, and are matrices of learned parameters for the starting and ending probabilities respectively, and is a vector representation of the question. The question vector is a weighted sum of all the hidden units of a bi-directional LSTM network over the word vectors of the question; also known as self-attention technique (Bahdanau et al., 2014; Vaswani et al., 2017). The paragraph token representation is the output of a hierarchical bi-directional LSTM network over word vectors at time-step . This can be seen as the concatenation of the representations of the left and right side of token in the paragraph. After training on the SQuAD dataset, this model’s validation accuracy is 69%.

Overall, this model is finding appropriate answers to questions about the article. An example of conversation can be seen in Table 3. DrQA successfully accomplishes the objective of correctly answering questions, but it also maintains conversation coherence by restating part of the article. This is an interesting feature to have in order to stay focused on the topic presented by the news paragraph. Restating a statement from the paragraph is useful for the user to stay focused on the subject of the article. Having such a feature provides a good starting point for future systems that want to avoid diverging from a main topic.

Article DrQA Conversation
In addition to the above, Greece is also to start oil
and gas exploration in other locations in the Ionian
Sea, as well as the Libyan Sea, within the Greek
exclusive economic zone, south of Crete. The
Ministry of the Environment, Energy and
Climate Change announced that there was interest
from various countries (including Norway and the
United States) in exploration, and the first
results regarding the amount of oil and gas
in these locations were expected in the summer
of 2012. In November 2012, a report published by
Deutsche Bank estimated the value of natural gas
reserves south of Crete at 427 billion euros.
U: Where is Greece starting oil
and gas explorations?
B: Ionian Sea
U: Where is the Greek exclusive
economic zone?
B: south of Crete
U: Which countries are interested
in the exploration?
B: Norway and the United States
U: When are the first results about
gas and oil expected?
B: in the summer of 2012
U: How much is the estimated value
of the gas reserves?
B: 427 billion euros
U: Which agency published the study
about the estimated reserve value?
B: Deutsche Bank
U: You are so smart!!
Table 3: Example of conversation between a human user (U:) and the DrQA model (B:). The human user starts the conversation by asking a question, and DrQA answers in the next message. The conversation alternates like this between human and DrQA until the end when the user replies “You are so smart!!”.

4.2.2 Topic Classifier

While DrQA can accurately answer questions about specific facts from the article, the model cannot answer a generic topical question, such as “what is this article about?”, which expects the system to understand the overall topic or theme of the article. To solve this problem, a topic classifier is implemented. The objective of this model is to answer the most popular question and make the user think our chatbot actually understands the high level topic of the article.

To extract an overall topic for any article’s paragraph, a text classifier is trained using fastText (Joulin et al., 2016) on the Yahoo News Corpus (Zhang and LeCun, 2015). This dataset is made of news article, each labeled with one topic from a list of ten: “Society & Culture”, “Science & Mathematics”, “Health”, “Education & Reference”, “Computers & Internet”, “Sports”, “Business & Finance”, “Entertainment & Music”, “Family & Relationships”, “Politics & Government”. This dataset is chosen because the number of labels is small (10), yet the topics are broad enough so that most of the articles in the competition fall under one of them.

The advantage of using fastText is that it is a simple and small model that can run quickly (Joulin et al., 2016), thus minimizing the user wait time for a response. The article is first encoded with a bag of n-grams features. This encoding is then multiplied by learned parameter matrices before applying a softmax operation. The classification is then done by sampling from the resulting probability distribution of belonging to each possible topic. The parameters are trained to minimize the negative log-likelihood of the predicted classes:

where is the number of articles in a batch, is the article label, is the softmax operation, and the weight matrices being learned, and the normalized bag of n-grams features for the article. After training on examples, the test accuracy on the held-out example is 61%.

This model is run once at each beginning of conversation. The predicted topic is then stored for later if the user asks for it, thus minimizing the answer time. Whenever this model is used, a pre-defined sentence with the predicted topic replacing a generic placeholder is returned to the user. Some examples can be seen in Table 4.

Topic Sentences
This article is about <topic>
I think it’s about <topic>
It’s about <topic>
The article is related to <topic>
Table 4: List of possible sentences returned to the user when asking for the topic of the article. At runtime, one sentence is randomly picked and “<topic>” is replaced by the predicted class from Yahoo News Corpus.

4.2.3 Fact Retriever

After a few interactions with the system, the user may not have more questions with regards to the article, or may loose interest. This inherently penalizes the system in terms of engagement, which is an important performance metric for the competition (see Section 3). To increase user engagement and to bring back the focus on the current topic, relevant facts with respect to the current conversation are presented using a fact retrieval model. The primary goal of this model is to make the conversation interesting for the user and avoid boredom, but also to have a fun ‘exit door’ when the system is not sure what to say.

This model retrieves the most relevant fact to the current conversation from a list of about interesting and fun facts, including facts about animals, geography and history. The list of facts was shared with authorization from the authors of the MILABOT (Serban et al., 2017). All facts are encoded by averaging their pre-trained word vectors (Word2Vec embeddings (Mikolov et al., 2013a, b) are used). The one that minimizes its cosine distance with the average word vector of the conversation history is returned. The fact vectors are computed only once before the challenge and saved as part of the model. This ensures that the computation time for the model is optimized by only computing the average word vector of the conversation history after each interaction. Formally, a fact is selected like so:

where is the matrix of all fact vectors, is the conversation history vector, and is a list of distances for each fact . If the selected fact has been already returned in the conversation, the next one minimizing its cosine distance with the conversation history is returned. Retrieved facts are incorporated in a randomly chosen pre-defined sentence, similar to the topic classification model. A set of prefixes is defined in case the user asks a question. Examples can be seen in Table 5.

Examples of <fact> <fact sentences> Prefixes
Butterflies cannot fly if their body
temperature is less than 86 degrees.
I’m not sure. However,
<fact sentence>
Neurons multiply at a rate 250,000
neurons per minute during pregnancy.
Did you know that
I’m not sure. But
<fact sentence>
The human brain is about 75% water.
Do you know that
I’m not quite sure.
But <fact sentence>
Flies jump backwards during takeoff.
Here’s an interesting
fact, <fact>
I don’t have an answer for
that. But <fact sentence>
In every episode of Seinfeld there is a
Superman somewhere.
Here’s a fact, <fact>
I don’t know. But
<fact sentence>
Table 5: From left to right: examples of some facts, sentences to include a fact, and prefixes to use when the user asks a question. At runtime, one sentence is randomly picked (with one random prefix if a question is asked) and “<fact>” is replaced by the most related fact.

4.3 Rule-based systems

Finally, the last category of models in the ensemble are rule-based models. Unlike the previous two categories of models, these do not require any training. They are simple yet effective models that work for specific cases. Three such models are considered: the Entity Sentences model, the Simple Answers model, and the A.LI.C.E. bot.

4.3.1 Entity Sentences

In addition to the Neural Question Generator (NQG) model that proactively asks questions about the news paragraph, a rule based model is used to ask questions and say statements about entities present in the article. This model is added to the ensemble in case the NQG model fails to generate a question. As previously mentioned, and showed in Table 2, NQG may generate incoherent questions. The Entity Sentences model offers an alternative to talk about the given paragraph with the user. Overall the objective of this model is to increase the user engagement by talking about things related to the article and by asking simple questions.

Examples of Entity Sentences
Do you know what <person> did in his life ?
Have you ever used any of <orgs>’s product or services ?
What do we eat in <gpe>? I’m starving!
Have you ever been to <loc> ? I heard it’s beautiful.
Once, I bought a <product>, but then somebody stole it from me.
What do you think about <event> ?
Do you know who did <work of art> ?
Do you know how to speak <language> ?
What happened in <date> ?
I met a <norp> once, she was nice.
Table 6: Examples of entity sentences for different named entity types. In order from top to bottom: “<person>” (people, including fictional); “<orgs>” (companies, agencies, institutions, etc.); “<gpe>” (countries, cities, states); “<loc>” (mountain ranges, bodies of water); “<product>” (objects, vehicles, foods, etc.); “<event>” (named hurricanes, battles, wars, sports events, etc.); “<work of art>” (titles of books, songs, etc.); “<language>”, “<date>”, “<norp>” (nationalities or religious or political groups).

A set of 50 different questions and statements are manually defined with special entity tags in them. The sentences are chosen by the authors to support a wide range of possible entity tags. Every paragraph coming in at each beginning of conversation is parsed with the Spacy Named Entity Recognizer101010 to recognized the following entities: “persons”, “organizations”, “geographical entities”, “locations”, “products”, “events”, “work of art” (books, songs), “languages”, “dates”, “nationalities, religious or political groups”. These entities are chosen because they are expected to be the most prevalent ones in a news text, allowing the model to say more than one statement or question on each random paragraph. After recognizing these entities in the article, all tags in the list of 50 sentences are replaced by the appropriate entity. A statement not used before from the list is randomly returned to the user. Some examples can be found in Table 6.

4.3.2 Simple Answers

While generative models such as HRED provide generic chit-chat conversations, they cannot handle specific unrelated queries which are not represented in their original training data. These unrelated queries tend to be questions regarding the personality of the chatbot. To handle these specific unrelated queries, a rule-based model consisting of regular expressions is used. The Simple Answers model’s goal is to make a personality for the system, and handle unrelated questions and edge cases.

A set of regular expressions is constructed to catch generic questions that the user may ask. Regular expressions are defined by the authors to handle a wide spectrum of formulation for some common personification questions. For each question, an appropriate answer is also manually defined by the authors. Unlike all the other models previously described, Simple Answers only returns a candidate response when the previous user message contains an expression matching an item in the set of regular expressions. Examples of such expressions and their pre-defined answers can be seen in Table 7.

Trigger sentences Pre-defined answers
How are you ? I am great! What about you?
What are you ? I am a chatbot.
Who made you ? I am a chatbot developed by students at McGill University.
What’s your name ? My name is RLLChatbot.
Where do you live ? I can live everywhere at anytime.
Table 7: Examples of sentences captured by our regular expressions and the possible sentence that the model could return.

4.3.3 A.L.I.C.E. Bot

One weakness of generative models such as HRED is that responses tend to be short, brisk and not to the point. In addition, since HRED is trained on publicly available Reddit and Twitter datasets, the model is rife with biases and hate speech which makes the responses not suitable in certain contexts (Henderson et al., 2018). In addition, retrieval systems previously defined are motivated by specific needs and do not provide generic responses often present in casual conversations. To mediate these issues, a complex rule-based model is included in the ensemble. This model is known as the Artificial Linguistic Internet Computer Entity (A.L.I.C.E.) (Wallace, 2009)111111 As introduced in Section 2.1, A.L.I.C.E. is composed of roughly 20,000 conversation rules with patterns to recognize and template answers to return. The goal of this model is to increase the fluency of the conversation by discussing generic topics, in cases when other models fail to generate coherent responses.

True Conversation A.L.I.C.E. candidate responses
: Hello chatbot! What
is this article about?
: Hi there! I am a social robot from earth. Ask
me anything about news, politics or movies!
1-1 : It’s about Science & Mathematics
: Ok, can you tell me anything
more about it?
: Excuse me! What I want to tell you is,
how much fun it is talking with you.
: Excuse me! What I want to tell you is,
how much fun it is talking with you.
: lol ok not much apparently.
What happened in 1906?
: I’m glad you find this amusing. Aha.
That is a non-committal reply.
I don’t know what happened.
1-1 : I don’t know. But here’s a fact! <fact>
1-1 : hmmm ok good to know : Pardon me? are you shy? Glad to hear it.
Table 8: Examples of ALICE responses () for each turn in a specific conversation (left column) between a human user (:) and the chatbot (:). Note that the second bot message (: Excuse me!…”) is the message returned by ALICE (ie: ). Other bot messages in the conversation used different models (ie: ).

ALICE is run after every user message to produce a possible response. Since the response only depends on the conversation history and not the article, this model is mostly used to maintain a social and coherent conversation with the user. This model provides a natural and coherent way to continue the conversation when other models fail and is thus an important fall-back. Some examples of messages can be seen in Table 8.

5 Scoring and Selection of Responses

After presenting the ensemble of models producing candidate responses, this section introduces the mechanism deciding which response is returned to the user. As illustrated in Figure 3, this process is done in two steps: each candidate response is first given a score, and the final selection is done based on the score and the conversation state.

5.1 Scoring of candidate responses

After generating several candidate responses in parallel, the system must pick exactly one response to give to the user. To help in this decision, a score is given to each possible response. Two alternate approaches are considered: the first is based on classification with supervised learning, while the other is based on prediction with reinforcement learning.

5.1.1 Supervised Scoring

Recall that the ConvAI competition allows human participants to up-vote or down-vote responses from the other participant in a conversation. The vote gives important feedback on the response quality presented to the user. This information is thus used to build a classifier that can predict the human vote for a given candidate response, conditioned on the conversation history and the article.

The competition dataset described in Section 3.2 is formatted into a collection of pairs where is a vector representation of the article, the conversation history and the next response; and for up-voted responses and for down-voted responses. Non-voted messages are ignored because the challenge does not give an incentive for human users to vote each response. Thus non-voted messages have an equal probability of being appreciated by the user, or not. The absence of a vote reflects more the laziness of a user rather than the actual quality of a response. Assuming otherwise would add a lot of noise in the training signal of the model. The input vector

is fed into a fully connected feed-forward neural network (denoted

). A softmax layer is added at the output of

to get the probability of the input message vector being an up-voted response: . A dropout layer (Srivastava et al., 2014) is also added before the last layer to prevent the network from overfitting on the small training data of instances. The architecture of the network is illustrated in Figure 5

. All parameters of the network are trained to minimize the cross-entropy loss function between the predicted probabilities and the true vote of each message:


with the predicted probability of response being up-voted, and the true up- or down-vote label represented as 1 or 0 respectively.

Figure 5: Architecture of the feed-forward network to predict candidate responses’ vote. Three fully connected (fc) feed-forward layers of dimension , , and are used. The input is of dimension and the output of dimension .

In order to represent the article, the conversation history, and the next response as a vector, a set of features inspired from Serban et al. (2017) is manually created for making that prediction. This allows to train fewer parameters and work with the small dataset of the competition, as mentioned in Section 3.2. The set of features computed for each (article, context, candidate) triple is listed below:

  • Average word embeddings of the candidate response, the previous messages (context), and the article. Pre-trained word2vec embeddings (Mikolov et al., 2013a) are used for each of them.

  • Similarity metrics between the candidate response & the context, and between the candidate response & the article. Average embedding cosine similarity, extrema embedding score

    (Forgues et al., 2014), and greedy matching score (Rus and Lintean, 2012) are used with pre-trained word2vec embeddings (Mikolov et al., 2013a).

  • Number of non stop-words, bi-grams, tri-grams, and spacy entities overlap between the candidate response & the context, and between the candidate response & the article.

  • Whether or not the candidate response is generic. A message is defined to be generic if it is only made of stop-words or words shorter than 3 characters. The same is computed for the previous user message.

  • Whether or not the candidate response has: one or more words starting by ‘wh’, one or more intensifier words (e.g. amazingly, crazy, and so on), one or more confusion words (e.g. confused, stupid, nonsense, and so on), one or more profanity words, and one or more negation words (not or n’t). The presence of each of these categories is indicated by a 1 and the absence by a 0. The same is computed for the previous user message.

  • The number of messages in the conversation so far.

  • The number of sentences in the article.

  • The number of words in the candidate message and in the previous user message.

  • The type of the candidate response. A type can be any combination of ‘greeting’, ‘question’, ‘affirmative’, ‘negative’, ‘request’, or ‘politic

    ’. A heuristic decision is made based on word presence for each of the types. The same is computed for the previous user message.

  • Sentiment score (negative, neutral, or positive) of the candidate response. The pre-trained Vader sentiment analyzer (Gilbert and Hutto, 2014) is used. The same is computed for the previous user message.

The combination of all these features makes a vector of values that are used as the input to the classifier.

This architecture is trained before the final competition with data from the first human round evaluation of the ConvAI challenge. The data released after this early human evaluation is described in Section 3.2. A random parameter search is done over experiments to find the best parameter combination of the system submitted to the ConvAI

challenge. The validation accuracy is evaluated after each training epoch. Early stopping with a patience of 20 epochs is performed. The model with highest validation accuracy was trained with a batch size of

, the RMSProp optimizer (Hinton et al., 2012), a learning rate of , ReLUactivation functions, and a dropout rate of . This combination of parameters gives a validation accuracy of . The resulting scoring model is considered as the Baseline model. Further experiments with the same architecture on additional data (performed after the ConvAI challenge) are described in Section 7. At runtime, the feature vector of each candidate response is computed and passed to the neural network. The output gives a vector of probabilities representing the probability of the candidate response being up- or down-voted. The score given to each candidate response is defined to be its probability of being up-voted.

5.1.2 Q-Scoring

The second scoring mechanism was implemented after the ConvAI competition. This method is based on reinforcement learning. Instead of predicting the immediate reward of a candidate response (the up- or down-vote), the Q-value of a response is estimated. The Q-value represents the expected reward after returning a response. The state of the environment is defined to be the news paragraph and the conversation history, the possible actions are defined to be the candidate responses to return to the user, and the reward of taking such action is a weighted version of the up- or down-vote signal. If the response is down-voted, the reward of taking that action is . When the response is up-voted, the reward is if the end-of-conversation score is 1 (‘very bad’) or 2 (‘bad’), if the end-of-conversation score is 3 (‘medium’) or 4 (‘good’), and if the end-of-conversation score is 5 (‘very good’). This reward is arbitrarily chosen to penalize ‘very bad’ and ‘bad’ conversations because they are often incoherent, while ‘medium’, ‘good’ and ‘very good’ conversations are coherent. It may occur that at specific points in the conversation, none of the candidate responses are coherent, yet one of them is up-voted by the user121212in the data collection process presented in Section 6, the user is forced to up-vote one candidate response.. This reward shaping protects the model from receiving the same reward for coherent and incoherent responses.

In order to predict Q-values, a large amount of bot-to-human conversational data is collected in addition to the competition dataset. The Q-value predictor is then trained to imitate the human behavior present in this data. Rather than performing classical Q-learning, where the agent is interacting with the environment while training, a form of Neural Fitted Q-Iteration (Riedmiller, 2005) is implemented. The motivation is primarily for practical reasons. The only way in which the agent can be trained while interacting with its environment is to have human users up-voting and down-voting responses while talking with the system. However, asking real users to do so is impractical. The time required for the system to learn is on the scale of days, but human users are not capable of interacting with the system continuously for several hours. This is why Neural Fitted Q-Iteration is used rather than traditional reinforcement learning. In this setting, rather than collecting data from an exploratory policy, a narrow, human policy is used. Section 6 describes how conversations are collected for this task. For now, let’s assume that a collection of expert trajectories of the form: is available, where is the current state of the environment (article & conversation history), is an action (a candidate response), is the reward of taking that action (between 0 and 1 as described above), and is the next state after taking action (article & new conversation history including action , and the human response to ).

Figure 6: Architecture of the Deep Q-Network to predict candidate responses’ action value. Hierarchical GRU networks with shared weights and fully connected feed-forward layers are used. The input word vectors and GRUs output vectors are of dimension 300. The fully connected layers reduce the dimensionality at each depth and the output is a simple scalar.
inputs :  : a list of tuples
: a list of tuples
: weights for the first DQN network
: weights for the target DQN network
: frequency at which to update target net (2,000)
: discount factor (0.99)
: number of maximal training episodes (10,000)
: patience term (20)
while  and episode  do
       foreach batch of examples  do
             such that ;
             if  then ;
       end foreach
       such that ;
       if  then
             Save ;
       end if
end while
Algorithm 1 DQN training algorithm for response scoring

In addition to the previously introduced feed-forward network, a Deep Q-Network (DQN) is designed to predict Q-values. With a large amount of conversations131313collection described in Section 6, a more complex architecture that automatically extracts features can be explored. Thus, recurrent neural networks are used to automatically represent the environment state (article & conversation) and the agent action (candidate message) in separate vectors. Similarly to the dueling architecture (Wang et al., 2015), this DQN splits the prediction of the Q-value as the sum between the state value and the advantage function . This has the advantage of predicting a value for the state on its own and for each action separately, empirically yielding better results on some tasks. Traditional recurrent network being weak at encoding long time dependencies (Bengio et al., 1993, 1994)

, hierarchical Gated Recurrent Unit (GRU) networks

(Cho et al., 2014c, a) with shared weights are used to encode the article, the conversation history, and the candidate response into vectors. GRUs are preferred over LSTM networks (Hochreiter and Schmidhuber, 1997) because of their similar performance with fewer parameters to train. The state and action vectors are then fed into fully connected feed-forward networks to compute a state value and an advantage value . The final Q-value is defined as . Figure 6 gives a visual description of the architecture.

The entire network is trained end-to-end to minimize the Huber loss function between the current estimate of the Q-value and the expected Q-value (also called the target) based on the observed reward:


with the estimated Q-value and

the target. The Huber loss is preferred over the mean squared error loss because it is less sensitive to outliers and in some cases prevents exploding gradients

(Girshick, 2015). The threshold is arbitrarily chosen by the Pytorch library used. The Double DQN target is used in order to have a better estimate (Van Hasselt et al., 2016):


where is the immediate reward of taking action in , is a discount factor set to , is the next state after taking action in , is the next possible action in , and is a target Q-function that uses old parameters from earlier in training, which helps stabilize learning (Van Hasselt et al., 2016). These old parameters are periodically updated with the most recent ones . The training algorithm is described in Algorithm 1. Eventually, the Q-value of each candidate response is computed at runtime based on the current state of the conversation. The score given to each candidate response is defined to be its Q-value. Experiments with this model are described in Section 7.

5.2 Selecting one response

After scoring each of the candidate messages, the system must pick only one response for the user. At the time of the ConvAI challenge, the feed-forward classifier (described in Section 5.1.1) is used to score candidate messages. Since its validation accuracy is only (when a random binary classifier can achieve ), a rule-based selection mechanism is built to help the system choose a response. Other scoring mechanisms explored after the competition are described in Section 7.

Figure 7: Set of rules to follow during a conversation in order to decide which response to return to the user based on both the conversation state and the score of each possible candidate response. The random article first comes in the conversation, RLLChatbot then greets the user and asks a question. After each user’s response these rules are followed.

Once a random wiki-news article’s paragraph is sent to both users, RLLChatbot follows the same pattern: it welcomes the participant with a scripted greeting message (“Hello! I hope you’re doing well. I am doing fantastic today! Let me go through the article real quick and we will start talking about it.”); and sends a message from either the Neural Question Generator (NQG) model (Section 4.1.2) or the Entity Sentences model (Section 4.3.1). One of these two models is randomly picked because they are best suited to start a conversation by asking an open question. The user is then free to answer. Based on the user’s reply, RLLChatbot generates a set of candidate responses, scores each of them, and returns a message based on the rules described below in order of specificity (also illustrated in Figure 7).

  1. The most specific module from the ensemble is the rule-based Simple Answers model (Section 4.3.2). The user’s message is thus parsed with the set of regular expressions from the model. If there is a match, the corresponding pre-defined answer is returned. If there are none, the next rule is applied.

  2. The next specific item that needs to be checked is if the user asks about the topic of the article. If the user’s last message matches the set of regular expressions designed to catch such messages, a response from the Topic Classifier model (Section 4.2.2) is returned. If not, the next rule is applied.

  3. Another important and common scenario expected to happen is when the user asks a question about the article. It is important to catch those cases because the DrQA model (Section 4.2.1) is specifically designed to answer questions. As such, if the previous user’s message has a common entity with the article, terminates by a question mark, and has a ‘wh-’word, a response from the DrQA model (Section 4.2.1) is returned. If the user’s response does not match those characteristics, the next rule is applied.

  4. In order to keep the conversation interesting for the participant, a ‘bored’ counter is introduced to see when the user could be bored and remedy that. This counter is incremented every time the user response is short (less than 3 words) or is entirely made of stop-words. As soon as this counter reaches 2, a response from the NQG model (Section 4.1.2), or the Fact Retriever model (Section 4.2.3), or the Entity Sentences model (Section 4.3.1) is sampled according to its score. The counter is then reset to 0. Only these models from the ensemble are sampled as they are best suited to re-launch the conversation and potentially start talking about something new. If the counter does not reach 2 or if the user is not considered ‘bored’, the next rule is applied.

  5. If a candidate response from one of the HRED models (Section 4.1.1) or the A.L.I.C.E. model (Section 4.3.3) has a high score (between 0.75 and 1.0), the candidate with the highest score is returned. The motivation is that if the scoring mechanism is strongly confident about a specific response, that response should be returned. These three models are only considered from the ensemble as they are the most flexible and produce generic conversations. If none of these responses have a high score, the next rule is applied.

  6. User messages are now split into two categories: they either ask a question, or they do not. If the user asked a question, the same generic models as in the previous rule are considered, with the addition of the DrQA model (Section 4.2.1) that is specifically trained to answer questions. More formally, if the user message terminates by a question mark and has a ‘wh-’word, a response from either one of the HRED models, or the A.L.I.C.E. model, or the DrQA model is sampled based on its score (as long as it is greater than 0.25). These models are considered because they are all flexible in terms of their type of responses and thus potentially capable of answering a broad range of questions. If the user message does not fall into these characteristics, the next rule is applied.

  7. When the user message does not contain a question, the same models as in the previous rule are sampled except for the DrQA model being replaced by the NQG model (Section 4.1.2) and the Entity Sentences model (Section 4.3.1). DrQA is not considered because it is designed to answer questions, while the NQG and Entity Sentences models are designed to ask questions.

  8. Eventually, if none of the above scenario is available (i.e. most responses have a score below 0.25), a more or less related fact from the Fact Retriever model (Section 4.2.3) is returned. This rule is introduced as a safe ‘exit door’ for the system so that it always has something to say.

After selecting a response to return to the user, RLLChatbot waits again for the user to reply and selects the next response based on the same set of rules following the same order, until the participant decides to terminate the conversation.

Eventually, the goal is to remove most of the above rules and let the scoring machine completely decide which response to return to the user. All these rules can then be replaced by either a hard decision: or a soft decision by sampling a response rather than always taking the best one: . This can make the system slightly more flexible and less reliant on human expertise about how conversations are supposed to go. To that end, an additional dataset of conversations is collected, as described in Section 6, and other scoring mechanisms are considered in Section 7.

6 Data Extension

This section presents the additional data independently collected after the ConvAI challenge in order to improve the scoring and selection process.

6.1 Collection Procedure

In order to expand the dataset collected during the ConvAI challenge, Facebook’s ParlAI framework141414 was used to ask workers from Amazon Mechanical Turk to chat with the RLLChatbot. To follow as much as possible the challenge structure, every conversation starts with a random paragraph from a random SQuAD article (Rajpurkar et al., 2016), presents the same greeting message used at the time of the competition, and starts by asking a question about the article’s paragraph by running both the NQG model (Section 4.1.2) and the Entity Sentences model (Section 4.3.1).

The difference with the competition is that in this case, rather than asking the user to up- or down-vote the response that the rule-based selection criteria would have chosen, the user decides which candidate response he or she prefers. After selecting one response, the user writes a reply to the chatbot message he or she previously selected. Every following interaction follows the same pattern: the participant is presented a list of candidate responses (all nine models described in Section 4 are used), picks one, and writes his or her reply. A visual description of the user interface used during the data collection can be seen in Figure 8.

Figure 8: User interface of Amazon Mechanical Turk during the data collection phase. A random paragraph from a random SQuAD article and a greeting message are sent. The user then selects the bot response ‘Do you know anyone from Estonia?’ and replies. The user is again presented with a choice of potential responses. He has to pick one. Detailed instructions are available on the left of the screen for the user to refer at any time during the conversation.

After a minimum of 5 interactions, the user can decide to finish the conversation, in which case he is asked to provide a score between 1 (‘very bad’), 2 (‘bad’), 3 (‘medium’), 4 (‘good’) and 5 (‘very good’) for the entire conversation. The participant is specifically asked to ignore bot responses that were not selected during the chat when giving this final score. That way, it represents a fair evaluation of the actual conversation that the user just had.

It is important to note that the nine different models that generate candidate responses (described in Section 4) were built and trained only once before the competition and remained constant thereafter. This allows to fix the generation intelligence of the system during the entire data collection process and work with a stable environment in which different scoring and selection mechanisms can be compared. In addition, if a model has many possible candidate responses for each time step in the conversation (such as the NQG model), a never before presented candidate response is picked randomly.

Letting the user decide which response the system returns avoids boring the participant with exploratory selection behaviors from the chatbot, and most importantly, is more data efficient. Indeed, for each interaction, both selected and non-selected responses are saved. On the other hand, during the ConvAI challenge, only 1 response per interaction was saved with its corresponding vote. Here, the selected response is considered as up-voted, and all the non-selected responses as down-voted.

6.2 Data Analysis

Figure 9: Statistics on the entire additional data collected with Amazon Mechanical Turk. One interaction consists of one bot response followed by a human response. A transition tuple is of the form {article, context, candidate, score} with the score being (selected) or (ignored). (a) Table of different data quantities. (b) Proportion of the number of messages per context. (c) Proportion of the number of available candidate responses per interaction.

The statistics of the data collected can be seen in Figure 9. Figure 9(b) shows that there is an equal number of data instances with context length being 1, 3, 5, 7, 9, and 11. That is to be expected since participants are asked to perform a minimum of 5 interactions before ending the conversation. It can also be seen that a few users continued their conversation further, which is a sign of appreciation for the RLLChatbot. The average number of interactions per chat is ~7 from Table 9

(a). Moreover, there is only an odd number of messages in every context because the first greeting message from the chatbot is the first context message; then the user picks a candidate message and replies to it, thus adding two messages to the context. Thereafter every one interaction is made of one bot message followed by one user reply.

Figure 9(c) shows that most of the time (76.32% of the data instances) the number of candidate responses available to the user is 8. That is to be expected since from the nine generative models, the Simple Answers model (Section 4.3.2) provides a response only when the previous user message matches a small set of regular expressions. All the other models are expected to return a response at all times. Note that in 14.64% instances, only two candidate responses are shown to the user, which is also to be expected since at every beginning of conversation only the NQG and the Entity Sentences models are triggered as previously explained. However, some rare times the number of candidate responses available to the user is only 7, 6, or even 5 (8.48% of the data instances in total). This is due to some models not providing responses.

Figure 10: Statistics on the different candidate models usage. (a) Proportion of the number of times each model (except the Simple Answers model) is not available when the number of candidate responses is 5, 6, and 7. (b) Proportion of the number of times each model is in the list of available candidate responses. (c) Proportion of the number of times each model is selected when it is in the list of available candidate responses.

To understand a little more why this happens, Figure 10(a) presents statistics about which model is not available to the user when there are only 5, 6, or 7 candidate responses (ignoring the Simple Answers model that is only available in rare cases). The Topic Classifier (Section 4.2.2) and the ALICE model (Section 4.3.3) are mostly responsible for cases in which the user is missing some candidate responses (~88.62% of the time).

Some investigation revealed that the fastText topic classifier (Joulin et al., 2016) used is running in a separate thread from the actual Topic Classifier module. Thus, if the fastText classifier thread is delayed by other modules, it may not have a label for the article’s paragraph, while the Topic Classifier module is ready to propose a candidate response. In such cases, the Topic Classifier model does not wait for the fastText classifier to provide a topic and returns an empty message instead.

The ALICE model is the second model that most often fails to return a response. After investigating the original source code downloaded for this model, we noticed that when ALICE initially generates a response containing quotation marks, it assumes that the response has profanity language and returns an empty message instead. Some failure examples are shown in Table 9. Other models that rarely fail to return an answer are the Entity Sentences (Section 4.3.1), and the NQG (Section 4.1.2) models. The most common reason for failure in those cases is when the article’s paragraph is one sentence. Indeed, in short paragraphs there is not always enough information for these models to return a response.

Furthermore, Figure 10(b) of shows some statistics about which models are available to the user for all interactions. As expected, the Entity Sentences and NQG models are more often present since they are the only ones presented at the first turn. As mentioned above, the Topic Classifier model and the ALICE model are slightly less presented than the other candidate models. In addition, the Simple Answers model is almost never presented to the user since the set of regular expressions used to fire this model is short and not flexible.

User message ALICE possible responses
Bertrand Russell once said that “it must
have required many ages to discover
that a brace of pheasants and a couple of
days were both instances of the number two”.
I think it would be nice, but I don’t
know a whole lot about it
Why do you want nice?. “it” being nice?
We don’t follow prodigies anymore By “we” do you mean you and me?
I have no idea “?” No idea about that?
Yes in a way. But I think it’s
more an historical article.
Great.. In what way? What does “it” refer to?
Next question?. In what way? It depends on
what the meaning of the word “it” is.
Aha.. In what way? “It” being what?
Table 9: Some user messages that fails the ALICE model: since the possible responses on the right column contain quotation marks, ALICE ignored those valid responses and returned an empty message instead.

Eventually, it is important to understand which model is the most often chosen by human users talking to the RLLChatbot. To that end, Figure 10(c) presents the proportion of the number of times each model is selected when it actually appears in the list of suggested candidate responses. The model with the highest selection rate is the Simple Answers model with a score of 37%. This is reassuring in a sense because even though this model is rarely presented to the user, when it is, its manually defined responses are preferred by users. The next most chosen model is the Entity Sentences model with a selection rate of 30.81% when available. The next two models with comparable selection rates are DrQA and ALICE with 21.87% and 21.61% selection rates respectively. The three rule-based systems of the ensemble are in this top-four. It is thus clear that, in general, the preferred systems according to human users are rule-based systems. The remaining models in decreasing order are NQG with 13.69% selection rate, Fact Retriever with 10.19%, and HRED reddit, HRED twitter, and Topic Classifier with 5.21%, 4.77% and 4.71% selection rates respectively. This analysis shows that current fully generative sequence-to-sequence models have not yet reached the user preference level of more restrictive, rule-based, or even retrieval-based models.

7 Experimentation and Evaluation

After describing and analyzing in detail the dataset collected from Amazon Mechanical Turk, this section presents various experiments on this data and reports results on a held-out test set.

7.1 Experiments

The reported experiments aim at building a mechanism that can automatically select which response to return to the user from a set of previously generated candidate responses. In Section 5 two different neural network architectures are presented: one using hand-crafted features (Figure 5), the other using Gated Recurrent Unit networks to automatically extract features (Figure 6). Two different training algorithms are also described: one using the cross-entropy classification loss (Equation 1), the other using the Huber loss with fitted Q-iteration (Algorithm 1). Eventually, three selection criteria are mentioned: one rule-based process (Figure 7), and two heuristics based on either taking the response with maximum score or sampling one according to its score. All these different ideas are now combined in the following set of experiments:

  • SmallR: The feed-forward network with hand-crafted features is trained to predict the immediate reward of a given candidate response: either 0 or 1. Equation 1 is used.

  • DeepR: The Gated Recurrent Unit network (GRU) is trained to predict the immediate reward of a given candidate response: either 0 or 1. Equation 1 is used, but with the GRU architecture (Figure 6).

  • SmallQ: The feed-forward network with hand-crafted features is trained to predict the Q-value of a given candidate response. Algorithm 1 is used, but with the architecture described in Figure 5.

  • DeepQ: The Gated Recurrent Unit network is trained to predict the Q-value of a given candidate response. Algorithm 1 is used.

All these experiments yield a scoring model that is able to score candidate responses. On top of these, three different selection mechanisms are explored:

  • Rule-Based: The hand-crafted rules as described in Section 5.2 and in Figure 7 decide which response is selected.

  • Sampled: A random candidate response is sampled (without replacement151515k responses are sampled when evaluating the model with Recall@k.) according to the distribution given by the scores of all candidate responses.

  • Argmax: The candidate response with the highest score is selected (without replacement).

The data described in Section 6 is split into 80% training, 10% validation, and 10% testing set. Both the validation and the testing set have no overlapping news article with the training set. This gives unique articles in the training set, in the validation set, and in the testing set. Further details can be found in Table 10. For instance, from the total examples, only have positive reward (+1), while have negative reward (0). This is due to the fact that, for each interaction, both the uniquely selected candidate response (labeled as positive example) and all other non-selected candidate responses (labeled as negative examples) are collected. Therefore a second version of the training set is constructed by over-sampling positive examples as one can see in the third column of Table 10. The over-sampled training set is used in all experiments regarding the classification of candidate responses (SmallR and DeepR experiments). Both the over-sampled and the regular training sets are experimented for the estimation of Q-values (SmallQ and DeepQ experiments). For all experiments, a random search of 100 parameter combination is done. Details about the explored parameters can be found in Appendix B.1.

Training set
Training set
unique articles 1,663 1,330 1,330 165 168
all examples 70,761 56,564 96,659 7,114 7,083
positive examples 10,292 8,233 48,328 1,031 1,028
negative examples 60,469 48,331 48,331 6,083 6,055
Table 10: Statistics on the regular training set, over-sampled training set, validation set and testing set.

The best parameter combination for the reward classification scorers (SmallR and DeepR experiments) is searched161616Further details about parameter exploration can be found in Appendix B.1. by evaluating the F1 score on the validation set with early stopping and a patience of 20 epochs. Training is stopped based on the validation F1 score rather than the validation accuracy because of the imbalance in the validation set as one can see in the fourth column of Table 10. After running 100 SmallR experiments and another 100 DeepR experiments, the best combination of parameters gave a validation F1 score of for the best SmallR model, and a validation F1 score of for the best DeepR model. The different parameters yielding these results can be found in Appendix B.2.

The best parameter combination for the Q-value estimation scorers (SmallQ and DeepQ experiments) is searched171717Further details about parameter exploration can be found in Appendix B.1. by evaluating the Huber loss on the validation set with early stopping and a patience of 20 epochs. After running 100 SmallQ experiments and another 100 DeepQ experiments, the best combination of parameters gave a minimal validation loss of for the best SmallQ model, and a minimal validation loss of for the best DeepQ model. The different parameters yielding these results can be found in Appendix B.3.

7.2 Evaluation

To automatically evaluate how the above models perform, the conversational dataset collected in Section 6 is used to measure how well each model can predict which response was chosen by the human. The held-out test set is used for evaluation. Similarly to the Next Utterance Classification task (Lowe et al., 2016), the Recall@k (R@k) is measured, which is the success rate of finding the correct response in the top k responses ranked in order according to the scoring model. All the above experiments (SmallR, DeepR, SmallQ, DeepQ) as well as the initial baseline model are evaluated. Three different selection mechanisms are considered: Rule-Based, Argmax, and Sampled. Results can be seen and compared in Table 11. The following sections discuss in details these results.

rule based
Avg. R@k
rule based
Avg. R@k
Avg. R@k
Baseline 28.89 % 20.33 % 19.94 % 73.53 % 73.62 % 70.40 %
SmallR 36.87 % 39.20 % 22.76 % 78.60 % 82.44 % 71.03 %
1-46-8 DeepR 37.16 % 37.26 % 20.91 % 77.88 % 80.78 % 69.78 %
SmallQ 24.32 % 16.73 % 19.16 % 70.50 % 63.64 % 66.06 %
1-46-8 DeepQ 24.61 % 16.63 % 21.01 % 70.54 % 64.07 % 66.86 %
Table 11: Recall@1 and average Recall@k for all experiments and all selection mechanisms.

7.2.1 Baseline

Figure 11: Recall measurements for different selection mechanisms with the baseline model used during the ConvAI challenge. (a) Recall@k for all possible values of k. (b) Recall@1 for different context lengths. The number of candidate responses vary between 2 and 9 as described in Figure 9(c).

The baseline is defined to be the scoring network described in Section 5.1.1 and trained for the ConvAI challenge with a validation accuracy of on the ConvAI competition dataset described in Section 3.2. Measurements of the Recall@k metric can be seen in Figure 11(a). Hand-crafted rules described in Section 5.2 helped a lot to boost the Recall@1 score for this model jumping from with Argmax selection to with Rule-Based selection. This is a good sign for the hand-crafted rules as it shows that the external knowledge used when designing them is useful for selecting appropriate responses. It is also interesting to see that starting at R@3, Argmax selection becomes better than Rule-Based selection.

Figure 11(b) focuses on the R@1 score, but for different context lengths, that is, at different time steps in a conversation. As expected, the Rule-Based selection mechanism yields better scores than the other two at all times in a conversation. Recall that for contexts of length 1, the number of candidate responses is only two, thus the random selection process of the Rule-Based systems (see Section 5.2) is expected to give a score around . For contexts of length greater than 1 (the number of candidate responses is around 8), the recall score increases as the conversation goes on, up until 13 messages in the context. This is specifically true for the Rule-Based selection mechanism, reaching a R@1 score of in conversations with 13 messages in the context. This may be due to the fact that with longer conversations, the system has more information about the nature of the conversation and is thus better suited to select which response is the most appropriate. Longer contexts (more than 13 messages) are not frequent enough in the dataset (as one can see from Figure 9(b)) to have a meaningful interpretation about the R@1 score.

7.2.2 Classifiers

Figure 12: Recall measurements for different selection mechanisms with the best SmallR and DeepR models. (a) Recall@k for all possible values of k with the best SmallR model. (b) Recall@1 for different context lengths with the best SmallR model. (c) Recall@k for all possible values of k with the best DeepR model. (d) Recall@1 for different context lengths with the best DeepR model. The number of candidate responses vary between 2 and 9 as described in Figure 9(c).

Figure 12 presents different recall measurements for the best SmallR and DeepR models with a maximal validation F1 score of and respectively. The first thing to notice is the improvement with the baseline model presented earlier from a R@1 score to a R@1 score for the best SmallR model, and to a R@1 score for the best DeepR model. In addition, Figures 12(a) and 12(c) show that the Argmax selection mechanism is as good if not better than the custom Rule-Based selection mechanism for all values of in R@k scores. This means that the system is now much more flexible as it does not rely on human rules to select which message to return to the user. Figures 12(a) and 12(c) also show that the best SmallR model and the best DeepR model are similar in terms of Recall@k score as it grows at the same rate as increases. However, the SmallR model is slightly better, specially with the Argmax selection mechanism which has an average recall score (computed by taking the average R@k score over all values of ) of against for the best DeepR model with Argmax selection as Table 11 reports. This shows that the deeper architecture involving GRU networks captures meaningful information about the state of the conversation, but the hand-crafted features are still slightly better in those experiments. Furthermore, the deeper architecture being designed more specifically for predicting Q-values by decomposing state and action values, one can expect that this complication may not be optimal for the current classification task.

Eventually, Figures 12(b) and 12(d) report the R@1 score of both models with different selection mechanisms at different time steps in the conversation. One can see that the main advantage of the Argmax selection over the Rule-Based selection reported from Figures 12(a) and 12(c) actually happens at the beginning of the conversation when the context length is 1. When the discussion contains more messages, the Rule-Based selection mechanism is sometimes better, sometimes worst than the Argmax selection. In general though, as in the baseline model, the R@1 score tends to increase with context length, up until 13 messages. Longer conversations are not frequent enough in the dataset (as one can see from Figure 9(b)) to have a meaningful R@1 score interpretation.

Figure 13: Recall measurements for different selection mechanisms with the best SmallQ and DeepQ models. (a) Recall@k for all possible values of k with the best SmallQ model. (b) Recall@1 for different context lengths with the best SmallQ model. (c) Recall@k for all possible values of k with the best DeepQ model. (d) Recall@1 for different context lengths with the best DeepQ model. The number of candidate responses vary between 2 and 9 as described in Figure 9(c).

7.2.3 Q-value Predictors

Finally, Figure 13 presents different recall measurements for the best SmallQ and DeepQ models with a minimal validation loss of and respectively. The first thing one can notice is that these models are actually worse than the baseline classifier with a R@1 score of attained by the best DeepQ model, against the baseline score. The scoring mechanism being poor at evaluating candidate responses, Figures 13(a) and 13(c) show that the Rule-Based selection process is now stronger than the other two for all values of under 7. Overall, the average Recall score of the SmallQ model is slightly lower than its counterpart DeepQ, specially with the Argmax selection mechanism which has an average Recall score (computed by taking the average R@k score over all values of ) of against for the best DeepQ model with Rule-Based selection as Table 11 reports. This shows that the deeper architecture involving GRU networks is preferred to estimate Q-values. This is to be expected as the architecture was inspired by Dueling Deep Q-Networks (Wang et al., 2015).

Another interesting result observed in Figure 13 is that, unlike in the classifier models, the Sampled selection process seems to perform better than the Argmax selection in both SmallQ and DeepQ models. This shows that being greedy with respect to the predicted Q-value may not always be the best strategy, and allowing some stochasticity can be beneficial in those cases. This is another sign that the predicted Q-values are not informative enough to make a greedy decision. Eventually, Figures 13(b) and 13(d) indicate that, just like in the previous experiments, longer contexts allow better R@1 score with the Rule-Based selection mechanism (up until 13 messages).

Overall, the experiments conducted and described above show that the additional data manually collected is indeed informative about how humans pick their responses. This novel dataset can be used to train message scoring and message selection models, thus minimizing the need of human expertise by automatically extracting text features. In addition, those experiments show that the choice of the scoring algorithm used is critical, as the SmallQ and DeepQ experiments yield poor results. This can be caused by the fact that the state and action space possible in a conversational environment is enormous if not infinite, and that the collected data cannot possibly capture all of it.

7.2.4 Qualitative Evaluation

<…>. Dou Wu and the Grand Tutor
Chen Fan attempted a coup d’etat
against the eunuchs Hou Lan, Cao Jie,
and Wang Fu. <…> General Zhang
Huan favored the eunuchs. <…>
<…>. Most of these companies
<…> temporarily disable the
10NES chip in the NES. <…>
1-3 bot Do you know Wu in person?
Do you think you would like
to work for NES?
1-3 user No, I do not. Yes I would.
1-3 bot
Did you know that a Boeing 747s
wingspan is longer than the Wright
brother’s first flight.?
The article is related to Computer
& Internet.
1-3 user I didn’t know that. That is interesting.
1-3 bot
Do you know what Chen Fan
did in his life?
I’m a fan of the game.
1-3 user
He was arrested for an
attempted coup d’etat.
What game?
1-3 bot
I don’t have a lot of other
questions about this article.
Maybe you can ask me one?
We are playing Turing’s
imitation game.
1-3 user Who favoured the eunuchs? How does it go?
1-3 bot General Zhang Huan How should I know how it go does.
<… > Margaret Thatcher travelled
to Beijing to negotiate with the
Chinese government <…>
<…>. The Los Angeles Avengers
announced that they were
folding the franchise. <…>
1-3 bot Are you a Chinese ?
Have you ever used any of the
Los Angeles Avengers’s
products or services?
1-3 user I am not chinese I’m american No i have not
1-3 bot i like the way you think . i like it .
Are you kidding?
Do you think you will?
1-3 user That’s just my brain Is it any good?
1-3 bot
Do you know that In ancient
Greece idiot meant a
private citizen or layman.?
This article is about Sports
1-3 user Ya those silly philosophers Which one?
1-3 bot
I’m a Romanian and I’m not
even a king .
The article is related to Sports
1-3 user
Well there aren’t very many
kings nowadays
1-3 bot I’m a king . I think it’s about Sports
Table 12: Examples of four bot-to-human conversations with the final RLLChatbot, using the best SmallR scorer and the Argmax selection mechanism.

The best scoring and selection mechanisms are combined together according to all the previously described experiments. The final version of the RLLChatbot is using the SmallR scorer and the Argmax selection with an average R@k score of 82.44%. This version of the system was used to collect a few conversations with human users. We present in Table 12 four such conversations.

The top left conversation follows a question – answer structure where the RLLChatbot asks most of the questions with the Entity Sentences model. At the end of the conversation the bot is then answering user questions using the DrQA model. The top right conversation is more of a social chat and the topic of the conversation diverges a little from the article but stays coherent. These first two conversations are some good examples of coherent interactions we collected.

The lower left chat is an example in which the RLLChatbot changes topic with the Fact Retriever model. In addition, it contradicts itself in the last interaction with the “I’m a king.” reply. The lower right conversation is an example in which the system goes in circles and no progress is made in the chat. These two conversations are examples of incoherent interactions collected with the dialog agent.

One weakness of using a probabilistic selection mechanism (Argmax) is that we cannot explicitly check for contradictions or repetitions from the system, whereas a simple rule based system can avoid those issues. On the other hand, using probabilistic models allow the system to be more flexible to new conversation topics.

8 Conclusion

8.1 Summary

Throughout this article we presented the RLLChatbot: a conversational agent capable of discussing random news paragraphs with a human user. Seeing a real system in details provide a lot of value to dialog researchers and practitioners. Using an ensemble of rule-based and statistical models, the system differentiates itself from previous conversational agents in many ways. Being non-goal oriented, it has to be flexible enough to discuss a wide range of topics, which motivated the use of different models ranging in their specificity.

Several models are used to generate up to nine distinct candidate responses at each interaction of a conversation. The final message returned to the user is selected according to a trained scoring mechanism. In contrast, typical conversational agents use at least 3 modules to produce a response: a natural language understanding machine, a dialog manager (often made of many sub-modules), and a natural language generator (Raux et al., 2005; Callejas and López-Cózar, 2005).

Another focus of this work is the presentation of a novel conversational dataset collected to train different message scoring mechanisms. Multiple bot responses are available in each interaction of a conversation, and the goal of the machine is to identify which response was chosen by the human. Four initial strategies are presented: two relying on supervised learning to perform a classification task, and two relying on reinforcement learning to perform a prediction task. Two types of architectures are also considered: one using hand-crafted features with a feed-forward neural network, the other using automatic feature extraction with Gated Recurrent Unit (GRU) networks. The difference between the deep GRU network and the feed-forward network with hand-crafted features is negligible. However, the training algorithm is a critical decision as the models trained to predict Q-values with reinforcement learning techniques are not as powerful as the baseline model according to the

Recall@k metric. On the other hand, models trained to classify candidate responses in a supervised fashion make a significant improvement on the Recall@1 score by going from with the baseline model and Rule-Based selection, to with the more flexible Argmax selection mechanism.

8.2 Limitations

Being partly motivated by an organized competition, some time constraints did not allow us to always pursue all the experiments planned.

The first set of limitations comes with the two HRED models described in Section 4.1.1. One extension could be to not only condition the decoder on the conversation history, but also on the news article’s paragraph after being processed by a recurrent neural network. This is especially true for the Reddit HRED model that can retrieve the online article that triggered the Reddit conversation. In addition, instead of vanilla HRED models, adding an Attention mechanism (Bahdanau et al., 2014) could help the models generate less generic responses.

Another model that could have been improved is the Topic Classifier model (Section 4.2.2). As of now, a simple 10-class classifier and some predefined sentences are used to inform the user about the general topic. However, there is a lot of work done in the area of text summarizarion that could be explored. For cases in which the news paragraph is quite long, a summarization model could be beneficial.

Regarding the different scoring techniques, one limitation is that the deep architecture involving Gated Recurrent Unit (GRU) networks, was not pre-trained on large corpora of text. Training the recurrent networks to encode and decode conversational text (just like the HRED models) could be beneficial for the scoring models. Furthermore, different deep architectures may yield better results in the classification task presented in Section 5.1.1.

8.3 Future Work

Eventually, we leave as future work the task of training an end-to-end version of the presented system. As previously mentioned, the nine generative systems producing candidate responses were trained once before the ConvAI competition and remained fixed thereafter. Thus, even if we had a Recall@1 score of 100%, the system would still be limited by the capabilities of its components.

Finally, organizing academic competitions like the Amazon Alexa Prize and the ConvAI challenge are good alternatives to evaluate conversational agents in various tasks. This work shows that making the data available can be useful to drive dialog research. Future challenges should thus encourage participation to have a good amount of evaluated conversational data, and release the data after the end of the competition. As described in Appendix A.6, our team encountered several engineering difficulties in deploying the system in a live environment. Often taking more time than expected, these challenges can reduce the amount of innovation in a system and discourage researchers from participating. Future academic challenges should thus provide as much help as possible to deploy systems in an easy and secure fashion.


The authors gratefully acknowledge the main organizers of the Conversational Intelligence Challenge: Mikhail Burtsev and Valentin Malykh. We are also thankful to Jack Urbanek and Alexander Miller from the Facebook ParlAI team for constructive feedback and technical help. We further acknowledge financial support from the Facebook ParlAI Research Fund, NSERC, Samsung Advanced Institute of Technology (SAIT), Pierre Arbour Foundation, Fonds de recherche du Québec - Nature et Technologies (FRQNT), Calcul Quebec, and Compute Canada. Finally, we thank all participants of McGill University who helped us evaluating our chatbot by chatting with it during our data collection phase.

Appendix A Technical Details of the proposed system

a.1 Challenge requirements

The requirements of the Conversational AI (ConvAI) competition was to submit a self-contained model in a Docker 181818 instance. The competition environment used Telegram messaging platform to pair bots with human users. Since a ranker mechanism chooses the best response from an ensemble of models, all models need to be loaded into memory at inference time. Individual models require variable amount of system memory, from the highest being the generative models and the lowest being the rule-based systems. Thus, a multiprocessing orchestrator communicating via inter-process communication (IPC) message queues is implemented (Figure 14).

a.2 Overall framework




Scoring module

Message queue


Figure 14: Overall system framework

At first, the orchestrator receives a start-of-conversation signal from the environment (the ConvAI framework talking to Telegram), followed by a randomly assigned news article paragraph. Then, the orchestrator fires a wake-up initialization call containing the news paragraph to all its child processes. Each of them contain one model from the ensemble. The models themselves can then choose to initiate themselves with the paragraph text. For example, the Entity Sentences model (Section 4.3.1) runs the Spacy Named Entity Recognizer191919 to pre-select a set of entities; the Topic classifier model (Section 4.2.2) runs the fastText classifier (Joulin et al., 2016) over the article text and saves the topic in its own dictionary. Subsequently, the orchestrator fires the question generator models (Neural Question Generator (NQG) (Section 4.1.2) and Entity Sentences) to start proactively conversing about the given article. Thereafter, on each turn, the orchestrator shares the user response to all the child processes who generate their individual responses and submit them to the scoring module. As soon as a candidate response is being generated, the scoring module evaluates it and submits the response along with its score to the message queue bus. Eventually, the selection mechanism selects the best response based on its score.

a.3 Message response latency

One of the critical constraints of the setup is to reduce message response latency so that human judges do not have to wait more than a few seconds to respond to the query. In this setup, the wait time can be potentially compounded depending on individual response generation and scoring times. Therefore, the score of each response is calculated within the model’s own child process. A hard wait time of 7 seconds is set for the orchestrator to listen to the message queue bus and it rejects any responses which arrive too late for processing. When candidate responses arrive at the selection module, the scores are already computed, thus the module has only to sample accordingly and return one response to the orchestrator. Furthermore, since multiple users can communicate at the same time with the system, an incoming/outgoing message queue is implemented in the orchestrator so that each model and the orchestrator can communicate asynchronously through IPC.

a.4 Choice of Inter-process communication system

ZMQ 202020 was first explored to have an individual incoming/outgoing thread within each model child process. ZMQ provides an IPC message bus highway for fast communication. However, due to system limitations, spawning two threads (input and output) per child process (per model) increases the complexity of the overall system and results in increased response latency. We thereafter switched to Python’s inbuilt shared message queue to reduce system complexity. However, production systems having unconstrained system requirements would benefit from using a dedicated message queue such as ZMQ or ActiveMQ 212121

a.5 Monitoring child processes

Since each model runs on its own child process, a fail-safe redundant process monitoring system is implemented to handle the possibility of a model crash. The orchestrator pings each model at every interaction, and if one model fails to respond within 60 seconds, then the orchestrator revive the child process of that specific model. This helps to have as many candidate responses as possible for each interaction.

a.6 Technical Difficulties

Part of the challenge was the engineering effort of exporting a research project into a real system with all the constraints that comes with it such as latency, concurrency, memory, and others. For instance, one instruction received at the end of the challenge was that the submitted systems must have the following hardware constraints: 2 virtual processors Intel Xeon CPU @ 2.40GHz, 16 Gb of RAM, and 50 Gb of disk space. This proved to be critical for the system as the RLLChatbot took about 50 Gb of RAM to load all the models in memory. This bottleneck resulted in increased latency of response due to operating system memory swapping and context switching. Future improvements can be made by reducing the model size by using mixed precision systems 222222

Appendix B Hyperparameter & Implementation details of experiments

b.1 Explored parameters

For all experiments described in Section 7.1, different combinations of the following parameters are explored by randomly sampling 100 values:

  • [noitemsep,topsep=0pt]

  • optimizer: ADAM (Kingma and Ba, 2014), SGD (Rumelhart et al., 1985, 1986), Adagrad (Duchi et al., 2011), Adadelta (Zeiler, 2012), and RMSProp (Hinton et al., 2012).

  • learning rate: , , and .

  • activation function: Sigmoid, ReLU (Glorot et al., 2011), and pReLU (He et al., 2015).

  • weight initialization: He (He et al., 2015), and Glorot (Glorot and Bengio, 2010)

  • dropout rate: , , , and

These are the only flexible parameters in order to limit the number of degrees of freedom in the system. Moreover, these parameters are expected to have a direct influence on the training behavior of the system. The architecture of the networks is kept fixed because the size of the networks only influence the capacity of the models rather than their ability to learn.

b.2 Best {Small/Deep}_R parameters

After running 100 SmallR experiments and another 100 DeepR experiments with different random parameter combinations as described in Appendix B.1, the SmallR and DeepR models with highest F1 validation score were trained with:

  • [noitemsep,topsep=0pt]

  • a batch size of ,

  • the RmsProp and SGD optimizers respectively,

  • a learning rate of and respectively,

  • pReLU activation functions with He weight initialization (He et al., 2015),

  • and a dropout rate of and respectively.

This combination of parameters gave a validation F1 score of for the best SmallR model, and a validation F1 score of for the best DeepR model.

b.3 Best {Small/Deep}_Q parameters

After running 100 SmallQ experiments and another 100 DeepQ experiments with different random parameter combinations as described in Appendix B.1, the SmallQ and DeepQ models with lowest validation loss were trained with:

  • [noitemsep,topsep=0pt]

  • the regular training set (i.e.: not the over-sampled one),

  • a batch size of ,

  • the SGD and ADAM optimizers respectively,

  • a learning rate of ,

  • a discount factor of ,

  • an update frequency of updates for the target DQN,

  • a hidden size of for the recurrent networks in DeepQ experiments,

  • sigmoid activation functions with Glorot weight initialization (Glorot and Bengio, 2010),

  • and a dropout rate of .

This combination of parameters gave a minimal validation loss of for the best SmallQ model, and a minimal validation loss of for the best DeepQ model.