Building an open-domain conversational AI for commercial use poses two main challenges. First is broad-coverage: Modeling natural conversation in an unrestricted number of topics is still an open problem as shown by the current concentration of research on dialogue in restricted domains (e.g. bowden2017data). Second is the scarcity of clean, unbiased and comprehensive datasets of open-ended conversation, which makes it difficult to develop conversational dialogue systems and limits the viability of using purely data-driven methods (e.g. neural networks).
Our approach aims to integrate rule-based with machine-learned behavior, with both grounded in data that is intelligible, reflective of what people want to talk about, and gathered automatically. This means acknowledging that people talk differently about different things at different times. The domain has to be potentially unrestricted and the model should be powerful enough to ensure that the most likely system response is in line with both the user’s most recent utterance and, ideally, the flow of the overall conversation.
We present Edina, a conversational AI agent that exploits a corpus of conversations harvested from Amazon Mechanical Turk (§2). Our innovation is to collect and use data in the form of self-dialogues, in which Workers engage in conversation with themselves on a specific topic, enabling us to gather first-hand what people discuss and how they discuss it. By running tasks periodically, we are able to identify the trending entities people talk about at a given point in time. Although our system architecture (§3) also includes a rule-based component (§3.1), our growing corpus of self-dialogues is designed to enable us to avoid having to hand-script potential conversational situations. We achieve this by an intuitive IDF (inverse document frequency)-based matching score (§3.3) that takes a user utterance and returns the most likely response, based on either conversational partner in a self-dialogue. The matching score is also able to self-censor when it is not confident that it has an intelligible reply, making it easy to integrate with other components. Although Edina currently converses on only three main topics (movies, music and sports), the approach is easily extensible to any topic. Our approach is also cost-effective: Once an initial quantity of data is collected around a topic, only small updates are needed in order to stay current on trending topics. After describing our methods in detail, we perform qualitative and quantitative evaluations of our system (§4), and describe the significance of our approach and results in the discussion (§5).
2 Data collection
Our focus on data collection stems from the scarcity of publicly available corpora for training dialogue systems. During the first stage of the competition, we surveyed the corpora mentioned in DBLP:journals/corr/SerbanLCP15 and found that suitable publicly available corpora were typically too small, artificial or difficult to obtain. We instead chose to gather our own data with the primary criteria that the conversations i) represent casual human dialogue and ii) have identifiable topics. We turned to Amazon Mechanical Turk (AMT), where we initially developed a chat interface for two AMT Workers to chat about a specific topic. The interface was based on VisDial (DBLP:journals/corr/DasKGSYMPB16), an interface that prompts the Workers to converse about a particular image. However, in demanding that two workers be connected at the same time, this setup slowed down the collection task on an otherwise fast-moving platform. This led us to develop an innovative new task on AMT, self-dialogue, in which a single worker is asked to play both parts in a conversation about a given topic. A sample self-dialogue is given in Figure 1, alongside the instructions provided to the AMT Workers. A key requirement is that the conversations appear natural. For this reason, we avoided lengthy instruction, instead allowing Workers to interpret the task as they wish within the given topic.
Overall we observed the following data-related benefits of self-dialogues, as compared with the two-person conversations that are more standardly collected:
Collection speed and efficiency of set up: Self-dialogues do not require waiting for two workers to connect at the same time, speeding up the collection process and easing the development of the back-end.
Data quality: Both parties of a self-dialogue are equally expert in the same topics, making it much easier for self-dialogues to go into detail about particular topics. We also found that two-person dialogues often showed misunderstandings between speakers, and clarifications that play out over several turns, requiring undesirable complexity of the conversation agent.
Naturalness: It was striking the extent to which people were able to have interesting and engaging conversations with themselves, adopting different perspectives and even different stances with respect to a topic, depending on which participant they were pretending to be at each point in the conversation. While one might have expected the conversations to simply resemble question-answer pairs, participants instead often contributed a pair of communicative actions — often with the first part addressing the previous conversation and the other, advancing it further. Such contributions are particularly valuable as data because they have built-in hook to further engagement.
Cost effectiveness: Related to (2), we were able to halve the cost of collecting conversation since we only had to pay one worker instead of two.
Submissions were of surprisingly high quality. We implemented a pipeline to review and identify Workers abusing the system or submitting duplicates. However, only eight of 2,717 were banned, and only 145 conversations () were rejected. Initial experiments with requirements on Workers and the pay per conversation converged on the following set:
location: the United States or United Kingdom;
HIT approval rate: greater than 95%;
number of HITs approved: greater than 500;
number of conversations per worker per task: maximum 20;
pay per 10-turn conversation: US $0.70-80; 5-turn conversation: US $0.35-40.
Workers produced 10-turn dialogues for the main topics, and 5-turn dialogues for sub-topics (except for "NFL Football"). Empirically, we observed that the 10-turn tasks afforded Workers the time to transition into a topic of interest, which was unnecessary for sub-topics.
To date, we have collected 24,283 self-dialogues through AMT, across four major topics separated into 23 tasks. These currently amount to 3,653,313 words, across 141,945 turns, from 2,717 Workers. On average, each worker has submitted 9 self-dialogues, so the corpus displays a wide variety of styles, content and ideas across a large population.
The total paid to Workers so far is US $17,947.54. AMT’s high liquidity has enabled us to gather data fast when required. In a period of approximately 20 days (from 20/06 to 10/07) we were able to gather about 20,000 conversations, with a peak of 2,307 conversations collected in a single day. The following is an excerpt from a self-dialogue in the Movies category:
What is your absolute favorite movie?
I think Beauty and the Beast is my favorite.
The new one?
No, the cartoon. Something about it just feels magical.
It is my favorite Disney movie.
What’s your favorite movie in general?
I think my favorite is The Sound of Music.
Really? Other than cartoons and stuff I can never get into musicals.
I love musicals. I really liked Phantom of the Opera.
The data (or offline conversations) are stored by grouping them by their corresponding tasks, with each line in each conversation as a response. In Section 3.3, we explain how a context is associated with each response for use with the Matching Score component.
|Topic/subtopic||# Conversations||# Words||# Turns|
|Fast & Furious||343||33,964||3,430|
|Rap / Hip-Hop||684||66,376||6,840|
|Music and Movies||216||37,303||4,320|
3 System Architecture
Our system is a hybrid of several rule-based and data-driven components. Following an initial preprocessing step, our system exploits a priority queue of components that vary in their intended coverage and functionality. Its three main components comprise:
A rule-based component that uses a list of hand written rules and templates. It returns a response only if the user’s input is an exact match. While this component has the highest priority, due to obvious coverage limitations, it often returns nothing and defers to the matching score component.
A matching score component that selects responses from a pool of conversational data, based on how close the context of the user conversation is to the context of the response in our data. The matching score also returns a confidence score that is used to better control its interaction with the other components.
A generative neural network that always generates a response and is deployed if the other two components fail. It often gives general and vague on topic responses as compared with the more specific responses of the matching score.
Interleaved between these three main components are three minor components:
EVI, which serves as our primary IR engine, but is only used to reply to user inputs that do not probe the system’s identity, personality or opinions. EVI takes priority in responding to WH-questions questions that require an exhaustive knowledge base. As conversations to date have rarely taken this route, EVI has rarely been called.
A likes and dislikes component whose purpose is to answer questions about Edina’s opinions on entities and definitions that the rule-based component fails to cover.
A proactive component which asks the user a question or series of questions, in order to steer the conversation back to what the matching score can handle. We limit its use, in order to avoid probing the user too often, resulting in an unpleasant experience.
Although the system is guaranteed to always return something, we retain a set of interesting facts or clarification replies in the unlikely case that all components fail — say in case of major network errors or if the output from all other components is judged offensive. Output is always filtered for the possibility of profanity. Table 2 shows how often each components’ output is chosen to be sent back to the Alexa device. Given the order in our priority queue, one can observe the challenges of creating a purely rule-based system that can guarantee full coverage. The rule-based component can only reply to of the user’s utterances. Our agent overcomes this limitation by employing data driven methods, i.e. the matching score (§3.3) and generative neural network (§3.6) components, that make for the majority of the system’s replies (). Figure 2 presents an overview of the complete system, from input to the Alexa device to the output generated by Edina (i.e. green arrow).
Edina’s conversations with Amazon customers as well as other information related to the conversation (e.g. topic, customer preferences, the outputs from each of our components etc.) are stored in a Postgres Amazon RDS instance. Throughout the document, we refer to dialogues with Amazon customers as online conversations in order to differentiate them from the self-dialogues collected via Amazon Mechanical Turk (offline conversations).
We perform a series of preprocessing steps on the raw input received from Lambda. The input is first processed through spaCy’s111https://www.spacy.io
pipeline mainly for tokenization and Named Entity Recognition (NER). Remaining preprocessing is performed in parallel, comprising another NER, coreference resolution and simple user-modeling. The second NER is performed using DBpedia Spotlight API(daiber2013) which extracts more information on the named entities from Wikipedia. We perform the coreference annotation (COREF) over the previous four turns (i.e. 2 turns each from the user and the bot), using Stanford CoreNLP deterministic implementation (recasens2013). The generated coreference chain is used to modify the current input message by replacing pronouns with the entities they refer to. Finally, user-modeling is a simple rule-based implementation that catches explicit mentions of the user’s likes and dislikes, as well as when the user would like to change the topic. These preferences are matched to a list of topics and subtopics that our system can handle. The processed input and the additional information that result from the preprocessing phase is sent to each of the components that build a reply.
3.2 Rule-based Component
The rule-based component deterministically matches a user’s input and returns a single output. Rules in the current ruleset address the following points:
the agent’s identity and preferences
. Rules ensure that identity information remains anonymous for the moment (e.g. name and location). Most preferences are personalized so to make the conversational agent more human-like.
sensitive topics such as suicide, cancer or death of a close person, which we carefully redirect to existing helplines when possible. Prompts containing a list of sensitive and/or potentially offensive words are also handled by a polite yet firm response (e.g. ‘This kind of talk makes me uncomfortable, let’s talk about something else.’).
topic shifting, so as to recognize when the user wants to set a new topic or change the current one, or when the agent should shift away from controversial topics, such as politics, which we do not handle.
other forms of engagement. These enable the agent to make jokes, to play a small point-based game, where the user has to complete the lyrics of popular songs, or to invoke a weather API that can return information about the weather, given the user’s location.
Although the development of the rule-based component was based solely on intuition about what the rules should capture, we have continued to refine our rules and add to our ruleset, based on conversations between Alexa users and Edina. Finally, we also integrated the list of frequent and common utterances to all socialbots that have been aggregated and anonymized. The code has been implemented in RiveScript222https://www.rivescript.com.
3.3 Matching Score component
Given a user utterance , the matching score component is designed to return the most appropriate response from the bank of self-dialogues (Section §2). We treat this bank as a set of tuples where is a tuple containing a response and a context , and is either the immediately preceding response (hence, =) or the one at one remove (=).
In the context of an online conversation, we define as as the response from Edina directly preceding the utterance . This notation is illustrated below:
previous response from Edina (): what’s your favorite movie
user utterance (): sound of music
Off-line Conversation (from )
: What’s your favorite movie in general?
: I think my favorite is The Sound of Music.
: Really? Other than cartoons and stuff I can never get into musicals.
(where = (, ) and = )
In order to retrieve the most likely response from , we use a scoring function that measures the similarity between the user utterance and the response with a context
. This similarity measure is based on bag-of-words vectors which up-weight rare words using inverse document frequency (IDF).333Word frequencies for IDF scores were taken from http://norvig.com/ngrams/count_1w.txt. Using bag-of-words permits inverted indexing, where response IDs are stored in a look-up table of words. This removes the need to directly compare with every .
More formally we define our final matching score
as an interpolation of three different subscores plus a normalization term:
where and are constant ( and ).
is the most important term and measures the similarity between and . Here the context is limited to one preceding response, hence =. Instead of normal IDF scores, we take their cube to give rarer words a higher relative weight. Formally, can be defined as:
Using this inner product term alone, however, would result in (1) a matching score with a strong preference for responses with high , and(2) higher scores for all matches where is high. Therefore we introduce a normalization term
would yield a cosine similarity based score.
One shortcoming of is that we might not want to perform exact match on very generic context such as simply "yes". In this case, it might be better for the matching score to defer to another component rather than find a match for "yes" in . It might also be desirable to consider the response when up-weighting overlapping words, because responses that contain words from the query are often more engaging and sometimes more relevant.
addresses both issues by i) computing the inner product of the quartic of the IDF scores so to create even more contrast between common and rare words and ii) calculating the similarity between the query and the response. This is formalized as:
It can also be useful to consider a wider context as well as more than one previous user utterances when choosing a response.
does that by considering as context as well as the agent response preceding . For this subscore, normal IDF scores are used. This translates into the following equation:
In the case of ties or near ties, we select the argmax of to try to select a response that is longer and with more rare words.
Finally, the matching score keeps track of the most recent topic or subtopic. When a response is pulled from a particular sub-dataset from our AMT data, the sub-dataset topic is assumed to be the current topic. All responses from the sub-dataset covering the current topic are given a small bonus to their score to improve topic coherence. The matching score returns the highest scoring response as well as the score of that response, which we refer to as the confidence score.
3.4 Likes and dislikes
This component is triggered only if the rule-based agent fails to return and the matching score confidence drops below a predefined threshold. It contains a set of very specific templates that depend on the number of recent views of the entity’s Wikipedia page. These templates are limited to ‘safe’ domains, e.g. actors, singers and bands, to avoid sensitive entities. It is particularly useful for unseen entities. An example response by this component is shown below in (3.4).
User: Do you like Kaizers Orchestra?
Edina: Kaizers Orchestra was a Norwegian alternative rock band formed on 1 January 2000. What do you think about them?
3.5 Proactive component
The proactive component is designed to drive the conversation towards a state that other components are better able to handle. It does so by sampling from a set of questions to ask the user. These questions are designed to either:
probe the user to mention a specific entity that the Matching Score component will likely be able to match;
ask a yes or no question to gauge the user’s interest in a particular topic, followed by a question probing the user to continue the conversation within that topic (cf. Example 3.5).
The proactive component is called when the matching score fails to return a response due to low confidence. It is in general beneficial if we can probe the user to mention specific entities related to the topics in our data, as our Matching Score component can often give interesting opinions about specific entities. The proactive component is limited to only returning a response once very 5 turns to avoid excessive use. An example interaction is shown below.
Edina: Did you enjoy the last Star Wars movie?
Edina: I was hoping for you to say yes! What did you think of it?
3.6 Generative Neural Network component
The generative neural network is a recurrent neural network (RNN) language model that can predict a probability over the next word given a history of words. For this component, we used a multiplicative LSTM(krause2017) which combines features of a multiplicative RNN (sutskever2011) and an LSTM (hochreiter1997long) to achieve a stronger language modeling performance. We pretrained our network on the open subtitles dataset (Lison2016) using vocabulary from our self-dialogues. We then fine-tuned this model on our data sources. We considered several stopping points to try to obtain the model that generates the best samples. In general, some degree of overfitting (continuing to train even after validation error is getting worse) was beneficial to sample quality. At runtime, we simply sampled this model word by word, conditioned on the user’s last response, and the agent’s response before that. We used a temperature of 0.7 for sampling, to draw samples from higher density regions under the model.
4 System Evaluation
4.1 Internal evaluation of matching score
We evaluated our matching score component and its associated confidence scores using an internal human evaluation metric. A single trial in our evaluation process involved sampling a triplefrom our data-bank of self-dialogues. We then use this conversational context as the user utterance for our matching score ( and ). We then find the top 4 responses under our scoring function , for , forcing the matching score to chose a response other than the true response which was given by the AMT worker. We randomly sampled response to either be one of these top-4 matching score responses, or the response given by the AMT worker (). The response was then rated in the context of and according to the evaluation metric given below. The rater was always blind to whether was a real AMT worker response.
5 - human-like response that is interesting or engaging
4 - human-like response that is not as interesting or engaging, or response that doesn’t quite fit conversationally but is interesting or engaging
3 - probably not a human-like response, but still somehow addresses part of the conversation context
2 - something is clearly off about response, but still could contribute to the conversation in some way
1 - response is incoherent.
We collected 181 trials of human evaluations of top-4 matching score responses and 33 trials of human evaluations of real responses from AMT workers. We found that the confidence scores of our top-4 matching score responses were positively correlated with human evaluation scores (, ). Statistics on human evaluation of matching score at different confidence thresholds and real human responses are given in Table 3.
|Response type||Mean human evaluation score|
|Human AMT worker|
|Matching score top 4, confidence > 0.7|
|Matching score top-4, confidence > 0.5|
|Matching score top-4|
Average human evaluation of matching score and real human responses, plus or minus standard error. Confidence threshold of 0.5 corresponded to confidence threshold for matching score component to response. Confidence threshold of 0.7 corresponded to confidence threshold for matching score component to move ahead in the queue of the likes and dislikes component.
4.2 Qualitative evaluation on online customer conversations
Following, we show an excerpt from an online conversation, where the agent’s responses are annotated with the components that generated them:
Bot: Hi! This is an Alexa Prize socialbot. I feel very rock and roll today. Shall we talk about music? [StartMessage]
User: sure what do you want to talk about
Bot: Lets talk about music [matching score]
User: okay which music in particular
Bot: Who is your favorite singer? [proactive]
User: David bowie
Bot: He passed away recently, very sad. [matching score]
User: I agree do you like David bowie
Bot: I’m a major David Bowie fan! You too? [matching score]
User: sure what’s your favorite song
Bot: Hit me baby one more time by Britney Spears. Hashtag classic. [rule based]
User: what’s your favorite David bowie song
Bot: That’s a tough question. I would have to say Hunky Dory or Station to Station. [matching score]
User: what’s your favorite artist
Bot: I adore the Guns and Roses. Such a great music. [rule based]
User: what is your favorite song is hit me baby 1 more time by Britney spears
Bot: That’s an odd but respectable classic pick. [matching score]
User: what’s your favorite guns and roses song
Bot: I have to go with November Rain. [matching score]
User: why do you like guns and roses
Bot: I love Guns and Roses. [matching score]
Edina demonstrates many of the advantages of having a hybrid method. The matching score was able to return a current output recognizing the fact that David Bowie has passed away recently. The outputs are relevant to the context; for instance the matching score was able to name actual songs by David Bowie and Guns and Roses. The rule based component is instead evoked correctly when the customer inquiries the preferences of our bot. Finally it is worth noticing that when the confidence threshold is low, the proactive component probes the user for an entity related to the current topic so not to disrupt the flow of the conversation.
Edina demonstrated the ability to engage in a wide range of subtopics during conversations with Alexa customers. This was achieved by drawing responses from a conversation dataset created with the novel technique of self-dialogues. This technique plays a significant role in enabling our agent to generate relevant, topical output closely resembling human conversations. Our internal system evaluation further demonstrated the value of using a thresholded matching score to select a response. During our human evaluation, the responses that have a high matching confidence score received a significantly higher rating compared to the other responses (see Table 3), showing that the matching score can self-censor to greatly reduce the risk of a poor response.
We believe the strength of our system is incorporating the advantages of both data-driven and rule-based approaches while avoiding their shortcomings. We were able to cover a broad range of topic and subtopics without the need of tailored scripts or rules, while at the same time using data that does not contain a large amount of noise and is easy to collect. Our self-conversation data collection technique and our unique approach to integrating rule-based, retrieval, and machine learning based methods should prove useful for future open domain conversational agents.