Conversational AI (conver) has been an active area of research for decades, and based on recent advances in natural language understanding (NLU) and related fields, intelligent assistants have continuously improved. However, unlike human assistants, these bots do not understand the true meanings of their generated responses. They are more vulnerable to failures, and as of now, there have been no methods reported to reliably and automatically detect and correct failures in human-machine conversations as they occur. Introducing an accurate online satisfaction prediction model would spur dramatic improvements of conversational agents. For example, automatic and timely detection of failures would allow a conversational system to gracefully handle mistakes, and potentially improve both immediate and future system responses.
Evaluating intelligent assistants is a challenging task, and has been an active area of research. For example, recent studies identified particular patterns of interactions which tend to contribute to final user satisfaction (e.g., (kiseleva2016understanding; egregious; sensitive)), and new behavioral metrics such as conversational depth and topic diversity have been proposed to systematically evaluate user experience in conversational systems (venkatesh2018evaluating; pred; conver). However, as we will show, these metrics do not directly correspond to actual subjective and immediate user satisfaction with the conversation. Furthermore, predicting conversational satisfaction is significantly different from evaluating traditional informational systems, where signals such as clickthrough, dwell time, and transactional signals could be used to evaluate web search engines (fox2005evaluating), or touch-based features for satisfaction with mobile search and assistants (pred; williams2016detecting). Especially for open-domain conversational systems, since a user may not have clearly defined goals, a successful prediction system must understand a wide range of conversational intents, topic preferences, and user behavior signals.
As an illustration, consider a sample conversation of a user with our system, shown in the Figure 1 above111Due to the Alexa Prize data confidentiality rules, we cannot reproduce actual user conversations, but the sample represents a typical conversation with our system
. The conversation starts well: our system successfully supports a 3-turn interaction about travel. However, our system failed to understand "Brad Pitt" due to automatic speech recognition (ASR) failure, and suggested a local bakery. The user has a hard time understanding the system’s non-relevant response, and asks why we suggested bakery instead of movies. Our system lost context beyond this point, and suggested recent news, as a way of reclaiming the user’s interest. At this point, the user is likely dissatisfied, as indeed supported by users’ satisfaction ratings for such conversations.
Thus, evaluating immediate user satisfaction is crucial for developing adaptive conversational strategies such as failure recovery and topic switching. To address this problem, we propose a novel online conversational satisfaction prediction model (ConvSAT). ConvSAT first represents each conversation turn as a vector of carefully designed behavioral features, inspired by prior work (described next). These features aim to capture both the overall conversation state so far, and the immediate state of the conversation at the current turn. Being able to jointly model the overall (aggregated) conversation state and the immediate state after the current turn requires learning a complex interaction between the global and turn-level conversation state. We represent this interaction through a global/immediate feature matrix, and combine it with contextualized word encoders to learn conversation-specific contextual word representations conditioned on the conversation so far. Lastly, we enrich the word representations with sub-word (character) information, aiming to improve generalization on unseen and broken words from ASR output.
Empirically validating such a satisfaction prediction system is challenging for two reasons. First, without a fully functional conversational agent, it is impossible to test and gather enough information to conduct a reasonable study. Second, even with an available system, it is challenging to create an environment and recruit enough users to interact with the system in a realistic setting. First, to calibrate our proposed method, we evaluate ConvSAT on a publicly available dialogue breakdown detection challenge 3 (DBDC3) dataset, generated by human users talking to different chatbots. The results show that our method outperforms the more recently reported state of the art methods on that task. Having established the effectiveness of our prediction method under controlled conditions, we then report satisfaction prediction results over a large conversational dataset collected by conversing with real users during the Amazon Alexa Prize 2018 competition. Our system successfully conversed with thousands of Alexa users, with a large fraction of these conversations explicitly rated by users for perceived conversation quality. Together, these results establish that our proposed method not only outperforms the existing state of the art methods on an established benchmark, but is able to successfully generalize to the more challenging real-world scenario of Alexa-based open-domain conversational AI challenge. In summary, our contributions include:
A novel ConvSAT model, applicable to both conversational satisfaction and failure prediction tasks, operationalized in both online and offline settings.
A comprehensive behavioral feature matrix, designed to capture both immediate and aggregated evidence for conversational user satisfaction.
Extensive experiments, demonstrating the effectiveness of ConvSAT on an open benchmark dataset, and on real open-domain human-machine conversations.
2. Related Work
In this section, we summarize the related work on conversational AI and satisfaction prediction to put our contributions in context.
Conversational systems (conver) aim to interact with users naturally through conversations. One classic study summarized four main challenges of conversational systems as: 1) processing and understanding the noisy text from speech; 2) designing a flexible system for easy adaptation of different tasks; 3) domain and intent recognition; 4) mixed-initiative dialogue between machines and users (toward_conv). There has been dramatic recent progress in all of these areas, which enabled modern conversational systems. For example, ASR quality has improved drastically over the last few years (speech_Hinton), enabling more natural voice input. Both rule-based dialogue management (DM) systems (form-DM; ravenclaw)
and end-to-end DM systems using neural networks(DM_end; luo2018learning; dhingra2016towards) have also improved in sophistication and flexibility (alquist). To enable an easier extension to new tasks, modular architectures with central DM systems have been proposed (emory; gunrock; sounding_board).
To improve the knowledge retrieval process, many recent approaches introduced novel frameworks to incorporate external knowledge into response generation as well as actively learning concepts through conversations(dinan2018wizard; luo2018learning; jia2017learning; ghazvininejad2018knowledge). Incorporating artificial personalities are studied to improve empathetic, personalized engagements within these systems (sounding_board; xiaoice).
Despite significant advances in conversational AI, a challenge of conducting a natural, mixed-initiative conversation remains elusive because deciding when to lead the conversation or to follow the user’s interest heavily depends on each user. Especially for open-domain conversations that do not have clearly defined goals, this becomes more problematic. One active field related to this challenge is predicting or evaluating user satisfaction interactively with intelligent assistants. If a user’s satisfaction or engagement with a conversation could be predicted in real-time, a conversational system could better adapt to the user’s interests or intents, or initiate graceful failure correction, among many other possible actions.
Measuring and Predicting User Satisfaction
User satisfaction can be viewed as an attitude toward an information system, which is measured by various types of beliefs about user interactions as defined in (user_satisfaction_theory1; user_satisfaction_theory2; paradise1; paradise2). More precisely, recent works on evaluating intelligent assistants focus on extracting useful features and representations to train an offline satisfaction model. For traditional IR systems such as Web search engines, previous studies showed that incorporating implicit features such as deviations from the average behavior and time on page into the ranking function could improve the search results (eugene_web). For mobile search assistants, combining implicit features with additional touch-related features dramatically increased the performance of a trained satisfaction model (pred; kiseleva2016understanding). Thus, we hypothesize that extracting contextualized behavioral features could lead to a more robust conversational satisfaction model.
To represent context, we draw on active prior work on representation learning using unsupervised feature discovery to predict user satisfaction. One recent work proposed a query representation learning technique with intent-sensitive word embeddings, and showed that modifications to improve query representation can improve overall model performance (sensitive). Another recent work introduced a model that can detect egregious conversations using textual representations, and addressed how this technique can be applied to an automated evaluation scheme (egregious). There have been studies to predict causes of query reformulation in intelligent assistants by using system, acoustic, language and additional features (sano2017predicting). Hence, we incorporated query reformulation features such as query overlap and repetition counts in our study to capture query reformulation as a potentially negative signal.
Lastly, there have been efforts in restricted domains to predict online satisfaction signals, such as using manually curated features from a flight-booking system (turn_prediction) or detecting online dialogue breakdowns (dissatisfaction) from DBDC3 challenge (dbdc3; kth). Another recent approach proposed a novel self-feeding framework to improve the quality of conversational systems (hancock2019learning) using online predictions. We will extend the proposed ideas here by introducing a much more comprehensive set of features to predict online satisfaction in non-goal oriented conversations. In summary, this is the first report of predicting satisfaction of real users on real-world open-domain conversations, for both the offline and online settings.
3. ConvSAT: Method Description
In this section, we present our proposed conversational satisfaction prediction model (ConvSAT). As illustrated in Figure 2, ConvSAT considers three complementary input dimensions: 1) contextualized word encoders (colored green); 2) contextualized character encoders (colored blue); 3) behavioral feature matrices (colored yellow).
3.1. Model Architecture
Contextualized Word Encoders
To add context history, we define a hyper-parameter called context window size (W) to control how many previous turns to condition. To ensure an online setting, we do not incorporate any future information. Hence, given previous turns (T1 … Ti-1) and current turn (Ti), current utterance (Ui) and current response (Ri) are expanded with previous W turns (We fixed W=3 for the illustration purpose throughout this section):
The boundaries between the expanded utterances and responses are marked with special tokens <U-END> and <R-END>. These two expanded sequences are tokenized to obtain two word sequences (Uiw, Riw), which will be the inputs to contextualized word encoders:
To represent the utterances contextually, we chose bidirectional Long Short Term Memory (bi-LSTM) networks, as they have shown promising performance for representing text. We have two separate encoders for both utterances (EncoderU) and responses (EncoderR). This is because in human-machine conversations, the ratio of words in an utterance to response is low, mainly due to limitations in open-domain conversational systems. By using two separate encoders, the goal is to reduce the possible bias towards long responses. The last hidden outputs from each forward LSTM ( and backward LSTM ( are concatenated to represent the entire word semantics in Ui and Ri. These two outputs are concatenated to obtain the final context representation (Encoderword) at Ti:
Contextualized Character Encoders
Voice-based conversational systems are vulnerable to automated speech recognition (ASR) errors. Errors are more frequent for entity names, such as people or brand names, and transcription errors in these are likely to lead to a failed conversation. We noticed that mis-spelled or mis-segmented words often share similar sub-word structures, because various accents and pronunciations are originated from a single root word. As an illustration, consider a short example of how ASR recognized several automobile brands for people with foreign accents:
|Actual word||ASR failures|
|Mercedes||Sadis, Cedes, Sadi’s|
|McLaren||Mac Laren, Mac Lauren, Mclaurin|
|Aston Martin||Astone Martine, Ask Tony Martin|
Without subword (character-level) information, these errors are likely to create noise in learning robust word representations. Moreover, the frequency of errors such as Sadi’s appearing in our data is low, which causes the embedding matrix to be more sparse. For the Ask Tony Martin case, it is likely that the model will understand this phrase differently from the original intent. Hence, by jointly training word-level and sub-word (character-level) models, we hypothesize that the overall semantics can be modeled better.
From the expanded word sequences Uiw, Riw in (3) and (4), we derive the character sequences Uic and Ric:
The following Uic and Ric are 2-dimensional matrices with first dimensions representing each tokenized word and second dimensions representing characters of each word. We flatten these matrices to two 1-dimensional character sequences. We also used bi-LSTM networks (EncoderUc, EncoderRc) to obtain final character representation (Encoderchar), which is identical to the process in (5).
Behavioral Features with Online Scaling
Behavioral features are manually engineered to encode different aspects of user behavior. At a particular turn Ti, user behavior is represented as one feature vector (vi), which can be a concatenation of various types of features. To incorporate conversational context, we append last W feature vectors to obtain matrix Vi:
Each vn encodes local information from beginning turn T0 to turn Tn. For instance, if we count total words in current Ti, total words are counted from T0 to Ti. Similarly, when computing the average number of words, total words from T0 to Ti is divided by the current turn i. Our proposed scaling function S(v, i) scales feature vectors (v) with respect to the current turn index (i). For online predictions, such scaling mechanism is crucial, because the goal is to detect a relative change in user behavior as the conversation progresses. If a user engaged deeply in one topic but started to diverge in the later turns, a feature capturing topic transition rate (how likely conversational states change) will gradually increase from lower to higher values. We apply this online scaling function to each vector in Vi to obtain scaled :
The resulting is a 2-dimensional dense matrix, with row representing each turn and column representing each scaled feature in respect to that turn i. Then, we feed to an attention layer to obtain a weighted sum of each vector. Given each vi, similarity score si is computed based on a shared trainable matrix M, feature context vector c and a bias term bi. M, c and bi are initialized randomly and jointly learned during training. Softmax activation is applied to similarity scores to obtain attention weights . Lastly, using learned , each v is multiplied to its attention weight and summed to obtain the attended output :
This is equivalent of learning how much previous information to attend when modeling relative changes in user behaviors by learning the weight of each turn.
Fully Connected Layer
The outputs from contextualized word encoders, char encoders and attended feature matrix are concatenated to obtain each turn representation:
To benefit from all previous turn outputs, we have one final unidirectional LSTM that models each turn sequentially. Depending on tasks (online or offline prediction), many-to-many or many-to-one output(s) can be obtained. Each output is fed to a linear layer with dropout to enforce regularization, followed by sigmoid or softmax activation to obtain binary or multi-class distribution.
3.2. Behavioral Features
Behavioral features extracted for ConvSAT are categorized into three types: 1) general behavioral features; 2) system features; 3) topic preference features. These features are concatenated to produce one feature vector per each turn.
General Behavioral Features
General behavioral features are features that encode user behaviors in various dimensions, including lexical, semantics and conversational. First, we define engagements as subsets of conversation that have 4+ conversational depth on the same topic. Count of engagements (F1) and max length of engagements (F2
) are derived respectively. Sentiment analysis using Valence Aware Dictionary for sEntiment Reasoning (VADER)(vader) on utterances is applied to obtain positive (F3, F5) and negative (F4, F6) sentiment scores. To capture how much topic transition occurs, state change ratio (F7) is derived by dividing total transitions to the current turn index. Similarly, agreement and disagreement ratios are derived (F8, F9) based on intent classification results. To measure the repetition between (Ui, Ri), (Ri-1, Ri) and (Ui-1, Ui), counts of token overlaps are computed (F10, F11, F12). Lastly, the average and total word count of user utterances and system responses are extracted (F13 … F18).
|Local Features||Short Description|
|F1 - NumEngagements||#Engagements|
F2 - MaxEngagements
|Max engagement in # of turns|
F3 - UtterancePos
|Positive sentiment in Ui|
F4 - UtteranceNeg
|Negative sentiment in Ui|
F5 - AvgPos
|Sum of pos sentiment counts / i|
F6 - AvgNeg
|Sum of neg sentiment counts / i|
F7 - StateChangeRatio
|#Topic Transitions / i|
F8 - YesRatio
|#Yes Responses/Agreements / i|
F9 - NoRatio
|#No Responses/Disagreements / i|
F10 - TokenOverlapU
|Token overlap in Ui, Ui-1|
F11 - TokenOverlapR
|Token overlap in Ri, Ri-1|
F12 - TokenOverlapUR
|Token overlap in Ui, Ri|
F13 - TotalWordU
|Total #Words in Ui|
F14 - TotalWordR
|Total #Words in Ri|
F15 - AvgWordU
|Average #Words in U1 … Ui|
F16 - AvgWordR
|Average #Words in R1 … Ri|
F17 - WordU
|#Words only in Ui|
F18 - WordR
|#Words only in Ri|
System features are directly related to systematic aspects of our conversational agent. There are two binary session-level features that capture if a user agreed to provide his name or if he is a returning user (F19, F20). For latency, we define two types, which are system latency (F21, F22, F23) and user latency (F24, F25, F26), both measured in seconds. System latency measures how long a user had to wait to hear the system response; user latency measures how long a user had to think before issuing an utterance. Lastly, every token in our utterances was annotated with ASR confidence value ranging from 0.0 to 1.0. Using these values, minimum, maximum and average token confidence on each Ui are added (F27, F28, F29).
|Session-level Features||Short Description|
|F19 - NameProvided||Name provided or not|
|F20 - ReturningUser||Returning user or not|
|Local Features||Short Description|
|F21 - Latency||System latency on Ui|
F22 - Latencyavg
|Average system latency|
F23 - Latencymax
|Max system latency|
F24 - UserLatency
|User latency on Ri|
F25 - UserLatencyavg
|Average user latency|
F26 - UserLatencymax
|Max user latency|
F27 - ASRmin
|Min token confidence on Ui|
F28 - ASRmax
|Max token confidence on Ui|
F29 - ASRavg
|Average token confidence on Ui|
Topic Preference Features
Topic distribution features encode specific behaviors related to topic diversity, visited topics and topic distribution so far. For topic diversity, we counted the length of the visited topic set to represent topic breadth (F30). Count of accepted topics and rejected topics (F31, F32) are extracted to explore topic acceptance and rejection trade-offs. Lastly, a 15-dim topic count vector and a 3-dim special state count vector from T0 to Ti are concatenated to represent the online topic distribution (F33, … F51). The special states include Stop, Profanity and Clarification. Stop state tracks whether a user expressed stop signals, profanity state tracks if an utterance or response contained profane words, clarification state tracks if system asked a user to repeat due to low ASR confidence.
|Local Features||Short Description|
|F30 - TopicBreadth||Number of unique topics visited|
|F31 - TotalAcceptedTopics||#Accepted topics|
|F32 - TotalRejectedTopics||#Rejected topics|
|F33…51 - TopicDistribution||Vector of 18 topic counts|
3.3. Additional Implementation Details
For contextualized word encoders, embedding weights are initialized with pretrained Google Word2Vec (mikolov2013distributed) of size 300 and tuned for conversational context. For contextualized char encoders, embedding weights of size 32 are randomly initialized and learned during training. We used 3 for W
, since we observed adding less or more context reduced performance on our experiments. Hidden dimension size 100 is used for each word LSTM and 32 for each char LSTM, resulting in each turn representation of size 528 (utterance + response) + #features. Adam optimizer was used to minimize cross entropy loss, with a 1e-4 learning rate. At the fully connected layer, a dropout rate of 0.5 is used. These hyper-parameters were obtained after tuning them to our Alexa validation data, but can be easily tuned for different conversational tasks. Our PyTorch implementation and models are available for the research community222Available at https://github.com/emory-irlab/ConvSAT.
4. Conversational Data and Tasks
We present statistics of DBDC3 dataset and our private dataset collected during Amazon Alexa Prize 2018. Then, three classification tasks are defined based on these two datasets.
4.1. Dialogue Breakdown Detection Challenge
Dialogue system technology challenges (DSTC), originally known as the dialogue state tracking challenges, were initiated in 2013 in order to promote research in conversational AI. We focus on the third track of DSTC6’17 challenge titled Dialogue Breakdown Detection Challenge 3 (DBDC3) (dbdc3), since it is closely related to online satisfaction prediction. Dialogue breakdown is defined as a situation in conversations where users cannot continue engaging with the system due to various system failures. Table 1 summarizes the DBDC3 English corpus statistics.
|NB||1207 (32.3%)||126 (33.3%)||756 (37.8%)|
|PB||974 (26.1%)||114 (27.1%)||456 (22.8%)|
|B||1549 (41.5%)||180 (42.8%)||788 (39.4%)|
Each turn is labeled by 30 human annotators with three labels: 1) not breakdown (NB); 2) potential breakdown (PB); 3) breakdown (B). According to the task specification, turn labels are obtained from majority voting and have to be predicted without looking at future context. We use the official training and test data splits to be consistent with other models published on this data. For our model training, we further set aside 10% of the official training data for model validation.
4.2. Alexa Prize Dataset
Alexa Prize Data Overview
Alexa Prize Dataset was collected during a worldwide research competition sponsored by Amazon, initiated in 2017 to advance conversational AI (conver) and continued in 2018. Our system conversed with thousands of Alexa customers during summer 2018, providing the “Alexa Prize” dataset for this paper. Customers were invited to optionally provide a rating when they were finished talking to the bot. Rated dialogues received rating scores between 1.0 and 5.0. A small subset (less than 1%) of the users who rated our system also provided free-form feedback, explaining why they chose their rating. For this study, we will only focus on conversations from one stable version of our system, with the data collected over a 2-week period in August 2018. The data used for this study contained 5,044 rated conversations, with 4,811 conversations (95.3%) from unique users. We randomly selected 93 conversations as our test set, and selected an additional 10% of the remainder as our validation set for training. Table 2 reports the statistics for Training, Validation, and Test data splits.
|Rating1||593 (13.3%)||62 (12.5%)||10 (10.7%)|
|Rating2||671 (15.0%)||74 (14.9%)||11 (11.8%)|
|Rating3||811 (18.2%)||95 (19.1%)||17 (18.2%)|
|Rating4||860 (19.3%)||96 (19.3%)||19 (20.4%)|
|Rating5||1520 (34.1%)||169 (34.0%)||36 (38.7%)|
For the entire data, the standard deviation on turns is15.81, meaning our data covers a wide range of different conversations from extremely short, to very long ones, with some conversations lasting over 100 turns. Interestingly, there was no strong correlation between a user rating and conversation length: the Pearson correlation coefficient is 0.095, indicating no correlation. Lastly, our system supports conversations on 15 different domains, ranging from popular domains such as Movies and Music to generic domains such as Weather and Wikipedia
. Our domain classifier, described in reference(ConCET) achieved 0.717 Micro-Averaged F1 on our 3,000 annotated test utterances.
User rating vs user satisfaction
User rating and user satisfaction are clearly related, but they are different metrics. In non-goal oriented setting, user rating is very subjective and hard to generalize especially in the five groups defined in Table 2. To simplify and better generalize our study, we propose to find a statistical relationship between user rating and user satisfaction using user feedback. We randomly selected 20 free-form feedback each from five rating groups and asked one human annotator to label each feedback as satisfied or dissatisfied. The goal is to find a threshold that best splits satisfaction (SAT) and dissatisfaction (DSAT). Our annotation results are reported in Figure 3.
For the experiments in this paper, we chose to frame the problem as a binary classification task, to predict SAT (satisfaction) vs. DSAT (dis-satisfaction). There is a long tradition in evaluation literature for this approach, e.g., (kiseleva2016understanding; hancock2019learning; pred; egregious; sensitive) in order to reduce high subjectivity and noise in user ratings. The challenge is where to choose the boundary to convert the user ratings to SAT/DSAT decisions. To set the DSAT/SAT boundary, we performed a qualitative analysis of user feedback. The qualitative results indicate that for 1.0 and 2.0 rating groups, 100% of users left negative feedback based on their interactions. For the 3.0 rating group, we see a small increase in positive feedback, but still, 80% of users were dissatisfied. For 4.0 and 5.0 rating groups, only 40% and 15% of users were dissatisfied. Hence, we conclude that setting a boundary between 3.0 and 4.0 ratings will best separate dissatisfaction from satisfaction, and we define our two user satisfaction labels as DSAT (ratings <= 3.5) and SAT (ratings > 3.5). Defining SAT to correspond to ratings of over 3.5 out of 5 has an additional benefit. One important goal of online satisfaction prediction is to provide consistent and reliable reinforcement signals for tasks such as online dialogue policy learning or model tuning. For such tasks, knowing highly satisfactory (and strongly dis-satisfactory) outcomes is valuable, while intermediate “partially” satisfied signals are not helpful.
Annotating online satisfaction labels
We defined SAT and DSAT labels based on our user feedback analysis. However, user ratings were requested after the conversation ended, and do not provide online satisfaction labels. To solve this challenge, we had to annotate each turn in a consistent and reliable way. To obtain these ground truth labels, we asked two human annotators to label our 1,959 turns using the annotation guidelines below. Only the conversation transcripts data (utterances and responses) were provided during the annotation process.
Label each turn into SAT or DSAT by considering all the previous information up to the current turn.
Factors to consider are conversational depth within the current topic, conversational coherency, domain detection rate, response quality, topic diversity, ASR and other miscellaneous errors.
For offline predictions, we use the satisfaction label derived from real ratings. Hence, the number of offline samples (93) is identical to the number of dialogues (93). For online predictions, we predict on all previous turns except for the final turn, resulting 1866 (1959-93) samples. The final SAT class distribution of offline and online test samples are 40.9% and 56.8% respectively. The kappa score (kappa) between the two annotators on these 1866 samples is 0.753, showing a substantial agreement. In the case of a disagreement, the final label was randomly chosen.
4.3. Task Definition
Based on the datasets, we define three classification tasks: 1) dialogue breakdown detection; 2) online satisfaction prediction; 3) offline satisfaction prediction.
Dialogue Breakdown Detection
Given a conversation turn (i), which is a concatenated vector of [Uiw; Riw; Uic; Ric; ] defined in Section 3, predict the dialogue breakdown label Bipred of each turn:
where NB, PB, and B represent “not breakdown”, “possible breakdown” and “breakdown”, respectively.
Online Satisfaction Prediction
We define two states for the dialogue: DSAT for dis-satisfied (equivalent to “breakdown”) and SAT for satisfied (equivalent to “not breakdown”). Given each Ti, conditioned on previous turns, we predict the most likely binary satisfaction label Sipred of each turn:
Offline Satisfaction Prediction
Given a session of length N turns, we predict at the end (TN) of the conversation:
Note that at the last turn of the conversation, the online- and offline- prediction tasks are equivalent.
5. Experimental Setup
For Alexa data, one remaining challenge is to create large-scale training samples for online prediction since human labeling is expensive and applied only to the test set. We introduce the labeling functions we devised to heuristically create large-scale training samples. Then, an overview of baseline models and evaluation metrics is presented.
Data Programming for Alexa Training Data
Since online satisfaction annotation is extremely time-consuming, it is not feasible to generate all the necessary labels for training. Moreover, because of privacy issues with Amazon customers, we cannot outsource the annotation task to a public service like Amazon Mechanical Turk. Given the small size of human-labeled data, training on it is unrealistic. Based on these limitations, our proposed solution is to apply data programming to generate training data by using heuristic weak supervision strategies. We combine our domain heuristics to design a set of simple rule-based labeling functions (data_pro; egregious) to generate online training labels. Once large-scale training data is generated, the goal is to compare heuristic performance with proposed models to see if models can learn beyond these simple rules. The details of our labeling process are described below.
Label SAT for each engagement of depth >= 4
Label SAT for 4+ consecutive affirmation intents
Label DSAT for 4+ consecutive negation intents
For remaining unlabeled turns, use imputation
For Alexa data, the average engagement depth on various domains is 2.45, ranging from 3.20 for the most popular "Movies" domain and 2.11 for the least popular "Travel" domain. Hence. we heuristically define 4+ engagement depth as a successful signal. Affirmation intents indicate users agreeing to the system’s recommendations while negation intents indicate disagreement and topic switches, which provide intuitions for the second and third rules. Lastly, the remaining unlabeled turns are imputed based on the average ratings from beginning to the current turn Ti, similar to our proposed online scaling mechanism in Section 3.3. We consider SAT labels as 5.0 ratings and DSAT labels as 1.0 ratings during imputation. The last turn Tn label always follows the real user rating.
To measure the statistical correlation of heuristic labeling to human annotated baseline, we applied these rules to our test data and computed the Fleiss Kappa score. The Kappa score is 0.46, indicating moderate agreement. Hence, we hypothesize that these rules are reliable heuristics to generate large-scale training data. We emphasize that the heuristic labeling was done to generate training data only. The test data was manually annotated by two independent internal judges.
We define our first baseline method as a non-contextual bi-LSTM model (LSTM). This model only looks at the current utterance and response, which is equivalent of setting contextual window size W as 1. For state-of-the-art (SOTA) baseline, a contextual bi-LSTM (CLSTM), introduced by Hashemi et al. (sensitive), models word-level contextual utterance representations along with conversational history. For DBDC3 data, we additionally report the best performing model (KTH Entry) participant on this challenge, which is a non-contextual LSTM model combined with a bag-of-words, Doc2Vec embeddings, and manual features (kth). Additionally, heuristic labeling (HL) baseline is reported for the online satisfaction task.
For DBDC3 task, we stay consistent with the official evaluation metrics, which are Micro-Averaged Accuracy and Macro-Averaged F1 on the breakdown label. We will additionally report Precision and Recall on breakdown labels for our implemented models. For the Alexa dataset, consistent with the DBDC3 setup, we report the Micro-Averaged Accuracy and Macro-Averaged values of Precision, Recall, and F1 scores for both SAT and DSAT classes.
For DBDC3 data, since our behavioral features are designed for our Alexa Prize system, some of the features related to latency, ASR, and detailed topic-specific features are not available. Hence, these features are excluded when training on DBDC3 data. For word encoders, the hidden dimension was set to 64 to prevent overfitting. We used softmax activation on output layers for DBDC3 data (since it is a multi-class problem) and sigmoid activation for Alexa data (more appropriate for the binary classification problem). All the other settings, including the model architecture (described in Section 3.3) remained identical.
6. Main Results
In this section, we compare ConvSAT to other baselines on three tasks defined in Section 4.3.
Dialogue Breakdown Detection Results
ConvSAT significantly outperformed all the baseline models on Accuracy, Precision, Recall and F1 for dialogue breakdown detection task, as shown in Table 3. There are 14.7% and 36.1% improvement in Accuracy and F1 compared to KTH entry. Precision and Recall for KTH entry are left blank because the official metrics did not include these. Similarly, ConvSAT improved the SOTA baseline by 2.4% on Accuracy and 5.5% on F1 score, indicating statistically significant improvements with
, measured by two-tailed Student’s t-test. To ensure stability of the results and improvements, we report the mean and standard deviation of ConvSAT performance on five random test folds of 40 conversations each. Higher deviations in Recall mostly occur between B and PB labels, indicating that the distinction between these two labels are the most challenging. Nonetheless, it is clear that leveraging sub-word information and behavioral feature matrices are beneficial for predicting failure.
|Impr. over KTH||14.7%||-||-||36.1%|
|Impr. over LSTM||10.9%||16.1%||15.0%||15.8%|
|Impr. over CLSTM||2.4%||6.5%||4.1%||5.5%|
We highlight that there is a significant gap in KTH entry and our re-implemented LSTM baseline (the LSTM baseline exhibits higher performance). The reason is due to a seemingly minor change in utterance representation. For KTH entry in the DBDC3 challenge, each utterance was represented by averaging the Google’s Word2Vec embeddings with pre-trained vectors, while our implementation of the LSTM baseline considers each word separately. This is significant because averaging simplifies the training process but loses the temporal relationship between each word. Moreover, KTH entry represented each turn differently from our LSTM baseline by treating each utterance and response as separate timestamps. This doubles the length of the original sequence, and required insertion of dummy labels for each utterance to satisfy the length of predictions to be same as the input. During prediction, the on three true labels were applied to each system response, ignoring the dummy label. In contrast, our LSTM baseline avoids this complexity by having two separate networks to represent each utterance and response separately. As a result, since our re-implementation of the baseline LSTM-based approach (inspired by the KTH entry) exhibits substantially higher performance on all metrics on this benchmark dataset, we use our LSTM implementation as the baseline for all subsequent Alexa experiments.
Online Satisfaction Prediction Results
ConvSAT improved all three baseline models on the online satisfaction prediction task, as reported in Table 4, with significant improvements over all the baselines on all metrics. This provides strong evidence that behavioral features and character information enable significant gains in real-world conversations. Compared to our heuristic baseline, ConvSAT showed 7.8% improvement in both Accuracy and F1 respectively. Compared to the recent SOTA baseline, ConvSAT also improved by 2.4% and 2.2% on Accuracy and F1 respectively, with all improvements significant with .
|Impr. over HL||+7.8%||+8.7%||+7.5%||+7.8%|
|Impr. over LSTM||+5.8%||+4.1%||+6.9%||+7.0%|
|Impr. over CLSTM||+2.4%||+2.9%||+2.0%||+2.2%|
ConvSAT achieved 0.786 Precision, 0.865 Recall and 0.823 F1 for the DSAT label. For the SAT label, 0.804 Precision, 0.701 Recall and 0.749 F1 were achieved. The standard deviations are also computed based on random 5 test folds. This shows that predicting SAT label correctly is harder than correctly classifying DSAT label. Intuitively, satisfactory conditions should be more subjective than failure conditions because people can still dislike the conversation simply because the responses are boring or lack coherence. However, there are more explicit signals of failures, such as low ASR confidence, profane utterances and high latency.
Offline Satisfaction Prediction Results
For offline satisfaction prediction, we noticed that the general performance is lower compared to the online prediction results. This is because offline satisfaction prediction requires more complex reasoning that spans from the beginning to the end of conversations. Since our conversations have, on average, over 16 turns, we expect the decision boundaries to be more complex.
|Impr. over LSTM||11.4%||8.6%||9.8%||11.1%|
|Impr. over CLSTM||3.1%||4.5%||4.6%||3.4%|
Nonetheless, ConvSAT outperforms the two state of the art baseline models significantly. There are 11.4%, 11.1% increases in Accuracy and F1, respectively, compared to the non-contextual LSTM, and 3.1%, 3.4% boost compared to the contextual LSTM baseline. ConvSAT achieved 0.864 Precision, 0.667 Recall and 0.752 F1 for DSAT. For SAT labels, ConvSAT achieved 0.612 Precision, 0.833 Recall, and 0.706 F1 score, which follows a similar pattern to online satisfaction results.
7. Discussion and Error Analysis
To understand the impact of different features groups, we conducted a feature ablation study on ConvSAT by systematically removing text representation and behavioral features. Then, we present the top 10 strongest behavioral features for Alexa dataset, followed by error analysis.
To show the effect of behavioral features and character information, we conducted an ablation study on both datasets by systematically removing these portions from ConvSAT. Table 6 shows the feature ablation results on online satisfaction and breakdown detection tasks. We used the same evaluation metrics defined for each task.
The results show that removing both behavioral features and character information decreases the Accuracy and F1 on both datasets. In general, the decrease is much greater when removing behavioral features over removing characters. It shows that word-level information already contains most information, and in the future, more advanced subword representation such as phonetic representation needs to be explored.
To conclude, distributional semantics are important features since they help models to learn the general context. However, we claim that they are not sufficient to model complex interactions between textual data and subjective satisfaction. For instance, a phrase I am done can be a strong signal of dissatisfaction after recent failures. However, after several successful engagements on multiple topics, the same phrase can represent a satisfaction or topic completion signal. Using distributional semantics alone, the model is likely to generalize on more frequent cases without learning the conversational flow effectively. Hence, we conjecture that our model successfully captures the behavioral features’ interaction with semantics, resulting in significant performance improvements over semantics alone.
Importance of Behavioral Features
Since we confirmed the importance of general behavioral signals, we now delve into specific behavioral feature importance. To understand the importance of each signal, we trained a gradient boosted decision tree (GBDT) by only using the behavioral feature matrices. We selected this tree-based model because of easy interpretability and support for categorical features. We used grid search to optimize the GBDT parameters, and used 5-fold cross validation to better generalize our model. Figure4
reports the top 10 features learned for this task, using binary logistic loss function. We trained GBDT only on online Alexa data because we have a more comprehensive set of features, and substantially larger samples compared to the DBDC3 dataset.
The results show that maximum user latency, UserLatencyMax (the amount of time a user spends before speaking) is a strong signal for online user (dis-)satisfaction. There are several potential reasons for high user latency, such as user not understanding the system’s response, or simply does not know what to say after a nonsensical response. An increase in user latency also potentially compounds ASR errors, as the user’s utterance is more likely to be cut off after a long pause.
Talking about the Worldcup was one of the most popular topics people enjoyed, while the emotional support topic was only invoked when user’s mood seemed negative. These two features show that adding a topic distribution vector can help the model to learn the effect of an individual topic on user satisfaction. Sentiment features were also important because it is natural to expect a strong correlation between user sentiment and user satisfaction. Minimum ASR confidence value is another strong indicator of potential ASR failure and strongly contributed to importance. Lastly, utterance word counts, engagement (topic depth), and profanity (user cursing) are found to be strongly correlated with satisfaction and dis-satisfaction, respectively.
Representative Error Analysis
Since we cannot release the real Alexa utterances, our error analysis is performed only on the DBDC3 data. After comparing the prediction results between ConvSAT and the SOTA baseline, we present a representative case study where both models failed. The true label is indicated inside the bracket next to the final response.
Bot Hello, I really like movies. How about we talk about movies? User Sure. What’s the most recent movie you’ve seen? Bot Honestly, you know, they’re also different but we were talking before about the Wolf of Wall Street. User I liked that movie a lot. I also liked inception. Bot Well, you know, I’m not sure. But anyway, I’d like to chat with you. [Not Breakdown]
For this case, the CLSTM baseline predicted "breakdown" and ConvSAT predicted "potential breakdown". This is expected since there are many breakdown samples that contain phrases such as "I am not sure" because many bots simply avoid answering if they did not understand. However, for humans, this example is acceptable since the bot acknowledged its mistake and suggested to continue chatting. We believe that such human-like reasoning is challenging for neural networks even with a more advanced contextual representation. To be successful, improvements in representing and combining behavioral signals to context need more explorations.
8. Conclusions and Future work
Conversational agents are being used widely in information-search, online bookings, and almost any setting where a human interaction could be valuable. While much prior work focused on the implementation and science behind these agents, this paper focuses on developing new, automated ways to evaluate conversational agents in online using contextual and behavioral clues.
We proposed a neural architecture called ConvSAT that combines these signals: 1) contextualized utterance and response representation; 2) contextualized sub-word information of utterance and response; 3) behavioral feature matrices; 4) previous conversational history. We experimented with thousands of real open-domain conversations as well as publicly available DBDC3 dataset to conduct a large-scale study on predicting satisfaction and dialogue breakdown. Our results are promising as ConvSAT outperformed state-of-the-art baselines in all three tasks, reaching 0.79 Accuracy and F1 on the SAT class for the online satisfaction prediction task.
Our experiments demonstrate that aggregating multiple signals derived from user behavior, topic preferences, system state, distributional semantics, and conversational context is needed when designing a successful satisfaction prediction model. In addition, we presented insights derived from feature ablation and importance for these tasks, showing that latency, topical, sentiment and ASR features are strong predictors of user (dis-)satisfaction. To conclude, our new ConvSAT model of conversational satisfaction, and experiments in online satisfaction prediction, offer promise for adaptive conversation strategies. The predicted satisfaction could be used for both offline evaluation for improving conversational systems, or as online feedback for adapting the conversation for each user, enabling a new generation of more responsive and intelligent conversational agents.