Developing intelligent conversational agents [McTear2002]
is a topic of great interest in Artificial Intelligence, and there already are several large-scale conversational agents on the market, such as Amazon Alexa, Google Assistant, Apple Siri, and Microsoft Cortana, that have significant numbers of users and usages[Google2015][Sarikaya2017]. Conversational agents today employ a diverse set of metrics to capture different aspects of business and user experience. For example, there are topline business metrics to track and drive success of the business: monthly active user, dialog count per user, downstream impact from highly valued action and negatively valued action. However, these metrics are normally not sensitive enough to detect changes in a timely fashion, requiring long experimentation times, and hence will result in slow experiment turnaround cycle. On the other hand, there are user experience metrics that capture unhandled user requests or dissatisfying user experiences due to system errors, incomplete service coverage, or poor response quality. User experience metrics generally have positive correlation with business metrics and tend to be fast moving and more sensitive than business metrics, and hence are suitable for experimentation to make data-driven decisions. Fine-grained user experience metrics are also a key to providing actionable insights to conversational skill developers by correlating metric shifts with interpretable factors such as user intent and slot.
In industry, manual annotation has been widely adopted to assess user satisfaction. Given a pair of user utterance and system response with its surrounding conversation history and contextual signals, via various mapping techniques including recent deep neural-net based approaches in Alexa [Kim et al.2018b][Kim et al.2018a], annotators return a user satisfaction score based on annotation guidelines. Due to its offline nature and limited bandwidth, however, it is ill-suited for online monitoring and experimentation or for providing actionable insights over a broad set of use cases. Although it is becoming common to build machine-learned models trained on manual annotations to mitigate such limitations, there still remain critical challenges: 1) Scalability: in Alexa today, there are tens of thousands skills available built by 3rd party public developers, and it is prohibitive to collect sufficient amounts of human annotations for all use cases; 2) Discrepancy with actual user satisfaction: annotators do not have full visibility into user’s goal and context, and they make their best guess according to the annotation guidelines.
To address these challenges, in this paper, we explore the use of post experience user feedback. As shown in the example dialog below, we instrumented a feedback elicitation system to ask user’s feedback with pre-designed prompts such as ”Did I answer your question?” and to interpret user’s response.
Unlike the conventional manual annotation, user feedback is a good proxy to user satisfaction as users know best whether Alexa provided the right experience they wanted, and the amount of user feedback can easily be scaled up to several orders of magnitude larger than that of human annotation. However, frequent feedback requests can introduce significant friction in user experience, thus we cautiously prompt users with a controlled rate. A machine-learned feedback prediction model is built to produce user satisfaction assessment when direct user feedback is unavailable.
During the course of our early exploration, we identified a few issues in collecting user feedback, which are preventing us from building a holistic metric using the user feedback signal only. Not all scenarios were applicable for collecting user feedback (e.g., when a user barges in and asks termination, asking a post-experience feedback can lead to an unnatural experience.), and not all domains/skills were able to onboard the feedback collection system at the same time. This caused an incomplete coverage of the user feedback data. In contrary, human annotation-based approaches are unobtrusive to user experience and applicable to all situations.
Therefore, in this work, we propose a practical hybrid approach to take the best of both worlds. At a high level, our hybrid approach fuses direct user feedback and two types of predicted user satisfaction by two machine-learned models, one trained on user feedback data and the other on human annotation data. During the inference time, a waterfall policy is employed for each pair of user utterance and system response: 1) We first check if a direct user feedback is available, and respect it if available; 2) Otherwise, we check if the feedback-based prediction model is eligible and its prediction result shows a high confidence score. If it is, we take that feedback prediction; 3) Finally, in case we could not get a prediction from prior stages, we make a prediction with the human annotation-based model. On an Amazon Alexa’s large-scale test dataset, our hybrid approach achieved significant improvements in precision, recall, F1-score, and PR-AUC (Precision-Recall Area Under Curve) by , , , and , respectively. Along with this performance improvement, in terms of data volume, the hybrid dataset became to have less dependency on human annotation as we were able to collect user feedback data collected at scale. This is another benefit of our hybrid approach.
The rest of the paper is structured as follows. In section 2, we provide a brief analysis to understand the quality and traits of user feedback. In section 3, we present our proposed hybrid approach. In section 4, we describe our experimental setup and results. In section 5, we summarize related work. Finally, in section 6, we conclude with discussion and future work.
2 User feedback analysis
This section provides an analysis to understand how user feedback correlates with human annotation. The primary dataset used contains 7,447 utterances with user feedback, and the dataset was annotated by trained annotators using the same annotation guidelines of our human annotation process. In the dataset, by the annotation work, about 35% of utterances are mapped to other categories than YES or NO. Among these other categories, the biggest bucket is SILENCE where users did not provide any feedback. Our findings indicate that the majority of silence feedback correspond to satisfying experience. As a more in-depth analysis is required to fully understand other categories, in this work, we decided to utilize only those utterances with a YES or NO feedback which amounts to 4,729 utterances. By the human annotation work, the user feedback has an agreement rate with human annotation of 97.4%, and a Cohen’s kappa of 0.7877 (‘substantial’ agreement according to typical kappa interpretation 111Cohen’s kappa coefficient, https://en.wikipedia.org/wiki/Cohen%27s_kappa). Note that the high agreement rate between user feedback and human annotation is partly due to its bias toward satisfaction feedback as we do not have feedback elicitation opportunities when users barge in and ask for termination that are strongly correlated with user’s dissatisfaction. To compensate this limitation, our hybrid approach supplement user feedback with human annotation when the use of user feedback is ineligible.
This section first describes a deep neural model that we designed to build predictive models for both user feedback and human annotation, and then provides the details of our hybrid approach to fuse several inputs of user satisfaction assessment.
3.1 Deep neural model for user satisfaction prediction
Before diving into modeling details, we first introduce a few terminologies. We define a pair of user utterance and Alexa response as a turn. Figure 1 shows an example dialog consisting of two turns, eliciting user feedback; We call the first turn the targeted turn where Alexa asks ”Did I answer your question?”, and the second turn the answering turn. Besides the user utterance and Alexa response, each turn also has some meta information, e.g., the timestamp of the turn, the conversational skill invoked to handle the turn, the active screen availability for the turn happened. 222Due to confidentiality reasons, we are not allowed to disclose the exact meta information features. Thus, we represent a turn as
where is user input text, is Alexa response text, and is a list of meta information features. To capture contextual cues from the surrounding turns such as user’s rephrasing patterns and barge-in patterns, we consider the dialog session including other turns around the targeted and answering turns. Rigorously, we define a session as a maximal list of turns such that any two adjacent turns have a time gap less or equal to minutes. (“Maximal” means that if a session is a sub-list of some list of turns, then that list must have some adjacent two turns whose time gap is greater then .) We denote a session as
where represents the number of turns in session . Our dataset consisting of a set of session and label pairs can then be represented as
where denotes the binary user satisfaction with respect to the targeted turn in session . Given session information , the model produces a prediction score
as the probability that the user isdissatisfied. Note that we treat dissatisfaction as our primary class since accurately detecting defective experiences offers greater value for us to improve downstream components. Finally, there are some meta features are shared across turns in the same session such as device type. We extract them from turns and refer to them as session-level features. Thus, at a high level, our model contains features of five types: turn-level textual features, turn-level categorical features, turn-level numerical features, session-level categorical features and session-level numerical features.
Taking these features as input, our model employs a deep-wide style network [Cheng et al.2016] to accommodate both structured and unstructured input with a multi-layered structure to perform encoding at both turn and session level as shown in Figure 2.
For each turn, we process user input text and Alexa response text separately by first converting them to word embeddings and then applying a GRU layer followed by an Attention layer to encode a sequence of wording embeddings into a high dimensional sentence vector. We then represent the turn by concatenating the sentence vectors of user input, Alexa response, the turn-level embeddings of categorical features, and the encoding of numerical features. For numerical feature encoding, we pass numerical features into a nonlinear dense layer to map them into a high dimensional vector. Next, we encode a sequence of turn-level encodings with another GRU and Attention layers and concatenate the resulting vector with the encoding of session-level numerical features and the embeddings of session-level categorical features. Finally, we pass the session encoding through two nonlinear dense layers followed by asigmoid layer to produce prediction .
With our deep neural architecture, we build two user satisfaction predictors: 1) a user feedback prediction model (FP) trained on user feedback data; 2) a fallback prediction model (HP) trained on human annotation data. Note that to make the user feedback prediction model more generalizable, we remove the feedback prompt and answering turn from the session data for both training and evaluation.
3.2 Hybrid approach
The goal of our hybrid approach is to accurately predict user satisfaction by fusing different types of candidate inputs for assessing user satisfaction, as depicted in Figure 3. Roughly, there are three different candidate inputs that are captured in the prediction layer, such as explicit user feedback, inferred user satisfaction, and skill-provided assessment:
Explicit user feedback Since user satisfaction is not directly observable, we ask user for post-experience feedback as the best proxy. This is the most direct method of determining whether the experience was satisfying; however, there are some gaps: 1) frequent feedback request introduces friction, thus we should cautiously use it with a controlled rate; 2) it is biased toward positive feedback as we don’t have an opportunity to collect feedback when there are barge-in and early termination that are strong indicators of negative user experience; 3) its coverage is currently limited to a set of whitelisted experiences.
Inferred user satisfaction User feedback is not always available. Predictive model allows us to measure user satisfaction even when user feedback is not directly collected. To build accurate machine-learned predictors, in the feature layer, we consider various input features such as conversation history, contextual features, domain signals, user profile, historical features and external knowledge. Specifically, we utilize the FP and HP models previously described in Section 3.1.
Skill-provided assessment While there are several implicit indicators of user dissatisfaction that are skill agnostic, assessing positive user satisfaction often requires knowledge of the target skill, and access to skill-specific signals. For example, in the media consumption domain, 30-sec playback is commonly used as implicit signal of user satisfaction. In a map/navigation application, we may declare success if we see no changes to the destination or route cancellation within a certain time interval (e.g., 15 secs). In a ticket-booking application, we may use a booking confirmation signal, followed by absence of cancellation within a certain time interval (e.g., 5 mins). Incorporation of skill-provided assessment is out of scope and we leave it for our future work.
As we are at an early stage of leveraging heterogeneous prediction sources, our fusion layer follows a simple waterfall policy to determine whether Alexa’s action/response was satisfying:
If explicit user feedback is present and interpretable, then determine user satisfaction accordingly based on the feedback. Otherwise, go to the next step.
In case, the FP model(the feedback prediction model) is eligible for the experience, predict user satisfaction with the FP model which is trained to predict user feedback. If the prediction confidence is high, then determine user satisfaction accordingly based on the prediction. Otherwise, go to the next step. The confidence threshold is tuned on a separate development dataset.
As fallback, predict user satisfaction with the HP model (the fallback prediction model), which is trained on human annotation to cover all types of experiences.
In this section, we describe the data sets and experimental setup, and present the experimental results of our hybrid approach.
4.1 Data sets
For our experiments, we have whitelisted 43 Alexa intents and collected post experience user feedback at 0.01% sampling rate. As a result, we have collected 1.3 million data points which is split into training, validation and test set with , , and ratios. The training and validation sets were used to train the FP model and the test set was secured to be used only for reporting. For human annotation data, we took a recent chunk of historical Alexa experience annotations that amount to 0.5 million data points. This data set covers all intents and is split into training, validation and test data sets with , , and ratios. Note that the size of feedback dataset is larger than that of human annotation dataset as the feedback collection process is much faster and cheaper than the human annotation process.
Due to the limitations in user feedback coverage, we designed an algorithm to build a composite ground-truth test set which weaves user feedback data with human annotation data. The basic idea is we use collected user feedback as ground truth for traffic segments that are eligible for feedback elicitation and use human annotation for all other traffic segments. The ineligible traffic segments include the following:
Intents are not whitelisted for feedback elicitation.
Barge-in or termination request from users.
Unhandled requests, Alexa saying “I’m sorry…” or no response.
Other feedback types than Yes or No.
We compute the proportions of each case in the live traffic and bring in human annotation data for the corresponding amounts. The detailed ground-truth test set construction algorithm is listed in Algorithm 1.
4.2 Experiment results
Our experiment results are presented in Figure 4. To demonstrate the effectiveness of leveraging user feedback, we compare the following three approaches: (1) HP: solely relying on the HP model, (2) EFB + HP: fusion of explicit user feedback and HP model prediction according to our hybrid approach and (3) EFB + FP + HP: fusion of explicit user feedback, FP model prediction and HP model prediction according to our hybrid approach. Note that whenever users provide explicit user feedback, our hybrid approach takes it as output instead of making any inference (based on the fusion approach described above), meaning whenever a user feedback is explicitly given, our hybrid approach can trivially make the right prediction. Thus, evaluating our hybrid approach requires a parameter that controls the rate at which we assume users provide explicit user feedback. Specifically, given a feedback collection rate, we mark user feedback instances as “given by user” in the ground-truth test data until the rate is met. Then for the marked instances, our hybrid approach takes the associated user feedback as its prediction. In our experiments, we varied feedback collection rate to have a value among the following: , , and . is the feedback collection rate we chose to use for the experiment, and the other two rates are hypothetical for projective purposes.
The micro-averaged results at feedback collection rate of clearly demonstrate a large gain that our proposed hybrid approach (i.e. EFB + FP + HP) brings in. Compared to a conventional approach (i.e. HP, a prediction model trained on human annotation data), precision, recall, F1-score, and PR-AUC (Precision-recall area under curve) metrics are improved by , , , and , respectively. Looking at the results of HP and EFB + HP, it is worth mentioning that explicit user feedback barely moves metrics at our current feedback collection rate as the small amount of collected user feedback is easily diluted by the enormous amount of traffic covered by the HP model. This, in turn, signifies the generalization power that the FP model offers, beyond the sparse user feedback samples, enabling us to make accurate predictions for those experiences that are eligible for feedback elicitation but not triggered for elicitation.
The proposed hybrid model also outperforms the other approaches in macro-averaged results, as shown in the Macro Average section of the table. Macro Average means that we perform two-staged averaging where we first calculate micro-averaged metrics per each domain, then take a simple average over all the domains. To avoid futile distortion in macro-averaged results due to long tails, we selected top 20 domains that covered
of the test set. The smallest standard deviation of the proposed hybrid approach indicates that our approach predicts user satisfaction in a more consistent manner across domains than the other approaches which is a critical property to allocate fair amounts of traffic to each domain according to their service quality.
With two hypothetical feedback collection rates at and , one can clearly see how the increased amount of explicit user feedback impacts the accuracy of our hybrid approach, as shown in the middle and bottom tables. As expected, as we collect more feedback, our hybrid approach makes more accurate predictions and performs in a more consistent fashion across all the domains. Although a blind increase of feedback elicitation rate can cause significant friction in user experience, higher feedback elicitation rates can be safely applied to some targeted segments of traffic without the risk of incurring bad user experience.
5 Related work
One of the conventional approaches to evaluate the quality of intelligent assistant systems is to measure the relevancy of the response of the system using some IR measures such as precision and cumulative gain measures such as NDCG [Järvelin and Kekäläinen2002, Saracevic et al.1988]. This approach, however, requires human judgement for the relevancy measures, which is generally costly and hard to scale. Such relevancy-based metrics, however, often do not capture the holistic view of system performance such as user satisfaction. To overcome this, some prior works in the search domain have studied users’ behavioral patterns to infer their satisfaction level with respect to search results [Ageev et al.2013, Jiang et al.2015, Hassan and White2013, Kim et al.2013]. There is also an attempt to understand the relationship between search engine effectiveness and user satisfaction [Azzah Al-Maskari and Clough2007].
In the area of spoken dialogue system, PARADISE [Walker et al.1997]
proposed a framework for evaluating spoken dialogue, specifying the relative contribution of various factors via a linear regression model. For modern intelligent assistants, there was an in-depth study about user satisfaction by classifying the user-system interaction patterns into several categories[Kiseleva et al.2016]
. Another work proposed a research agenda about context-aware user satisfaction estimation for mobile interactions using gesture-based signals[Kiseleva and de Rijke2017]. These work [Bodigutla et al.2019a][Bodigutla et al.2019b] estimated conversation quality via user satisfaction estimation. However, most prior works are annotation intensive. There is an interesting work that pointed out the issues with annotation-based approaches [Aroyo and Welty2015], which aligns with our motivation toward feedback-based user satisfaction estimation approaches, and even further with our hybrid approach. The ability to accurately predict user satisfaction enables a conversational agent to evolve in a self-learning manner. This overview article [Sarikaya2017] about personal digital assistants discussed about user experience prediction using customer feedback. A recent work on Alexa showed how Alexa learns to fix speech recognition and language understanding errors by leveraging an automated user satisfaction predictor without requiring any human supervision [Ponnusamy et al.2020].
In this work, we proposed an effective hybrid approach that outperforms conventional approaches that are solely based on human annotation in the user satisfaction prediction problem. We started from the limitations of the approaches based on human annotation, were motivated to utilize direct user feedback that is not only more direct in capturing user satisfaction, but also more scalable and cost-effective. Our hybrid approach fuses explicit user feedback, user satisfaction predictions inferred by two machine-learned models, one trained on user feedback data and the other human annotation data via a waterfall policy, resulting in significant improvements in performance metrics. The hybrid model also achieved the most consistent performance across the domains, which is another strength. Our proposed approach has been verified with Alexa, and we believe the approach can be extended to other conversational system and text-based chatbot applications. We will extend the fusion layer of our hybrid approach by leveraging machine learning methods.
- [Ageev et al.2013] Mikhail Ageev, Dmitry Lagun, and Eugene Agichtein. 2013. Improving search result summaries by using search behavior data. In Proceedings of the 36th annual international ACM SIGIR conference on Research and development in information retrieval, pages 13–22.
- [Aroyo and Welty2015] Lora Aroyo and Chris Welty. 2015. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine, 36(1).
- [Azzah Al-Maskari and Clough2007] Mark Sanderson Azzah Al-Maskari and Paul D Clough. 2007. The relationship between ir effectiveness measures and user satisfaction. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 773–774.
- [Bodigutla et al.2019a] Praveen K. Bodigutla, Lazaros Polymenakos, and Spyros Matsoukas. 2019a. Multi-domain conversation quality evaluation via user satisfaction estimation. In NeurIPS 3rd Conversational AI Workshop.
- [Bodigutla et al.2019b] Praveen K. Bodigutla, Longshaokan Wang, Kate Ridgeway, Joshua Levy Swanand Joshi, Alborz Geramifard, and Spyros Matsoukas. 2019b. Domain-independent turn-level dialogue quality evaluation via user satisfaction estimation. In Proceedings of SIGDial 2019 Conference.
[Cheng et al.2016]
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, and Mustafa Ispir.
Wide & deep learning for recommender systems.In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7–10.
- [Google2015] Google. 2015. Teens use voice search most, even in bathroom. google‘s mobile voice study finds.
- [Hassan and White2013] Ahmed Hassan and Ryen W. White. 2013. Personalized models of search satisfaction. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 2009–2018.
- [Järvelin and Kekäläinen2002] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems, 20.
- [Jiang et al.2015] Jiepu Jiang, Ahmed Hassan Awadallah, Xiaolin Shi, and Ryen W. White. 2015. Understanding and predicting graded search satisfaction. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining, pages 57–66.
- [Kim et al.2013] Youngho Kim, Ahmed Hassan, Ryen W. White, and Yi-Min Wan. 2013. Playing by the rules: mining query associations to predict search performance. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining, pages 133–142.
- [Kim et al.2018a] Young-Bum Kim, Dongchan Kim, Joo-Kyung Kim, and Ruhi Sarikaya. 2018a. A scalable neural shortlisting-reranking approach for large-scale domain classification in natural language understanding. In Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) 2018, pages 2214–2224.
- [Kim et al.2018b] Young-Bum Kim, Dongchan Kim, Anjishnu Kumar, and Ruhi Sarikaya. 2018b. Efficient large-scale neural domain classification with personalized attention. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 2214–2224.
- [Kiseleva and de Rijke2017] Julia Kiseleva and Maarten de Rijke. 2017. Evaluating personal assistants on mobile devices. In Proceedings of the 1st International Workshop on Conversational Approaches to Information Retrieval.
- [Kiseleva et al.2016] Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Aidan C. Crook, Imed Zitouni, and Tasos Anastasakos. 2016. Understanding user satisfaction with intelligent assistants. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval, pages 121–130.
- [McTear2002] Michael F. McTear. 2002. Spoken dialogue technology: enabling the conversational user interface. ACM Computing Surveys, 34(1).
- [Ponnusamy et al.2020] Pragaash Ponnusamy, Alireza Roshan-Ghias, Chenlei Guo, and Ruhi Sarikaya. 2020. Feedback-based self-learning in large-scale conversational ai agents. In Proceedings of the 34th AAAI Conference on Artificial Intelligence.
- [Saracevic et al.1988] Tefko Saracevic, Paul Kantor, Alice Y. Chamis, and Donna Trivison. 1988. A study of information seeking and retrieving: I. background and methodology. II. users, questions and effectiveness. III. searchers, searches, overlap. Journal of the American Society for Information Science, 39:161–176; 177–196; 197–216.
- [Sarikaya2017] Ruhi Sarikaya. 2017. The technology behind personal digital assistants: An overview of the system architecture and key components. IEEE Signal Processing Magazine, 34.
- [Walker et al.1997] Marilyn A. Walker, Diane J. Litman, Candace A. Kamm, and Alicia Abella. 1997. PARADISE: A framework for evaluating spoken dialogue agents. In 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, pages 271–280, Madrid, Spain. Association for Computational Linguistics.