The spoken language understanding (SLU) task of a large-scale conversational agent goes through a set of components such as automatic speech recognition (ASR), natural language understanding (NLU) including domain classification, intent classification, and slot-filling[Tur2011], and skill routing (SR) [YBKim2018b, JKKim2020], which is responsible for selecting an appropriate service provider to handle the user request.
When a user interacts with the agent, the underlying systems may not be able to understand what the user actually wants if the utterance is ambiguous. Ambiguity comes from ASR when audio cannot be recognized correctly (e.g., audio quality issues can cause ASR to confuse “Five minute timer” and “Find minute timer”); it comes from NLU when the user’s request cannot be interpreted correctly (e.g., “Garage door” could mean open it or close it); it comes from SR when it is not possible to confidently select the best experience between multiple valid service providers (e.g., “Play frozen” can mean playing video, soundtrack, or game), and so on. Ignoring such ambiguities from upstream components can pass incorrect signals to the downstream and lead to an unsatisfactory user experience.
|U: Set a timer for 15 minutes|
|Correct||S: 15 minutes, right?||Unnecessary|
|S: 15 minutes, starting now.|
|U: Set a timer for 50 minutes|
|Incorrect||S: 50 minutes, right?||Necessary|
|U: No, 15 minutes|
|S: 15 minutes, starting now.|
Thus, when the agent is unsure due to ambiguity, it should engage in a clarifying dialog before taking actions. However, asking clarifying questions for all the detected ambiguous utterances can end up spamming users with too many redundant questions, resulting in poor user experiences as shown in Fig. 1.
There are various studies about how to compose clarifying questions111e.g., asking either reprise questions for targeted clarification or generic questions such as repeat or rephrase requests dependent on the occurred ambiguity contexts. and their effectiveness in SLU systems when ambiguities exist [Gabsdil2003, liu2014, Stoyanchev2014, Kiesel2018]. Also, clarifying questions on other tasks such as Q&A [Rao2018, Xu2019, Kumar2020, Min2020] and information retrieval [Aliannejadi2019, Zamani2020, Padmakumar2021] are being actively studied. However, none of them specifically focus on initiating clarifying interactions only when necessary in the interest of preventing user experience degradation, which is crucial in large-scale SLU systems.
To address the issue of deciding whether to ask clarifying questions in large-scale SLU systems, we propose a unified neural self-attentive model that makes a global decision on whether to trigger a clarifying question considering ambiguity occurrence information and various contextual signals. We show that the self-attentive representations of the top hypothesis and the aggregated alternative hypotheses from a hypothesis reranker [Robichaud2014, Khan2015, YBKim2018b, JKKim2020] are effective for dealing with the ambiguities from SLU.
Given the fact that a large-scale conversational system supports various devices, languages, and application components, providing access to a wide variety of skills [Kumar2017, YBKim2018a, JKKim2018], it is not scalable to rely on manually annotated data to train and evaluate the model. Instead, we leverage a user satisfaction model, which has recently attracted significant attention in both academia and industry [Hashimoto2018, Hancock2019, Park2020], to generate ground-truth labels at scale. The user satisfaction model we use [Park2020] marks defective turns by examining the input utterance, the system response, and the user’s implicit/explicit feedback of their following turns. Having turn-level defect labels, our model is supervised to not trigger a clarifying question when the agent is likely to deliver satisfying experience even if ambiguities are detected from the upstream SLU components.
In this paper, we define five ambiguity types that are popular in the SLU task (see Section 2.2), conduct extensive experiments using real data from a large-scale commercial conversational system, and demonstrate significant improvements over several baseline approaches in reducing unnecessary clarifying interactions.
2 Ambiguities in SLU
Our task is to determine whether to ask clarifying questions when ambiguity signals are captured by any SLU components in the user utterance. In this section, we describe how an input utterance is interpreted with different hypotheses in SLU, the ambiguity types we are dealing with, and how ground-truths of the ambiguous utterances are assigned in our work.
2.1 Hypothesis representations
In large-scale SLU, it is common to represent various possible interpretations of an input utterance as hypotheses, each of which contains the outputs and the confidence scores of upstream components such as ASR, domain, intent, and slot-filling results [Robichaud2014, Khan2015, YBKim2018b, JKKim2020]. Given the hypothesis list as the input, a hypothesis ranker (HypRank) is used to rank the input hypotheses. Table 1 shows an example of the hypotheses from HypRank.
Then, given the HypRank output, which is a ranked hypothesis list, the top ranked hypothesis is chosen as the final SLU decision for unambiguous utterances. For ambiguous utterances, since it is unconfident whether the top hypothesis would be promising, clarification interactions allow choosing a non-top/alternative hypothesis. In this work, we focus on deciding whether clarification is necessary or not for the ambiguous utterances.
2.2 Ambiguity Types
We first define five common types of ambiguities in SLU as follows:
|Ambiguity||User utterance||Potential clarifying question||User response||System response or <action>|
|ASR||Set a thirty minute timer||Do you mean thirteen or thirty?||Thirteen||Start a thirteen minute timer|
|IC||Turn on off||Do you mean turn on or turn off?||Off||<Turn off the agent>|
|HC||Get me a ride||Do you want Uber or Lyft?||Uber||Finding a uber driver|
|SNR||Turn on the ligXXX (noisy)||Sorry, could you repeat it?||Turn on the light (clear)||<Turn on the light>|
|TRUNC||Turn on the||Sorry, turn on what?||The fan||<Turn on the fan>|
ASR: Two ASR outcomes are regarded as ambiguous when their edit distance is 1, their ASR confidences are close, and they produce different slot values. e.g., “thirteen minutes” vs “thirty minutes”.
Similar Intent Confidences (IC): The intent of an utterance is ambiguous. e.g., “turn on off” is ambiguous since both TurnOnIntent and TurnOffIntent
can have high confidences by the intent classifier.
Similar Hypothesis Confidences (HC): The final hypothesis confidences from HypRank [YBKim2018b, JKKim2020] are similar. e.g., “get me a ride” can have similar confidences for the hypotheses associated with UBER and LYFT as the service providers.
Signal to Noise Ratio (SNR): When the acoustic noise level is very high, it is not clear whether we can trust the ASR output even if the ASR confidence is sufficiently high.
Utterance Truncation (TRUNC): An utterance can be recognized missing the later tokens due to slow speaking or ASR errors. e.g., if a user said “Music composed by Mozart” but only “Music composed by” are recognized, the missed token should be clarified. In this work, we regard utterances ending with articles (“a”, “an”, “the”), some possessive pronouns (e.g., “my”), or some prepositions (e.g., “by”) as truncated.
Table 2 shows the clarifying dialog examples for different ambiguity types, which demonstrates how clarifying questions can help resolve the ambiguities.
2.3 Ground-truth Labeling
It is difficult to decide whether a clarifying question would be helpful or not when ambiguities exist since each ambiguity is with a different occurrence condition, multiple ambiguities can co-occur, and the top predictions are correct in many cases even if ambiguities exist. In this work, we regard ambiguous utterances with unsatisfactory results as those need clarifications, and vice versa. The rationale is that if a user is unsatisfied with the top predicted hypothesis from the HypRank when ambiguities exist, the user could have been satisfied by allowing the user to choose another hypothesis.
We use the log data from a conversational agent system, where each utterance is assigned to be either satisfactory or unsatisfactory by a user satisfaction metric [Park2020, Hashimoto2018, Hancock2019]. Specifically, we use a model-based metric described in [Park2020], which utilizes the current turn’s utterance and the response as well as the follow-up turn’s utterance to judge whether the current turn is satisfactory or not.
Our labeling method is a weak supervision approach assuming no clarifying questions exist in the log data. If the log data already include turns with clarifying questions, we can identify whether the questions were helpful or not. For example, if a user selected the top hypothesis in the clarification, it was unnecessary to ask since the top hypothesis could be chosen without the clarification. Oppositely, if the user chose the other hypothesis, then the clarification was useful evading unwanted top hypothesis. Formally speaking, the ground-truths can be set with counterfactual learning using the logged data, but this is beyond the scope of this work.
Figure 2 shows the overall architecture of the proposed model deciding whether to ask clarifying questions or not.
The input to our proposed model is a subset of the HypRank hypotheses described in Section 2.1. The top predicted hypothesis is always included in our model input since it is what to be compared with the alternative hypotheses through clarification dialogs. Then, for each occurred ambiguity, we add the most confident alternative hypothesis corresponding to the ambiguity. For the example in Table 1, assuming 0.8 is the threshold to represent ASR, IC, and HC ambiguity occurrences, and SNR and TRUNC ambiguities did not occur, hypothesis #2 is with the highest confidence among the hypotheses corresponding to HC ambiguity and #3 is the one corresponding to IC ambiguity. Since the hypothesis with different transcript has lower ASR confidence (0.6) than the threshold (0.8), we do not use it as an alternative hypothesis. Therefore, the input sequence to the proposed model consists of hypothesis #1 as the top hypothesis, and #2 and #3 as the alternative hypotheses. 222SNR and Trunc ambiguities do not have corresponding alternative hypotheses since those ambiguities could be resolved by generating new hypotheses based on additional information from clarification interactions. Therefore, we represent the corresponding hypothesis with vector for each hypothesis elements.
As the model input, each hypothesis is represented as a concatenated vector of ASR, ASR confidence, intent confidence, domain, intent, slots, and ambiguity type.333We do not include the hypothesis confidence as an input feature since there exist utterances without the scores due to rule-based or shortlister only hypothesis decision. ASR is represented by the output summation of a single layered standard multi-head transformer encoder [Vaswani2017]444We empirically find 4 attention heads shows the best performance., where the word embedding is initialized with GloVe [Pennington2014]. ASR confidence is a scalar value normalized to be between 0 and 1. A vector for slots is represented as a sum of matched slot key vectors similarly to [YBKim2018b]. Domain, intent, and ambiguity type555The top hypothesis’s ambiguity type is denoted as TOP to differentiate it with the alternative hypotheses’ ambiguity types such as ASR, IC, and HC. are also vectorized with embeddings.666The effectiveness of these features is shown in Section 4.5. On top of the hypothesis vector sequence, we obtain a contextualized vector sequence using self-attention with a transformer encoder. For this self-attention, inspired by Set Transformer [Lee2019], we do not utilize position encoding since the order of alternative hypotheses for different ambiguities is not informative for the model’s decision.
From the contextualized hypotheses, we obtain the top hypothesis’s representation and the sum of the alternative hypothesis representation to be used as the inputs to the final prediction layer. Summation of the alternatives is necessary since the number of the alternative hypotheses (i.e. # occurred ambiguities) varies for each utterance and they should be aggregated to be used as an input representation. While the majority of self-attentive models for other tasks use single representation aggregated over all the elements in the given sequence, we observe that separating the top hypothesis representation and the aggregated representation over the alternative hypotheses performs better due to different aspects of the top hypothesis and the alternative hypotheses in terms of deciding clarification or not.
In addition to the hypothesis representations, we also use other signals: SNR, which is a scalar value normalized to be between -1 and 1, ambiguity occurrence vector, which is a concatenation of binary values representing the occurred ambiguities, and a binary signal representing repetition of the previous user utterance, which is a common indicator that the same utterance was wrongly recognized or unsatisfactory previously. All these vectors are concatenated and transformed to an output vector through a feed-forward network.
|Total||Ask||No ask||Total||Ask||No ask||Total||Ask||No ask|
To the best of our knowledge, there is no existing public dataset for asking clarification questions including ASR related features such as ASR confidences and SNR values. Based on an assessment of a randomly sampled ambiguous utterances from a conversational AI system, we estimate that about 23% ambiguous traffic should be resolved for user satisfaction through a clarification dialog. To show statistical significance on the evaluation results and to make the data split similar to real deployment scenarios, we construct a large test set by selecting the second half of the data based on time stamp. We then randomly split the first half of data to training/validation sets with 9:1 ratios. The detailed statistics for each ambiguity type are summarized in Table3. For example, there are total 4.6M utterances with ASR ambiguity in the test set. However, only 780K of them are worth to clarify for better user satisfaction and the remaining 3.8M do not need to clarify since these are satisfactory to the users even though they are ambiguous. The ratio of ‘ask’ labels varies for different ambiguity types due to different criteria and thresholds in ambiguity detection.
4.2 Experiment Setup
In a high level, we consider three approaches in our experiments, (i) asking questions for every ambiguity occurrence, denoted as Always; (ii) utilizing the top hypothesis and the context information as the input to the output layer, denoted as No-alt in Figure 2(a); (iii) using all the top and the alternative hypotheses and the context information.
In the third category, we try different types of alternative hypothesis aggregation. The simplest one is using the top hypothesis vector and simple average over the alternative hypothesis vectors, denoted as Alt-avg in Figure 2(b). To more effectively represent contextualized information of the hypotheses, we try two types of attention mechanisms: (1) representing alternative hypotheses with cross attention given the top hypothesis as the key, denoted as Crs-att in Figure 2(c). (2) a complete self-attention over all the hypotheses is Self-att in Figure 2(d).
When using both the top hypothesis and the alternative hypothesis aggregation, the previous approaches concatenate the top and the alternative vectors to be used as the input to the final layer. We also try their summation, denoted as Self-sum to check whether separately representing the top and the alternatives in the final layer is better or not.
4.3 Implementation Details
We train each model with ADAM optimizer [Kingma2015]
for 20 epochs and select the best model based on performance on validation set. The dimensionality of hypothesis components such as domain, intent, and slot vectors and utterance vectors are set to 100. The other hyperparameters for self-attention and positional embedding are identical to the default values in[Vaswani2017].
4.4 Experiment Results
Note that precision can evaluate the model’s ability to avoid unnecessary clarifications and recall can measure the ability to ask clarifications when necessary. Hence, we use F1 score, which balances the two metrics, to evaluate the model performance. To make a thorough evaluation, we evaluate both F-1 for each ambiguity type and the overall F-1 for all the types.
The relative F-1 scores over all aforementioned models are summarized in Table 4.777 Due to internal confidentiality policy, we report relative F-1 scores, , where and denote F-1 scores of model and model , respectively. Since Always always asks clarifying questions for the ambiguous utterances, its F-1 score is the lowest due to low precision even though recall is 100%. As aforementioned, about 23% of all the ambiguous utterances need clarification in our experiment setting, thereby the F-1 score of all the ambiguities is around 37%, where each ambiguity’s F-1 is between 20% to 70%. Therefore, using any of the tested models shows significantly better performance.
Compared to No-alt model, which does not utilize the alternative hypothesis information, Alt-avg does not significantly improve the performance. However, the other models using attention mechanisms to represent the alternative hypotheses significantly outperform No-alt model. This indicates that properly represented alternative hypothesis information is an important factor in the model decision. Also, using self-attention, Self-att, is again significantly better than using cross-attention with the top hypothesis as the key, Crs-att. This shows that the fully contextualized hypothesis representations are helpful. We further deepen the model by considering 2-layer of self-attention, which further improves the performance. Such good performance verifies the effectiveness of our self-attention based model in asking clarification question task. In addition, relatively poor performance of Self-sum reflects that the proposed architecture, which separately represents the top hypothesis and the aggregated alternative hypotheses with concatenation, is more proper for our task than singly aggregated representation, which is more common in self-attentive architectures. This empirically demonstrates the top hypothesis and the alternative hypotheses play different roles in the model decision, thereby separating them is more helpful.
4.5 Ablation Study
We further explore the impact of the input features and architecture settings from our best model, Self-att2. We represent each hypothesis as the concatenation of two vectors: the hypothesis vector and the sentence vector. Among the features of the hypothesis vector, ASR confidence is expected to be closely related to ASR and SNR ambiguities. Therefore, we conduct the following experiments: excluding whole hypothesis features (i.e., all the input features except the sentences), excluding hypothesis features but ASR confidence, and excluding sentence vectors. Another architecture decision is using single self-attention weight for the concatenation of the hypothesis vector and the sentence vector. Since those two vectors are significantly different views of a hypothesis, we check if using single attention weight is better than having separate attention weights for the two different vectors. In addition, we check the effectiveness of using the repetition feature since we hypothesize that whether the current utterance is repeated or not is helpful for the model decision.
The ablation study results are shown in Table 5. Excluding the hypothesis vector features (No-hyp) results in a big drop in the overall performance. Hence, the hypothesis features are critical in the model decision. Using only ASR confidence signal and the sentence vector (ASR) is shown slightly helpful for most ambiguities except IC, but the improvements for ASR and SNR ambiguities are not very high. This means that the contextual features unrelated to the acoustics are also significantly influential to the decision for the acoustics related ambiguities. Excluding the sentence vector (No-sent) also causes lower performance but the drop is less compared excluding the hypothesis vector. This indicates that both the sentence and the hypotheses are important but the hypothesis features would provide more information for the model decision. Using different self attention weights (Diff-att) shows worse performance, which demonstrates that holistic self attention for the concatenation of the hypothesis vector and the sentence vector is not only simpler but also empirically more effective. One possible reason of the result is that the attention over different sentence vectors would be less influential because sentence vectors in different hypotheses are different only when ASR ambiguity exists. Excluding the repetition feature (No-rpt) also shows significant degradation, which reflects its effectiveness.
These findings show that the utilized signals and the architecture decisions are helpful for making proper model predictions.
In this work, we have introduced five common ambiguities in SLU, where empirically 23% of the utterances with these ambiguities need be clarified. To decide whether asking a clarifying question would be helpful, we have proposed a scalable neural self-attentive model, where the top and the alternative hypotheses, ambiguity occurrence information, and the other contextual information are used as the input representation and then the model predicts whether to ask clarifying questions or not. The model is supervised leveraging a user satisfaction model in order to ask a clarifying question when it would be helpful. The proposed model utilizing self-attention for hypothesis representations and ambiguity related contextual information has showed significantly improved performance compared to various baseline approaches evaluated on the user log data from a conversational agent system.
As future work, we will study how logged clarifying interactions can be utilized for the fine-tuning to further improve user satisfaction with clarification as briefly described in Section 2.3. Also, we will look at how the clarifying questions should be composed for each ambiguity type for effective and natural user engagement in the large-scale setting.