Large-scale conversational AI agents such as Alexa, Siri, and Google Assistant are getting more and more prevalent, opening up in new domains and taking up new tasks to help users across the globe. One key consideration in designing such systems is how they can be improved over time at that scale. Users interacting with these agents experience frictions due to various reasons: 1) Automatic Speech Recognition (ASR) errors, such as ”play maj and dragons” (should be ”play imagine dragons”), 2) Natural Language Understanding (NLU) errors, such as ”don’t play this song again skip” (Alexa would understand if it is formulated as ”thumbs down this song”), and even user errors, such as ”play bazzi angel” (it should’ve been ”play beautiful by bazzi”). It goes without saying that fixing these frictions help users to have a more seamless experience, and engage more with the AI agents.
One common method to address frictions is to gather these use cases and fix them manually using rules and Finite State Transducers (FST) as they’re often the case in speech recognition systems [mohri]. This of course is a laborious technique which is: 1) not scalable at Alexa scale, and 2) prone to error, and 3) getting stale and even defective over time. Another approach could be to identify these frictions, ask annotators to come up with the correct form of query, and then update ASR and NLU models to solve these problems. This is also: 1) not an scalable solution, since it needs a lot of annotations, and 2) it is expensive and time consuming to update those models. Instead, we have taken a ”query rewriting” approach to solve customer frictions, meaning that when necessary, we reformulate a customer’s query such that it conveys the same meaning/intent, and is actionable (i.e. interpretable) by Alexa’s existing NLU systems.
In motivating our approach, consider the example utterance, ”play maj and dragons”. Now, without reformulation, Alexa would inevitably come up with the response, ”Sorry, I couldn’t find maj and dragons”. Some customers give up at this point, while others may try enunciating better for Alexa to understand them: ”play imagine dragons”. Also note that there might be other customers who give up, and change the next query to another intent, for example: ”play pop music”. Here, frictions evidently cause dissatisfaction with different customers reacting differently to them. However, quite clearly there are good rephrases by some customers among all these interactions, which beckons the question – how can we identify and extract them to solve customer frictions?
We propose using a Markov-based collaborative filtering approach to identify rewrites that lead to successful customer interactions. We go on to discuss the theory and implementation of the idea, as well as show that this method is highly scalable and effective in significantly reducing customer frictions. We also discuss how this approach was deployed into customer-facing production and what are some of the challenges and benefits of such approach.
Collaborative filtering has been used extensively in recommender systems. In a more general sense, collaborative filtering can be viewed as a method of mining patterns from various agents (most commonly, people), in order to help them help each other out [terveen]. Markov chains have been used previously in collaborative filtering applications to recommend course enrollment [khorasani], personalized recommender systems [sahoo], and web recommendation [fouss].
Studies have shown that Markov processes can be used to explain the user web query behavior [jansen], and Markov chains have since been used successfully for web query reformulation via absorbing random walk [wang], and modeling query utility [zhu]
. We here present a new method for query reformulation using Markov chain that is both highly scalable and interpretable due to intuitive definitions of transition probabilities. Also, to the best of the authors’ knowledge, this is the first work where Markov chain is used for query reformulation in voice-based virtual assistants.
One important difference between the web query reformulation and Alexa’s use case is that we need to seamlessly replace the user’s utterance in order to remove friction. Asking users for confirmation every time we plan to reformulate is on itself an added friction, which we try to avoid as much as possible. Another difference is how success and failure are defined for an interaction between user and a voice-based virtual assistant system. We use implicit and explicit user feedback when interacting with Alexa to establish the absorbing states of success and failure.
The Alexa conversational AI system follows a rather well-established architectural pattern of cloud-based digital voice assistants [jianfeng] i.e. comprising of an automatic speech recognition (ASR) system, a natural language understanding (NLU) system with a built-in dialog manager, and a text-to-speech (TTS) system, as visualized in Fig. 1. Conventionally, as a user interacts with their Alexa-enabled device, their voice is first recognized by ASR and decoded into plain text, which we refer to as an utterance. The utterance is then interpreted by the NLU component to surface the aforementioned user’s intent by also accounting for the state of user’s active dialog session. Thereafter, the intent and the corresponding action to execute is passed on to the TTS to generate the appropriate response as speech back to the user via their Alexa-enabled device, thus closing the interaction loop. Also note that the metadata associated with each of the above systems are anonymized and logged asynchronously to an external database.
In deploying our self-learning system, we first intercept the utterance being passed onto the NLU system and rewrite it with our reformulation engine. We then subsequently pass the rewrite in lieu of the original utterance back to NLU for interpretation, and thus restoring the original data flow. This is shown as the post-deployment data flow path in Fig. 1. Our reformulation engine is essentially implements rather lightweight service-oriented architecture that encapsulates the access to a high-performance, low-latency database, which is queried with the original utterance to yield its corresponding rewrite candidate. This along with the fact that the system is fundamentally stateless across users translates to a rather scalable customer-facing system with marginal impact to the user perceived latency of their Alexa-enabled device.
In order to discover new rewrite candidates and maintain the viability of existing rewrites, our Markov-based model ingests the anonymized Alexa log data on a daily basis to learn from users’ reformulations and subsequently updates the aforementioned online database. We discuss the nature of the dataset and how our model achieves this in later sections of this paper. This ingestion to update process takes place offline in entirety with the rewrites in the database updated via a low-maintenance feature-toggling (i.e. feature-flag) mechanism. Additionally, we also have an offline blacklisting mechanism which evaluates the rewrites from our Markov model by independently comparing their friction rate against that of the original utterance, and subsequently filtering them from being uploaded to the database should they perform worse against their no-rewrite counterpart using a-test with a rather conservative -value of . This allows us to maintain a high precision system at runtime. It is worth mentioning that friction detection is done using a pre-trained ML model based on user’s utterance and Alexa’s response. The details of that model is out of scope of this paper.
As our objective is to learn the patterns from user interactions with Alexa, we pre-process 3 months of anonymized Alexa log data across millions of customers, which constitutes a highly randomized collection of time-series utterance data, to build our dataset, comprising of a set of sessions, i.e.:
Here, in defining the concept of a session, we first define the construction function , parameterized by a customer, , a device, , and an initial timestamp, , to yield a finite ordered set of successive utterances, (and its associated metadata) such that the time delay between any two consecutive utterances is at most . We also note that interjecting utterances, , i.e. those leading to StopIntent, CancelIntent, etc., that occur before the end of the aforementioned set are removed. Then, a session, is defined as follows:
such that the following properties hold true:
Intuitively speaking, a session is effectively a time-delimited snapshot of a user’s conversation history with their Alexa-enabled device. We illustrate this in Fig. 2 (a), (b), and (c) where each session is represented as a linear directed chain of successive utterances e.g. . In this paper, we choose the value of seconds as a result from a separate study.
Absorbing Markov Chain
In this section, we show how encoding user interaction history as paths in an absorbing Markov Chain model can be used to mine patterns for reformulating utterances. In particular, we discuss in detail the concept of the interpretation space, (Section 4.1), which functions as the vertex set of the model’s transient states (Section 4.2). We then elaborate on the construction of the absorbing states, (Section 4.3), the canonical solution to the model (Section 4.4), and the practical implementation of the model (Section 4.5). As the Markov Chain model is inherently a probabilistic graphical model, we can represent it as graph, , where the vertex set, and the edge set, are given as follows:
We note that from here on out, we use the terms, Graph and Markov model interchangeably.
While our definition of a session in Section 3 naturally extends towards having each ordered linear sequence of utterances as a path in our Markov model, this encoding in the utterance space, i.e. the space of all utterances
, imposes a limitation on the model by creating heavily sparse connections. This is primarily due to the high degree of semantic and structural variance in, which would ultimately result in a lower capacity for generalization.
To resolve this, we leverage the domain and intent classifier as well as the named entity recognition (NER) results from Alexa’s NLU systems to surface structured representations of utterances, and thus encapsulate a latent distribution over. Consequently, each utterance in a session is projected into this interpretation space, which comprises the set of all interpretations , to define a latent session:
To exemplify this, consider the utterance, ”play despicable me” (i.e. in Fig. 2), which would be mapped into the -space as:
which is compactly represented as in Fig. 2. As the -space condenses the semantics of , this mapping between and is inherently a many-to-one relationship. However, given the stochasticity of Alexa’s NLU, the original projection itself is not entirely bijective and thus results in a many-to-one relationship in both the forward and inverse mapping, i.e. and
, akin to a bipartite mapping. This in turn, yields the conditional probability distributions,and , such that for a particular and , they are defined as follows:
where is the co-occurrence count of the pair in the dataset, i.e. the total number of times both and are mapped onto each other.
Given our transformed dataset, of latent sessions , we take each such session and the interpretations within it to represent paths and transient states respectively in our Markov model, such that each successive pair of interpretations would represent an edge in the Graph. In defining the transition probability distribution, we first define , the total occurrence of an interpretation in the aforementioned dataset as follows:
where is the co-occurrence count of the pair i.e. the total number of times is adjacent to , aggregated across all sessions (i.e. over 3 months and millions of customers) in :
Then, the corresponding probability that a transition state transitions to in the Graph is given by:
Taking this in context of Fig. 2, consider the transition probability . From the sessions (a), (b), and (c), we can note that the transition state is adjacent to the states, with each of them having a co-occurrence of with . Here, refers to the failure absorbing state (defined in the following sub-section). As such, the probability as shown in (d).
In formulating the definition of the absorbing states of the Markov model, we look towards encoding the notion of interpreted defects as perceived by the user. As we have briefly introduced earlier, this concept of defect surfaces in two key forms i.e. via explicit and implicit feedback.
Here, explicit feedback refers to the type of corrective or reinforcing feedback received from direct user engagement. This primarily includes events where users opt to interrupt Alexa by means of an interjecting utterance (as defined above in Section 3). This is illustrated in the example below:
|User:||”play a lever”|
|Alexa:||”Here’s Lever by The Mavis’s, starting now.”|
In contrast, implicit feedback is typically observed when users abandon a session following Alexa’s failure to handle a request either due to an internal exception or simply unable to find a match for the entities resolved. Case in point:
|User:||”play maj and dragons”|
|Alexa:||”Sorry, I can’t find the artist maj and dragons.”|
Given this, we define two absorbing states: failure (), and success (), where success is defined as the absence of failure. These states are artificially injected to the end of all sessions, based on the implicit and explicit feedback we infer from Alexa’s response, and user’s last utterance.
To clarify this, let’s walk through the examples above assuming that they are the last utterances of their corresponding sessions. In the first example, we would drop the ”stop” turn, and add a failure state. In the second example, we simply add the failure state to the end of the session. Finally, in the absence of an explicit or implicit feedback, we add a success state to the end of the session. There are certain edge cases, but for the sake of brevity, we do not discuss them here. Given this, we can then define the probability that a given transient state, is absorbed in much the same way as in Eq. 8, e.g.:
Note that in Fig. 2, we refer to the failure (), and success () states as and respectively.
With the distributions over both the transition and absorbing states defined above, recall that the interpretation space, is the set of all transient states in the Graph. Then, we can summarize the Markov model in its canonical form via the transition matrix, as follows:
Now, we generalize the previous notation of probabilities as i.e. the probability at depth- of the Graph, with implicitly referring to . Then, let and be given source and target transient states in the Graph respectively. We further define the probability of success of given such that is reached by in at most steps as follows:
As such, in the context of reducing defects, we consider to be a possible reformulation candidate for if it is reachable by , such that conditioned on , has a higher chance of success than on its own, i.e.:
Here, reachability of any two states implies that there exists a path between them in the Graph or mathematically speaking, there exists a non-zero value of for which . Now, consider the probability of success of given such that is reached by in exactly steps. We would then have the following:
where refers to the -entry of the matrix ( multiplied by itself times), which in turn refers to the probability of reaching from in exactly steps i.e. . Expanding this to any number of steps i.e. reachable would thus allow us to reformulate the left set of terms in the inequality of Eq. 12 using matrix notations:
Generalizing this across all , define the matrix such that its -th entry, . Then, we have:
is the diagonal matrix whose diagonal is the vector. Now, as is a square matrix of probabilities, we have and that is convergent. Then the summation above leads to a geometric series of matrices, which as given by Definition 11.3 in [grinstead], corresponds to the fundamental matrix of the Markov model, denoted by :
with referring to the identity matrix with the dimensions, . Given this, let be the -th row vector of the matrix corresponding to . As such, every non-zero entry in translates to the probability of some reachable . This vector is thus given by:
where refers to the Hadamard (element-wise) product. We then frame our objective as identifying the which maximizes the aforementioned probability for the given :
Intuitive speaking, in the event that , the model shows that there exists a reachable target interpretation that when reformulated from , has a better chance at a successful experience than not doing so. In reference to Fig. 2, we can see that reformulating to increases the likelihood of success as:
Suppose that . In which case, the source interpretation is already successful on its own and hence requires no reformulation. As such, the model is effectively able to automatically partition the vertex space, into sets of successful () and unsuccessful () interpretations. In extending this reformulation back to the utterance space, , we leverage the distributions and defined in Eq. 5 and re-define our objective as follows for a given source utterance :
The intuition described above can similarly be applied here where is the more successful reformulation of . Note that the self-partitioning feature of the model directly extends to the utterance space, , allowing it to surgically target only utterances that are likely to be defective and surface their corresponding rewrite candidates. This is the cardinal aspect of the model that drives the self-learning nature of the proposed system without requiring any human in the loop.
With , constructing the matrix , let alone inverting it, poses a key challenge towards scaling out the model, particularly in its batched form. As such, we formulate an approximation in computing the vector for all source interpretations, by means of a distributed approach.
We note that from our dataset, , that in the event that a given source utterance, is defective, users would only attempt at reformulating their query a few times before either arriving at a satisfactory experience or abandoning their session entirely. This translates to most () source interpretations, in the Markov model having short path lengths (i.e. typically ) prior to them being absorbed by an absorbing state. Consequently, this along with the fact that these reformulations are recurrent across users, most high-confidence reformulations often only involve visiting a much smaller set of target interpretations, , i.e.
This leads us to deduce that the matrix is highly sparse and the corresponding Graph contains many clustered (i.e. community) structures. We then leverage these facts to first collect the paths for every source interpretation, in a series of map-reduce tasks, by means of a distributed breadth-first search traversal up to a fixed depth of 5 using Apache Spark [zaharia]. Thereafter, each task receives the paths corresponding to a single and in turn uses them to construct an approximate transition matrix, . As the dimensionality of the matrix is much lower than that of , we can easily compute the approximate fundamental matrix, and the approximate vector within the same task. As a result, we have a distributed solution for parallelizing the computation of for every .
The breadth-first search traversal, which involves a series of sort-merge joins, does indeed introduce an algorithmic overhead of , where and refer to the depth of the traversal and the set of all edges in the Graph respectively. We do also note that as this is a distributed join, the incurred network cost due to data shuffles are omitted here for simplicity. That being said, these overheads are offset by the advantage of being able to scale out the model. For purposes of optimization, each successive join is only performed on the set of paths which are non-cyclic and have yet to be absorbed while paths with vanishing probabilities are pruned off.
|1||play maj and dragons||play imagine dragons||Good|
|2||play shadow by lady gaga||play shallow by lady gaga|
|3||play rumer||play rumor by lee brice|
|4||play sirius x. m. chill||play channel fifty three on sirius x. m.|
|5||play a. b. c.||play the alphabet song|
|6||don‘t ever play that song again||thumbs down this song|
|7||turn the volume to half||volume five|
|8||play island ninety point five||play island ninety eight point five|
|9||play swaggy playlist||shuffle my songs||Bad|
|10||play carter five by lil wayne||play carter four by lil wayne|
Baseline: Pointer-Generator Sequence-to-Sequence Model
Sequence-to-sequence (seq2seq) architectures have been the foundation for many neural machine translation and sequence learning tasks[sutskever]
. As such, by formulating the task of query rewriting as an extension of sequence learning, we used a Long Short-Term Memory-based (LSTM) model as an alternative method to produce rewrites. In short, we first mined 3 months of rephrase data using a rephrase detection ML model such that the first utterance was defective, and the rephrase was successful. We then used this data to train the seq2seq model, such that given the first utterance, it produces the second utterance. The model is based on well-established encoder-decoder architecture with attention and copy mechanisms[see]. After the model is trained, we then used it to rewrite the same utterances that the Graph rewrites.
In order to evaluate the quality of the rewrites we obtained, we annotated 5,679 unique utterance-rewrite pairs generated using Graph, and estimated the accuracy and win/loss ratio to be 93.4% and 12.0, respectively. Win/loss ratio is defined as the ratio of rewrites that result in better customer experience and the rewrites that deteriorate customer experience. We further used the seq2seq model to generate rewrite for these utterances as a baseline.
Applying the seq2seq model on this dataset resulted in accuracy of 55.2%, significantly lower than the accuracy of Graph. This is expected, since the Graph is 1) aggregating all three months of data (and not only rephrases), 2) taking into account the frequency of transitions whereas the seq2seq model only has unique rephrase pairs for training, and 3) utilizing the interpretation space to further compact and aggregate the utterances. However, the seq2seq model has the benefit of higher recall (since it can rewrite any utterance), and it learns the patterns, e.g. SongName play SongName. Another important difference between the Graph and seq2seq methods is that the Graph is capable of marking an utterance as Successful i.e. when . This is a signal to not rewrite an utterance, since on itself is mostly successful. However, the seq2seq model lacks this capability, and it may rewrite a successful utterance, and cause a friction.
Table 1 shows some examples of good and bad rewrites from the Graph. It is clear from the examples that the rewrites are capable of fixing ASR (no. 1-3), NLU (no. 4-7) and even user errors (no. 8). On the other hand, there are cases that the rewrites fail (no. 9-10). One of the recurring cases of failure is when an utterance is rewritten to a generic utterance, like ”play”, or ”shuffle my songs”. This usually happens due to the original utterance not being successful, and the users trying many different paths that eventually loses information, and is aggregated in a generic utterance (due to Eq. 20). Another common case of failure is when the rewrite changes the intention of the original utterance by changing the song name or artist name. This happens because of various reasons. For example, the data that we use for building the Graph may contain a period of time where the original utterance was not usually successful, so the users changed their mind by asking to play another similar song (like no. 10). The first type of error is easy to correct, by either applying rules or building a learning-based ranker after the Graph generation. The second type, however, is tricky to detect, since a lot of times, the change in the interpretation helps. We relied on an online blacklisting mechanism to remove these rewrites in the production system.
Offline Rewrite Mining
Since there are thousands of new utterances per day, and there are constant changes to the upstream and downstream systems in Alexa on a daily basis, it is important to update our rewrites on a regular basis to remove stale and ineffective rewrites. We run daily jobs to mine the most recent rewrites in an offline fashion. This allows us to find the most recent rewrites and serve them to users. It is noteworthy that in case of conflicts between the rewrites, we pick the most recent rewrite, since it has the latest data. We have online alarms and metrics to monitor daily jobs, since sometimes changes to the upstream and downstream Alexa components can impact our rewrite mining algorithm. In case of large changes in our metrics, we do a dive deep into the data to find the root cause.
Since the Graph is static during the period it is used, and there are many repetitive utterances per day, we opted to mine the rewrites as key-value pairs, where the original utterance is the key, and the rewrite is the value. For example, we store ”play babe shark” ”play baby shark” as one entry. We then serve these pairs in a high-performance database to meet the low latency requirement. This allows us to decouple the offline mining process and the online serving process for high availability and low latency requirements.
After all the offline analysis and traffic simulations, we launched Graph rewrites in production in an A/B testing setup. We monitored the performance of our rewrites against no-rewrites for over two weeks, and we observed more than 30% reduction in defect rate (), helping millions of users. We further measured the win/loss ratio three months after the release, by calculating the number of unique rewrites where rewriting is significantly better - win - or worse - loss - compared to no-rewrite option (we used Z-test to test the significance, and set p-value threshold of 0.01). The post-launch win/loss ratio closely matched our offline estimate (11.8 online vs. 12.0 offline).
We have been running this application for over 9 months in production, and it has been serving millions of users since, improving their experience on a daily basis without getting in their way. We know this for a fact since we have been monitoring customer satisfaction metrics on a weekly basis. We monitor the total number of rewrites, and the average friction rate for the rewrites, along with average friction for no-rewrites. On top of tracking online metrics, we continue doing offline evaluations on a weekly basis, where we sample our traffic, and send it for annotation. Combining the online and offline metrics in a longitudinal fashion allows us to closely follow the changes in the customer experience, which is the ultimate metric for our system.
As conversational agents become more popular and grow into new scopes, it is critical for these systems to have self-learning mechanisms to fix the recurring issues continuously with minimal human intervention. In this paper, we presented a self-learning system that is able to efficiently target and rectify both systemic and customer errors at runtime by means of query reformulation. In particular, we proposed a highly-scalable collaborative-filtering mechanism based on an absorbing Markov chain to surfacesuccessful utterance reformulations in conversational AI agents. Our system achieves a high precision performance thanks to aggregating large amounts of cross-user data in an offline fashion, without adversely impacting users’ perceived latency by serving the rewrites in a look-up manner online. We have tested and deployed our system into production across millions of users, reducing customer frictions by more than 30% and achieving a win/loss ratio of 11.8. Our solution has been customer-facing for over 9 months now, and it has helped millions of users to have a more seamless experience with Alexa.