Virtual assistants that use Automatic Speech Recognition (ASR) rely on the accuracy of the transcript to understand the user’s intent. However, even though ASR accuracy has improved, recognition errors are common. Errors typically occur due to both noise and the ambiguity of language. Xiong et al.[xiong2016achieving]
found the error rate of professional transcribers to be 5.9% on Switchboard, which is a dataset of conversational speech. Specifically, for out-of-vocabulary words, like names, ASR systems often make mistakes. Recognition errors lead to misunderstandings of the user’s intent, which results in unwanted behaviour. Recognition errors can be mitigated by detecting that an error has occurred[ogawa2017error, fayolle2010crf]. If this is the case, the virtual assistant might initiate a recovery strategy. Typical recovery strategies include asking clarifying questions or asking the user to repeat a certain part of the request [skantze2008galatea].
If the system fails to detect a recognition error and asks the user to correct, it may respond nonsensically or take an action the user did not intend. In response, the user might try to alleviate the misunderstanding. The most common way to do this is repeating (part of) the utterance [swerts2000corrections]. We call this repetition-based recovery. Substantial work exists on detecting if such a repetition-based correction occurred using prosody [litman2006characterizing, hirschberg2004prosodic, stifelman1993user], n-best hypothesis overlap [kitaoka2003detection], phonetic distances [lopes2015detecting] or a combination of the above [skantze2004early].
In this paper, we focus on automatically correcting the error that was made through Query Rewrite (QR). Query Rewrite is used in information retrieval systems [musa2019answering, DBLP:journals/corr/abs-1809-02922] and smart assistants [ChenQR, RoshanGhias2020PersonalizedQR, Rastogi2019ScalingMD]. We generate a corrected transcription of the incorrectly understood utterance. A Natural Language Understanding (NLU) system can then parse this rewritten query so that the system can recover from the error. QR is a generalizable way of handling repetition-based recovery, because it makes no assumptions about a specific grammar the user should be following. This is especially important for virtual assistants that handle a wide range of requests. Previously proposed systems for repetition-based recovery [sagawa2004correction, lopes2015detecting] assume a specific grammar or task making them hard to use for a virtual assistant.
We propose a model that can generate a rewritten query by merging together the incorrect first utterance and the correction follow-up utterance. We can then use the rewritten query to re-estimate the user’s intent and act upon it. The inputs to the model are Acoustic Neighbour Embeddings (ANE)[jeon2020acoustic] of the words in the utterances. ANEs are embeddings that are trained in such a way that similar sounding words are close to each other in the embeddings space. The ANE model takes in a word as a sequence of graphemes and produces a word embedding. With these embeddings, the model can infer which words sound similar and thus could have been confused by the ASR. Wang et al. [DBLP:conf/interspeech/WangDLLAL20] rewrite single turn utterances by taking phonetic inputs from the ASR model to handle entity recognition errors.
Our proposed model is an encoder-decoder [sutskever2014sequence] network. The decoder is a modified pointer network [vinyals2015pointer] that takes two input sequences, one for the incorrect first utterance and one for the correction follow-up utterance. A pointer network is an attention based [bahdanau2014neural] model that outputs words from a variable-sized dictionary. A pointer network is necessary because the model only considers words occurring in the two input utterances as potential outputs. We modify the standard pointer network to a 2-Step attention pointer network, which is tailored to this problem. The pointer network first attends over the first turn, then attends over the second turn and finally selects between the two attention results.
The model is trained using synthetic data generated based on a transcribed speech dataset. Using the ASR transcription and the reference transcription, a repetition-based recovery is generated. This data generation method allows the rewrite model to be trained using existing commonly available data resources.
The contributions of this paper are threefold:
We propose QR as a generalizable way of handling repetition-based recovery.
We propose 2-Step Attention (SA2) pointer network for this task.
We propose a method to generate training data based on transcribed speech.
The remainder of this paper is organised as follows. First, in Section 2, we describe the proposed model, and in section 3, we explain how we generate training data to train this model. In Section 4, we describe the models we used as baselines. Next, in Section 5, we present the data we used for the experiments, the metrics we used to measure the performance and the model hyper-parameters. In Section 6, we discuss the results. Finally, in Sections 7 and 8, we formulate future work and conclusions.
Our proposed model takes two utterances as inputs: the potentially incorrect first utterance and the correction followup utterance. The output is a rewritten query. The rewritten query is represented as a sequence of pointers to words in the incorrect first utterance and the second utterance. To clarify take the following example:
First utterance: Call Uncle LeVar
ASR transcription: Call Uncle of R
Second utterance: No, I said Uncle LeVar
Model output: 1-1 1-2 2-5
1-1 represents first word in first utterance (Call), 1-2 represents second word in first utterance (Uncle), 2-5 represents the fifth word in the second utterance, LeVar in this example. The model thus replaces “of R” with “LeVar”, keeping the “call” verb. Some words can occur in both sentences, such as “Uncle”. We select the first-turn word as target during training. Figure 1 depicts a diagram of the proposed model. Each component is discussed in detail in the following subsections.
We encode each utterance into a sequence of context-aware word vectors. We embed the words into ANEs. The ANE model is an LSTM that takes in the sequence of graphemes that make up the word and produces an embedding vector[jeon2020acoustic]:
where is the ANE and is the grapheme sequence. Finally, we pass the ANEs through a bidirectional LSTM (BLSTM) [hochreiter1997long] to create context-aware word representations.
A schematic of one step of the 2SA decoder is found in Figure 2. The 2SA decoder is a modified pointer network [vinyals2015pointer] and takes the following inputs:
: The word from the previous decoder step. This vector is taken out of the word-level representation discussed in Section 2.1.
: The decoder LSTM state from the previous decoder step.
: The first utterance attention context from the previous decoder step.
: The second utterance attention context from the previous decoder step.
The output of the decoder step is the next word of the output sequence, which is chosen from the words in the two input utterances.
First, we feed all the inputs into the LSTM to obtain the updated decoder LSTM state . This updated state is used to query the first utterance using an attention mechanism [bahdanau2014neural]. The attention mechanism gives a vector that contains a weight for each word in the first utterance. These attention weights are used to compute the context vector , which is a weighted average of the context-aware word vectors using as weights. We then query the second utterance using the context from the first utterance and the LSTM state in the same way to get the attention weights and second utterance context vector .
The intuition of the above is as follows: Both attention mechanisms look into an utterance for a candidate word to output next. We start by looking into the first utterance, because the corrected transcription is expected to be similar to the first utterance, so it is relatively easy to know which part to attend to next. Once we have a candidate word from the first utterance, we look into the second utterance for a similar sounding word that could act as a replacement.
Once the attention mechanisms have found a candidate for each utterance, the selector chooses which word to output. The selector receives the LSTM state and both attention contexts as input and outputs three probabilities:
: Probability that the candidate from the first utterance is correct
: Probability that the candidate from the second utterance is correct
: Probability for ending the sequence
We then create a probability distribution over all words in both utterances and the-label by multiplying each utterance probability with the respective attention weights:
where is the probability distribution. We then select the next word from this distribution.
During training we use the ground truth output sequences as inputs to the decoder. For inference we do beam search with beam width 3 to find the most likely output sequence, where we keep generating words until an end-of-sequence token is generated. We also enforce a strict left-to-right attention policy in both attention mechanisms. This means that the attention mechanism can only attend to words to later in the utterance than the last word selected from this utterance.
For a virtual assistant, the vast majority of interactions are not repetition-based recoveries. For these cases, the model should not rewrite the original query. To determine whether a rewrite is necessary, we add a classifier that classifies a rewrite as either necessary or not necessary. In this paper, we compute a phonetic weighted Levenshtein distance between the original request and the rewrite. The costs of the edits in the Levensthein distance are based on a phonetic edit distance that was computed based on the confusability of phones in the ASR. If this distance is lower than some threshold, the rewrite is deemed necessary.
3 Training data generation
To train the model, we synthetically generated repetition-based recovery interactions based on a transcribed speech dataset. We first run all the utterances in the dataset through the ASR to get ASR transcriptions. For the cases, where the ASR transcription contains an error we find the substitution the ASR made and generate a repetition-based recovery. In the example in Section 2, the ASR made the substitution “LeVar”
“of R”. Using this substitution, a repetition-based recovery is generated using the following heuristics:
A random number of words to the left and right of the error are included in the recovery. For 85% of cases, this is 0 words, for 10% this is one word and for 5% this is 2 words.
For 10% of repetition-based recoveries, a prefix is included. This prefix was randomly selected for a list of common prefixes, like “No, I said”.
With these heuristics some of the repetition-based recoveries that can be generated for the above example are:
No, I said uncle LeVaR
The model is trained using the ASR transcription and generated repetition-based recovery as input and the reference transcription as the target rewrite.
We use two baselines to compare the results of our proposed model: one rule-based model based on phonetic alignments and one more standard pointer network similar to our proposed model.
The rule-based model generates a list of all possible n-gram pairs of both utterances for potential replacements. We compute a phonetic edit distance for all pairs and generate the corrected transcription by replacing the n-gram in the first utterance with the n-gram from the correction follow-up. We compute the phonetic edit distance using a phone confusion matrix and normalise that distance by dividing by the number of phones in the first utterance n-gram.
For the example given in Section 2, the replacement with the smallest phonetic edit distance would be for replacing “Uncle of R” with “Uncle LeVar”.
Furthermore, we implement an unmodified pointer network [vinyals2015pointer] to compare with our proposed 2SA pointer network. We use the same ANE vectors as input instead of word embeddings.
Both neural network models are trained on an internally collected dataset described in Section3. Using the transcribed first turn, we generate the second turn and the target rewrite. This leads to about 350k turn pairs. For evaluation, we employ a fully annotated test set with 9.2k turn pairs. The annotators transcribe both turns and indicate whether the user intended to correct in the second turn.
We compute the word error rate reduction (WERR) by comparing the WER of the original first turn utterances with the WER of the rewrites.
We measure WERR only on turn pairs that have been annotated as correction, as a rewrite is not required for the others. We additionally compute false alarm rates (FAR) at a range of classifier thresholds. FAR is identical to false positive rate and it’s the proportion of unnecessary rewrites.
5.3 Training setup
The baseline pointer network has 1 encoder and decoder BLSTM layer with 128 hidden size. We train it using learning rate 0.0003 with Adam and batch_size 32. For our proposed model we employ 2 encoders (one for each turn) and 1 decoder with the same dimension, but modified attention head. We train it with batch size 128 and learning rate 0.0001 with Adam. We use pretrained ANEs that are fixed during training.
While the 2SA has twice as many encoder parameters, each encoder only sees half of the data as they only receive their respective turn. For the pointer network, we concatenate both turns to generate a rewrite.
The rule based baseline achieves a higher maximum WERR, but at a higher FAR. Both machine learned models give high WERR early on before tapering off. We compare the pointer-network and the 2SA network and note the pointer-network has lower FAR for the same WERR, but lower maximum WERR overall, showing a FAR/WERR tradeoff between the two model approaches.
We note the rule-based rewriter uses the same edit distance minimizing rules on positive and negative pairs. Our classifier uses edit distance to score the rewrite. Given the unbalance of our evaluation dataset and real life traffic, it requires a low classifier threshold (implying high FAR) to be feasible. The machine learned do a rewrite when it makes sense to do so, and output random rewrites otherwise. These rewrites will have a high edit distance and be filtered by the classifier.
|2 step attention pointer-network||25.80%||6.77%|
7 Future Work
The user can try to help correct errors by providing extra information to the system. For example, the user might describe an entity that was misrecognized. This is not captured by pure query rewrite. In future work, one could investigate modifying the task to incorporate this extra information.
The proposed system cannot correct the error if the ASR repeats the error in the correction followup. However, it is possible that the correct hypothesis is present in the n-best list of the speech recognizer. By combining information from both utterances, it might be possible to surface the correct hypothesis.
We propose a system that allows the user to correct recognition errors through repetition. The model handles this repetition-based recovery by rewriting the original query to incorporate the follow-up correction. The rewritten query is then scored with an edit distance based classifier for thresholding. We propose a 2-Step Attention pointer network and show that it outperforms both a standard pointer network and a rule-based baseline. We propose a way to generate synthetic training data using only transcribed data and an ASR system to train our machine learned models.