Minimising Biasing Word Errors for Contextual ASR with the Tree-Constrained Pointer Generator

05/18/2022
by   Guangzhi Sun, et al.
0

Contextual knowledge is essential for reducing speech recognition errors on high-valued long-tail words. This paper proposes a novel tree-constrained pointer generator (TCPGen) component that enables end-to-end ASR models to bias towards a list of long-tail words obtained using external contextual information. With only a small overhead in memory use and computation cost, TCPGen can structure thousands of biasing words efficiently into a symbolic prefix-tree and creates a neural shortcut between the tree and the final ASR output to facilitate the recognition of the biasing words. To enhance TCPGen, we further propose a novel minimum biasing word error (MBWE) loss that directly optimises biasing word errors during training, along with a biasing-word-driven language model discounting (BLMD) method during the test. All contextual ASR systems were evaluated on the public Librispeech audiobook corpus and the data from the dialogue state tracking challenges (DSTC) with the biasing lists extracted from the dialogue-system ontology. Consistent word error rate (WER) reductions were achieved with TCPGen, which were particularly significant on the biasing words with around 40% relative reductions in the recognition error rates. MBWE and BLMD further improved the effectiveness of TCPGen and achieved more significant WER reductions on the biasing words. TCPGen also achieved zero-shot learning of words not in the audio training set with large WER reductions on the out-of-vocabulary words in the biasing list.

READ FULL TEXT VIEW PDF
09/01/2021

Tree-constrained Pointer Generator for End-to-end Contextual Speech Recognition

Contextual knowledge is important for real-world automatic speech recogn...
07/02/2022

Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition

Incorporating biasing words obtained as contextual knowledge is critical...
01/28/2020

Joint Contextual Modeling for ASR Correction and Language Understanding

The quality of automatic speech recognition (ASR) is critical to Dialogu...
08/24/2020

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

End-to-end (E2E) automatic speech recognition (ASR) systems lack the dis...
04/05/2021

Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion

How to leverage dynamic contextual information in end-to-end speech reco...
05/06/2021

Reducing Streaming ASR Model Delay with Self Alignment

Reducing prediction delay for streaming end-to-end ASR models with minim...
05/18/2020

Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems

In this paper, we present a series of complementary approaches to improv...

I Introduction

End-to-end ASR systems often suffer from high recognition errors on long-tailed words that are rare or not included in the training set. Contextual biasing integrates external contextual knowledge into ASR systems at test-time, which plays an increasingly important role in addressing the long-tail word problem in many applications [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 12, 11, 14]. Contextual knowledge is often represented as a list (referred to as a biasing list) of words or phrases (referred to as biasing words) that are likely to appear in a given context. There exist a variety of resources from which biasing lists can be extracted or organised, such as a user’s contact book or playlist, recently visited websites and the ontology of a dialogue system etc. Although biasing words occur infrequently and hence may only have a small impact on the overall word error rate (WER), they are mostly content words, such as nouns or proper nouns, and are thus particularly important to downstream tasks and highly valuable. A word is more likely to be recognised if it is incorporated in the biasing list, which makes contextual biasing critical to the correct recognition of those rare content words in an end-to-end ASR system.

As end-to-end ASR systems [18, 26] are often designed to incorporate all of the required knowledge into a single static model, it is particularly challenging for such systems to integrate contextual knowledge specific to a dynamic test-time context. Therefore, dedicated contextual biasing approaches have been proposed, including shallow fusion (SF) with a special weighted finite-state transducer (WFST) or a language model (LM) adapted for the contextual knowledge [1, 2, 3, 14, 12, 11], attention-based deep context approaches [4, 5, 6, 7, 8], as well as deep biasing (DB) with a prefix tree for improved efficiency when dealing with large biasing lists [9, 10]. More recently, contextual biasing components with a neural shortcut that directly modifies the output distribution has been proposed [63, 64], which can be jointly optimised with ASR systems.

In this paper, a tree-constrained pointer generator (TCPGen) component is proposed and developed for end-to-end contextual speech recognition. This paper extends our work in [63]

. TCPGen directly interpolates the original model distribution with an extra distribution (the TCPGen distribution) estimated from contextual knowledge, based on a dynamic interpolation weight predicted by the TCPGen component. TCPGen creates a neural shortcut between biasing lists and the final ASR output distribution to allow end-to-end training of a single neural network model. In contrast to the original work on pointer generators

[15, 16, 17] which rely on the attention-mechanism to attend to all biasing words, TCPGen represents biasing lists as wordpiece-level symbolic prefix trees and only attends to the valid subset of the biasing words at each time step in decoding, and can thus handle large biasing lists with high efficiency. Furthermore, TCPGen also uses wordpieces instead of whole words for pointer generators, which allows the entire system to use wordpieces to address the out of vocabulary (OOV) issue. Therefore, as not only a few frequent words but also OOV words exist in our biasing lists, TCPGen can be viewed as a structure that achieves zero-shot learning of previously unseen words without changing model parameters. As a result, TCPGen combines the advantages of both neural and symbolic methods and improves pointer generators for ASR applications with large biasing lists.

To further improve the effectiveness of contextual biasing with TCPGen, a minimum biasing word error (MBWE) loss is proposed to directly optimise the model performance on the biasing words. By changing the risk function of the widely used minimum word error (MWE) loss [40, 41, 42, 43, 44], the proposed MBWE loss has a greater focus on minimising the expected errors of the rare and OOV words in the biasing list associated with that utterance. An efficient beam search algorithm is proposed for MBWE training in RNN-T by limiting the number of output wordpieces at each encoder step to one [65, 66]. Moreover, to address the issue that end-to-end models often suffer from the internal LM effect that biases towards common words, a biasing-driven LM discounting (BLMD) method is proposed. The BLMD method extends the density ratio-based LM discounting method [49] to incorporate an additional discounting factor for the TCPGen distribution before interpolation.

In this paper, TCPGen, a generic component for end-to-end ASR systems, is integrated into both attention-based encoder-decoder (AED) [18, 19, 20, 21, 22, 38]

and recurrent neural network transducer (RNN-T)

[26, 27, 28, 29, 30]. MBWE and BLMD methods are also applied to both types of end-to-end models in conjunction with TCPGen. Experiments were performed on two different types of data, including the Librispeech audiobook data and goal-oriented dialogue data. Improvements in word error rate (WER) were achieved by using TCPGen in both AED and RNN-T configurations across all test sets compared to the baseline and the DB method, with a significant WER reduction on the biasing words.

The remainder of this paper is organised as follows. Sec. II reviews related work. Sec. III introduces the TCPGen component. Sec. IV and V describe MBWE training and BLMD for TCPGen respectively. Sec. VI and VII present the experimental setup and results. Sec. VIII gives the conclusions.

Ii Related work

Ii-a End-to-end contextual speech recognition

Various contextual biasing algorithms have been recently developed for end-to-end ASR. One of the major streams of research in this area focuses on representing biasing lists as extra WFST which is incorporated into a class-based LM via SF [1, 2, 3, 11]

. Such methods usually rely on special context prefixes like “call” or “play”, which limits its flexibility in handling more diverse grammars in natural speech. Neural network-based deep context approaches using attention mechanisms have also been proposed. These approaches encode a biasing list into a vector to use as a part of the input to the end-to-end ASR models

[4, 5, 6, 7, 8]. Although the deep context approaches address the dependency issue on syntactic prefixes in the SF methods, they are more memory intensive and less effective for handling large biasing lists.

Work in [9]

jointly adopted the use of deep context and the SF of a WFST together in an RNN-T. It also improved the efficiency by extracting the biasing vector from only a subset of wordpieces constrained by a prefix tree representing the biasing list, which is referred to as deep biasing (DB) in this paper. Work in

[10] extended the prefix-tree-based method to RNN LMs which are used for SF to achieve further improvements in biasing words. Moreover, while previous studies only focused on industry datasets, researchers in [10]

proposed and justified a simulation of the contextual biasing task on open-source data by adding a large number of distractors to the list of biasing words in an utterance. More recently,

[63, 64] have simultaneously proposed creating a neural shortcut between the biasing list and the final model output distribution.

Ii-B MWE training for end-to-end ASR models

The MWE loss that directly minimises the expected WER [40], has become increasingly popular in training end-to-end ASR models[41, 42, 43, 44, 45, 46, 47]. MWE training using the N-best hypotheses to approximate the expected word errors has been proposed in [41] for the AED model, which was then improved in [44]. More recently, work in [46] applied LM fusion and internal LM estimation during MWE training of an AED model to improve the N-best approximation. Work in [45] exploited a lattice structure in place of the N-best list to calculate the expected word errors. For RNN-T models, [42] applied the same N-best approximation as in AED to calculate the expected errors. Work in [43]

further improved MWE training by relating the gradient calculation for each alignment of a hypothesis in the N-best list to the original RNN-T loss function, which enabled offline decoding of N-best lists when abundant CPU resource was available. Moreover, work in

[48] first discussed the necessity of LM discounting in sequence discriminative training for HMM-based large-vocabulary continuous speech recognition, and [47] applied MWE to the hybrid auto-regressive transducer (HAT) [53].

Ii-C LM discounting

Various LM discounting algorithms have recently been proposed to minimise the internal LM effect of end-to-end ASR models, in particular when applying the model to a test set in a different domain from the training data. In [49], a density ratio method was introduced to estimate the score from a source-domain external LM that is to be subtracted from the target-domain LM score. Later, the HAT was proposed to preserve the modularity of a hybrid system and allowed internal LM scores to be estimated and discounted [53]. An internal LM estimation method was proposed in [54] to estimate the source-domain LM score directly from the end-to-end ASR model. This method was further improved by performing internal LM training [55] in order to better estimate the internal LM score. Moreover, [56] proposed regularising the internal LM in RNN-T training to avoid over-fitting to text priors.

Iii Tree-constrained pointer generator

TCPGen is a neural network-based component combining symbolic prefix-tree search with a neural pointer generator for contextual biasing, which also enables end-to-end optimisation. TCPGen represents the biasing list as a wordpiece-level prefix tree. At each output step, TCPGen calculates a distribution over all valid wordpieces constrained by the prefix tree. TCPGen also predicts a generation probability which indicates how much contextual biasing is needed at a specific output step. The final output distribution is the weighted sum of the TCPGen distribution and the original AED or RNN-T output distribution (Fig.

1).

Fig. 1: Illustration of interpolation in TCPGen with corresponding terms in Eqn. (3). is the TCPGen distribution. is the distribution from a standard end-to-end model. is the final output distribution. and are the scaled and unscaled generation probabilities.

The key symbolic representation of the external contextual knowledge in TCPGen is the prefix tree. For simplicity, examples and equations in this section are presented for a specific search path, which can be generalised easily to beam-search with multiple paths. In the example prefix tree with three biasing words shown in Fig. 2, if the previously decoded wordpiece is Tur, wordpieces in_ and n form the set of valid wordpieces .

Fig. 2: An example of prefix tree search and attention in TCPGen. With previous output Tur, in_ and n are two valid wordpieces on which attention will be performed. A word end unit is denoted by _.

Denoting and as input acoustic features and output wordpiece, as the query vector carrying the decoding history and acoustic information, as the key vectors, a scaled dot-product attention is performed between and to compute the TCPGen distribution and an output vector as shown in Eqns. (1) and (2).

(1)
(2)

where is the size of (see [62]), Mask sets the probabilities of wordpieces that are not in to zero, and is the value vector relevant to . For more flexibility, an out-of-list (OOL) token is included in indicating that no suitable wordpiece can be found in the set of valid wordpieces. To ensure that the final distribution sums up to 1, the generator probability is scaled as , and the final output can be calculated as shown in Eqn. (3).

(3)

where conditions, , are omitted for clarity. represents the output distribution from the standard end-to-end model, and is the generation probability.

Iii-a TCPGen in AED

A standard AED contains three components: an encoder, a decoder and an attention network. The encoder. encodes the input, , into a sequence of high-level features, . At each decoding step , an attention mechanism is first used to combine the encoder output sequence into a single context vector, . The decoder computation is thus as follows:

(4)

where Decoder denotes the decoder network and

is the embedding of the preceding wordpiece. The model output can be calculated using a Softmax layer taking

as input.

To calculate the TCPGen distribution in AED, the query combines the context vector and the previously decoded token embedding as shown in Eqn. (5).

(5)

where and are parameter matrices. The keys and values are computed from the decoder wordpiece embedding matrix as shown in Eqn. (6).

(6)

where denotes the -th row of the embedding matrix. and are key and value parameter matrices which are shared throughout this paper. The TCPGen distribution and the TCPGen output can be computed using Eqns. (1) and (2) respectively. The generation probability are calculated from the decoder hidden state and the TCPGen output , as shown in Eqn. (7).

(7)

where is a parameter matrix. The distribution of can be calculated using Eqn. (3). In AED, deep biasing can be applied as shown in Eqn. (8).

(8)

where the biasing vector is obtained from the sum of embeddings of all wordpieces in , similar to [9].

Iii-B TCPGen in RNN-T

A standard RNN-T consists of an encoder, a predictor and a joint network. The encoder in RNN-T is similar to that in AED which outputs . The predictor encodes all wordpieces in the history into a vector, , analogous to an LM. Given encoder outputs and predictor outputs, the joint network determines the output distribution for each combination of and as shown in Eqns. (9) and (10).

(9)
(10)

where the matrices are the parameter matrices of the joint network, and where represents the set of all wordpieces.

In RNN-T, the query for the TCPGen distribution is computed for each combination of as shown in Eqn. (11).

(11)

where is the wordpiece embedding from the predictor. Key and value vectors are derived from the predictor embedding matrix. The generation probability for each -pair is computed using the penultimate layer output of the joint network and the TCPGen output vector.

(12)

As only exists in , its value is directly copied to the final output distribution as shown in Eqn. (13)

(13)

where is the interpolated probability for the output token in Eqn. (3), except that is scaled by a factor of to ensure all probabilities sum up to 1. Moreover, whenever TCPGen is used in RNN-T, the biasing vector, is always sent to the input of the joint network which yielded the best results as discussed in [63]. As for AED, DB can be applied as described in [9].

Iv MBWE training for TCPGen

The MWE loss in end-to-end ASR systems minimises the expected value of word errors across all possible output sequences of a given input. Denoting the output sequence and input sequence for convenience, the MWE loss can be written as Eqn. (14).

(14)

where is the ground-truth output sequence and is the risk function representing the number of word errors calculated using an edit-distance between each possible sequence and the ground-truth sequence . The sum is performed over all possible sequences and is the probability of a specific sequence calculated from the end-to-end ASR model output. As it is intractable to enumerate over all possible sequences and calculate their probabilities, a common practice which has been widely adopted in MWE training for end-to-end ASR systems [40, 41, 42, 43, 44, 45] is to use the N-best hypotheses to approximate the expected word errors, as shown in Eqn. (15).

(15)

where is the N-best hypotheses obtained via beam search, and represents the edit-distance function. The probabilities of the N-best hypotheses are normalised to form a valid distribution, where each normalised probability is represented as

. Moreover, the average number of word errors over the N-best hypotheses is often subtracted from the number of word errors in each hypothesis as a form of variance reduction.

To apply the MWE loss to contextual ASR with TCPGen which focuses on the correct recognition of biasing words, we propose a new risk function that includes an additional biasing word error term to the word error term as shown in Eqn. (16).

(16)

where is the additional biasing word error term which is the edit-distance between the sequence of biasing words in and the sequence of biasing words in . Scaling factors and control the importance of the word error term and the new biasing word error term. As a result, if and , becomes . If and , it is equivalent to giving a weight of 2 to any rare word errors in the original word error rate. A new MBWE loss function is proposed to use instead of . That is,

(17)

As a generic loss for end-to-end ASR systems, MBWE can be applied to the standard AED and RNN-T models, as well as other deep context models.

Iv-a MBWE training in AED

The MBWE loss can be applied to the AED model following a similar MWE training scheme proposed in [41] which also interpolated the MBWE loss with cross-entropy (CE) loss for better training stability, as shown in Eqn. (18).

(18)

where is the total loss function to be minimised, is the proposed MBWE loss in Eqn. (17) and

is the CE loss. As it is hard to train a randomly initialised model with the MBWE loss, which is similar to MWE, the MBWE loss is applied from the epoch when the model is reasonably trained with CE loss, which depends on optimisation algorithms. Moreover, to boost the efficiency of beam search which is the bottleneck in the time taken for training, batched beam search is implemented by parallelising the model forward computation of all beams of all utterances in the same mini-batch on a GPU.

Iv-B MBWE training in RNN-T

The MBWE loss can also be applied to the RNN-T model following a similar MWE training scheme proposed in [42], except that the original RNN-T loss is also included for stable training, as shown in Eqn. (19).

(19)

where represents the original RNN-T loss [26]. Although previous work [42, 43] tried to obtain N-best lists using standard beam search for RNN-T, it is infeasible to perform such a training scheme on a single GPU with a limited number of CPUs, even with batched decoding. The major obstacle that restricts the level of parallel computation is the unknown number of wordpiece tokens to output at a given encoder step, as beams requiring two or more output tokens have to be handled separately. However, as the encoder output sequence is usually longer than the number of output wordpiece tokens, cases where two or more tokens are output at a specific encoder step, should be rare. To verify this conjecture, taking a standard RNN-T model trained on the Librispeech training set and decoded on its dev set as an example, the path taken and the number of output tokens at each encoder step for each 1-best hypothesis were recorded, as shown in Table I.

# Wordpiece tokens 0 1 2+
Count 80% 18% 2%
TABLE I: Statistics of the Number of Output Wordpiece Tokens at Each Encoder Step for RNN-T 1-best Hypothesis on Librispeech Dev Set. Percentage is of the Total Number of Encoder Steps.

As shown in the table, the vast majority of encoder steps output 0 or 1 wordpiece token, where 0 means a token is output. Although in the standard beam search, it is always required to compare the score with a second output token, it is often unnecessary for generating reasonably good N-best hypotheses, especially for MBWE training where a strong approximation using the N-best list has already been made. Therefore, a restricted beam search which only allows 0 or 1 output token at each encoder step is proposed for efficient MBWE training in RNN-T, which is similar to the one-step constrained beam search algorithm in [65] but with all neural network computations parallelised across all samples in the mini-batch. With restricted beam search, the model forward computation can be efficiently parallelised for all beams of all utterances in the same mini-batch on a single GPU.

V BLMD for TCPGen

To incorporate an external LM, SF is often performed for both AED and RNN-T models via log-linear interpolation. Define the source domain data as the text of the training data for the end-to-end model, and the target domain data as the data used to train an external LM such that it generates better probability estimates for the test data. Then, the recognised sequence for LM SF can be written as Eqn. (20).

(20)

where is the output of the end-to-end system and is the probability of the output sequence predicted by an LM trained on the target domain. The interpolation factor is a hyper-parameter to be determined. After decomposing the probability of each possible sequence into a token-level sequence, , the probability of each output token after SF, can be written as

(21)

where the conditions on acoustic and history information were omitted for clarity. The density ratio method provides a Bayes’ rule-grounded way to reduce the effect of the internal LM in the end-to-end system especially when there is a difference between the source and target domain data. That is,

(22)

where refers to the probability of the output sequence predicted by an LM trained on the source domain. The factors and are hyper-parameters. In this way, the probabilities of commonly seen text patterns in the source domain are penalised, whereas those of text patterns specific to the target domain are boosted. Therefore, the density ratio LM discounting method can also be applied to the TCPGen component to further improve its performance on the biasing words that are rare in the source domain. As the final distribution comes from both the model and the TCPGen distribution which use different parameters and history information, density ratio SF is separately performed as

(23)

where is the TCPGen distribution, and the same source and target LMs are used for both distributions, but with different sets of hyper-parameters and . To avoid a complicated hyper-parameter search, the best set of obtained from the standard end-to-end system can be directly applied to , and only tuned to find the best values for the TCPGen distribution.

Vi Experimental setup

Vi-a Data

Experiments were performed on two different data sets, including the Librispeech audiobook corpus and the dialogue state-tracking challenge (DSTC) data. The Librispeech corpus [60] contains 960 hours of read English from audiobooks. The dev-clean and dev-other sets were held out for validation, and the test-clean and test-other sets were used for evaluation. Models trained on the Librispeech data were finetuned and tested on the DSTC data.

The DSTC data was included as a realistic application with a limited amount of audio training resources of TCPGen where the ontology was used to extract contextual knowledge. The DSTC data contains human-machine task-oriented dialogues where user-side input audio was used for recognition. The DSTC track2 train and dev sets were used as the training and validation sets, and the DSTC track3 test set was used for evaluation. The 80-dim FBANK features at a 10 ms frame rate concatenated with 3-dim pitch features were used as the model input. SpecAugment [57] with the setting was used without any other data augmentation or speaker adaptation.

Vi-B Biasing list selection

For Librispeech, the full rare word list containing 200k distinct words proposed in [10] was used as the collection of all biasing words, in which more than 60% were OOV words that did not appear in the Librispeech training set. Following the scheme in [10], biasing lists were organised by finding words that belong to the full rare word list from the reference transcription of each utterance and adding a certain number of distractors to it. There were 10.3% word tokens in the test sets that belong to the full rare word list and hence were covered by the biasing list during testing.

On the DSTC data, the biasing list arrangement by adding distractors was used for training only, where the full rare word list was augmented with words that occurred less than 100 times in the DSTC training data. During the evaluation, the ontology of DSTC3 which contains task-specific named entities was used to form the biasing list by extracting distinct words from those named entities and removing common words with 200 or more occurrences in the training data. This biasing list contained 243 distinct words, and 4.7% word tokens in the test set belong to this biasing list.

Vi-C Model specification

Systems were built using the ESPnet toolkit [58]. A unigram wordpiece model with 600 distinct wordpieces was built on the Librispeech data and was directly applied to the DSTC data. For both the AED and RNN-T models, a Conformer [39] encoder was used. The Conformer encoder contained 16 conformer blocks consisting of 4 512-d attention heads. The AED additionally contained a single-layer 1024-d LSTM decoder and a 4-head 1024-d location-sensitive attention. The RNN-T additionally had a 1024-d predictor with a 1024-d single fully-connected layer joint network.

For BLMD, a 2-layer 2048-d LSTM-LM trained on the Librispeech 800 million-word text training corpus was used as the target domain LM for Librispeech experiments. Each source domain LM trained on the text of the audio training data used a single-layer 1024-d LSTM. Each LM had the same wordpieces as the corresponding ASR system.

Vi-D Training specifications

During training, biasing lists with 1000 distractors were used for the Librispeech experiments and 100 distractors for the DSTC data. These lists were organised by finding biasing words from the reference and adding distractors. The dropping technique described in [10] was applied for AED model training, where biasing words that were found in the reference transcription had a 30% probability to be removed from the biasing list. This was to prevent the model from being over-confident about TCPGen outputs. The Noam [62] optimiser was used for the Conformer. The MBWE loss was applied after 16 epochs. The beam size for MBWE training was 5 for all experiments and 30 for decoding. A coverage penalty [61] of 0.01 was applied to AED models.

Vi-E Evaluation metrics

In addition to WER, a rare word error rate (R-WER) was used to evaluate the system performance on biasing words that were “rare” in the training data for that system. R-WER is the total number of error word tokens that belong to the biasing list divided by the total number of word tokens in the test set that belong to the biasing list. Insertion errors were counted in R-WER if the inserted word belonged to the biasing list [10]. As WERs on Librispeech test sets were all small, for the rest of this paper, 2 decimal places were included for WER whereas one decimal place was used for other results. Moreover, for small WER and R-WER reductions, a project-by-project sign test was performed for Lirbispeech experiments based on the “project ID” of each utterance. The same sign test was performed dialogue-by-dialogue for the DSTC data.

Vii Results

Fig. 3: Plots of training (left) and dev (right) set WERs across 4 training epochs. Training set WER was calculated on 5% randomly sampled utterances from the full 960-hour training set. Dev-set combines both dev-clean and dev-other sets. MBWE parameters were defined in Eqn. (16).

Vii-a Conformer AED experiments on Librispeech

The TCPGen component together with proposed MBWE and BLMD algorithms was first applied to the conformer AED model.

test-clean test-other
System MBWE params. BLMD params. %WER %R-WER %WER %R-WER
Baseline - 3.71 13.2 9.36 29.5
Baseline - 3.65 13.0 9.02 28.9
Baseline - 3.62 12.8 9.08 28.6
TCPGen - 3.23 8.6 8.43 21.3
TCPGen - 2.96 7.6 7.88 19.5
Baseline 3.33 12.3 8.04 27.6
Baseline 3.19 12.2 7.95 27.1
Baseline 3.17 11.7 7.92 27.3
TCPGen 2.79 6.9 7.40 19.5
TCPGen 2.59 6.4 7.13 18.2
TABLE II:

WER and R-WER on Librispeech test-clean and test-other sets for Conformer-based systems trained on Librispeech full 960-hour training set. MBWE params. include

and in Eqn. (16), and BLMD params include and in Eqn. (23). The baseline here refers to the standard Conformer AED model.

With the Noam optimiser, the learning rate was a smooth function and preliminary experiments found that adjusting the weight of the cross-entropy loss yielded significantly worse results as it effectively introduced an abrupt change to the learning rate. Therefore, to adjust the contribution of the MBWE loss, different values of and were used while keeping the coefficient of the cross-entropy loss to 1. The effect of using small and large values of and on the training and dev set WER are shown in Fig. 3. As shown in Fig. 3, applying MBWE with both small and large values had a similar effect on the training set WER, whereas using smaller values produced better results on the dev set and hence was adopted for the experiments.

Then, the best set of BLMD parameters was searched for and applied to the trained models during decoding. The search procedure is illustrated in Fig. 4. The left part of Fig. 4 shows the dev set WER of different sets of BLMD parameters for the baseline standard AED system. The best values found here were , which were then fixed for the search of , as shown on the right part of Fig. 4 for the system with TCPGen. As a result, were used which indicates that a stronger LM discounting effect was needed for the TCPGen distribution.

The results for the Conformer AED model are summarised in Table II. As shown in Table II, for the baseline standard Conformer AED system, using the MBWE loss benefits the R-WER. When MBWE is applied to TCPGen, there was a 12% relative reduction in R-WER on the test-clean set and 9% on the test-other set compared to the TCPGen system without MBWE training. This increased the relative R-WER improvement by using TCPGen from 33% to 41% on the test-clean set, and from 28% to 32% on the test-other set compared to the baseline with the same training loss (i.e. comparing row 1 with row 4 and row 3 with row 5 in Table II).

Fig. 4: Illustration of tuning BLMD hyper-parameters for the baseline standard Conformer AED model and the Conformer AED model with TCPGen. Numbers in each grid are dev set WER in percentage. Left: Tuning on the baseline model. Right: Tuning on the TCPGen model with the best set of found from the baseline on the left.

After applying BLMD, obvious reductions in R-WER were observed for both the baseline and TCPGen systems on both test sets. In particular, large R-WER reductions were found when different discounting factors were applied to the TCPGen distribution, which further increased the relative R-WER improvement by using TCPGen from 41% to 46% on the test-clean set and from 32% to 33% on the test-other set (comparing row 3 with row 5 and row 8 with row 10).

test-clean test-other
System MBWE params. BLMD params. %WER %R-WER %WER %R-WER
Baseline - 4.02 14.1 10.12 33.1
Baseline - 4.01 14.0 9.96 32.5
Baseline - 3.87 13.8 9.80 31.8
DB - 3.57 10.4 9.45 25.0
TCPGen - 3.40 8.9 8.79 22.2
TCPGen - 3.12 8.1 8.64 20.7
Baseline 3.55 12.5 8.90 30.4
Baseline 3.38 12.6 8.59 29.5
Baseline 3.28 12.2 8.50 28.8
TCPGen 3.02 8.0 7.49 18.6
TCPGen 2.79 7.0 7.44 18.2
TABLE III: WER and R-WER on Librispeech test-clean and test-other sets for Conformer-based RNN-T models trained on Librispeech full 960-hour training set. MBWE params. include and in Eqn. (16), and BLMD params include and in Eqn. (23). The baseline here refers to the standard Conformer RNN-T model.

Vii-B Conformer RNN-T experiments on Librispeech

Experiments were then performed on Librispeech full 960-hour training data as shown in Table III. Preliminary experiments on Librispeech found that for the MBWE loss should be set larger than for better performance when using TCPGen in RNN-T. The baseline here is a standard Conformer-based RNN-T model. The MBWE and BLMD hyper-parameters were found in the same way as for the AED experiments. In addition to the standard baseline system, the DB method proposed in [10] was used as a biasing method for comparison. In general, consistent and significant WER and R-WER reductions were achieved using TCPGen compared to both the baseline and the DB system, with a p-value less than . MBWE with a restricted beam search achieved WER and R-WER reductions for both baseline and TCPGen systems. In particular, the relative R-WER improvement increased from 37% to 41% on the test-clean set and from 33% to 37% on the test-other set compared to the baseline system (comparing row 4 to row 1 and row 6 to row 3 in Table III).

Applying BLMD achieved further WER and R-WER reductions for all systems. In particular, BLMD increased the relative R-WER reduction from 41% to 43% on test-clean and from 37% to 38% on test-other (comparing row 6 to row 3 and row 11 to row 9 in Table III). The sign test was used to compare TCPGen with and without MBWE training, as the WER and R-WER were smaller than those observed in AED. All R-WER improvements after applying MBWE, including those when BLMD was applied, were significant at .

Compared to the results for the AED model, the R-WER improvement was smaller in general, and to investigate this discrepancy, the generation probabilities for TCPGen in AED and RNN-T models are plotted in Fig. 5 for an example utterance from the test-clean set.

Fig. 5: Heat map of the generation probability for each wordpiece in an utterance taken from recognition results to show how each system spots where to use contextual biasing. Biasing words are vignette and Turner.

As shown in Fig. 5, the colour on biasing words was lighter for RNN-T than AED, indicating that RNN-T had a smaller dependency on TCPGen than AED. This resulted in a smaller R-WER reduction using TCPGen in RNN-T. During RNN-T training, the probability of each alignment to be maximised contained only a tiny portion of the factorised probabilities that correspond to a new wordpiece output with a large portion of the blank symbol, , output, and TCPGen has an effect on only a small portion of those that output a new wordpiece token. Therefore, RNN-T used TCPGen much less frequently than AED during training. Moreover, when TCPGen was used, the generation probability was scaled by to ensure a valid output distribution. To encourage RNN-T to use TCPGen, the scaling factor for the rare word error loss for TCPGen was set to 5.0 during MBWE training.

Systems AED RNN-T
Baseline + MBWE + BLMD 75.6% 78.0%
TCPGen + MBWE + BLMD 37.8% 41.7%
TABLE IV: Zero-shot WERs on OOV words on the combined test-clean and test-other set using the baseline and TCPGen systems with MBWE and BLMD. Same biasing lists were used as those in Table II and Table III.

TCPGen also achieved zero-shot learning of OOV words incorporated in the biasing list. There were 468 OOV word tokens in the combined test-clean and test-other set that were covered by the biasing list. The OOV WER which was measured in the same way as R-WER but for OOV words on the combined test sets was separately reported in Table IV for the baseline and for TCPGen systems using MBWE and BLMD. Note that these systems used exactly the same biasing lists with 1000 distractors as those in Table II and Table III, so they had the same WER and R-WER as those corresponding systems. As a result, TCPGen achieves a large OOV WER reduction compared to the best baseline system, and the majority of OOV words could be correctly recognised once incorporated in the biasing list.

Vii-C DSTC experiments

Finally, TCPGen and the proposed MBWE and BLMD methods were evaluated on the DSTC data where the biasing list was extracted from the ontology. Models trained on the Librispeech 960-hour data were finetuned on the DSTC track2 training set. For the baseline and TCPGen systems without MBWE training, finetuning was performed only with the CE loss, whereas for systems trained with MBWE, the MBWE loss was also applied during finetuning. The hyper-parameters for MBWE were found in the same way as before, with and . Moreover, as it is difficult to obtain large external task-oriented dialogue data for LM training, an LM was trained only on the DSTC2 training data to perform either shallow fusion or LM discounting. For the baseline system, this DSTC LM was found to be more effective as an SF LM with , which achieved a limited WER improvement. For TCPGen, BLMD was applied with the same and as the baseline, and , which, in addition to the SF, the internal LM effect was discounted in the TCPGen distribution. The WER and R-WER are reported in Table V where the R-WER was measured for biasing words that appeared in the ontology.

System AED(%) RNN-T(%)
Baseline 21.31 (61.5) 21.26 (64.2)
Baseline + MBWE 20.81 (60.4) 21.15 (64.1)
Baseline + MBWE + SF 20.73 (60.4) 20.63 (64.1)
TCPGen 20.38 (45.2) 20.05 (49.2)
TCPGen + MBWE 20.00 (43.6) 19.87 (47.4)
TCPGen + MBWE + BLMD 19.74 (40.3) 19.13 (40.9)
TABLE V: WER and R-WER (in brackets) on the DSTC3 test set for Conformer AED and RNN-T models trained on Librispeech full 960-hour training set and finetuned on DSTC2 train and dev sets.

Progressive WER and R-WER reductions were achieved by applying MBWE and BLMD successively for both AED and RNN-T using TCPGen. For AED, using TCPGen achieved a 26% relative R-WER reduction compared to the baseline, which then increased to 33% with BLMD for AED. For RNN-T, TCPGen alone achieved a 23% relative R-WER reduction compared to the baseline, which increased to 36% relative R-WER reduction compared to the corresponding baseline with SF. Moreover, a dialogue-by-dialogue sign test was performed between TCPGen and TCPGen with MBWE loss. For both AED and RNN-T, R-WER improvements were significant at a p-value less than 0.05.

Viii Conclusion

This paper proposed the TCPGen component for contextual ASR. TCPGen combines a neural pointer generator with a symbolic prefix-tree search. Meanwhile, the minimum biasing word error (MBWE) loss was proposed to improve the training of TCPGen with an emphasis on biasing words, and the biasing-word-driven LM discounting (BLMD) method was proposed for decoding to account for the internal LM effect. Experiments were performed on Librispeech and DSTC dialogue data. Consistent and significant WER improvements were found using TCPGen, especially on biasing words. Applying MBWE and BLMD achieved further significant R-WER reductions compared to the original TCPGen reductions.

References

  • [1] I. Williams, A. Kannan, P. Aleksic, D. Rybach, T. Sainath, “Contextual speech recognition in end-to-end neural network systems using beam search”, Proc. Interspeech, Hyderabad, 2018
  • [2] Z. Chen, M. Jain, Y. Wang, M. L. Seltzer & C. Fuegen

    “End-to-end contextual speech recognition using class language models and a token passing decoder”,

    Proc. ICASSP, Brighton, 2019
  • [3] D. Zhao, T. Sainath, D. Rybach, P. Rondon, D. Bhatia, B. Li & R. Pang, “Shallow-fusion end-to-end contextual biasing”, Proc. Interspeech, Graz, 2019.
  • [4] G. Pundak, T. Sainath, R. Prabhavalkar, A. Kannan & D. Zhao “Deep context: End-to-end contextual speech recognition”, Proc. ICASSP, Calgary, 2018.
  • [5] Z. Chen, M. Jain, Y. Wang, M. L. Seltzer & C. Fuegen “Joint grapheme and phoneme embeddings for contextual end-to-end ASR”, Proc. Interspeech, Graz, 2019.
  • [6] M. Jain, G. Keren, J. Mahadeokar, G. Zweig, F. Metze & Y. Saraf, “Contextual RNN-T for open domain ASR”, Proc. Interspeech, Shanghai, 2020.
  • [7] U. Alon, G. Pundak & T. Sainath, “Contextual speech recognition with difficult negative training examples”, Proc. ICASSP, Brighton, 2019.
  • [8] Z. Chen, M. Jain, Y. Wang, M. Seltzer & C. Fuegen “End-to-end contextual speech recognition using class language models and a token passing decoder”, Proc. ICASSP, Brighton, 2019.
  • [9] D. Le, G. Keren, J. Chan, J. Mahadeokar, C. Fuegen & M. L. Seltzer “Deep shallow fusion for RNN-T personalization”, Proc. SLT, 2021.
  • [10] D. Le, M. Jain, G. Keren, S. Kim, Y. Shi, J. Mahadeokar, J. Chan, Y. Shangguan, C. Fuegen, O. Kalinli, Y. Saraf & M. L. Seltzer “Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion”, arXiv: 2104.02194, 2021.
  • [11] R. Huang, O. Abdel-hamid, X. Li & G. Evermann “Class LM and word mapping for contextual biasing in end-to-end ASR”, Proc. Interspeech, Shanghai, 2020.
  • [12] Y. M. Kang & Y. Zhou “Fast and robust unsupervised contextual biasing for speech recognition”, arXiv: 2005.01677, 2020.
  • [13] A. Garg, A. Gupta, D. Gowda, S. Singh & C. Kim,

    “Hierarchical multi-stage word-to-grapheme named entity corrector for automatic speech recognition”,

    Proc. Interspeech, Shanghai, 2020.
  • [14] D. Liu, C. Liu, F. Zhang, G. Synnaeve, Y. Saraf & G. Zweig, “Contextualizing ASR lattice rescoring with hybrid pointer network language model”, Proc. Interspeech, Shanghai, 2020.
  • [15] A. See, P. J. Liu & C. D. Manning “Get to the point: summarization with pointer-generator networks”, Proc. ACL, Vancouver, 2017.
  • [16] Z. Liu, A. Ng, S. Lee, A. T. Aw & N. F. Chen “Topic-aware pointer-generator networks for summarizing spoken conversations”, Proc. ASRU, Singapore, 2019.
  • [17] W. Li, R. Peng, Y. Wang & Z. Yan

    Knowledge graph based natural language generation with adapted pointer-generator networks”,

    Neurocomputing, vol. 328, pp. 174–187, 2020.
  • [18] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho & Y. Bengio, “Attention-based models for speech recognition”, Proc. NIPS, Montreal, 2015.
  • [19] L. Lu, X. Zhang, K. Cho & S. Renals, “A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition”, Proc. Interspeech, Dresden, 2015.
  • [20] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel & Y. Bengio, “End-to-end attention-based large vocabulary speech recognition”, Proc. ICASSP, Shanghai, 2016.
  • [21] C. C. Chiu, T. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski & M. Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models”, Proc. ICASSP, Calgary, 2018.
  • [22] A. Zeyer, K. Irie, R. Schlüter & H. Ney,

    “Improved training of end-to-end attention models for speech recognition”,

    Proc. Interspeech, Hyderabad, 2018.
  • [23] Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. Laurent Y. Bengio, A. Courville,

    “Towards end-to-end speech recognition with deep convolutional neural networks”,

    Proc. Interspeech, San Francisco, 2016.
  • [24] T. Hayashi, S. Watanabe, Y. Zhang, T. Toda, T. Hori, R. Astudillo & K. Takeda, “Back-translation-style data augmentation for end-to-end ASR”, Proc. SLT, Athens, 2018.
  • [25] S. Karita, A. Ogawa, M. Delcroix, & T. Nakatani “Sequence training of encoder-decoder model using policy gradient for end-to-end speech recognition”, Proc. ICASSP, Calgary, 2018.
  • [26] A. Graves, A. Mohamed & G. Hinton, “Speech recognition with deep recurrent neural networks”, Proc. ICASSP, Vancouver, 2013.
  • [27] R. Prabhavalkar, K. Rao, T. Sainath, B. Li, L. Johnson & N. Jaitly, “A comparison of sequence-to-sequence models for speech recognition”, Proc. Interspeech, Stockholm, 2017.
  • [28] E. Battenberg, J. Chen, R. Child, A. Coates, Y. Gaur, Y. Li, H. Liu, S. Satheesh, D. Seetapun, A. Sriram & Z. Zhu, “Exploring neural transducers for end-to-end speech recognition”, Proc. ASRU, Okinawa, 2017.
  • [29] J. Li, R. Zhao, H. Hu, Y. Gong, “Improving RNN Transducer modeling for end-to-end speech recognition”, Proc. ASRU, Singapore, 2019.
  • [30] Y. He, T. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S. Chang, R. Rao & A. Gruenstein “Streaming end-to-end speech recognition for mobile devices”, Proc. ICASSP, Brighton, 2019.
  • [31] M. Ghodsi, X. Liu, J. Apfel, R. Cabrera & E. Weinstein “Rnn-Transducer with stateless prediction network”, Proc. ICASSP, Barcelona, 2020.
  • [32] Q. Li, C. Zhang & P. C. Woodland, “Integrating source-channel and attention-based sequence-to-sequence models for speech recognition”, Proc. ASRU, Singapore, 2019.
  • [33] A. Graves, S. Fernandez, F. Gomez, & J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks”, Proc. ICML, Brighton, 2006.
  • [34] A. Zeyer, E. Beck, R. Schluter & H. Ney “CTC in the context of generalized full-sum HMM training”, Proc. Interspeech, Stockholm, 2017.
  • [35] S. Kim, T. Hori, & S. Watanabe, “Joint ctcattention based end-to-end speech recognition using multi-task learning”, Proc. ICASSP, New Orleans, 2017.
  • [36] Q. Liu, Z. Chen, H. Li, M. Huang, Y. Lu & K. Yu “Modular end-to-end automatic speech recognition framework for acoustic-to-word model”, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2174-2183, 2020.
  • [37] H. Hadian, H. Sameti, D. Povey & S. Khudanpur, “End-to-end speech recognition using lattice-free MMI”, Proc. Interspeech, Hyderabad, 2018.
  • [38] W. Chan, N. Jaitly, Q. V. Le & O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition”, Proc. ICASSP, Shanghai, 2016.
  • [39] A. Gulati, J. Qin, C. C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu & R. Pang, “Conformer: convolution-augmented transformer for speech recognition”, Proc. Interspeech, Shanghai, 2020.
  • [40] D. Povey & P.C. Woodland, “Minimum phone error and I-smoothing for improved discriminative training”, Proc. ICASSP, Orlando, 2002.
  • [41] R. Prabhavalkar, T. Sainath, Y. Wu, P. Nguyen, Z. Chen, C. C. Chiu, A. Kannan, “Minimum word error rate training for attention-based sequence-to-sequence models”, Proc. ICASSP, Calgary, 2018.
  • [42] C. Weng, C. Yu, J. Cui, C. Zhang, D. Yu “Minimum bayes risk training of RNN-Transducer for end-to-end speech recognition”, Proc. Interspeech, Shanghai, 2020.
  • [43] J. Guo, G. Tiwari, J. Droppo, M. V. Segbroeck, C. W. Huang, A. Stolcke, R. Maas, “Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition”, Proc. Interspeech, Shanghai, 2020.
  • [44] C. Weng, J. Cui, G. Wang, J. Wang, C. Yu, D. Su & D. Yu, “Improving attention based sequence-to-sequence models for end-to-end english conversational speech recognition”, Proc. Interspeech, Hyderabad, 2018.
  • [45] N. P. Wynands, W. Michel, J. Rosendahl, R. Schlüter, H. Ney, “Efficient sequence training of attention models using approximative recombination”, arXiv: 2110.09245, 2021.
  • [46] Z. Meng, Y. Wu, N. Kanda, L. Lu, X. Chen, G. Ye, E. Sun, J. Li, Y. Gong, “Minimum word error rate training with language model fusion for end-to-end speech recognition”, Proc. Interspeech, Brno, 2021.
  • [47] L. Lu, Z. Meng, N. Kanda, J. Li & Y. Gong, “On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer”, Proc. Interspeech, Brno, 2021.
  • [48] P. C. Woodland & D. Povey,

    “Large scale discriminative training of hidden Markov models for speech recognition”,

    Computer Speech and Language, Vol. 16, pp.25–47, 2002.
  • [49] S. Toshniwal, A. Kannan, C. C. Chiu, Y. Wu, T. Sainath & K. Livescu, “A density ratio approach to language model fusion in end-to-end automatic speech recognition”, Proc. ASRU, Singapore, 2019.
  • [50] A. Kannan, Y. Wu, P. Nguyen, T. Sainath, Z. Chen & R. Prabhavalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model”, Proc. ICASSP, Calgary, 2018.
  • [51] T. Hori, J. Cho & S. Watanabe, “End-to-end speech recognition with word-based RNN language models”, Proc. SLT, Athens, 2018.
  • [52] S. Kim, Y. Shangguan, J. Mahadeokar, A. Bruguier, C. Fuegen, M. L. Seltzer & D. Le, “Improved neural language model fusion for streaming recurrent neural network transducer”, Proc. ICASSP, Toronto, 2021.
  • [53] E. Variani, D. Rybach, C. Allauzen & M. Riley “Hybrid Autoregressive Transducer (HAT)”, Proc. ICASSP, Barcelona, 2020.
  • [54] Z. Meng, S. Parthasarathy, E. Sun, Y. Gaur, N. Kanda, L. Lu, X. Chen, R. Zhao, J. Li & Y. Gong, “Internal language model estimation for domain-adaptive end-to-end speech recognition”, Proc. SLT, Shenzhen, 2021.
  • [55] Z. Meng, N. Kanda, Y. Gaur, S. Parthasarathy, E. Sun, L. Lu, X. Chen, J. Li & Y. Gong, “Internal language model training for domain-adaptive end-to-end speech recognition”, Proc. ICASSP, Toronto, 2021.
  • [56] C. Zhang, B. Li, Z. Lu, T.N. Sainath & S. Chang, “Improving the fusion of acoustic and text representations in RNN-T”, Proc. ICASSP, Singapore, 2022.
  • [57] D. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk & Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition”, Proc. Interspeech, Graz, 2019.
  • [58] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala & T. Ochiai “ESPnet: End-to-end speech processing toolkit”, Proc. Interspeech, Hyderabad, 2018.
  • [59] K. Simonyan & A. Zisserman “Very deep convolutional networks for large-scale image recognition”, Proc. CVPR, Columbus, 2014.
  • [60] V. Panayotov, G. Chen, D. Povey & S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books”, Proc. ICASSP, South Brisbane, 2015.
  • [61] J. Chorowski & N. Jaitly “Towards better decoding and language model integration in sequence to sequence models”, Proc. Interspeech, Stockholm, 2017.
  • [62] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser & I. Polosukhin “Attention is all you need”, Proc. NIPS, Long Beach, 2017.
  • [63] G. Sun, C. Zhang & P. C. Woodland “Tree-constrained pointer generator for end-to-end contextual speech recognition”, Proc. ASRU, Cartagena, 2021.
  • [64] C. Huber, J. Hussain, S. Stüker & A. Waibel “Instant one-shot word-learning for context-specific neural sequence-to-sequence speech recognition”, Proc. ASRU, Cartagena, 2021.
  • [65] J. Kim & Y. Lee, “Accelerating RNN transducer inference via one-step constrained beam search”, IEEE Signal Processing Letters, Vol. 27, pp. 2019-2023, 2020.
  • [66] A. Tripathi; H. Lu; H. Sak; H. Soltau, “Monotonic recurrent neural network transducer and decoding strategies”, Proc. ASRU, 2019.