Present day ASR models using Deep Neural Networks (DNN) can be broadly classified into two frameworks: hybrid[hybrid-ctc] and E2E [rnnt-graves, las, loc_attn]
. A typical hybrid HMM-DNN system consists of three components trained individually: an acoustic model (AM) that estimates the posterior probabilities of Hidden Markov Model (HMM) states, a language model (LM) that estimates probabilities of word sequences, and a pronunciation model (PM) to map phonemes to words. These models are optimized independently[hinton2012deep] and then combined together using a Weighted Finite State Transducer (WFST) [mohri2002weighted] for efficient decoding. In an E2E speech recognition model such as the RNN-T [rnnt-graves], a single neural network learns to map audio to text instead of using the distinct components of the hybrid systems. While this generally simplifies overall training and inference pipelines for ASR, E2E models tend to have difficulties with correctly recognizing words that do not appear frequently in paired audio-text training data [clas, dc_tom, dc_phoebe]. Since hybrid ASR systems optimize AM, LM and PM components independently, they can address the rare word recognition issue by 1) training the LM component with large amounts of unpaired text data to model occurrences of rare words and 2) representing pronunciation in the PM.
In this work, we propose an attention-based context biasing approach to address the following underlying issues in the setting of an E2E RNN-T based ASR system:
Rare word recognition: It is common for a vanilla RNN-T ASR system to make mistakes in recognizing words that occur infrequently in the training data. Assuming that PyTorch is a rare word in paired ASR training data, a vanilla RNN-T model might produce a hypothesis as when you look at pie towards itself whereas the true transcript is when you look at PyTorch itself. Although PyTorch is a rare word in training data, it’s appearance in the video metadata can be used to recognize it correctly.
Disambiguation between similar words: When acoustically similar words appear in similar contexts in the training data, it is difficult for the model to pick the correct word without additional biasing. E.g. the names Sean and Shaun might appear with similar frequencies in a similar context, but we might want to prefer one of them if it appears also in related text metadata.
Entity names often suffer from both these issues. Further, these often carry a high degree of semantic meaning relative to other words in the transcript, so it is important for ASR systems to recognize these correctly. Therefore, we measure the effectiveness of our approach on recognition of entity names. We aim to address these problems by biasing ASR using additional context from the accompanying text metadata of the video. This metadata is unstructured and potentially irrelevant to the speech being transcribed, so the ASR system needs to learn to selectively use or ignore it.
The rest of the paper is organized as follows: We review prior work around use of contextual words in E2E ASR in Section 2. We describe the base RNN-T model in Section 3. In Section 4, we propose changes to the RNN-T model to allow it to incorporate unpaired and unstructured text context via an attention mechanism. We show experiment results in Section 5 and further analyze the effectiveness of the proposed method in Section 6 by visualizing attention weights towards the metadata. We conclude in Section 7.
2 Prior Work
Prior work has leveraged contextual words either by on-the-fly (OTF) rescoring [shallowe2e, shallowfusion2018, streaming_rnnt] or as an additional input to the DNN along with the audio. The first approach is generally referred to as Shallow Fusion whereas the latter as Deep Contextualization [clas]. Our work falls in the latter category. It is most closely related to Contextual Listen, Attend And Spell (CLAS) [clas], which also used context words from unpaired text to bias an E2E ASR model. The CLAS model was originally evaluated for closed domain ASR tasks like those used for virtual assistants by using entities such as contact names as context words. Further improvements to CLAS were done in [dc_phoebe] and [dc_tom] by using representations that leverage phonetic information as well. In this work, different from CLAS, we look at Deep Contextualization in the setting of an RNN-T ASR model, and evaluate our method on an open domain video ASR task using noisy text metadata from videos as context. In a closed domain use case such as making calls through an assistant, there is strong prior information about where entity names can appear in the utterance, whereas in our case the context words may appear anywhere in the conversational speech of the video. Deep contextualization of RNN-T was explored in [keyword-rnnt]
for keyword spotting use case, where the phoneme sequence of the keyword represented as a one-hot vector was used to attend to and recognize the target keyword. An alternate approach for using contextual metadata from videos to improve ASR is explored in[darong-paper], where lattices produced by a hybrid ASR system are rescored using metadata.
3 RNN Tranducer
The framework of RNN-T ASR system is illustrated in Fig. 1. RNN-T for ASR has three main components: Audio Encoder, Text Predictor and Joiner.
The Audio Encoder uses audio frame at to produce audio embedding (Equation 1). The Audio Encoder used in this work is a stack of bi-directional LSTM (BLSTM) layers.
The Text Predictor uses the last non-blank target unit to produce embedding (Equation 2). The Text Predictor is a stack of LSTM layers in this work. We use sentence pieces as target units.
The Joiner takes in the output of Audio Encoder and Text Predictor and combines them to produce an embedding :
and are matrices that are used to project audio and text embeddings to the same dimensions.
is a non-linear function such as Relu[relu] or tanh.
Finally, the joiner’s output,) , i.e. sentence pieces plus a special symbol:
By incorporating both audio and text for producing (Equation 4b), RNN-T can overcome the conditional independence assumption of CTC models [graves2006connectionist]. The emission of as output unit results in an update of the audio embedding by moving ahead in time axis whereas emission of non results in a change in the text embedding. This results in various possible alignment paths as shown in the lattice of size in Figure 1 of [rnnt-graves]. The sum of probabilities of these paths gives the probability of an output sequence, , given the input sequence, , where is the sequence of non output target units and is the input sequence of audio frames.
4 Contextual RNN-T
As in [clas], each context word, , is first represented as a sequence of target sentencepiece units, e.g. the word “Jarred” may be mapped to [Ja, r, re, d]. This sequence is then fed to an BLSTM, and the last state of the BLSTM is used as the embedding of the given context word (shown as in Figure 2).
In the vanilla RNN-T system described in Section 3, probabilities over target units (Equation (4b)) are conditionally dependent on the outputs of the Audio Encoder, , and Text Predictor, . In contextual RNN-T, we would like to make conditionally dependent on contextual metadata words as well. This dependency can be achieved by incorporating the context word embeddings, , into any of the Audio Encoder, Text Predictor and Joiner components.
In this work, we explore incorporating the context word embeddings into the Text Predictor of the RNN-T. An Attention Module (AttModule) is used to compute attention for each word in the metadata text. AttModule uses the predictor output for non-blank text history up to () and word embedding, , to compute attention weight, , as shown in Equation (5b). We use location-aware attention that takes into account the attention weights from the previous predictor state, , while computing alignments at the current step [loc_attn].
We next compute a context vector, , [clas, keyword-rnnt] as the weighted sum of word embedding as:
The dataset used for our experiments was sampled from English videos shared publicly on Facebook. The data is de-identified by removing information such as the user who posted the video, and only use the content of the video and the text metadata for training and evaluation.
We segment the data in duration of 10 seconds for training and evaluation. Metadata words are obtained from the title and description text of the video, if available, after doing simple text cleaning and filtering such as removing words with hyperlinks in them. If a word in the text is capitalized then we also its lowered case version as metadata word. We do not preserve the ordering of words in metadata both in training and evaluations. Each segment of the same video shares the same list of contextual words. We use about 8k hours of data for training and about 170 hours for evaluation. We further divide the evaluation test set based on whether there is any word in the reference for that segment that also appears in the metadata words for that video. The segments for which there is at least one common word are referred to below as the CommonNonZero set and the remainder as the CommonZero set. We present results on these two evaluation sets.
The architecture of the Contextual RNN-T model from Section 4 used for the experiments in this paper is as follows. The Audio Encoder is a 5-layer of BLSTM with 704 dimensions. We use subsampling of 2 across the time dimension after the first BLSTM layer. The Text Predictor is a 2-layer LSTM of 704 dimensions. We used a token set consisting of 200 sentence pieces, trained using the sentence piece library [sentence_piece]. The Embedding Extractor is a 2-layer BLSTM of size 100. The Attention Module has the following parameters: 1) Convolution () with 2 out channels and kernel of size 1, 2) Attention is computed over 64 dimensions with being of size , of size and of size (in Equations (5a), (5b) and (5c)). The baseline model (Figure 1) does not use Embedding Extractor and Attention Module. All components of both the baseline and Contextual RNN-T model are trained from scratch.
The input to the network consists of globally normalized 80-dimensonal log Mel-filterbank features, extracted with 25ms FFT windows and 10ms frame shifts. Sentence piece encoding of each word,, in the metadata is appended with a special sentence piece unit. We use the Adam optimizer [kingma2014adam], with a learning rate of 0.0004, and SpecAugment [spec_aug]
with policy LB during training. Both RNN-T models were trained for 30 epochs. A beam size of 10 was used during inference.
5.3 Impact on WER-NE and WER
We measure performance of our models using WER and WER-NE on the two test sets described in Section 5.1. An in-house Entity tagger was used to tag named entities in transcripts and metadata.
As seen in Table 1, the Contextual RNN-T model (row 2) improves on WER-NE by about 12% relative compared to the baseline model (row 1) on the CommonNonZero evaluation set. As shown in Table 2, both WER and WER-NE for the CommonZero test set does not get significantly impacted by the Contextual RNN-T model when there is no intersection between the metadata words and the reference.
We also measure robustness of our system using precision and recall of the emission of context words in the model’s hypotheses. ATrue Positive occurs when a context word from the metadata of the video is correctly output by the model as compared to the reference. A False Positive occurs if the model outputs a context word but it does not appear in the reference. We show aggregated precision and recall over both test sets for triggering of the context words in Table 3. We see an improvement in recall by 4.5% and degradation in precision by 1.2% for the Contextual model compared to the baseline.
To understand better what the Contextual RNN-T model is doing, we visualize attention values for a few test segments where it correctly recognizes named entities that the baseline model makes errors on. These examples are shown in Table 4.
|Reference Snippet||Baseline Output||Contextualization Output||Metadata Words (truncated)|
|from the Africa Android challenge||
For the example shown in row 1 of Table 4, both the Contextual and baseline models are able to recognize common entities such as Africa. However, the baseline model has difficulties in recognizing entities that are not frequent in training data set, such as Android and PyTorch. Since Android appears in the metadata, the Contextual RNN-T model is able to attend to it and transcribe it correctly. This can be seen in the visualization of attention given to the words in the metadata at each output target unit () of the Contextual RNN-T Model in Figure 3.
In this work, we show that contextual metadata text, even if it is noisy, can be used to improve recognition of named entities for a challenging open domain ASR task such as social media videos within the framework of an E2E RNN-T ASR model. Some directions to explore further as future work could be: i) Using contextual embeddings from other modalities such as images from video, ii) An in-depth study of the impact of augmenting contextual information in the Audio Encoder, Text Predictor and Joiner for different modalities iii) Exploring different architectures for the EmbeddingExtractor(EE). iv) Using semantic embeddings to represent the meta data.
The authors would like to thank Anuroop Sriram, Duc Le, Florian Metze and Geoffrey Zweig for many helpful discussions and suggestions.