Significant progresses in automatic speech recognition (ASR) have taken place in recent years. Today, ASR systems are utilized in our daily lives, where a diverse range of recognition scenarios that contain distinct background acoustic conditions are observed. However, modern ASR systems are still struggling to effectively overcome noise levels and adverse background conditions, leading to unsatisfactory recognition results in daily use. It is suggested that this could be caused by less effective ASR acoustic modeling based on Mel-Frequency Cepstral Coefficients (MFCC) or log-mel filterbanks (FBANK) energies. Such features are sensitive to noise[bhattacharjee2016statistical]
, and these systems are generally trained for a specific use case and sensitive to test mismatch. Performance can further degrade for distant talking situations, where signal energy is lower, reverberation is possible, and environment signal-to-noise ratio (SNR) is lower. As a result, it is necessary to create a solution to minimize the influences from changing background acoustic conditions.
In the past, methods have been proposed to address the problem of noisy speech recognition [gong1995speech]. Most focus on feature enhancement [yu2008minimum], or model adaptation [seltzer2010acoustic]. One proposed method is based on factor-aware training. Such a technique introduces factors including noise [seltzer2013investigation], speaker [tan2016speaker], and/or room characteristics [giri2015improving]
into the training of deep neural networks (DNN) as auxiliary information. This added supporting information serves as a factor-dependent bias to the DNN which causes the output of the DNN to depend on the individual factor values. The most well-known example is the i-Vector that was originally proposed for speaker recognition[dehak2010front]. Here, it is possible for us to apply it as speaker and channel representations in factor aware training.
To address diversity in acoustic characteristics, we propose adding a feature to model the acoustic characteristics, such as channel distortions and environmental noise types, in the audio. The goal here is to make the acoustic model aware of this available information, which can be summarized as a ”scenario” that exists in the audio. This idea needs either several good representations for each classifiable factor, or an exceptional representation that can suitably distinguish a specific acoustic context.
Past studies have explored triplet loss as a means for improving speech technology, specially for speaker ID [zhang2018text, zhang2019utd]. However, to the best of our knowledge, triplet loss studies have not been explored in acoustic modeling for ASR. In this study, as motivated by past efforts in speaker recognition [zhang2018text, zhang2019utd]
, we employ a triplet-loss based representation generated by TRIpLet Loss network (TRILL)[shor2020towards] for speech recognition. In that network, a subset of the AudioSet [gemmeke2017audio]
that possesses the speech label is used for training in a self-supervised manner. Since the AudioSet corpus is a large dataset for general audio machine learning with general audio speech tags, it is useful due to size and scope. As a result, the triplet-loss based representation is expected to learn generalization for audio. The technique developed in[jansen2018unsupervised] was used to allow the network the ability to represent segments in audio that are closer in time to be closer in the embedding space. Details are presented in Sec.3.2.
The proposed method is assessed using two datasets, the 100-hour challenge corpus of the CRSS-UTDallas Fearless Steps Corpus (Sec.4.1) and CHiME-4 corpus (Sec.4.2). Systems development employs the Kaldi speech recognition toolkit [povey2011kaldi]
, and uses the same feature extraction pipeline shown in Fig.1. For the CRSS-UTDallas Fearless Steps task, we utilize a factorized time delay neural network (TDNN-f) [povey2018semi] for acoustic modeling while for CHiME-4, we focus on the 1 channel track task as they employed in [chen2018building].
2 Related work
Historically, many approaches have been proposed to address noise robustness in ASR systems [li2014overview, hansen2014environment]. In [seltzer2013investigation], an approach based on noise-aware training which incorporates information about the environment was introduced into DNN training. In [qian2016neural], three extraction models for speaker, phone, and environment were considered, along with a multi-task joint training architecture. In [liang2018learning], the invariant representation learning technique was proposed, which demonstrated significant reduction in character error rate and robustness for out-of-domain noise settings. In [raj2020frustratingly]
, a simple method was considered to extract a noise vector for acoustic model training. It is suggested that the technique could also be applied in online ASR by estimating the mean vector with frame-level maximum likelihood.
3 Proposed system
Given the challenges in robustness for ASR with CRSS-UTDallas Fearless Steps and CHiME-4, this section presents the formulation of our scenario aware based acoustic modeling to address environmental variability.
3.1 Scenario Aware
Factor aware training has been shown to be effective in ASR system development [seltzer2013investigation, qian2016neural, raj2020frustratingly]. This training strategy produces a system that is more robust to factors such as noise, speaker, and room characteristics. Most earlier studies have used a representation for each specific distortion factor, where the extracted representations are either fed into the input layer, the hidden layer, or the output layer. In our study, we use a single representation to characterize all factors/acoustic info within an audio, including speech, which leads to a scenario aware training for the resulting acoustic model.
The input feature for our acoustic model contains two types of vectors, the first is the commonly used MFCC, along with the i-Vector which we denote as . The second feature is the triplet-loss vector from TRILL, which we denote as . This total input vector is represented as the concatenation of dimensional vector and dimensional vector :
Note that we average the triplet-loss embeddings over time for an audio input to form a triplet-loss vector, that is one vector for the entire audio. A flow diagram is shown in Fig 1.
3.2 Triplet-loss based Representation
Previously, the triplet-loss based representation generated by TRILL model was introduced in [shor2020towards] and originally used for non-semantic downstream tasks. The pre-trained model111https://tfhub.dev/google/nonsemantic-speech-benchmark/trill/3 we used was trained on a subset of AudioSet [gemmeke2017audio] training set clips that own the speech label and with the ResNet-50 architecture discussed in [hershey2017cnn], followed by a 512-dimensional embedding layer. Next, the temporal proximity is used as a self-supervision signal [jansen2018unsupervised]. The idea behind using the temporal proximity is that sounds in a given environment are usually restricted to a subset of sound creating objects that are often closely related. Hence, a pair of events in the same audio should have a higher probability of being the same, or at least related in a certain level than any two audio clips randomly chosen from a large audio collection.
The collection of audio for training the triplet-loss model such as TRILL, can be depicted as a sequence of spectrogram context windows , where with and represent frequency and time window. The goal for this model is to learn a map that transform into d-dimensional space such that when . This is achieved by first sampling in a great amount of triplets , which are known as the anchor, positive, and negative respectively, where and with a reasonable time scale . Next, we train the model with the triplet loss:
where is the norm,
is a non-negative margin hyperparameter, andis the hinge loss. It is clear the loss will be exactly zero if all the training triplets satisfy the inequality:
In [shor2020towards], the training task is based on a pair-wise data set with the same clip ()/different clip () discrimination achieved by setting the utterance value to 10 seconds, which is the maximum duration of the clips in AudioSet. This makes the triplet-loss model capable of mapping audio clips with close acoustic information into similar embeddings. With the enormous size and scope of labels in AudioSet, we employed triplet-loss based representation to model the environment scenario in the audio.
The method for extracting the input features for the following experiments are all the same. Firstly, we extract 512 dimensional embeddings through the TRILL model and average these embeddings over time to form a triplet-loss vector. Next, the vector is combined with 40 dimensional MFCC features and 100 dimensional i-Vector as the input vector for acoustic model training. The feature generation pipeline is also shown in the Fig 1.
4.1 Fearless Steps Corpus Experiments
The Fearless Steps Corpus [hansen2018fearless] consists of 19,000 hours of actual Apollo conversational speech across 30 time-synchronized channels, with Channel#1 representing the time synchronized IRIG timecode. The audio represents actual communications from the Apollo-11 mission including all Mission Specialists, Astronauts, and support staff over the 7-day mission to the moon. These communication channel loops have distinct acoustic characteristics (e.g., noise, distortion, background interference, etc.) from NASA analog cables to the SoundScriber recording platform with channel and system noise that contributes to loss in ASR system performance. The variability across channel loops is due to the extensive cabling, headsets, relays, etc. necessary to bridge 600 NASA specialists in different locations to allow them to communicate and work collaboratively to achieve a successful mission. All audio was recorded on 30-track analog 1 inch reel-to-reel recording tapes, and then digitized by CRSS-UTDallas initially at a 44.1kHz and later down-sampled to 8kHz, with 30 min. per data chunk for speech analysis.
For this study, we employ only the 100-hour Fearless Steps challenge corpus [joglekar2020fearless] that consists of 5 selected channels with labeled data. This includes Network Controller (NTWK), Electrical, Environmental and Consumables Manager (EECOM), Guidance Navigation and Control (GNC), Flight Director (FD), Mission Operations Control Room (MOCR). Here, we use the ASR track2 in the challenge corpus where the audio is already segmented with utterance level transcriptions. The training set is roughly 28 hours, with development set being 7.6 hours, and the evaluation set being 10.6 hours. The training set is used for both the acoustic and language model. We use the development set for computing the perplexity in language model training, and the evaluation set is used only for test.
4.1.2 Baseline System
For lexicon model, we employ the CMU dictionary222http://www.speech.cs.cmu.edu/cgi-bin/cmudict as a basic pronunciation dictionary. However, since many words in the Fearless Steps Corpus are NASA space related and not present in the CMU dictionary, we use the Phonetisaurus G2P [novak2012wfst] to generate pronunciations for these out-of-vocabulary words. A speaker adapted HMM-GMM is first trained on the training set to generate phoneme to audio alignments for DNN training. The TDNN-f [povey2018semi] with 15 1024-dimensional layers factorized with 160-dimensional linear bottlenecks is used for acoustic modeling on the same dataset. For the language model, a basic 3-gram model was used, with pronunciation and silence probability modeling as described in [chen2015pronunciation].
4.1.3 Results and Analysis
The purpose of using a triplet-loss based representation, is to model the acoustic condition of each audio context. As shown in Fig.2, the extracted triplet-loss vectors from training data are categorized into a few different blocks. Each block can be treated as a distinct acoustic characteristic. Since we assume that the channel number corresponds to each utterance is not known, the speaker information is used instead as the label. Most speakers have utterances spread out across multiple blocks, but they do not necessarily cover every block. The triplet-loss based representation allowed for the analysis of environment scenario as shown in Fig.3, which shows selected channels that have different characteristics. This figure used a randomly selected 360-hour subset of the complete 9,000-hour Fearless Steps Corpus that contains channel number information for each audio stream. Each point represents a triplet-loss vector extracted from TRILL model from a 15 second block of audio cut from the original 30 min. sequential audio chunks.
|(No.) Model||i-vector||Dim||Dev (%)||Eval (%)|
|(3) + T-REP matrix*||yes||1536||27.07||29.64|
|(4) + T-REP||no||1536||26.68||29.30|
|(5) + T-REP||yes||1536||26.49||29.17|
|(6) + T-REP & multi-style||yes||1536||26.16||28.94|
|Gorin et al.[gorin400houston]||yes||1024||28.60||31.4|
*This means we do not average over time on the embeddings.
In Table 1, word error rates (WER) are shown for experiments based on the Fearless Steps Corpus. The first row is the baseline system described in Sec.4.1.2. We found that by increasing the layer dimension of the TDNN-f to 1536 (No.2), can further reduce WER, but increasing to larger dimensions such as 2136 (not shown here), caused a loss in performance. After adding the triplet-loss based representation, we observe a 2.9% and 1.1% relative WER improvement in development and evaluation set (No.5 vs. No.2). Another discovery is that using an average over time with triplet-loss embeddings advances the WER (No.5 vs. No.3). No.4 is the only experiment without the i-Vector. This shows that MFCC with the triplet-loss based representation (No.4) is better than MFCC with i-Vector (No.2). In all, our best system (No.6) with the triplet-loss based representation and multi-style training achieves 5.42% and 3.18% relative improvement on WER in development and evaluation set respectively. The multi-style training is accomplished by adding data augmented with the room impulse response (RIR) and MUSAN corpus (music, speech, and noise). With the original data included, we are able to expand the training set size by 5x vs. the original data. Note that simply adding triplet-loss based representation provides more improvement than multi-style training (No.5 - No.2 vs. No.6 - No.5). We include the last row as a comparison to the best system in Fearless Steps Challenge Phase II, with the matched condition in both acoustic and language model and training data used in our system.
4.2 CHiME-4 corpus Experiments
The CHiME-4 data [vincent2017analysis] includes real data recorded in real-world noisy environments, and simulated data, that is artificially created using clean speech data mixed with noisy background data. Five locations (i.e. booth (BTH), on the bus (BUS), cafe (CAF), pedestrian area (PED), and street junction (STR)) are chosen for real data recording. The BTH recordings are used for generating the simulated data, while all the rest are for ASR evaluation.
4.2.2 Baseline System
For a fair comparison, we only focus on the single channel track in the CHiME-4 challenge. The baseline system follows the work in [chen2018building], which uses a TDNN LF-MMI training on all 6 channels data and a LSTMLM trained with Kaldi-RNNLM [xu2018neural] on a 3-fold texts of training data. The pronunciation dictionary was also based on the CMU dictionary.
|Model||Dev (%)||Test (%)|
|Model||real (%)||simu (%)|
4.2.3 Results and Analysis
In Table 2, we demonstrate the effect of adding the triplet-loss based representation. We observe a 16.10% and 11.90% relative WER improvement in real development and real test data by adding the triplet-loss based representation with only a small loss in simulation data.
The Fig.4 shows the t-SNE plot of triplet-loss vectors extracted from training set of CHiME-4 corpus. We can see a clear separation of each location, where only the CAF and PED are more overlapped than others. This observation matches the WER improvement. Also, it is suggested that this is one of the reasons why CHiME-4 benefits more from triplet-loss based representation versus the Fearless Steps Corpus.
We further investigate the effectiveness of triplet-loss based representation using the environment difference in the test set. In Table 3, we show a greater improvement for real data versus simulation data. With the observation in Fig.4 in mind, it is shown that BUS and STR environment locations have greater improvement than CAFE and PED environment locations. This leads to the conclusion that the more distinct the acoustic context is in the audio from others, the more beneficial triplet-loss based representation will help.
4.3 Analysis on Triplet-loss Representation Performance
We note that there is a wide gap in performance of triplet-loss based representation between Fearless Steps and CHiME-4 corpora ( i.e. 1.1% compared to 11.90% relative improvements on WER). It is suggested that this is based on the dissimilar in formation of the acoustics for the difference in system improvements. As mentioned in Sec.4.1.1
, the audio for Fearless Steps are all analog recordings, where audio cable routing and channel recording conditions cause additional background noise that become a distinct characteristic of each channel. However, the specific distortion of these channels, such as strong low frequency harmonics, are highly unsophisticated, compared to the diverse general background noises in CHiME-4 data. Another possible reason is that the audio conversational-turn duration in Fearless Steps Corpus are commonly short, with a mean of 1.93 sec, and standard deviation of 3.28; while audio duration in CHiME-4 has a mean of 7.44 sec and standard deviation of 2.86. Here, 22% of the training set consists of audio turns which are less than 1 sec. These short duration make it hard for triplet-loss based representation to be as meaningful or effective.
This study has considered a triplet-loss approach as our proposed method for scenario aware speech recognition. To employ triplet-loss based representation, we utilize the TRILL model (Sec.3.2) to model all factors/acoustic info within an utterance, leading to a scenario aware ASR system. This technique is especially beneficial for real data when compared to simulation data. Furthermore, the more distinct the background acoustic structure is from each other, the greater the improvement possible. The system achieved 5.42% and 3.18% relative WER improvement on the development and evaluation test sets of the Fearless Steps Corpus, and 11.90% relative WER improvement on real test data of CHiME-4 corpus.
Our future work will explore alternative representations trained for different architectures and data. Also, we will further explore the integration between neural embeddings and the resulting acoustic model.
The authors would like to express our sincere thanks for valuable discussions with Wei-Cheng Lin, Midia Yousefi, and Aditya Joglekar.