On Cross-Corpus Generalization of Deep Learning Based Speech Enhancement

02/10/2020 ∙ by Ashutosh Pandey, et al. ∙ The Ohio State University 0

In recent years, supervised approaches using deep neural networks (DNNs) have become the mainstream for speech enhancement. It has been established that DNNs generalize well to untrained noises and speakers if trained using a large number of noises and speakers. However, we find that DNNs fail to generalize to new speech corpora in low signal-to-noise ratio (SNR) conditions. In this work, we establish that the lack of generalization is mainly due to the channel mismatch between the trained and untrained corpus. Additionally, we observe that traditional channel normalization techniques are not effective in improving cross-corpus generalization. Further, we evaluate publicly available datasets that are promising for generalization. We find one particular corpus to be significantly better than others. Finally, we find that using a smaller frame shift in short-time processing of speech can significantly improve cross-corpus generalization. The proposed techniques to address cross-corpus generalization include channel normalization, better training corpus, and smaller frame shift in short-time Fourier transform (STFT). These techniques together improve the objective intelligibility and quality scores on untrained corpora significantly.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Speech signal in a real-world environment is degraded by background noise. A degraded speech signal can severely degrade the performance of speech-based applications such as automatic speech recognition (ASR), speaker identification, and hearing aids. Speech enhancement is concerned with improving the intelligibility and quality of a speech signal degraded by additive noise, and commonly used as preprocessors in speech-based applications to improve their performance in noisy environments.

In real-world environments, speech signals are varied or distorted [2]. Sources of variations include background noise, room reverberation, speaker, language, accent, and communication channel. Ideally a speech enhancement algorithm should work well in different acoustic conditions. However, developing a general algorithm that works in all conditions remains a technical challenge.

Traditional approaches to speech enhancement include spectral subtraction [3], Weiner filtering [30], statistical model-based methods [18], and nonnegative matrix factorization [19]. These approaches work well for stationary noises but have difficulty in handling nonstationary noises or a large number of speakers. In recent years, deep learning-based approaches have become the mainstream for speech enhancement (see [36] for an overview). Among the most popular deep learning approaches are fully-connected networks [38, 41]

, recurrent neural networks (RNNs)

[39, 5]

and convolutional neural networks (CNNs)

[11, 32, 22].

In [6], Chen et al. demonstrated that fully connected feedforward networks trained for a single speaker, using a large number of noises, can generalize to untrained noises. However, such a network has difficulty generalizing to both of untrained speakers and noises, when trained using a large number of noises and speakers [5]. In [5]

, a RNN with long short-term memory (LSTM) is employed to develop a speaker- and noise-independent model for speech enhancement. This was achieved by training a four-layered RNN model using utterances from 77 speakers mixed with 10000 different noises.

In the last few years, speech enhancement research has aimed to improve the performance of speaker-and noise-independent models. In [32], the authors propose a CNN with gated and dilated convolutions for magnitude-spectrum enhancement. A recent trend is the enhancement of phase, obtaining better speech enhancement than the magnitude-only enhancement approaches. The two popular approaches are complex-spectrogram enhancement [40, 34, 9, 23, 7] and time-domain enhancement [10, 22, 25, 27, 28, 24].

The common practice in all the DNN based approaches is that a DNN is trained using utterances of different speakers from a single corpus and evaluated on untrained speakers from the same corpus. However, we find that when evaluated on utterances from untrained corpora, DNN performance may degrade significantly. This behavior has not been revealed and analyzed before. To be suitable for real-world applications, speech enhancement has to work on noisy utterances recorded in an unknown fashion, i.e. on any untrained corpus.

In this study, we perform an experimental study to understand cross-corpus generalization of DNNs. Our key observation is that the generalization gap is severe at low SNR conditions and is mainly due to the channel mismatch between different speech corpora. We examine the effectiveness of traditional channel normalization techniques for speech enhancement in low SNR conditions.

The general behavior of traditional channel normalization methods used in ASR or speaker identification systems, such as cepstral mean subtraction (CMS) [1, 12] or RASTA filtering [14, 15], is unknown for supervised speech enhancement. In supervised approaches to speech enhancement, a noisy utterance is generated by adding a noise segment to a clean speech utterance. It is highly unlikely that the channels of clean speech and noise will be similar. This creates a channel situation that is different from those in ASR and speaker recognition where the noise channel is not a main concern. In other words, a noisy utterance captures two kinds of channel effects, one for speech and the other for noise. This implies that the predicted channel from the noisy utterance may be inaccurate in noise dominant segments. To verify this analysis, we have evaluated two different channel normalization methods, mean subtraction and RASTA filtering in the log-spectrum domain. We choose the log-spectrum domain because most of the DNN based speech enhancement systems use either spectrum or log-spectrum as the input features. We observe improved enhancement using channel normalization, however, the improvements are indeed limited in low SNR conditions.

Further, we evaluate different corpora that are promising for cross-corpus generalization. A corpus that is recorded using many microphones or recorded in different acoustic conditions would be promising as it will expose the underlying DNN model to different channels. LibriSpeech [21] and VoxCeleb2 [8] are two such corpora. The utterances in LibriSpeech are extracted from audiobooks that are read by different volunteers across the globe. This implies that the utterances recorded by different volunteers have different channel characteristics. VoxCeleb2 utterances are extracted from the audios in YouTube videos and hence are recorded in different conditions and using different devices. We find LibriSpeech to be significantly better than VoxCeleb2 and WSJ [26], the latter commonly used in speaker-independent enhancement models.

Additionally, we investigate the use of smaller frame shifts in STFT, as smaller shifts may lead to better cross-corpus generalization because of the averaging effect in the overlap-and-add stage of inverse STFT. This turns out to be a very simple and effective technique for improving cross-corpus generalization.

Finally, we combine all the proposed techniques; channel normalization, better training corpus, and smaller frame shift. This combination substantially improves objective intelligibility and quality scores. The short-time objective intelligibility (STOI) [31] and the perceptual evaluation of speech quality (PESQ) [29] scores at dB SNR for babble noise are improved by percentage points and respectively for the utterances of a male speaker in the challenging IEEE corpus [16].

To our knowledge, this is the first systematic study on cross-corpus generalization in DNN based speech enhancement. The results of this study, we believe, represent a major step towards robust speech enhancement in real-world conditions. The rest of the paper is organized as follows. In Section II, we describe the speech enhancement framework used in this study. Section III explains corpus channel. Section IV illustrates the corpus fitting problem in speech enhancement. In Section V, we describe the techniques explored in this study to improve cross-corpus generalization. Experimental settings are given in Section VI and Section VII presents the results. Concluding remarks are given in Section VIII.

Ii Deep Learning based speech enhancement

Ii-a Problem Definition

Given a clean speech signal and a noise signal , the noisy speech signal is formed by the additive mixing as the following

(1)

where {, , } .

represents the number of samples in the signal. The goal of a speech enhancement algorithm is to get a close estimate,

, of given .

Ii-B Data Generation

Given a speech corpus containing training utterances {} and test utterances {}, we denote as the set of training utterances and as the set of test utterances in corpus .

The noisy utterances are generated by artificially adding noises to the utterances in and .

(2)
(3)

In general, to assess noise generalization, and are set to be either different noises or different segments of nonstationary noises. Similarly, to assess speaker generalization, speakers in and are set to be different.

In this work, we evaluate DNN based speech enhancement models for cross-corpus generalization. We train different models on corpora {, , …, } but evaluate them on utterances from untrained corpora {, , …, }. and denote the numbers of training and test corpora respectively.

Ii-C Feature Extraction and Training Targets

The pairs {, , } are transformed to the time-frequency (T-F) representation using STFT.

(4)
(5)
(6)

where {, , } , and and represent the number of frames and number of frequency bins. In this study, we use either STFT magnitude or logarithm of STFT magnitude, log, as the input feature.

There are many training targets studied in the literature such as the ideal ratio mask (IRM) [37], STFT magnitude [41], and spectral magnitude mask (SMM) [37]. We use the IRM in this study, defined as:

(7)

where , and , respectively, denote the values of , and at the corresponding T-F unit.

Ii-D Model Architecture

We use a 4-layer bidirectional LSTM (BLSTM) network with 512 hidden units in each direction. One fully-connected layer with 512 units is used before the BLSTM, which is followed by a fully-connected layer at the output with sigmoidal nonlinearity.

Ii-E Loss Function

The BLSTM network takes as input the feature, or log, and outputs the estimated IRM, . A mean squared error (MSE) loss is used between and . The utterance level MSE loss is given below.

(8)

Ii-F Time Domain Reconstruction

The trained model is used for predicting the IRM of noisy utterances in the test set. is multiplied to the noisy STFT magnitude, , to obtain the enhanced STFT magnitude, .

(9)

where denotes element-wise multiplication.

The estimated STFT magnitude is combined with the noisy STFT phase to obtain the estimated STFT.

(10)

where represents the noisy phase. Finally, inverse STFT is used to obtain the enhanced waveform.

(11)

Iii Corpus Channel

A speech corpus generally contains different utterances spoken by many speakers. The utterances are recorded in a controlled environment so that the recording is clean and suitable to be used for speech-based applications. The different controlled environments used for different corpora may lead to different stationary components in the utterances. For example, if recording microphones are different, a sentence spoken by the same person can be very different in quality. We refer to the stationary component of a corpus as the corpus channel.

An algorithm developed and shown to be effective for one corpus may not work when evaluated on a corpus recorded in a different condition. To illustrate this, Fig. 1 plots the log-spectrum of an utterance from the TIMIT corpus [13] that is convolved with two different microphone impulse response (MIR) functions111The two MIRs are obtained from https://www.audiothing.net/impulses/vintage-mics/. We can observe that the energy patterns in the two spectra are very different. The left spectrum has higher energy around frequency bin and lower energy around the bin compared to the right spectrum. This type of difference in distribution may cause an algorithm to degrade on untrained corpora.

Fig. 1: Differences in the energy distribution of a spectrum convolved using different MIR functions. The frequency responses of MIRs are shown in the top row.

A stationary channel can be defined as a linear- and time-invariant filter given in the following equation,

(12)

where denotes the convolution operator, and are discrete signals indexed by , and is a digital filter with taps. When the underlying signal, , is a time-varying speech signal, Equation 12 can be transformed into the following form using STFT.

(13)

where is the time-invariant but frequency-dependent gain introduced by the channel. Note that does not contain any time index implying the stationarity of the channel. Taking the logarithm of complex magnitude in both sides of Equation 13, we get

(14)

A straightforward method to remove stationary channel from a speech signal is log-spectral mean subtraction (LSMS). In this method, the long-term average of a log-spectrum is subtracted from the log-spectrum to obtain a channel removed log-spectrum. Taking the average over time in Equation 14, we get

(15)

Now, we define the channel of a corpus, , using the following equation.

(16)

Thus the defined corpus channel consists of two components, where corresponds to the recording channel and corresponds to the average of log-spectrum over the corpus. It is important to note that channel differences between corpora are primarily caused by , as the long-term average speech spectrum is similar across different dialects of the same language and even different languages [4].

Further subtracting Equation 16 from Equation 14, we get

(17)

The above equation says that removing the defined corpus channel from an utterance of a corpus gives a normalized utterance with both channel and speech mean effects removed.

We will use Equation 16 to estimate the spectral magnitudes of the corpus channel of three popular corpora utilized for speech enhancement; WSJ SI-84, TIMIT, and IEEE [16]. A frame of 20 ms with a shift of 10 ms is used for STFT computation. The estimates for the channels are plotted in Fig. 2. We can observe that the channels are quite different from each other. Even though the peaks occur at nearby frequencies, the decay rates are much different. The decay rate is fastest for IEEE and slowest for TIMIT. TIMIT and WSJ exhibit 2 peaks whereas IEEE shows only one peak.

Fig. 2: The estimated spectral magnitudes of the channels of three speech corpora.

Iv Corpus Fitting

In this section, we demonstrate that models trained on one corpus fail to generalize to untrained corpora. Further, we show that the corpus channel is one of the factors that reduce the performance on untrained corpora.

We evaluate three different types of models; an IRM based BLSTM model described in Section II, a complex-spectrum based model proposed in [33] and two time-domain models proposed in [22, 24]. The models are trained on the WSJ corpus and are evaluated on 3 different corpora, WSJ, TIMIT, and IEEE. We select one male and one female speaker from the IEEE dataset and treat them as two different corpora. They are denoted as IEEE Male and IEEE Female respectively. The evaluation results in terms of STOI (%) and PESQ, for babble noise at SNRs of dB and dB, are given in Table I.

One can observe that the performance on the trained corpus, WSJ, is excellent. STOI is improved by more than 19.5% for all the models. However, the improvements are much reduced on untrained corpora, TIMIT, IEEE Male and IEEE Female. For the IEEE Male speaker, AECNN-SM and CRN even degrade STOI compared to unprocessed mixtures. Similarly, PESQ is also degraded in many cases. The results suggest that the BLSTM model is better in terms of generalization, even though within-corpus enhancement results are not as good as the more recent models. Therefore we choose this model for comparisons in the rest of the paper.

width=0.95 Test Corpus WSJ TIMIT IEEE Male IEEE Female Test SNR -5 dB -2 dB -5 dB -2 dB -5 dB -2 dB -5 dB -2 dB STOI (%) Mixture 58.6 65.5 54.0 60.9 55.0 62.3 55.5 62.9 BLSTM 77.4 83.0 64.7 73.3 60.4 74.0 62.5 73.5 CRN [33] 80.3 86.8 59.0 69.6 52.6 65.5 51.6 68.0 AECNN-SM [22] 81.0 88.3 60.8 72.0 51.5 65.2 61.1 75.8 TCNN [24] 82.7 88.9 61.6 72.9 57.2 69.9 56.5 74.1 PESQ Mixture 1.54 1.69 1.46 1.63 1.46 1.63 1.12 1.32 BLSTM 1.97 2.22 1.70 2.00 1.52 1.89 1.26 1.66 CRN [33] 2.17 2.50 1.33 1.73 1.07 1.50 0.91 1.50 AECNN-SM [22] 2.19 2.60 1.40 1.78 1.13 1.50 1.28 1.83 TCNN [24] 2.19 2.53 1.33 1.74 1.18 1.61 1.01 1.64

TABLE I: STOI and PESQ comparisons between different test corpora for four deep learning based speech enhancement methods.
Fig. 3: Effects of corpus-channel on cross-corpus generalization. First row plots STOI (%) obtained using original utterances. Second row plots STOI (%) using channel-removed utterances.

Next, we illustrate the behavior of the BLSTM model for different types of noises and at different SNR conditions. The plots of STOI improvement (%) are shown in the first row of Fig. 3. We observe that for all the noises the gap between trained and untrained corpus is the largest at dB and gradually narrows with increasing SNR. This illustrates that cross-corpus generalization is a severe issue in low SNR conditions. Similarly, the generalization gap at low SNRs for different noises is in order of babble, cafeteria, factory and engine.

Finally, we design an experiment to demonstrate that the corpus channel is a major culprit for the cross-corpus generalization issue. We use Equation 17 to get corpus channel removed spectrum of utterances in a corpus. The corpus channel removed spectrum is used for time-domain reconstruction using Eqs. 10 and 11. For a given corpus , we use for the corpus channel estimation, and use it to get corpus channel removed utterances in both and . We use a frame size of 2048 and frame shift of 32 in STFT. We find that this setting introduces negligible artifacts in the modified utterances.

We show the effect of corpus channel normalization on sample utterances from different corpora in Fig. 4. One can observe that the energy distribution in different frequency bins becomes more prominent, especially in the high-frequency range where the corpus channel has a large attenuation factor.

We use corpus channel normalized utterances to generate a new training corpus on WSJ and new test corpora on WSJ, TIMIT, IEEE Male and IEEE Female. The BLSTM model is trained on the new WSJ corpus and evaluated on all the test corpora for four different noises. The improvements in STOI (%) are plotted in the second row of Fig. 3. These improvements are significantly higher than those in the first row. For example, STOI of the babble noise at dB changes from to for IEEE Male, and to for IEEE Female. In addition, STOI improves for all the noises and in all SNR conditions. This demonstrates that the corpus channel is one of the main causes for the cross-corpus generalization issue, and channel differences need to be accounted for in order to improve cross-corpus generalization.

V Improving cross-corpus generalization

In this section, we describe different techniques investigated in this study to improve cross-corpus generalization.

V-a Modified Loss Function

We find that using a loss over high energy T-F units is better for cross-corpus generalization. We use loss over T-F units within the 20 dB of the maximum amplitude T-F unit. The modified utterance level loss is given as

(18)

where,

(19)
Fig. 4: Effects of channel normalization. The spectrogram of one utterance from each of the three corpora are plotted in the first column. The corresponding channel removed spectrograms are plotted in the second column.
Fig. 5: STOI and PESQ comparisons between the baseline, modified loss, LSMS and RASTA.

V-B Channel Normalization

We have discussed in Section IV that removing the corpus channel can be helpful in improving cross-corpus generalization. We evaluate the following channel normalization techniques in this study.

V-B1 Log-Spectral Mean Subtraction

Given a noisy utterance , the channel can be estimated by taking the average of log-spectra over all the frames in the utterance

(20)

The channel normalized log-spectrum is defined as

(21)

We use log as the input feature in this case. Note that estimating the channel using noisy utterances may not be as accurate as using clean utterances because noise and speech in the data are likely to be recorded in different conditions and using different kinds of devices. Nevertheless, it can give a good approximate for the frequency bins dominated by speech. We add a small positive constant before applying the logarithm operator.

V-B2 RASTA Filter

The RASTA filter has been shown to attenuate the channel effects and improve the generalization of ASR systems [20]. The RASTA filter is applied over log-spectral magnitude and is given by

(22)

where is a parameter that is set to .

width=0.95 Corpus WSJ VoxCeleb2 LibriClean LibriOther LibriAll # of speakers 77 5994 921 1166 2087 # of utterances 6385 1092009 104014 148688 252702

TABLE II: Different corpus sizes used in this study.

width=0.85 Epoch 1 to 0.6 (0.6 + 1) to 0.9 (0.9 + 1) to Learning rate 0.0002 0.0001 0.00005

TABLE III: Learning rate schedule. denotes the maximum number of epochs of training.

V-C Training Corpus

We evaluate following corpora to understand cross-corpus generalization behavior.

V-C1 Wsj

We use the WSJ0-SI-84 corpus as the baseline since this corpus has been used in past to train speaker- and noise-independent models [5, 32, 22].

V-C2 VoxCeleb2

The VoxCeleb2 corpus is promising for cross-corpus generalization because of the following reasons. First, it is very large with around 1.1 million utterances of speakers. Second, it is extracted from YouTube therefore it has the potential of generalizing to different channels as the uploaded videos on YouTube are usually recorded in different conditions and using different devices.

V-C3 LibriSpeech

LibriSpeech is a corpus derived from read audiobooks from the LibriVox project. It contains around 0.25 million utterances of 2.1k speakers. It is promising for cross-corpus generalization because the English utterances are spoken by different volunteers across the globe. This implies that the utterances recorded by different volunteers are typically over different channels.

We have evaluated three different versions of LibriSpeech; LibriClean, LibriOther, and LibriAll. LibriClean contains relatively clean utterances compared to LibriOther. LibriAll is the combination of both LibriClean and LibriAll. We list different corpora in terms of their size in Table II.

V-D Frame Shift

In short-time processing of speech, a frame shift equal to the half of frame size typically is used, and overlap-and-add is used during final reconstruction in the time domain. However, when frame shift is smaller, there will be multiple predictions (2) of a single T-F unit from the neighboring frames. This leads to averaging the multiple predictions of a sample in the overlap-and-add stage. We find that the simple idea of using a smaller frame shift leads to a significant improvement in cross-corpus generalization. We fix the frame size to ms and evaluate frame shifts of { ms, ms, ms, ms}.

width=0.95 Test Corpus WSJ TIMIT IEEE Male IEEE Female Test SNR -5 dB -2 dB -5 dB -2 dB -5 dB -2 dB -5 dB -2 dB STOI (%) Mixture 58.6 65.5 54.0 60.9 55.0 62.3 55.5 62.9 SMS 77.4 83.0 64.7 73.3 60.4 74.0 62.5 73.5 SMS_MOD 78.3 83.5 65.7 74.3 64.8 75.1 63.8 75.2 LSMS 78.6 83.6 68.4 76.4 64.4 76.6 66.0 76.7 RASTA 78.0 83.4 66.2 74.8 62.6 75.6 64.5 75.2 Same Corpus - - 70.2 77.0 73.3 78.5 71.4 78.4 PESQ Mixture 1.54 1.69 1.46 1.63 1.46 1.63 1.12 1.32 SMS 1.97 2.22 1.70 2.00 1.52 1.89 1.26 1.66 SMS_MOD 2.00 2.23 1.73 2.04 1.63 1.92 1.31 1.74 LSMS 2.02 2.25 1.82 2.12 1.64 2.00 1.39 1.81 RASTA 2.01 2.24 1.77 2.06 1.59 1.94 1.33 1.75 Same Corpus - - 1.90 2.15 1.87 2.10 1.64 1.93

TABLE IV: STOI and PESQ comparisons between the baseline, modified loss, LSMS and RASTA on babble noise.
Fig. 6: STOI and PESQ comparisons between different training corpora with the frame shift of ms.

Vi Experimental Settings

Vi-a Data Preparation

We train corpus dependent models on WSJ, TIMIT, IEEE Male, and IEEE Female corpora. Corpus independent models are trained on WSJ, VoxCeleb2, LibriClean, LibriOther, and LibriAll. For training, we use all utterances of the TIMIT corpus and random utterances out of of IEEE Male and IEEE Female. All the clean utterances are resampled to kHz. For WSJ training utterances, we remove all the frames in the beginning and end that are not within dB of the maximum frame energy.

Noisy utterances are created during the training time by randomly adding noise segments to all the utterances in a batch. For training noises, we use non-speech sounds from a sound effect library (www.sound-ideas.com) as in [6]. For each utterance, we cut a random segment of seconds if the utterance is longer than seconds. A random noise segment is added to the utterance at a random SNR in { dB, dB, dB, dB, dB, dB}. For a corpus containing less than utterances, an epoch is defined as when the model has seen around utterances. This corresponds to , and noisy utterances per clean utterance in one epoch of IEEE, TIMIT, and WSJ respectively.

The WSJ test set consists of utterances of speakers not included in WSJ training. The TIMIT test set consists of 192 utterances from the core test set. The IEEE Male and IEEE Female test sets both consist of the clean utterances not included in their training sets. A test set is generated from different noises: babble, cafeteria, factory and engine, at the SNRs of { dB, dB, dB}. The babble and cafeteria noises are from Auditec CD (available at http://www.auditec.com). Factory and engine noises are from Noisex [35].

All noisy utterance samples are normalized to the range [, ] and corresponding clean utterances are scaled accordingly to maintain an SNR. The frame size of ms with the Hamming window is used for STFT.

Vi-B Training Methodology

The models trained on TIMIT and IEEE use a dropout rate of 0.5 for each layer except for the output. The models are trained for 10 epochs on TIMIT and IEEE, 100 epochs on LibriSpeech, and 20 epochs on VoxCeleb2.

The Adam optimizer [17]

is used with a learning rate schedule given in Table III. A batch size of 32 utterances is used. All the utterances that are shorter than the longest utterance in a batch are padded with zero at the end. The loss values computed over the outputs corresponding to zero-padded inputs are ignored.

width=0.95 Test Corpus WSJ TIMIT IEEE Male IEEE Female Test SNR -5 dB -2 dB -5 dB -2 dB -5 dB -2 dB -5 dB -2 dB STOI (%) Mixture 58.6 65.5 54.0 60.9 55.0 62.3 55.5 62.9 WSJ 78.6 83.6 68.4 76.4 64.4 76.6 66.0 76.7 VoxCeleb2 76.0 82.2 68.1 76.2 68.0 77.5 67.5 77.8 LibriClean 77.9 83.5 70.0 77.6 68.8 78.3 67.8 78.4 LibriOther 78.4 83.8 70.5 78.1 69.8 78.5 69.4 79.1 LibriAll 78.5 83.8 71.4 78.4 70.7 79.5 70.2 79.2 Same Corpus - - 70.2 77.0 73.3 78.5 71.4 78.4 PESQ Mixture 1.54 1.69 1.46 1.63 1.46 1.63 1.12 1.32 WSJ 2.02 2.25 1.82 2.12 1.64 2.00 1.39 1.81 VoxCeleb2 1.99 2.22 1.87 2.15 1.79 2.09 1.53 1.89 LibriClean 2.02 2.25 1.91 2.19 1.79 2.10 1.50 1.89 LibriOther 2.04 2.27 1.92 2.20 1.83 2.13 1.56 1.93 LibriAll 2.04 2.26 1.95 2.21 1.86 2.15 1.59 1.93 Same Corpus - - 1.90 2.15 1.87 2.10 1.64 1.93

TABLE V: STOI and PESQ comparisons between different training corpora with the frame shift of ms on babble noise.
Fig. 7: STOI and PESQ comparisons between different frame shifts.

Vi-C Evaluation Metrics

In our experiments, models are evaluated using STOI [31] and PESQ [29], which represent the standard metrics for speech enhancement. STOI has a typical value range from to , which can be roughly interpreted as percent correct. PESQ values range from to .

Vi-D Baseline

For the baseline, we train the BLSTM model on WSJ using the loss function given in Equation 8. STFT magnitude is used as the feature with the channel normalization in Equation 22 but applied to STFT magnitude instead of log magnitude. We call this model SMS, standing for spectral mean subtraction (in Fig. 5 and Table IV).

Vii Results and Discussions

First, we evaluate the modified loss function (Section V.A) and two channel normalization methods (Section V.B) and compare them with the baseline model. The models are trained on the WSJ corpus with a frame shift of ms. We denote the baseline with SMS and the model with modified loss as SMS_MOD. Average STOI and PESQ over all the four test noises and at SNRs of dB, dB, and dB are plotted in Fig. 5.

We observe that SMS_MOD is consistently better than SMS. The improvement is maximum at dB for all the corpora. The maximum improvement is observed for the IEEE Male corpus. The objective scores indicate that training a model using a loss over all the T-F units leads to overfitting on the corpus. Using a loss computed over only high energy T-F units can achieve better generalization. All the following models trained in this study, except for SMS, will use the modified loss function.

The objective scores for two normalization schemes suggest that LSMS and RASTA both are better than SMS and SMS_MOD for all untrained corpora. LSMS is consistently better than RASTA for all the corpora and at all SNR conditions.

We also provide comparisons for babble noise at SNRs of dB and dB in Table IV. The bold scores in the last row of STOI and PESQ, Same Corpus (trained corpus), provide the scores obtained by training a model on the same corpus as the test corpus. Note that the results on the trained corpora, TIMIT and IEEE, represent benchmarks where the number of unique training utterances is small. IEEE corpora have only training utterances and TIMIT has utterances in which many speakers speak the same set of sentences. A good model should be able to match the scores obtained using Same Corpus.

We observe a similar performance trend for babble noise. SMS_MOD improves STOI at dB by and for IEEE Male and Female, respectively. PESQ is improved by and . SMS_MOD is consistently better for all the test corpora. LSMS and RASTA are better than SMS_MOD, and LSMS is better than RASTA for all corpora.

Even though LSMS obtains better objective scores, it is not able to improve the scores for IEEE Male and IEEE Female to the extent comparable to Same Corpus. This suggests that traditional channel normalization approaches are helpful, but can not obtain an adequate improvement on untrained corpora.

width=0.95 Test Corpus WSJ TIMIT IEEE Male IEEE Female Test SNR -5 dB -2 dB -5 dB -2 dB -5 dB -2 dB -5 dB -2 dB STOI (%) Mixture 58.6 65.5 54.0 60.9 55.0 62.3 55.5 62.9 16 ms 78.5 83.8 71.4 78.4 70.7 79.5 70.2 79.2 8 ms 81.6 86.4 73.7 81.1 73.7 82.1 73.8 82.9 4 ms 82.4 87.3 75.1 82.1 74.3 83.2 74.8 84.3 2 ms 82.7 87.4 75.6 82.1 75.3 83.7 75.2 84.1 PESQ Mixture 1.54 1.69 1.46 1.63 1.46 1.63 1.12 1.32 16 ms 2.04 2.26 1.95 2.21 1.86 2.15 1.59 1.93 8 ms 2.31 2.56 2.13 2.44 2.05 2.37 1.85 2.25 4 ms 2.43 2.70 2.20 2.52 2.11 2.47 1.94 2.41 2 ms 2.46 2.73 2.23 2.55 2.15 2.50 1.97 2.43

TABLE VI: STOI and PESQ comparisons between different frame shifts on babble noise.
Fig. 8: STOI and PESQ comparisons between different training corpora with the frame shift of ms.

Next, we examine different training corpora on test noises. The models are trained using LSMS with a frame shift of ms. The average STOI and PESQ over four test noises are plotted in Fig. 6. A general trend for STOI and PESQ scores are , except for TIMIT where VoxCeleb2 is worse than WSJ.

The performance for babble noise at SNRs of dB and dB is given in Table V. LibriAll is the best among all corpora. It obtains similar or better scores compared to Same Corpus except for IEEE Male and IEEE Female at dB, where STOI is worse by and respectively.

A key observation from the corpora comparisons is that the corpus content is important to achieve better generalization but not the size of the corpus. A corpus with multiple possible channels sources, LibriAll, is very effective for generalization. However, a similar corpus VoxCeleb2 containing times more utterances is not as effective. This observation is further supported by the fact that no dramatic performance differences exist between LibriClean ( utterances), LibriOther ( utterances) and LibriAll ( utterances), all of which contain utterances from the LibriSpeech corpus.

width=0.95 Test Corpus WSJ TIMIT IEEE Male IEEE Female Test SNR -5 dB -2 dB -5 dB -2 dB -5 dB -2 dB -5 dB -2 dB STOI (%) Mixture 58.6 65.5 54.0 60.9 55.0 62.3 55.5 62.9 Baseline 77.4 83.0 64.7 73.3 60.4 74.0 62.5 73.5 + Modified loss 78.3 83.5 65.7 74.3 64.8 75.1 63.8 75.2 + LSMS 78.6 83.6 68.4 76.4 64.4 76.6 66.0 76.7 + frame shift 4 ms 82.8 87.5 71.9 79.9 66.2 80.8 69.5 81.1 + LibriAll 82.4 87.3 75.1 82.1 74.3 83.2 74.8 84.3 Same Corpus - - 73.5 80.7 77.9 82.6 75.9 83.2 PESQ Mixture 1.54 1.69 1.46 1.63 1.46 1.63 1.12 1.32 Baseline 1.97 2.22 1.70 2.00 1.52 1.89 1.26 1.66 + Modified loss 2.00 2.23 1.73 2.04 1.63 1.92 1.31 1.74 + LSMS 2.02 2.25 1.82 2.12 1.64 2.00 1.39 1.81 + 4 ms frame shift 2.45 2.72 2.09 2.43 1.8 2.33 1.67 2.22 + LibriAll 2.43 2.70 2.20 2.52 2.11 2.47 1.94 2.41 Same Corpus - - 2.12 2.42 2.14 2.38 2.03 2.40

TABLE VII: Performance improvements on babble noise using different techniques proposed in this study.

Perhaps surprisingly, VoxCeleb2 is not able to obtain good generalization. This might be due to the types of utterances in VoxCeleb2. Most of the utterances include some sort of reverberation, cross-talk or background noise. Hence, it may not be very suitable to be employed for the enhancement of utterances from clean corpora. More research is needed to explain the cross-corpus generalization behavior of VoxCeleb2.

Further, we compare models trained with different frame shifts. We compare frame shifts from { ms, ms, ms, ms}. All the models are trained on LibriAll using LSMS with a frame size of 32 ms. Average STOI and PESQ scores are plotted in Fig. 7, and comparisons for babble noise are given in Table VI. We can observe a clear improvement in the objective scores when moving from ms to ms, and from ms to ms. However, the performances for ms and ms are very similar, suggesting the diminishing effect from reducing frame shift. Note that similar performance improvements are obtained using all the training corpora, suggesting that using small frame shift is an effective technique applicable to all training corpora. The performance is also improved on the trained corpus, WSJ in this case, when trained using smaller frame shifts. This is an important observation because getting an improvement on the trained corpus does not necessarily result in an improvement over untrained corpora as we have reported in Table I.

We also compare all the training corpora using a smaller frame shift of ms in Fig. 8. We obtain the same performance trend as using the frame shift of ms. This implies that using smaller frame shift and better training corpora are two independent techniques for improving cross-corpus generalization.

Finally, we report results on babble noise when different techniques to improve channel generalization are incrementally incorporated into the baseline model. The results are given in Table VII. We observe that the most effective approach is the use of LibriAll that improves STOI at dB by on TIMIT, on IEEE Male, and on IEEE Female while obtaining similar performance on WSJ as to that obtained by training on WSJ. Similarly, smaller frame shift is also very effective as it improves STOI at dB by on TIMIT, on IEEE Male, and on IEEE Female.

Viii Concluding Remarks

This work reveals robustness problem with deep learning based speech enhancement algorithms. We have shown that a model trained on a given corpus fails to generalize to utterances from an untrained corpus. The problem is more severe at low SNR levels, where speech enhancement is actually more needed. We have established that the cross-corpus generalization issue is mainly due to the channel mismatch between a trained and untrained corpus.

We have examined traditional channel normalization methods and found that they improve performance on untrained corpora, but improvement is limited, and hence other techniques need to be developed to further improve generalization.

We have proposed two effective methods to significantly improve cross-corpus generalization. The first technique is to use a corpus obtained using crowd-sourced audio recordings such as LibriSpeech and VoxCeleb. We found LibriSpeech to be significantly better than VoxCeleb. The second technique is the use of a smaller frame shift in STFT and ISTFT layers.

Further research is needed to evaluate the effectiveness of LibriSpeech and smaller frame shift for complex-domain and time-domain speech enhancement models. The behavior of VoxCeleb, which is found to be not very effective for generalization, needs to be further explored for a better understanding of cross-corpus generalization.

References

  • [1] B. S. Atal (1976) Automatic recognition of speakers from their voices. IEEE 64 (4), pp. 460–475. Cited by: §I.
  • [2] M. Benzeghiba, R. De Mori, O. Deroo, S. Dupont, T. Erbes, D. Jouvet, L. Fissore, P. Laface, A. Mertins, C. Ris, et al. (2007) Automatic speech recognition and speech variability: a review. Speech Communication 49 (10-11), pp. 763–786. Cited by: §I.
  • [3] S. Boll (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing 27 (2), pp. 113–120. Cited by: §I.
  • [4] D. Byrne, H. Dillon, K. Tran, S. Arlinger, K. Wilbraham, R. Cox, B. Hagerman, R. Hetu, J. Kei, C. Lui, et al. (1994) An international comparison of long-term average speech spectra. The Journal of the Acoustical Society of America 96 (4), pp. 2108–2120. Cited by: §III.
  • [5] J. Chen and D. L. Wang (2017) Long short-term memory for speaker generalization in supervised speech separation. The Journal of the Acoustical Society of America 141 (6), pp. 4705–4714. Cited by: §I, §I, §V-C1.
  • [6] J. Chen, Y. Wang, S. E. Yoho, D. L. Wang, and E. W. Healy (2016) Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. The Journal of the Acoustical Society of America 139 (5), pp. 2604–2612. Cited by: §I, §VI-A.
  • [7] H. Choi, J. Kim, J. Huh, A. Kim, J. Ha, and K. Lee (2019) Phase-aware speech enhancement with deep complex U-Net. arXiv preprint arXiv:1903.03107. Cited by: §I.
  • [8] J. S. Chung, A. Nagrani, and A. Zisserman (2018) VoxCeleb2: deep speaker recognition. In INTERSPEECH, Cited by: §I.
  • [9] S. Fu, T. Hu, Y. Tsao, and X. Lu (2017) Complex spectrogram enhancement by convolutional neural network with multi-metrics learning. In

    International Workshop on Machine Learning for Signal Processing

    ,
    pp. 1–6. Cited by: §I.
  • [10] S. Fu, Y. Tsao, X. Lu, and H. Kawai (2017) Raw waveform-based speech enhancement by fully convolutional networks. arXiv preprint arXiv:1703.02205. Cited by: §I.
  • [11] S. Fu, Y. Tsao, and X. Lu (2016) SNR-Aware convolutional neural network modeling for speech enhancement.. In INTERSPEECH, pp. 3768–3772. Cited by: §I.
  • [12] S. Furui (1981) Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing 29 (2), pp. 254–272. Cited by: §I.
  • [13] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett (1993) DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. nist speech disc 1-1.1. NASA STI/Recon technical report n 93. Cited by: §III.
  • [14] H. Hermansky, N. Morgan, A. Bayya, and P. Kohn (1991) Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP). In European Conference on Speech Communication and Technology, Cited by: §I.
  • [15] H. Hermansky and N. Morgan (1994) RASTA processing of speech. IEEE Transactions on Speech and Audio Processing 2 (4), pp. 578–589. Cited by: §I.
  • [16] IEEE (1969) IEEE recommended practice for speech quality measurements. IEEE Transactions on Audio and Electroacoustics 17, pp. 225–246. Cited by: §I, §III.
  • [17] D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §VI-B.
  • [18] P. C. Loizou (2013) Speech enhancement: theory and practice. 2nd edition, CRC Press, Boca Raton, FL, USA. External Links: ISBN 1466504218, 9781466504219 Cited by: §I.
  • [19] N. Mohammadiha, P. Smaragdis, and A. Leijon (2013) Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Transactions on Audio, Speech, and Language Processing 21 (10), pp. 2140–2151. Cited by: §I.
  • [20] H. Murveit, J. Butzberger, and M. Weintraub (1992) Reduced channel dependence for speech recognition. In Workshop on Speech and Natural Language, pp. 280–284. Cited by: §V-B2.
  • [21] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) LibriSpeech: an ASR corpus based on public domain audio books. In ICASSP, pp. 5206–5210. Cited by: §I.
  • [22] A. Pandey and D. Wang (2019) A new framework for CNN-based speech enhancement in the time domain. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 27 (7), pp. 1179–1188. Cited by: §I, §I, TABLE I, §IV, §V-C1.
  • [23] A. Pandey and D. Wang (2019) Exploring deep complex networks for complex spectrogram enhancement. In ICASSP, pp. 6885–6889. Cited by: §I.
  • [24] A. Pandey and D. Wang (2019) TCNN: temporal convolutional neural network for real-time speech enhancement in the time domain. In ICASSP, pp. 6875–6879. Cited by: §I, TABLE I, §IV.
  • [25] S. Pascual, A. Bonafonte, and J. Serrà (2017) SEGAN: speech enhancement generative adversarial network. In INTERSPEECH, pp. 3642–3646. External Links: Document Cited by: §I.
  • [26] D. B. Paul and J. M. Baker (1992) The design for the wall street journal-based CSR corpus. In Workshop on Speech and Natural Language, pp. 357–362. Cited by: §I.
  • [27] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florêncio, and M. Hasegawa-Johnson (2017) Speech enhancement using bayesian wavenet. In INTERSPEECH, pp. 2013–2017. Cited by: §I.
  • [28] D. Rethage, J. Pons, and X. Serra (2018) A wavenet for speech denoising. In ICASSP, pp. 5069–5073. Cited by: §I.
  • [29] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra (2001) Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs. In ICASSP, pp. 749–752. Cited by: §I, §VI-C.
  • [30] P. Scalart et al. (1996) Speech enhancement based on a priori signal to noise estimation. In ICASSP, Vol. 2, pp. 629–632. Cited by: §I.
  • [31] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing 19 (7), pp. 2125–2136. Cited by: §I, §VI-C.
  • [32] K. Tan, J. Chen, and D. Wang (2018) Gated residual networks with dilated convolutions for monaural speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (1), pp. 189–198. Cited by: §I, §I, §V-C1.
  • [33] K. Tan and D. Wang (2019) Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement. In ICASSP, pp. 6865–6869. Cited by: TABLE I, §IV.
  • [34] K. Tan and D. Wang (2019) Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 380–390. Cited by: §I.
  • [35] A. Varga and H. J. Steeneken (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication 12 (3), pp. 247–251. Cited by: §VI-A.
  • [36] D. L. Wang and J. Chen (2018) Supervised speech separation based on deep learning: an overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, pp. 1702–1726. Cited by: §I.
  • [37] Y. Wang, A. Narayanan, and D. L. Wang (2014) On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing 22 (12), pp. 1849–1858. Cited by: §II-C.
  • [38] Y. Wang and D. L. Wang (2013) Towards scaling up classification-based speech separation. IEEE Transactions on Audio, Speech, and Language Processing 21 (7), pp. 1381–1390. Cited by: §I.
  • [39] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller (2015) Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In International Conference on Latent Variable Analysis and Signal Separation, pp. 91–99. Cited by: §I.
  • [40] D. S. Williamson, Y. Wang, and D. L. Wang (2016) Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing 24 (3), pp. 483–492. Cited by: §I.
  • [41] Y. Xu, J. Du, L. Dai, and C. Lee (2015) A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing 23 (1), pp. 7–19. Cited by: §I, §II-C.