Joint Separation and Denoising of Noisy Multi-talker Speech using Recurrent Neural Networks and Permutation Invariant Training

08/31/2017 ∙ by Morten Kolbæk, et al. ∙ Tencent Aalborg University 0

In this paper we propose to use utterance-level Permutation Invariant Training (uPIT) for speaker independent multi-talker speech separation and denoising, simultaneously. Specifically, we train deep bi-directional Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) using uPIT, for single-channel speaker independent multi-talker speech separation in multiple noisy conditions, including both synthetic and real-life noise signals. We focus our experiments on generalizability and noise robustness of models that rely on various types of a priori knowledge e.g. in terms of noise type and number of simultaneous speakers. We show that deep bi-directional LSTM RNNs trained using uPIT in noisy environments can improve the Signal-to-Distortion Ratio (SDR) as well as the Extended Short-Time Objective Intelligibility (ESTOI) measure, on the speaker independent multi-talker speech separation and denoising task, for various noise types and Signal-to-Noise Ratios (SNRs). Specifically, we first show that LSTM RNNs can achieve large SDR and ESTOI improvements, when evaluated using known noise types, and that a single model is capable of handling multiple noise types with only a slight decrease in performance. Furthermore, we show that a single LSTM RNN can handle both two-speaker and three-speaker noisy mixtures, without a priori knowledge about the exact number of speakers. Finally, we show that LSTM RNNs trained using uPIT generalize well to noise types not seen during training.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Focusing ones auditory attention towards a single speaker in a complex acoustic environment with multiple speakers and noise sources, is a task that humans are extremely good at [1]. However, achieving similar performance with machines has so far not been possible [2], although it would be highly desirable for a vast range of applications, such as mobile communications, robotics, hearing aids, speaker verification systems, etc.

Traditionally, speech denoising [3, 4, 5, 6, 7, 8] and multi-talker speech separation [9, 10, 11, 12, 13, 14, 15] have been considered as two separate tasks in the literature, although, for many applications both speech separation and denoising are desired. For example, in a human-machine interface the machine must be able to identify what is being said, and by who, before it can decide which signal to focus on, and consequently respond and act upon.

The recent success of Deep Learning[16] has revolutionized a large number of scientific fields, and is currently achieving state-of-the-art results on topics ranging from medical diagnosis [17, 18] to Automatic Speech Recognition (ASR) [19, 20]. Also the area of single-channel speech enhancement has seen improvement, with deep learning algorithms that have been reported to improve speech intelligibility for normal hearing, hearing impaired and cochlear implant users [21, 8, 22, 23]. Speaker independent multi-talker speech separation, on the other hand, has so far not taken a similar leap forward, partly due to the long-lasting label permutation problem (further described in Section 3), which has prevented progress on deep learning based techniques for this task.

Recently, two technical directions have been proposed for speaker independent multi-talker speech separation; a clustering based approach [11, 12, 13], and a regression based approach [10, 24]. The clustering based approaches include the Deep Clustering (DPCL) techniques [11, 12] and the DANet technique [13]. The regression based approaches include the Permutation Invariant Training (PIT) technique [10] and the utterance-level PIT (uPIT) technique [24]

. The general idea behind the DPCL and DANet techniques is that the mixture signal can be represented in an embedding space, e.g. using Recurrent Neural Networks (RNNs), where the different source signals in the mixture form clusters. These clusters are then identified using a clustering technique, such as K-means. The clustering based techniques have shown impressive performance on two-speaker and three-speaker mixtures. The regression based PIT and uPIT techniques, which are described in detail in Section 

3 utilize a cost function that jointly optimizes the label assignment and regression error end-to-end, hence effectively solving the label permutation problem.

Both clustering based and regression based methods [11, 12, 13, 10, 24, 23] focus on ideal, noise-free training/testing conditions; i.e. situations where the mixtures contain clean speech only. For any practical application, background noise, e.g. due to interfering sound sources or non-ideal microphones must, be expected. However, it is yet to be known how these techniques perform, when tested in noisy conditions that reflect a realistic usage scenario.

In this paper we apply the recently proposed uPIT technique [24] for speaker independent multi-talker speech separation and denoising, simultaneously. Specifically, we train deep bi-directional Long Short-Term Memory (LSTM) RNNs using uPIT for speaker independent multi-talker speech separation in multiple noisy conditions, including both synthetic and real-life, known and unknown, noise signals at various Signal-to-Noise Ratios (SNRs).

To the authors knowledge, this is the first attempt to perform speech separation and denoising simultaneously in a deep learning framework; hence, no competing baseline has been identified for this particular task.

2 Source Separation using Deep Learning

The goal of single-channel speech separation is to separate a mixture of multiple speakers into the individual speakers using a single microphone recording. Similarly, single-channel speech denoising aims to extract a single target speech signal from a noisy single channel recording.

Let , , be the time domain source signal of length from source and let the observed mixture signal be defined as

(1)

where is a speech signal and , can be either speech or additive noise signals. Furthermore, let and , , be the

-point Short-Time discrete Fourier Transforms (STFT) of

and , respectively. Also, let and denote the single-sided STFT spectrum, at frame , for sources and the mixture signal, respectively.

We define the magnitudes of the source signals and mixture signal as and , respectively, and their corresponding single-sided magnitude spectra as and . For separating the mixture signal

into estimated target signal magnitudes

, , we adopt the approach from [24] and estimate a set of masks , using bi-directional LSTM RNNs. Let be the ideal mask (to be defined in Sec. 2.1) for speaker at frame . The masks , are then used to extract the target signal magnitudes as , , where is the element-wise product, i.e. the Hadamard product. Similarly, when the masks are estimated by a deep learning model we arrive at the estimated signal magnitudes as , , . The overlap-and-add technique and the inverse discrete Fourier transform, using the phase of the mixture signal, is used for reconstructing , in the time domain.

2.1 Mask Estimation and Loss functions

A large number of training targets and loss functions have been proposed for masking based source separation

[7, 25, 23]. Since the one reasonable goal is to have an accurate reconstruction, a loss function based on the reconstruction error instead of the mask estimation error is preferable [23].

In [24], different such loss functions were investigated for speaker independent multi-talker speech separation and the best performing one was found to be the Phase-Sensitive Approximation (PSA) loss function [7], which for frame is given as

(2)

where is the element-wise phase difference between the mixture and the source and is the -norm.

In contrast to the classical squared error loss function, i.e. Eq. (2) without the cosine term, the PSA loss function accounts for some of the errors introduced by the noisy phase used in the reconstruction. When the PSA loss function is used for mask estimation, the actual mask estimated is the Ideal Phase-Sensitive Filter (IPSF) [7], which due to the phase correction property is preferable over other commonly used masks such as the Ideal Ratio Mask, or the Ideal Amplitude Mask [23].

3 Permutation invariant training

Permutation Invariant Training (PIT) is a generalization of the traditional approach for training Deep Neural Networks (DNNs) for regression based source separation problems, such as speaker separation or denoising.

For training a DNN based source separation model with output masks, , , an MSE criterion is typically used and is computed between the true sources and the estimated sources , , . However, with multiple outputs, it is not trivial to pair the outputs with the correct targets. The commonly used approach for pairing a given output to a certain target is to predefine the targets into an ordered list, such that output one is always paired with e.g. target one, i.e. , output two with target two , etc.

For tasks such as speech denoising with a single speaker in noise, or speech separation of known speakers[15], simply predefining the ordering of the targets works well and the DNN can learn to correctly separate the sources and will provide the correct source at the output corresponding to the correct target. However, for mixtures containing similar signals, such as unknown equal energy male speakers, this standard training approach fails to converge [10, 14, 11]. Empirically, it is found that DNNs are likely to change permutation from one frame to another for highly similar sources. Hence, predefining the ordering of the targets, might not be the optimal solution, and clearly a bad solution for certain types of signals. This phenomenon, and the challenge of choosing the output-target permutation during training, is commonly known as the label permutation or ambiguity problem [14, 11, 13, 24].

In [10] a solution to the label permutation problem was proposed, where targets are provided as a set instead of an ordered list and the output-target permutation , for a given frame, is defined as the permutation that minimizes the cost function in question (e.g. squared error) over all possible permutations . Following this approach combined with the PSA loss function, a permutation invariant training criterion and corresponding error , for the frame, can be formulated as

(3)

As shown in [10], Eq. (3) effectively solves the label permutation problem. However, since PIT as defined in Eq. (3) operates on frames, the DNN only learns to separate the input mixtures into sources at the frame level, and not the utterance level. In practice, this means that the mixture might be correctly separated, but the frames belonging to a particular speaker are not assigned the same output index throughout the utterance and without exact knowledge about the speaker-output permutation, it is very difficult to correctly reconstruct the separated sources. In order to have the sources separated at the utterance-level, so that all frames from a particular output belong to the same source, additional speaker tracing or very large input-output contexts are needed [10].

3.1 Utterence-level Permutation Invariant Training

In [24] an extension to PIT, known as utterance-level PIT (uPIT) was proposed for solving the speaker-output permutation problem. In uPIT, the output-target permutation is given as the permutation that gives the minimum squared error over all possible permutations for the entire utterance, instead of only a single frame. Formally, the utterance-level permutation used for training is found as

(4)

and the permutation is then used for all frames within the current utterance, hence an utterance-level loss for the frame in a given utterance is defined as

(5)

Using the same permutation for all frames in the entire utterance has the consequence that the smallest per-frame error will not always be used for training as with original PIT. Instead the smallest per-utterance error will be used, which enforces the estimated sources to stay at the same DNN outputs for the entire utterance. Ideally, this means that each DNN output contains a single source. Finally, since the whole utterance is needed for computing the utterance-level permutation in Eq. (4), RNNs are a natural choice of DNN model for this loss function.

4 Experimental Design

To study the noise robustness of the uPIT technique, we have conducted several experiments with noise corrupted mixtures of multiple speakers. Since uPIT uses the noise-free source signals as training targets, a denoising capability is already present in the uPIT framework. By simply adding noise to the multi-speaker input mixture, a model trained with uPIT will not only learn to separate the sources but also to remove the noise.

4.1 Noise-free Multi-talker Speech Mixtures

We have used the noise-free two-speaker mixture (WSJ0-2mix) and three-speaker mixture (WSJ0-3mix)111Available at: http://www.merl.com/demos/deep-clustering datasets for all experiments conducted in this paper. These datasets have been used in [11, 24, 10, 12], which allows us to relate the performance of uPIT in noisy conditions with the performance in noise-free conditions. The feature representation is based on 129-dimensional STFT magnitude spectra, extracted from a 256 point STFT using a sampling frequency of 8 kHz, a hanning window size of 32 ms and a 16 ms frame shift.

The WSJ0-2mix dataset was derived from the WSJ0 corpus [26]. The WSJ0-2mix training set and validation set contain two-speaker mixtures generated by randomly selecting pairs of utterances from 49 male and 51 female speakers from the WSJ0 training set entitled si_tr_s. The two utterances are then mixed with a difference in active speech level [27] uniformly chosen between 0 dB and 5 dB. The training and validation sets consist of 20000 and 5000 mixtures, respectively, which is equivalent to approximately 30 hours of training data and 5 hours of validation data. The test set was similarly generated using utterances from 16 speakers from the WSJ0 validation set si_dt_05 and evaluation set si_et_05, and consists of 5000 mixtures or approximately 5 hours of data. That is, the speakers in the test set are different from the speakers in the training and validation sets. The WSJ0-3mix dataset was generated using a similar approach but contains mixtures of speech from three speakers.

Since we want a single RNN architecture that can handle both two-speaker and three-speaker mixtures, we have chosen a model architecture with three outputs. The specific architecture is described in detail in Sec. 4.3. To ensure that the model can handle both two-speaker and three-speaker mixtures, the model must be trained on both scenarios, so we have combined the WSJ0-2mix and WSJ0-3mix datasets into a larger WSJ0-2+3mix dataset. To allow this fusion, we have extended the WSJ0-2mix dataset with a third ”silent” speaker, such that the combined WSJ0-2+3mix dataset consists of only three speaker mixtures, but half of the mixtures contain three speakers, and the remaining half contain two speaker mixtures (and a ”silent speaker”). To minimize the risk of numerical issues, e.g. in computing ideal masks, the third ”silent” speaker consists of white Gaussian noise with an average energy level 70 dB below the average energy of the other two speakers in the mixture.

4.2 Noisy Multi-talker Speech Mixtures

To simulate noisy environments, we follow the common approach [3] for generating noisy mixtures with additive noise and simply add the noise-free WSJ0-2+3mix mixture signal with a noise signal. To achieve a certain SNR the noise signal is scaled based on the active speech level of the noise-free mixture signal as per ITU P.56 [27].

To evaluate the robustness of the uPIT model against a stationary noise type, we use a synthetic Speech Shaped Noise (SSN) signal. The SSN noise signal is constructed by filtering a Gaussian white noise sequence through a

-order all-pole filter with coefficients found from Linear Predictive Coding (LPC) analysis of 100 randomly chosen TIMIT sentences [28].

To evaluate the robustness against a highly non-stationary noise type we use a synthetic 6-speaker Babble (BBL) noise. The BBL noise signal is also based on TIMIT. The corpus, which consists of a total of 6300 spoken sentences, is randomly divided into 6 groups of 1050 concatenated utterances. Each group is then normalized to unit energy and truncated to equal length followed by addition of the six groups. This results in a BBL noise sequence with a duration of over 50 min.

To evaluate the robustness against realistic noise types we use the street (STR), cafeteria (CAF), bus (BUS), and pedestrian (PED) noise signals from the CHiME3 dataset[29]. These noise signals are real-life recordings in their respective environments.

All six noise signals are divided into a 40 min. training sequence, a 5 min. validation sequence and a 5 min. test sequence. That is, the noise signals used for training and validation are different from the sequence used for testing.

4.3 Model Architectures and Training

For evaluating uPIT in noisy environments we have trained a total of seven bi-directional LSTM RNNs [30], using the training conditions, i.e. datasets and noise types, presented in Table. 1.

Model ID Dataset + Noise type (SNR: -5 dB – 10 dB)
LSTM1 WSJ0-2+3mix + SSN
LSTM2 WSJ0-2+3mix + BBL
LSTM3 WSJ0-2+3mix + STR
LSTM4 WSJ0-2+3mix + CAF
LSTM5 WSJ0-2+3mix + SSN + BBL + STR + CAF
LSTM6 WSJ0-2mix + BBL
LSTM7 WSJ0-3mix + BBL
Table 1: Training conditions for different models.

LSTM1-5 were trained on the WSJ0-2+3mix dataset, which contains a mix of both two-speaker and three-speaker mixtures. LSTM1-4 are noise type specific in the sense that they were trained using only a single noise type. LSTM5 was trained on all four noise types. LSTM6 and LSTM7 were trained using WSJ0-2mix and WSJ0-3mix datasets, respectively, and only a single noise type. LSTM5 will show the performance degradation, if any, when less a priori knowledge about the noise types is available. Similarly, LSTM6-7 will show the potential performance improvement if the number of speakers in the mixture is known a priori. Each mixture in the dataset was corrupted with noise at a specific SNR, uniformly chosen between -5 dB and 10 dB.

Each model has three bi-directional LSTM layers, and a fully-connected output layer with ReLU

[16]activation functions. LSTM1-5 and LSTM7 have 1280 LSTM cells in each layer and LSTM6 has 896 cells, to be compliant with [24]. The input dimension is 129, i.e a single frame and the output dimension is , i.e. , . We apply 50% dropout [16] between the LSTM layers, and the outputs from the forward and backward LSTMs, from one layer, are concatenated before they are used as input to the subsequent layer. LSTM6 has approximately trainable parameters, and LSTM1-5 and 7 have approximately

trainable parameters, which are found using stochastic gradient descent with gradients found by backpropagation. In all the experiments, the maximum number of epochs was set to 200 and the learning rates were set to

per sample initially, and scaled down by when the training cost increased on the training set. The training was terminated when the learning rate got below . Each minibatch contains 8 randomly selected utterances. All models are implemented using the Microsoft Cognitive Toolkit (CNTK) [31]222Available at: https://www.cntk.ai/.

5 Experimental Results

We evaluated the noise robustness of LSTM1-7 using the Signal to Distortion Ratio (SDR) [32] and the Extended Short-Time Objective Intelligibility (ESTOI) measure [33]. The SDR is an often used performance metric for source separation and is defined in dB. The ESTOI measure estimates speech intelligibility and has been found to be highly correlated with human listening tests [33], especially for modulated maskers. The ESTOI measure is defined in the range , and higher is better. When evaluating SDR and ESTOI, we choose the output-target permutation that maximizes the given performance metric. Furthermore, when evaluating two-speaker mixtures, we identify the silent speaker as the output with the least energy and then compute the performance metric based on the remaining two outputs.

Tables 7, 7, 7 and 7 summarize the SDR improvements achieved by LSTM1-5 on two and three-speaker mixtures corrupted by SSN, BBL, STR, and CAF noise, respectively. The improvements are relative to the SDR of the noisy mixture without processing (”No Proc.” in Tables). Tables 13, 13, 13 and 13 summarize ESTOI improvements achieved by the same models in similar conditions. We evaluate the models at the challenging SNR of  dB, as well as at ,   , and  dB. At an input SNR of  dB, speech intelligibility, as estimated by ESTOI, is severely degraded, primarily due to the noise component, whereas speech intelligibility degradation at  dB is primarily caused by the competing talkers in the mixture itself. As a reference, we also reported the IPSF performance, which uses oracle information and therefore serves as an upper performance bound on this particular task.

From Tables 13, 13, 13, 13, 7, 7, 7 and 7 we see that all noise-type specific models, i.e. LSTM1-4, in general achieve large SDR and ESTOI improvements with an average improvement of  dB and for SDR and ESTOI, respectively, for two-speaker mixtures and  dB and , respectively, for three-speaker mixtures. Furthermore, we see that LSTM5 performs only slightly worse than the noise type specific models, which is interesting, since LSTM5 and LSMT1-4 have all been trained with 60 hours of speech, but LSTM5 have only seen hours of each noise type, compared to hours for LSTM1-4. We also observe that the highly non-stationary BBL noise seems to be considerably harder than the three other noise types, which corresponds well with existing literature [3, 34, 35].

Tables 13 and 7 summarize the performance of LSTM6 and LSTM7. We observe that both models perform approximately similar to the noise-type-general LSTM5. More surprisingly, we see that LSTM2 consistently outperforms both LSTM6 and LSTM7, which corresponds well with a similar observation in the noise-free case in [24]. These results are of great importance, since they show that training a model on noisy three-speaker mixtures helps the model separating noisy two-speaker mixtures, and vice versa.

Tables 13 and 7 summarize the performance of LSTM5, when evaluated using speech mixtures corrupted with the two unknown noise types, PED and BUS, i.e. noise types not included in the training set. The score with respect to the BUS noise type is on the left of the vertical bar () and PED is on the right. We see that LSTM5 achieves large SDR and ESTOI improvements for both noise types, at almost all SNRs. More importantly, we observe that the scores are comparable with, and in some cases even exceed, the performance of LSTM5, when it was evaluated using known noise types as reported in Tables 13, 13, 13, 13, 7, 7, 7 and 7. These results indicate that LSTM5 is relatively robust against variations in the noise distribution.

In general, we observe SDR improvements for all models that are comparable in magnitude with the noise-free case [10, 24, 11, 12]. However, the SDR measure, as well as ESTOI, do not differentiate between distortions from other speakers (such as Source to Inference Ratio from [32]) and distortion from the noise source. This means that the trade-off between speech separation and noise-reduction is yet to be fully understood. We leave this topic for future research.

2-Speaker 3-Speaker
SNR
[dB]
No
Proc.
IPSF
LSTM1
LSTM5
No
Proc.
IPSF
LSTM1
LSTM5
-5 -8.8 15.9 9.6 9.4 -10.3 16.6 8.0 7.8
0 -5.1 14.5 9.1 9.0 -7.0 15.2 7.6 7.4
5 -2.4 13.9 8.6 8.4 -4.8 14.6 7.0 6.9
20 0.0 14.8 8.7 8.8 -3.0 15.1 6.6 6.7
Avg. -4.1 14.8 9.0 8.9 -6.3 15.4 7.3 7.2
Table 3: SDR improvements for LSTM2 and 5 tested on BBL.
2-Speaker 3-Speaker
SNR
[dB]
No
Proc.
IPSF
LSTM2
LSTM5
No
Proc.
IPSF
LSTM2
LSTM5
-5 -8.9 17.2 6.0 5.4 -10.4 17.8 4.4 3.8
0 -5.1 15.4 8.1 7.6 -7.1 16.0 6.3 5.8
5 -2.4 14.5 8.5 8.1 -4.8 15.1 6.7 6.5
20 0.0 14.8 9.0 8.8 -3.0 15.2 6.8 6.7
Avg. -4.1 15.5 7.9 7.5 -6.3 16.0 6.0 5.7
Table 4: SDR improvements for LSTM3 and 5 tested on STR.
2-Speaker 3-Speaker
SNR
[dB]
No
Proc.
IPSF
LSTM3
LSTM5
No
Proc.
IPSF
LSTM3
LSTM5
-5 -8.9 18.2 11.5 11.5 -10.4 18.6 9.7 9.6
0 -5.2 16.2 10.2 10.2 -7.1 16.7 8.4 8.3
5 -2.4 14.9 9.2 9.1 -4.8 15.5 7.3 7.2
20 0.0 14.9 8.9 8.8 -3.0 15.2 6.6 6.7
Avg. -4.1 16.1 9.9 9.9 -6.3 16.5 8.0 7.9
Table 5: SDR improvements for LSTM4 and 5 tested on CAF.
2-Speaker 3-Speaker
SNR
[dB]
No
Proc.
IPSF
LSTM4
LSTM5
No
Proc.
IPSF
LSTM4
LSTM5
-5 -8.9 18.2 10.0 9.9 -10.4 18.6 8.4 8.2
0 -5.1 16.3 9.7 9.5 -7.1 16.8 7.9 7.7
5 -2.4 15.1 9.0 8.9 -4.8 15.6 7.1 6.9
20 0.0 14.8 8.8 8.8 -3.0 15.2 6.7 6.6
Avg. -4.1 16.1 9.4 9.3 -6.3 16.6 7.5 7.3
Table 6: SDR improvements for LSTM6, 7 and 5 tested on BBL.
2-Speaker 3-Speaker
SNR
[dB]
No
Proc.
IPSF
LSTM6
LSTM5
No
Proc.
IPSF
LSTM7
LSTM5
-5 -8.9 17.2 5.6 5.4 -10.4 17.8 4.0 3.8
0 -5.1 15.4 7.7 7.6 -7.1 16.0 5.7 5.8
5 -2.4 14.5 8.0 8.1 -4.8 15.1 6.3 6.5
20 0.0 14.9 8.4 8.8 -3.0 15.2 6.4 6.7
Avg. -4.1 15.5 7.4 7.5 -6.3 16.0 5.6 5.7
Table 7: SDR improvements for LSTM5 tested on BUS PED.
2-Speaker 3-Speaker
SNR
[dB]
No
Proc.
IPSF
LSTM5
No
Proc.
IPSF
LSTM5
-5 -9.0 -8.9 19.6 16.7 11.7 7.3 -10.5 -10.4 19.9 17.4 9.7 5.7
0 -5.2 -5.2 17.3 14.9 10.7 7.8 -7.2 -7.1 17.6 15.7 8.5 6.3
5 -2.4 -2.4 15.7 14.1 9.5 7.9 -4.8 -4.8 16.1 14.8 7.4 6.3
20 0.0 0.0 14.9 14.8 8.8 8.7 -3.0 -3.0 15.2 15.2 6.7 6.7
Avg. -4.1 -4.1 16.9 15.1 10.2 7.9 -6.4 -6.3 17.2 15.8 8.1 6.2
Table 2: SDR improvements for LSTM1 and 5 tested on SSN.
2-Speaker 3-Speaker
SNR
[dB]
No
Proc.
IPSF
LSTM1
LSTM5
No
Proc.
IPSF
LSTM1
LSTM5
-5 0.18 0.65 0.17 0.16 0.14 0.69 0.10 0.09
0 0.29 0.58 0.23 0.22 0.22 0.63 0.15 0.14
5 0.39 0.50 0.23 0.22 0.29 0.58 0.17 0.16
20 0.54 0.39 0.17 0.18 0.38 0.53 0.15 0.15
Avg. 0.35 0.53 0.20 0.20 0.26 0.61 0.14 0.14
Table 9: ESTOI improvements for LSTM2 and 5 tested on BBL.
2-Speaker 3-Speaker
SNR
[dB]
No
Proc.
IPSF
LSTM2
LSTM5
No
Proc.
IPSF
LSTM2
LSTM5
-5 0.19 0.66 0.09 0.06 0.14 0.70 0.04 0.02
0 0.29 0.59 0.18 0.15 0.22 0.65 0.11 0.09
5 0.39 0.51 0.21 0.20 0.29 0.60 0.15 0.14
20 0.53 0.40 0.19 0.18 0.37 0.53 0.15 0.15
Avg. 0.35 0.54 0.17 0.15 0.26 0.62 0.11 0.10
Table 10: ESTOI improvements for LSTM3 and 5 tested on STR.
2-Speaker 3-Speaker
SNR
[dB]
No
Proc.
IPSF
LSTM3
LSTM5
No
Proc.
IPSF
LSTM3
LSTM5
-5 0.24 0.60 0.16 0.15 0.18 0.65 0.10 0.09
0 0.32 0.54 0.21 0.19 0.24 0.61 0.14 0.13
5 0.40 0.49 0.21 0.20 0.30 0.57 0.15 0.15
20 0.54 0.39 0.18 0.18 0.37 0.53 0.15 0.15
Avg. 0.38 0.51 0.19 0.18 0.27 0.59 0.14 0.13
Table 11: ESTOI improvements for LSTM4 and 5 tested on CAF.
2-Speaker 3-Speaker
SNR
[dB]
No
Proc.
IPSF
LSTM4
LSTM5
No
Proc.
IPSF
LSTM4
LSTM5
-5 0.24 0.60 0.13 0.12 0.19 0.65 0.08 0.07
0 0.33 0.54 0.18 0.17 0.25 0.61 0.12 0.11
5 0.41 0.48 0.20 0.19 0.30 0.58 0.15 0.14
20 0.53 0.39 0.18 0.18 0.37 0.53 0.15 0.15
Avg. 0.38 0.50 0.17 0.17 0.28 0.59 0.12 0.12
Table 12: ESTOI improvements for LSTM6, 7 and 5 tested on BBL.
2-Speaker 3-Speaker
SNR
[dB]
No
Proc.
IPSF
LSTM6
LSTM5
No
Proc.
IPSF
LSTM7
LSTM5
-5 0.20 0.66 0.07 0.06 0.14 0.69 0.02 0.02
0 0.30 0.59 0.16 0.16 0.22 0.65 0.08 0.09
5 0.39 0.52 0.20 0.20 0.29 0.60 0.13 0.14
20 0.54 0.40 0.17 0.19 0.38 0.53 0.14 0.15
Avg. 0.36 0.54 0.15 0.15 0.26 0.62 0.09 0.10
Table 13: ESTOI improvements for LSTM5 tested on BUS PED.
2-Speaker 3-Speaker
SNR
[dB]
No
Proc.
IPSF
LSTM5
No
Proc.
IPSF
LSTM5
-5 0.32 0.18 0.55 0.64 0.14 0.08 0.24 0.14 0.61 0.68 0.08 0.03
0 0.39 0.28 0.50 0.58 0.18 0.15 0.28 0.21 0.58 0.63 0.12 0.09
5 0.45 0.37 0.46 0.52 0.20 0.20 0.32 0.28 0.56 0.59 0.14 0.13
20 0.55 0.54 0.39 0.40 0.18 0.18 0.38 0.37 0.53 0.53 0.15 0.15
Avg. 0.43 0.34 0.47 0.54 0.17 0.15 0.31 0.25 0.57 0.61 0.12 0.10
Table 8: ESTOI improvements for LSTM1 and 5 tested on SSN.

6 Conclusion

In this paper we have proposed utterance-level Permutation Invariant Training (uPIT) for speaker independent multi-talker speech separation and denoising. Differently from prior works, that focus only on the ideal noise-free setting, we focus on the more realistic scenario of speech separation in noisy environments. Specifically, using the uPIT technique we have trained bi-directional Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs), to separate two and three-speaker mixtures corrupted by multiple noise types at a wide range of Signal to Noise Ratios (SNRs).

We show that bi-directional LSTM RNNs trained with uPIT are capable of improving both Signal to Distortion Ratio (SDR), as well as the Extended Short-Time Objective Intelligibility (ESTOI) measure for challenging noise types and SNRs. Specifically, we show that LSTM RNNs achieve large SDR and ESTOI improvements, when evaluated using noise types seen during training, and that a single model is capable of handling multiple noise types with only a slight decrease in performance. Furthermore, we show that a single LSTM RNN can handle both two-speaker and three-speaker noisy mixtures, without a priori knowledge about the exact number of speakers. Finally, we show that LSTM RNNs trained using uPIT generalizes well to unknown noise types.

References

  • [1] A. W. Bronkhorst, “The Cocktail Party Phenomenon: A Review of Research on Speech Intelligibility in Multiple-Talker Conditions,” Acta Acust united Ac, vol. 86, no. 1, pp. 117–128, 2000.
  • [2] J. H. McDermott, “The cocktail party problem,” Current Biology, vol. 19, no. 22, pp. R1024–R1027, Dec. 2009.
  • [3] M. Kolbæk, Z. H. Tan, and J. Jensen, “Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 1, pp. 153–167, 2017.
  • [4] J. Chen and D. Wang, “Long Short-Term Memory for Speaker Generalization in Supervised Speech Separation,” in Proc. INTERSPEECH, 2016, pp. 3314 – 3318.
  • [5] F. Weninger, F. Eyben, and B. Schuller, “Single-channel speech separation with memory-enhanced recurrent neural networks,” in Proc. ICASSP, 2014, pp. 3709–3713.
  • [6] F. Weninger et al., “Discriminatively trained recurrent neural networks for single-channel speech separation,” in GlobalSIP, 2014, pp. 577–581.
  • [7] H. Erdogan et al., “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proc. ICASSP, 2015, pp. 708–712.
  • [8] J. Chen et al., “Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises,” J. Acoust. Soc. Am., vol. 139, no. 5, pp. 2604–2612, 2016.
  • [9] J. Du et al., “Speech separation of a target speaker based on deep neural networks,” in ICSP, 2014, pp. 473–477.
  • [10] D. Yu et al., “Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation,” in Proc. ICASSP, 2017, pp. 241–245.
  • [11] J. R. Hershey et al., “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, 2016, pp. 31–35.
  • [12] Y. Isik et al., “Single-Channel Multi-Speaker Separation Using Deep Clustering,” in Proc. INTERSPEECH, 2016, pp. 545–549.
  • [13] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in Proc. ICASSP, 2017, pp. 246–250.
  • [14] C. Weng et al., “Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 10, pp. 1670–1679, 2015.
  • [15] P.-S. Huang et al., “Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 12, pp. 2136–2147, 2015.
  • [16] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.   MIT Press, 2016.
  • [17] V. Gulshan et al., “Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs,” JAMA, vol. 316, no. 22, pp. 2402–2410, 2016.
  • [18] A. Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, 2017.
  • [19] W. Xiong et al., “Achieving Human Parity in Conversational Speech Recognition,” arXiv:1610.05256 [cs], 2016.
  • [20] G. Saon et al., “English Conversational Telephone Speech Recognition by Humans and Machines,” arXiv:1703.02136 [cs], 2017.
  • [21] E. W. Healy et al., “An algorithm to increase speech intelligibility for hearing-impaired listeners in novel segments of the same noise type,” J. Acoust. Soc. Am., vol. 138, no. 3, pp. 1660–1669, 2015.
  • [22] T. Goehring et al., “Speech enhancement based on neural networks improves speech intelligibility in noise for cochlear implant users,” Hearing Research, vol. 344, pp. 183–194, 2017.
  • [23] H. Erdogan et al., “Deep Recurrent Networks for Separation and Recognition of Single Channel Speech in Non-stationary Background Audio,” in New Era for Robust Speech Recognition: Exploiting Deep Learning.   Springer, 2017.
  • [24] M. Kolbæk et al., “Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 10, pp. 1901–1913, 2017.
  • [25] Y. Wang, A. Narayanan, and D. Wang, “On Training Targets for Supervised Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 1849–1858, 2014.
  • [26] J. Garofolo et al., “CSR-I (WSJ0) Complete LDC93s6a,” 1993, philadelphia: Linguistic Data Consortium.
  • [27] “ITU Rec. P.56 : Objective measurement of active speech level,” 1993, https://www.itu.int/rec/T-REC-P.56/.
  • [28] J. S. Garofolo et al., “DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM,” 1993.
  • [29] J. Barker et al., “The third ’CHiME’ Speech Separation and Recognition Challenge: Dataset, task and baselines,” in Proc. ASRU, 2015.
  • [30] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [31] A. Agarwal et al., “An introduction to computational networks and the computational network toolkit,” Microsoft Technical Report {MSR-TR}-2014-112, Tech. Rep., 2014.
  • [32] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006.
  • [33] J. Jensen and C. H. Taal, “An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 11, pp. 2009–2022, 2016.
  • [34] P. C. Loizou, Speech Enhancement: Theory and Practice.   CRC Press, 2013.
  • [35] J. S. Erkelens et al., “Minimum Mean-Square Error Estimation of Discrete Fourier Coefficients With Generalized Gamma Priors,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 15, no. 6, pp. 1741–1752, 2007.