FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks

02/12/2019 ∙ by Ziqiang Shi, et al. ∙ FUJITSU 0

Deep dilated temporal convolutional networks (TCN) have been proved to be very effective in sequence modeling. In this paper we propose several improvements of TCN for end-to-end approach to monaural speech separation, which consists of 1) multi-scale dynamic weighted gated dilated convolutional pyramids network (FurcaPy), 2) gated TCN with intra-parallel convolutional components (FurcaPa), 3) weight-shared multi-scale gated TCN (FurcaSh), 4) dilated TCN with gated difference-convolutional component (FurcaSu), that all these networks take the mixed utterance of two speakers and maps it to two separated utterances, where each utterance contains only one speaker's voice. For the objective, we propose to train the network by directly optimizing utterance level signal-to-distortion ratio (SDR) in a permutation invariant training (PIT) style. Our experiments on the the public WSJ0-2mix data corpus results in 18.1dB SDR improvement, which shows our proposed networks can leads to performance improvement on the speaker separation task.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-talker monaural speech separation has a vast range of applications. For example, a home environment or a conference environment in which many people talk, the human auditory system can easily track and follow a target speaker’s voice from the multi-talker’s mixed voice. In this case, a clean speech signal of the target speaker needs to be separated from the mixed speech to complete the subsequent recognition work. Thus it is a problem that must be solved in order to achieve satisfactory performance in speech or speaker recognition tasks. There are two difficulties in this problem, the first is that since we don’t have any priori information of the user, a truly practical system must be speaker-independent. The second difficulty is that there is no way to use the beamforming algorithm for a single microphone signal. Many traditional methods, such as computational auditory scene analysis (CASA) [30, 20, 10], Non-negative matrix factorization (NMF) [23, 13], and probabilistic models [29], do not solve these two difficulties well.

More recently, a large number of techniques based on deep learning are proposed for this task. These methods can be briefly grouped into three categories. The first category is based on deep clustering (DPCL) 

[8, 11]

, which maps the time-frequency (TF) points of the spectrogram into the embedding vectors, then these embedding vectors are clustered into several classes corresponding to different speakers, and finally these clusters are used as masks to inversely transform the spectrogram to the separated clean voices; the second is the permutation invariant training (PIT) 

[12, 35], which solves the label permutation problem by minimizing the lowest error output among all possible permutations for mixing sources assignment; the third category is end-to-end speech separation in time-domain [17, 18, 27]

, which is a natural way to overcome the obstacles of the upper bound source-to-distortion ratio improvement (SDRi) in short-time Fourier transform (STFT) mask estimation based methods and real-time processing requirements in actual use.

This paper is based on the end-to-end method [17, 18, 27], which has achieved better results than DPCL based or PIT based approaches. Since most DPCL and PIT based methods use STFT as front-end. Specifically, the mixed speech signal is first transformed from one-dimensional signal in time domain to two-dimensional spectrum signal in TF domain, and then the mixed spectrum is separated to result in spectrums corresponding to different source speeches by a deep clustering or mask estimation method, and finally the cleaned source speech signal can be restored by an inverse STFT on each spectrum. This framework has several limitations. Firstly, it is unclear whether the STFT is the optimal (even assume the parameters it depends on are optimal, such as size and overlap of audio frames, window type and so on) transformation of the signal for speech separation. Secondly, most STFT based methods often assumed that the phase of the separated signal to be equal to the mixture phase, which is generally incorrect and imposes an obvious upper bound on separation performance by using the ideal masks. As an approach to overcome the above problems, several speech separation models were recently proposed that operate directly on time-domain speech signals [17, 18, 27]

. Inspired by these first results, we propose FurcaNeXt, which is a general name for a series of fully end-to-end time-domain separation methods, includes 1) multi-scale dynamic weighted gated dilated convolutional pyramids network (FurcaPy): due to the influence of different word lengths or different speech speeds, multiple branches of a variety of temporal receipt field scales are introduce to characterize speech, and the weights of different scales are automatically determined by a “weightor” network; 2) deep gated dilated temporal convolutional networks (TCN) with intra-parallel convolutional components (FurcaPa): replace two convolutional related modules in each dilated convolutional module by two intra-parallel convolutional modules , which can reduce the variance of this model. The intra-parallel convolutional modules replicate weight matrices and take the average from the feature maps produced by those layers. This convenient technique can effectively improve separation performance. 3) weight-shared multi-scale gated TCN (FurcaSh): a simple design is proposed to achieve the functions of FurcaPy but without increasing the number of network parameters. 4) dilated TCN with gated difference-convolutional component (FurcaSu): inspired by the work of Highway network 


, in which two additional non-linear transformations acts as gates that can dynamically pass part of its inputs and suppress the other part, conditioned on the input itself. Authors simplify the Highway network through multiple ways 

[34]. After further simplification we propose to use two identical transformation function branches to implemented a simplified version of the highway network module.

The remainder of this paper is organized as follows: section 2 introduces monaural speech separation with TCN. Section 3 describe our proposed FurcaNeXt and the separation algorithm in detail. The experimental setup and results are presented in Section 4. We conclude this paper in Section 5.

2 Speech separation with TCN

In this section, we review the formal definition of the monaural speech separation task and the original TCN architecture.

The goal of monaural speech separation is to estimate the individual target signals from a linearly mixed single-microphone signal, in which the target signals overlap in the TF domain. Let denote the target speech signals and denotes the mixed speech respectively. If we assume the target signals are linearly mixed, which can be represented as:

then monaural speech separation aims at estimating individual target signals from given mixed speech . In this work it is assumed that the number of target signals is known.

In order to deal with this ill-posed problem, Luo et al. [18] introduce TCN [14, 2] to do this task. TCN is proposed as an alternative to RNN in various tasks [14, 2]. Each layer in the TCN contains a 1-D convolution block with an increased dilation factor. The dilation factor is increased exponentially to ensure a suitable large temporal context window to take advantage of the long range dependence of the speech signal, as shown in Figure 1. Dilated convolution has made a huge success in WaveNet for audio generation [26]. Dilated convolutions with different dilations have different receptive fields. Stacked dilated convolution provides a very large receptive fields for the network with only a few layers, because the dilation range grows exponentially. This allows the network to capture the temporal dependence of various resolutions with the input sequences. The TCN introduces a time hierarchy: the upper layer can access longer input subsequences and learn representations on larger time scales. Local information from lower layers spreads through the hierarchy by means of residuals and skip connections.

There are two important elements in the original TCN [2] as shown in Figure 1

, one is the dilated convolutions, and the other is residual connections. Dilated convolutions follow the work of 

[26], it is defined as

where is the 1-D input signal, is the filter(aka kernel), and is the dilation factor. Therefore, dilation is equivalent to introducing a fixed step size between every two adjacent filter taps. The general way to increase the receipt field of the TCN is to increase the dilation factor . In this work we increase exponentially with the depth of the network and as shown in Figure 1, and this TCN has four layers of 1-D Conv modules with dilation factors of respectively. As shown in Figure 1, Each 1-D Conv module is a residual block [7], which contains one layer of dilated convolution (Depth wise conv [9]), two layers of 1 1 convolutions (1

1 Conv), two non-linearity activation layers (Parametric Rectified Linear Unit, PReLU 

[6]), and two normalization layers (Normalization). For normalization, we applied global normalization [18] to the convolutional filters.

Figure 1: The structure of TCN.

Luo et al. proposed a TCN based speech separation method [18], which consists of three processing stages, as shown in Figure 2: encoder (Conv1d is followed by a PReLU), separator (consisted in the order by a LayerNorm, a 11conv, 4 TCN layers, 11conv, and a softmax operation) and decoder (a FC layer). First, the encoder module is used to convert short segments of the mixed waveform into their corresponding representations. Then, the representation is used to estimate the multiplication function (mask) of each source and each encoder output for each time step. The source waveform is then reconstructed by transforming the masked encoder features using a linear decoder module.

Figure 2: The pipeline of TCN based speech separation in [18].

3 Speech separation with FurcaNeXt

The main work of this paper is to make several improvements to the TCN (Figure 1) and TCN based framework (Figure 2) for speech separation. First, we introduced gating operations in this TCN, as shown in Figure 3. Nonlinear gated activation had been used in prior work on sequence modeling [26, 4], it can control the flow of information and may help the network to model more complex interactions. Two gates are added to each 1-D convolutional module in the plain TCN, one is corresponding to the first 11 convolutional layer in the 1-D convolutional module, the other is corresponding to all the layers from the depth-wise convolutional layer to the output 11 convolutional layer. This gated TCN based speech separation pipeline is called FurcaPorta in this work.

Figure 3: The structure of gated TCN.

3.1 FurcaPy: Multi-scale dynamic weighted gated dilated convolutional pyramids network

Since in real life the utterance always have the feature of temporal scale variation caused by different word lengths and pronunciation characteristics (e.g. speed) of different people, thus different temporal receipt fields may help in speech separation. The temporal receipt field is fixed in previous network structure. In order to remedy the temporal scale variation problem, a multi-scale dynamic weighted pyramids gated TCNs based pipeline which is called FurcaPy is proposed as shown in Figure 4 and there are three kinds of different temporal receipt fields in this description. FurcaPy’s encoder and decoder are the same as the previous FurcaPorta, they differ only in the separator. In the separator of FurcaPy, each branch in the pyramid consists of a different number of gated TCNs. The length of the temporal receptive field of each branch is several times the length of the temporal receiving field of a single gated TCN. If the receptive field of a single gated TCN is assumed to be , then the length of the receptive field of all branches in the Figure 4 is 3,4, and 5

respectively. The total output is obtained by weighted averaging the outputs of the different branches corresponding to different receipt fields. Additionally, a “weightor” module is designed to determine which temporal receipt field is more suitable for current input utterance signal, that means the weights of different gated TCNs are determined dynamically by a “weightor” network for each utterance. The “weightor” is composed of a common multi layer 1-D convolutional neural network as shown in Figure 

4 and it consist of Conv1d, PReLU, LayerNormal, 3 layers of 1

1 Conv and max pooling, and Softmax.

Figure 4: The structure of FurcaPy.

3.2 FurcaSh: Weight-shared multi-scale gated TCN

FurcaPy will increase the number of parameters of the network several times, and the processing speed of the network will decrease a lot. In many cases, there is no way to meet the requirements of real-time processing for such network. In order to deal with these problems, a new structure is proposed that can achieve the two-level multi-scale receptive fields without increasing the number of network parameters. As shown in Figure 5 and Figure 6, two levels of multi-scale temporal receptive fields is introduced, one is in the dilated 1-D conv module level, that is the outputs corresponding to different dilated factors are summed and averaged to result in the final output of this gated TCN; the other is that since there are 4 gated TCNs in the FurcaSh pipeline, the outputs of different gated TCNs are summed and averaged to result in the separator. So there are two different levels of multi-scale temporal receipt field in this structure.

Figure 5: The structure of FurcaSh.
Figure 6: The structure of FurcaSh.

3.3 FurcaPa: Deep gated dilated temporal convolutional networks (TCN) with intra-parallel convolutional components

The performance of a single predictive model can always be improved by ensemble, that is to combines a set of independently trained networks. The most commonly used method is to do the average of the model, which can at least help to reduce the variance of the performance. As shown in Figure 7, in the different layers of each 1-D convolutional module of gated TCN in FurcaPorta, two identical parallel branches are added. This structure is called FurcaPa. The total output of each intra-parallel convolutional components is obtained by averaging the outputs of all the different branches. In each single dilated 1-D convolutional module layer, two intra-parallel convolutional components are introduced, the first one is near the input layer and covers the Conv1d, PReLU, and Normalization layers; the other one is near the output and it covers the rest all layers, including the Depthwise conv, PReLu, Normalization and 11 Conv layers. The reason why we do this ensemble in two places is to reduce the sub-variances of each block.

Figure 7: The structure of FurcaPa.

3.4 FurcaSu: Dilated TCN with gated difference-convolutional component

Highway network can be simplified and generalized to have better performance [34]. Follow the work of [34], we further simplify the Highway network module, as shown in Figure 8, in each single dilated 1-D convolutional module layer, two Highway network module or gated difference-convolutional components as we called are introduced, the first one is near the input layer and covers the Conv1d, PReLU, and Normalization layers; the other one is near the output and it covers the rest all layers, including the Depthwise conv, PReLu, Normalization and 11 Conv layers. Different from the original use of three different transformation functions, in order to simplify the design and improve performance, here we use three identical transformation branches, one branch as the attention gates, the other two are signal transformations, and their results are subtracted and then gated.

Figure 8: The structure of FurcaSu.

3.5 Perceptual metric: Utterance-level SDR objective

Since the loss function of many STFT-based methods is not directly applicable to waveform-based end-to-end speech separation, perceptual metric based loss function is tried in this work. The perception of speech is greatly affected by distortion 

[33, 1]. Generally in order to evaluate the performance of speech separation, the BSS_Eval metrics signal-to-distortion ratio (SDR), signal-to-Interference ratio (SIR), signal-to-artifact ratio (SAR) [5, 28], and short-time objective intelligibility (STOI) [25] have been often employed. In this work we directly use SDR, which is most commonly used metrics to evaluate the performance of source separation, as the training objective. SDR measures the amount of distortion introduced by the output signal and define it as the ratio between the energy of the clean signal and the energy of the distortion.

SDR captures the overall separation quality of the algorithm. There is a subtle problem here. We first concatenate the outputs of FurcaNet into a complete utterance and then compare with the input full utterance to calculate the SDR in the utterance level instead of calculating the SDR for one frame at a time. These two methods are very different in ways and performance. If we denote the output of the network by , which should ideally be equal to the target source , then SDR can be given as [5, 28]


Then our target is to maximize SDR or minimize the negative SDR as loss function respect to the .

In order to solve tracing and permutation problem, the PIT training criteria [12, 35] is employed in this work. We calculate the SDRs for all the permutations, pick the maximum one, and take the negative as the loss. It is called the uSDR loss in this work.

4 Experiments

4.1 Dataset and neural network

We evaluated our system on two-speaker speech separation problem using WSJ0-2mix dataset [8, 11], which contains 30 hours of training and 10 hours of validation data. The mixtures are generated by randomly selecting 49 male and 51 female speakers and utterances in Wall Street Journal (WSJ0) training set si_tr_s, and mixing them at various signal-to-noise ratios (SNR) uniformly between 0 dB and 5 dB. 5 hours of evaluation set is generated in the same way, using utterances from 16 unseen speakers from si_dt_05 and si_et_05 in the WSJ0 dataset.

We evaluate the systems with the SDR improvement (SDRi) [5, 28] metrics used in [11, 16, 32, 3, 12]. The original SDR, that is the average SDR of mixed speech for the original target speech and is 0.15. Table 1 lists the average SDRi obtained by the different structures in FurcaNeXt and almost all the results in the past two years, where IRM means the ideal ratio mask


applied to the STFT of to obtain the separated speech, which is evaluated to show the upper bounds of STFT based methods, where is the STFT of .

In this experiment, as baselines, we reimplemented several classical approaches, such as DPCL [8], TasNet [17] and Conv-TasNet [18]. Table 1 lists the SDRis obtained by our methods and almost all the results in the past two years, where IRM means the ideal ratio mask. Compared with these baselines an average increase of nearly 2.6dB SDRi is obtained. FurcaPy has achieved the most significant performance improvement compared with baseline systems, and it break through the upper bound of STFT based methods a lot (nearly 6dB).

Method SDRi
DPCL [8] 5.9
uPIT-BLSTM [35] 10.0
cuPIT-Grid-RD [32] 10.2
DANet [3] 10.5
ADANet [16] 10.5
DPCL* 10.7
DPCL++ [11] 10.8
CBLDNN-GAT [15] 11.0
TasNet [17] 11.2
TasNet* 11.8
Chimera++ [31] 12.0
FurcaX [21] 12.5
IRM 12.7
FurcaNet [22] 13.3
Conv-TasNet [18] 15.0
Conv-TasNet* 15.8
FurcaPorto 17.3
FurcaSu 17.9
FurcaSh 18.0
FurcaPa 18.2
FurcaPy 18.4
Table 1: SDRi (dB) in a comparative study of different separation methods on the WSJ0-2mix dataset. * indicates our reimplementation of the corresponding method.

5 Conclusion

In this paper we investigated the effectiveness of deep dilated temporal convolutional networks modeling for multi-talker monaural speech separation. We propose a series structure under the name of FurcaNeXt do to speech separation. Benefits from the strength of end-to-end processing, the novel gating mancinism and dynamic improvements, the best performance of structure in FurcaNeXt achieve 18.4dB SDRi on the the public WSJ0-2mix data corpus, results in 16% relative improvement, and we achieve the new state-of-the-art on the public WSJ0-2mix data corpus. For further work, although SDR is widely used and can be useful, but it has some weaknesses [19]. In the future, maybe we can use SNR to evaluation our models. It would be interesting to see how consistent the SDR and SNR are.

6 Acknowledgment

We would like to thank Jian Wu at Northwestern Polytechnical University, Yi Luo at Columbia University, and Zhong-Qiu Wang at Ohio State University for valuable discussions on WSJ0-2mix database, DPCL, and end-to-end speech separation.