1 Introduction
Recent progress in deep learningbased speech separation has ignited the interest of the research community in timedomain approaches [24, 29, 30, 25, 28, 3]. Compared with standard timefrequency domain methods, timedomain methods are designed to jointly model the magnitude and phase information and allow direct optimization with respect to both time and frequencydomain differentiable criteria [7, 16, 21].
Current timedomain separation systems can be mainly categorized into adaptive frontend and direct regression approaches. The adaptive frontend
approaches aim at replacing the shorttime Fourier transform (STFT) with a differentiable transform to build a frontend that can be learned jointly with the separation network. Separation is applied to the frontend output as with the conventional timefrequency domain methods applying the separation processes to spectrogram inputs
[30, 25, 28]. Being independent of the traditional timefrequency analysis paradigm, these systems are able to have a much more flexible choice on the window size and the number of basis functions for the frontend. On the other hand, the direct regression approaches learn a regression function from an input mixture to the underlying clean signals without an explicit frontend, typically by using some form of onedimensional convolutional neural networks (1D CNNs) [29, 20, 7].A commonality between the two categories is that they both rely on effective modeling of extremely long input sequences. The direct regression methods perform separation at the waveform sample level, while the number of the samples can usually be tens of thousands, or sometimes even more. The performance of the adaptive frontend methods also depend on selection of the window size, where a smaller window improves the separation performance at the cost of a significantly longer frontend representation [25, 13]. This poses an additional challenge as conventional sequential modeling networks, including RNNs and 1D CNNs, have difficulty on learning such longterm temporal dependency [18]. Moreover, unlike RNNs that have dynamic receptive fields, 1D CNNs with fixed receptive fields that are smaller than the sequence length are not able to fully utilize the sequencelevel dependency [4].
In this paper, we propose a simple network architecture, which we refer to as dualpath RNN (DPRNN), that organizes any kinds of RNN layers to model long sequential inputs in a very simple way. The intuition is to split the input sequence into shorter chunks and interleave two RNNs, an intrachunk RNN and an interchunk RNN, for local and global modeling, respectively. In a DPRNN block, the intrachunk RNN first processes the local chunks independently, and then the interchunk RNN aggregates the information from all the chunks to perform utterancelevel processing. For a sequential input of length , DPRNN with chunk size and chunk hop size contains chunks, where and corresponds to the input lengths for the inter and intrachunk RNNs, respectively. When , the two RNNs have a sublinear input length () as opposed to the original input length (), which greatly decreases the optimization difficulty that arises when is extremely large.
Compared with other approaches for arranging local and global RNN layers, or more general the hierarchical RNNs that perform sequence modeling in multiple time scales [17, 6, 26, 11, 35, 5], the stacked DPRNN blocks iteratively and alternately perform the intra and interchunk operations, which can be treated as an interleaved processing between local and global inputs. Moreover, the first RNN layer in most hierarchical RNNs still receives the entire input sequence, while in stacked DPRNN each intra or interchunk RNN receives the same sublinear input size across all blocks. Compared with CNNbased architectures such as temporal convolutional networks (TCNs) that only perform local modeling due to the fixed receptive fields [25, 28, 19], DPRNN is able to fully utilize global information via the interchunk RNNs and achieve superior performance with an even smaller model size. In Section 4 we will show that by simply replacing TCN by DPRNN in a previously proposed timedomain separation system [25]
, the model is able to achieve a 0.7 dB (4.6%) relative improvement with respect to scaleinvariant signaltonoise ratio (SISNR)
[16] on WSJ02mix with a 49% smaller model size. By performing the separation at the waveform sample level, i.e. with window size of 2 samples and hop size of 1 sample, a new stateoftheart performance is achieved with a 20 times smaller model than the previous best system.2 Dualpath Recurrent Neural Network
2.1 Model design
A dualpath RNN (DPRNN) consists of three stages: segmentation, block processing, and overlapadd. The segmentation stage splits a sequential input into overlapped chunks and concatenates all the chunks into a 3D tensor. The tensor is then passed to stacked DPRNN blocks to iteratively apply local (intrachunk) and global (interchunk) modeling in an alternate fashion. The output from the last layer is transformed back to a sequential output with overlapadd method. Figure 1 shows the flowchart of the model.
2.1.1 Segmentation
For a sequential input where is the feature dimension and is the number of time steps, the segmentation stage splits into chunks of length and hop size
. The first and last chunks are zeropadded so that every sample in
appears and only appears in chunks, generating equal size chunks . All chunks are then concatenated together to form a 3D tensor .2.1.2 Block processing
The segmentation output is then passed to the stack of DPRNN blocks. Each block transforms an input 3D tensor into another tensor with the same shape. We denote the input tensor for block as , where . Each block contains two submodules corresponding to intra and interchunk processing, respectively. The intrachunk RNN is always bidirectional and is applied to the second dimension of , i.e., within each of the blocks:
(1) 
where is the output of the RNN, is the mapping function defined by the RNN, and is the sequence defined by chunk . A linear fullyconnected (FC) layer is then applied to transform the feature dimension of back to that of
(2) 
where is the transformed feature, and are the weight and bias of the FC layer, respectively, and represents chunk in . Layer normalization (LN) [2] is then applied to , which we empirically found to be important for the model to have a good generalization ability:
(3) 
where are the rescaling factors, is a small positive number for numerical stability, and denotes the Hadamard product. and
are the mean and variance of the 3D tensor defined as
(5)  
(6) 
A residual connection is then added between the output of LN operation and the input
:(7) 
is then served as the input to the interchunk RNN submodule, where the RNN is applied to the last dimension, i.e. the aligned time steps in each of the blocks:
(8) 
where is the output of RNN, is the mapping function defined by the RNN, and is the sequence defined by the th time step in all chunks. As the intrachunk RNN is bidirectional, each time step in contains the entire information of the chunk it belongs to, which allows the interchunk RNN to perform fully sequencelevel modeling. As with the intrachunk RNN, a linear FC layer and the LN operation are applied on top of . A residual connection is also added between the output and to form the output for DPRNN block . For , the output is served as the input to the next block .
2.1.3 OverlapAdd
Denote the output of the last DPRNN block as . To transform it back to a sequence, the overlapadd method is applied to the chunks to form output .
2.2 Discussion
Consider the sum of the input sequence lengths for the intra and interchunk RNNs in a single block denoted by where the hop size is set to be 50% (i.e. ) as in Figure 1. It is simple to see that where is the ceiling function. To achieve minimum total input length , should be selected such that , and then also satisfies . This gives us sublinear input length () rather than the original linear input length ().
For tasks that require online processing, the interchunk RNN can be made unidirectional, scanning from the first up to the current chunks. The later chunks can still utilize the information from all previous chunks, and the minimal system latency is thus defined by the chunk size . This is unlike standard CNNbased models that can only perform local processing due to the fixed receptive field or conventional RNNbased models that perform framelevel instead of chunklevel modeling. The performance difference between the online and offline settings, however, is beyond the scope of this paper.
3 Experimental procedures
3.1 Model configurations
Although DPRNN can be applied to any systems that require longterm sequential modeling, we investigate its application to the timedomain audio separation network (TasNet) [24, 23, 25]
, an adaptive frontend method that achieves high speech separation performance on a benchmarking dataset. TasNet contains three parts: (1) a linear 1D convolutional encoder that encapsulates the input mixture waveform into an adaptive 2D frontend representation, (2) a separator that estimates
masking matrices for target sources, and (3) a linear 1D transposed convolutional decoder that converts the masked 2D representations back to waveforms. We use the same encoder and decoder design as in [25] while the number of filters is set to be 64. As for the separator, we compare the proposed deep DPRNN with the optimally configured TCN described in [25]. We use 6 DPRNN blocks using BLSTM [10] as the intra and interchunk RNNs with 128 hidden units in each direction. The chunk size for DPRNN is defined empirically according to the length of the frontend representation such that as discussed in Section 2.2.3.2 Dataset
We evaluate our approach on twospeaker speech separation and recognition tasks. The separationonly experiment is conducted on the widelyused WSJ02mix dataset [9]. WSJ02mix contains 30 hours of 8k Hz training data that are generated from the Wall Street Journal (WSJ0) si_tr_s set. It also has 10 hours of validation data and 5 hours of test data generated by using the si_dt_05 and si_et_05 sets, respectively. Each mixture is artificially generated by randomly selecting different speakers from the corresponding set and mixing them at a random signaltonoise ratio (SNR) between 5 and 5 dB.
For the speech separation and recognition experiment, we create 200 hours and 10 hours of artificially mixed noisy reverberant mixtures sampled from the Librispeech dataset [27] for training and validation, respectively. The 16 kHz signals were convolved with room impulse responses generated by the image method [1]. The length and width of the room are randomly sampled between 2 and 10 meters, and the height is randomly sampled between 2 and 5 meters. The reverberation time (T60) is randomly sampled between 0.1 and 0.5 seconds. The locations for the speakers as well as the single microphone are all randomly sampled. The two reverberated signals are rescaled to a random SNR between 5 and 5 dB, and further shifted such that the overlap ratio between the two speakers is 50% on average. The resultant mixture is further corrupted by random isotropic noise at a random SNR between 10 and 20 dB [8]. For evaluation, we generate mixture in the same manner sampled from Microsoft’s internal genderbalanced clean speech collection consisting of 44 speakers. The target for separation is the reverberant clean speech for both speakers.
3.3 Experiment configurations
We train all models for 100 epochs on 4second long segments. The learning rate is initialized to
and decays by 0.98 for every two epochs. Early stopping is applied if no best model is found in the validation set for 10 consecutive epochs. Adam [14]is used as the optimizer. Gradient clipping with maximum
norm of 5 is applied for all experiments. All models are trained with utterancelevel permutation invariant training (uPIT) [15] to maximize scaleinvariant SNR (SISNR) [16].The effectiveness of the systems is assessed both in terms of signal fidelity and speech recognition accuracy. The degree of improvement in the signal fidelity is measured by signaltodistortion ratio improvement (SDRi) [31] as well as SISNR improvement (SISNRi). The speech recognition accuracy is measured by the word error rate (WER) on both separated speakers.
4 Results and discussions
4.1 Results on WSJ02mix
We first report the results on the WSJ02mix dataset. Table 1 compares the TasNetbased systems with different separator networks. We can see that simply replacing TCN by DPRNN improves the separation performance by 4.6% with a 49% smaller model. This shows the superiority of the proposed localglobal modeling to the previous CNNbased localonly modeling. Moreover, the performance can be consistently improved by further decreasing the filter length (and the hop size as a consequence) in the encoder and decoder. The best performance is obtained when the filter length is 2 samples with an encoder output of more than 30000 frames. This can be extremely hard or even impossible for standard RNNs or CNNs to model, while with the proposed DPRNN the use of such a short filter becomes possible and achieves the best performance.
Table 2 compares the DPRNNTasNet with other previous systems on WSJ02mix. We can see that DPRNNTasNet achieves a new record on SISNRi with a 20 times smaller model than the previous stateoftheart system [28]. The small model size and the superior performance of DPRNNTasNet indicate that speech separation on WSJ02mix dataset can be solved without using enormous or complex models, revealing the need for using more challenging and realistic datasets in future research.


Separator network  Model size  Window (samples)  Chunk size (frames)  SISNRi (dB)  SDRi (dB) 

TCN  5.1M  16  –  15.2  15.5 
DPRNN  2.6M  16  100  15.9  16.1 
8  150  17.0  17.3  
4  200  17.9  18.1  
2  250  18.8  19.0  



Method  Model size  SISNRi (dB)  SDRi (dB) 



DPCL++ [12]  13.6M  10.8  – 
uPITBLSTMST [15]  92.7M  –  10.0 
ADANet [22]  9.1M  10.4  10.8 
WAMISI5 [32]  32.9M  12.6  13.1 
ConvTasNetgLN [25]  5.1M  15.3  15.6 
Sign Prediction Net [33]  55.2M  15.3  15.6 
Deep CASA [19]  12.8M  17.7  18.0 
FurcaNeXt [28]  51.4M  –  18.4 
DPRNNTasNet  2.6M  18.8  19.0 

4.2 Speech separation and recognition results
We use a conventional hybrid system for speech recognition. Our recognition system is trained on largescale singlespeaker noisy reverberant speech collected from various sources [34]. Table 3 compares TCN and DPRNNbased TasNet models with a 2ms window. We can observe that DPRNNTasNet significantly outperforms TCNTasNet in SISNRi and WER, showing that the speriority of DPRNN even under challenging noisy and reverberant conditions. This further indicates that DPRNN can replace conventional sequential modeling modules across a range of tasks and scenarios.


Separator network  Model size  SISNRi (dB)  WER (%) 

TCN  5.1M  7.6  28.7 
DPRNN  2.6M  8.4  25.9 
Noisefree reverberant speech  –  –  9.1 

5 Conclusion
In this paper, we proposed dualpath recurrent neural network (DPRNN), a simple yet effective way of organizing any types of RNN layers for modeling an extremely long sequence. DPRNN splits the sequential input into overlapping chunks and performs intrachunk (local) and interchunk (global) processing with two RNNs alternately and iteratively. This design allows the length of each RNN input to be propotional to the square root of the original input length, enabling sublinear processing and alleviating optimization challenges. We also described an application to singlechannel timedomain speech separation using timedomain audio separation network (TasNet). By replacing 1D CNN modules with deep DPRNN and performing samplelevel separation in the TasNet framework, a new stateoftheart performance was obtained on WSJ02mix with a 20 times smaller model than the previously reported best system. Experimental results of noisy reverberant speech separation and recognition were also reported, proving DPRNN’s effectiveness in challenging acoustic conditions. These results demonstrate the superiority of the proposed approach in various scenarios and tasks.
References
 [1] (1979) Image method for efficiently simulating smallroom acoustics. The Journal of the Acoustical Society of America 65 (4), pp. 943–950. Cited by: §3.2.
 [2] (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §2.1.2.
 [3] (2019) A comprehensive study of speech separation: spectrogram vs waveform separation. arXiv preprint arXiv:1905.07497. Cited by: §1.
 [4] (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §1.
 [5] (2017) Dilated recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 77–87. Cited by: §1.
 [6] (2016) Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704. Cited by: §1.

[7]
(2018)
Endtoend waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks
. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 26 (9), pp. 1570–1584. Cited by: §1, §1.  [8] (2007) Generating sensor signals in isotropic noise fields. The Journal of the Acoustical Society of America 122 (6), pp. 3464–3470. Cited by: §3.2.
 [9] (2016) Deep clustering: discriminative embeddings for segmentation and separation. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 31–35. Cited by: §3.2.
 [10] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.

[11]
(2017)
Chunkbased decoder for neural machine translation
. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1901–1912. Cited by: §1.  [12] (2016) Singlechannel multispeaker separation using deep clustering. Interspeech 2016, pp. 545–549. Cited by: Table 2.
 [13] (2019) Universal sound separation. arXiv preprint arXiv:1905.03330. Cited by: §1.
 [14] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
 [15] (2017) Multitalker speech separation with utterancelevel permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (10), pp. 1901–1913. Cited by: §3.3, Table 2.
 [16] (2019) SDR–halfbaked or well done?. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. Cited by: §1, §1, §3.3.

[17]
(2015)
A hierarchical neural autoencoder for paragraphs and documents
. arXiv preprint arXiv:1506.01057. Cited by: §1. 
[18]
(2018)
Independently recurrent neural network (indrnn): building a longer and deeper rnn.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 5457–5466. Cited by: §1.  [19] (2019) Divide and conquer: a deep casa approach to talkerindependent monaural speaker separation. arXiv preprint arXiv:1904.11148. Cited by: §1, Table 2.
 [20] (2018) Endtoend music source separation: is it possible in the waveform domain?. arXiv preprint arXiv:1810.12187. Cited by: §1.
 [21] (2019) FaSNet: lowlatency adaptive beamforming for multimicrophone audio processing. arXiv preprint arXiv:1909.13387. Cited by: §1.
 [22] (2018) Speakerindependent speech separation with deep attractor network. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (4), pp. 787–796. External Links: Document, Link Cited by: Table 2.
 [23] (2018) Realtime singlechannel dereverberation and separation with timedomain audio separation network. Proc. Interspeech 2018, pp. 342–346. Cited by: §3.1.
 [24] (2018) TasNet: timedomain audio separation network for realtime, singlechannel speech separation. In Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on, Cited by: §1, §3.1.
 [25] (2019) Convtasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (8), pp. 1256–1266. Cited by: §1, §1, §1, §1, §3.1, Table 2.
 [26] (2016) SampleRNN: an unconditional endtoend neural audio generation model. arXiv preprint arXiv:1612.07837. Cited by: §1.
 [27] (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §3.2.
 [28] (2019) FurcaNeXt: endtoend monaural speech separation with dynamic gated dilated temporal convolutional networks. arXiv preprint arXiv:1902.04891. Cited by: §1, §1, §1, §4.1, Table 2.
 [29] (2018) Waveunet: a multiscale neural network for endtoend audio source separation. arXiv preprint arXiv:1806.03185. Cited by: §1, §1.
 [30] (2018) Endtoend source separation with adaptive frontends. In 2018 52nd Asilomar Conference on Signals, Systems, and Computers, pp. 684–688. Cited by: §1, §1.
 [31] (2006) Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing 14 (4), pp. 1462–1469. Cited by: §3.3.
 [32] (2018) Endtoend speech separation with unfolded iterative phase reconstruction. arXiv preprint arXiv:1804.10204. Cited by: Table 2.
 [33] (2019) Deep learning based phase reconstruction for speaker separation: a trigonometric perspective. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 71–75. Cited by: Table 2.
 [34] (2019Revised July) Meeting transcription using virtual microphone arrays. Technical report Technical Report MSRTR201911, Microsoft Research. Note: Available as https://arxiv.org/abs/1905.02545 External Links: Link Cited by: §4.2.
 [35] (2017) Chunkbased biscale decoder for neural machine translation. arXiv preprint arXiv:1705.01452. Cited by: §1.
Comments
There are no comments yet.