The speech signal captured by distant microphones often presents reverberation, noise, and multiple speakers, rendering the low speech intelligibility for human listeners. In such situations, obtaining the signal from the target speaker requires the ability to perform dereverberation and source separation, with noise being viewed as a special source.
Despite the great success of speech separation on the clean close-talk utterances blind separation in a reverberant environment remains challenging. Most studies performance the dereverberation and separation with tandem systems, each part of which is designed for a single tasks. The framework in (Nakatani2020DNNsupportedMC) integrates deep learning-based speech separation, statistical model-based dereverberation and beamforming. Another study in (Fan2020SimultaneousDA) cascades networks to learn different targets, which obtained better performance compared with directing mapping from the reverberated mixture to the clean signal. Few studies have been conducted that extend the separation algorithms into the reverberant multi-speaker signals with the unified deep learning-based frame.
More recently, the time-domain audio separation network (TasNet) has provided a novel separation scheme that works on the time-domain representations with the time-domain convolutional encoder and decoder (Luo2018TasNetSI). The subsequent Conv-TasNet (Luo2019ConvTasNetSI) and other works (Shi2019FurcaxEM; Bahmaninezhad2019ACS) have demonstrated significant separation performance that even exceeds that of the ideal T-F masks. However, the architecture and permutation invariant training limits its flexibility, i.e., it can only be deployed to separate a fixed number of speakers. On the other hand, the deep attractor network (DAN) (Luo2018SpeakerIndependentSS) predicts frequency domain masks by first generating high-level speaker embeddings, and then forming an attractor to extract the target speaker’s sound. To the best of the author’s knowledge, the time-domain deep attractor network has not been studied.
In this study, we propose a novel time-domain deep attractor network (TD-DAN) for simultaneously performing both dereverberation and separation. The designed architecture consists of 2 parallel streams, a speaker encoding stream (SES) for speaker embedding modeling and a speech decoding stream (SDS) for target speaker extraction and dereverberation. The SES is trained with the reconstruction loss and the concentration loss, resulting in speaker embeddings suitable for clustering. Meanwhile, the SDS serves as an inference module which first models the speaker information using an approach similar to that of the SES, and then interacts with the SES to output waveforms. The proposed scheme makes the following contributions:
The TD-DAN is introduced with a two-stream architecture, which performs dereverberation and separation simultaneously.
A concentration loss is employed to bridge the gap between the oracle attractor and K-Means.
The TD-DAN is explored with different waveform encoders on the 2-speaker reverberant mixtures as well as the more challenging 3-speaker reverberant signals.
We have found that the TD-DAN can perform dereverberation and separation simultaneously, exceeding the Conv-TasNet by dB and dB on the 2- and 3-speaker development/evaluation (Dev./Eval.) set, respectively.
The rest of the paper is organized as follows. In Section 2, we provide techniques related to the proposed method. In Section 3, we describe the proposed time-domain deep attractor network. Section 4 presents the experimental results of the proposed methods. In Section 5, we present our conclusions and our future plan.
2 Related work
Previous work of far-field speech separation focused on the following issues: dereverberation, speech separation and the unified framework.
Dereverberation: To address the dereverberation problem, many algorithms have been proposed, such as beamforming (Schwartz2016JointML; Kodrasi2017EVDbasedMD; Nakatani2019MaximumLC) and blind inverse filtering (Schmid2012AnEA; Yoshioka2012GeneralizationOM). Weighted prediction error (WPE) is developed under the paradigm of blind inverse filtering, which rose to prominence in the REVERB challenge (Kinoshita2016ASO). It aims to minimize the weighted prediction error by optimizing the delayed linear filters to eliminate the detrimental late reverberation (Nakatani2010SpeechDB
). Deep neural networks (DNNs) have been used to learn spectral mapping from reverberant signals to anechoic ones (Geetha2017LearningSM
). In practice, the mask estimation is preferred for its better performance over direct mapping (Wang2014OnTT). Moreover, complex ideal ratio masks are proposed to overcome the drawback that the real-number masks cannot reconstruct both the magnitude and the phase information of the target signal (Williamson2017TimeFrequencyMI
). Some researchers attempt to combine the DNNs with WPE by deep learning-based energy variance estimation, leading to a non-iterative WPE algorithm (Heymann2019JointOO).
Paradigm of speech separation: Most architectures adopt paradigms, permutation invariant training (PIT) and speaker clustering-based methods. The PIT (Kolbaek2017MultitalkerSS) directly optimizes the reconstruction loss with possible permutations. The drawback of PIT is that its architecture cannot deal with a variable number of speakers. The speaker clustering methods like DC (Hershey2016DeepCD; Wang2018AlternativeOF) are trained to generate discriminative speaker embeddings in each T-F bins, and use cluster algorithms to obtain speaker assignment during the test phase. The DAN is developed following the DC, but it directly optimizes the reconstruction of the spectrogram (Luo2018SpeakerIndependentSS). DC and DAN can deal with a variable number of speakers by setting the cluster number. Speech extraction differs from speech separation with prior knowledge of the target speaker. For instance, SpeakerBeam extracts the voice of the target speaker with a pre-collected adaptation utterance (molkov2019SpeakerBeamSA).
Learning object of speech separation: Most previous approaches have been formulated by predicting T-F masks of the mixture signal. The commonly used masks are IBMs, ideal ratio masks (IRMs) and Wiener filter-like masks (WFMs) (Wang2014OnTT). Some approaches directly approximate the spectrogram of each sources (Du2016ARA
). Both the mask estimation and spectrum prediction use the inverse short-time Fourier transform (iSTFT) of the estimated magnitude spectrogram of each source together with the original or the modified phase. Recently, TasNet introduced a novel method for separating signals from the raw waveform. It utilizes-D convolutional filters to perform encoding and decoding on the generated spectro-temporal representations. A speech separation module accepts the representation and predicts each source’s masks. Unlike the fixed weights in the short-time Fourier transform (STFT), TasNet learns the transform weight by optimizing scale-invariant source-to-distortion ratios (SI-SDRs) between the estimated and target source signals. Yet as the TasNet can only separate a fixed number of speakers, it is less flexible than DC and DAN. Thus, how exploring time-domain speaker clustering-based separation can best be achieved is a problem that has yet to be resolved.
Unified frameworks: Speech separation in a reverberant environment is a difficult task by simultaneously addressing the dereverberation and separation problems. Most of the systems adopt algorithms in tandem, for example, the framework in (Nakatani2020DNNsupportedMC) combines Weighted Power minimization Distortionless response (WPD) (Nakatani2019AUC
), noisy Complex Gaussian Mixture Model (noisyCGMM) (Ito2018NoisyCC), and CNN-based PIT. A purely deep learning-based network is introduced for denoising and dereverberation by first learning the noise-free deep embeddings and then performing mask-based dereverberation (Fan2020SimultaneousDA). TasNet achieved a SI-SDR of dB in WHAMR!, a reverberant version of WHAM!, whereas it obtained dB in the clean WSJ0-2MIX dataset (Maciejewski2019WHAMRNA). The cascaded framework of separation and dereverberation improves the performance to dB, much lower than that in the clean situation. The performance results of WHAMR! indicate that simultaneous dereverberation and separation exhibits a difficult problem. Moreover, a much more challenging task of separating more than 2 speakers in a reverberant environment, has not been previously explored.
In this section, we first formulate the problem and introduce the baseline DAN and Conv-TasNet. Following the design of the speaker attractor and time convolutional network, 2 types of two-stream TD-DANs are put forward, one with hybrid encoders and another one with fully time-domain waveform encoders. Additionally, a clustering loss is proposed to improve the performance of K-Means.
3.1 Problem formulation
Assume that speech signals from speakers are captured by a distant microphone in a noisy reverberant environment. The captured signal is
where is the noise, is the reverberant source signal, decomposed as representing the direct sound and early reflection, and representing the late reverberation, respectively. For simplicity, is referred to as the early reflection in the following. STFT transforms the signal from the time domain to T-F representations, reformulating Eq.(1) as
with the frame number , maximum frequency index , frame index and frequency index . The early reflection and the late part are generated by convolution,
where is the transfer function with late reverberation starting from frame and ending at frame for frequency , and is the source signal for speaker in bin . As indicated in (Bradley2003OnTI), the early reflections increase the speech intelligibility scores for both impaired and non-impaired listeners. Thus, in this study, dereverberation is to eliminate the late part .
The ideal masks are defined on the frequency domain over the sources. The IRM for speech separation only is expressed as
where is a modulus operation. In the reverberant environment, the IRM for dereverberated source is redefined as
where the interference signal is obtained by removing the early part of source , i.e., it includes both the late reverberation of the target source and other interference signals. Similarly, WFM is formulated as
3.2 Baseline DAN and Conv-TasNet
Our TD-DAN is inspired by the design of the speaker attractor and the time convolutional network (TCN) originally proposed in DAN and Conv-TasNet, respectively. We briefly introduce these 2 networks in the following.
3.2.1 Deep attractor network
The attractor is a speaker embedding indicating speaker information. As shown in Fig.(1), DAN accepts log power spectrum (LPS) and generates -dimensional speaker embeddings ,
where denotes the matrix form with subscripts representing the axes, and
is the LPS feature extractor. In the training, the attractor vectorfor speaker is obtained by averaging over the T-F bins,
where denotes the presence/absence of speech calculated by a threshold of power, is the binary speaker assignment. Here, we use source signals to calculate
since is expected to indicate the source information and can be used to perform both separation and dereverberation. During the test phase, the attractors are obtained by K-Means clustering with prior knowledge of the speaker number,
The masks are estimated with Sigmoid activation,
where is the -dimensional attractor of speaker . The DAN is trained by minimizing the reconstruction loss for both separation and dereverberation,
The optimization leads to speaker embeddings in which the vectors from the same speakers get closer and those from different speakers become more discriminative. However, due to , Eq.(14) may lead to performance degradation in clustering, which can be solved by adding an extra concentration loss (Sec.3.4).
The Conv-TasNet is a fully convolutional time-domain audio separation network, composed of a convolutional encoder, a separation module and a convolutional decoder. Multiple sequential TCN blocks with various dilation factors are stacked as the separation module. The fully convolutional architectures result in a small model size and can be deployed in a causal way. As plotted in Fig.(2), the encoder encodes the input mixture signal,
where is the time convolutional kernel, is the spectro-temporal representation. We use “Free” to indicate that the kernel parameters are learnable. The TCN-based separation module is trained to predict masks,
where is the estimated mask defined on the spectro-temporal representation. The decoder decodes the masked spectro-temporal representation and outputs the enhanced waveforms,
where is the time de-convolutional kernel. The Conv-TasNet is trained to optimize the SI-SDR. Utterance-level permutation invariant training (uPIT) is deployed to address the source permutation problem in the training phase (Kolbaek2017MultitalkerSS).
3.3 Time-domain deep attractor network
The TD-DAN is a two-stream architecture, a speaker encoding stream (SES) for speaker embedding modeling and a speech decoding stream (SDS) for dereverberation and speaker extraction. We creatively separate the task into parts and jointly train streams with a multi-task loss. We first describe the two-stream architecture together with the hybrid waveform encoders, and then step into the fully time-domain encoders.
3.3.1 TD-DAN with hybrid encoders
As plotted in Fig.(3), the SES is similar with the DAN network, which accepts the log power spectral (LPS) with , and calculates the masks with the speaker embeddings and attractors. The whole feed-forward procedure follows Eq.(9)-(14).
The SDS models the input signal with a convolutional encoder and stacked TCNs, which can be viewed as a dereverberation process,
where is the
-dimensional high-level representation. The SES interacts with the SDS through a linear transformation of the attractor. Then the SDS accepts the transformed attractor to calculate the masks and finally outputs the dereverberated and separated signal,
where Linear is the linear transformation communicating the SES and SDS. The model is trained from raw by optimizing a multi-task loss,
where is the loss balance factor.
This TD-DAN is with hybrid encoders because the SES is encoded by the STFT transform, while the SDS is encoded by a convolutional encoder with free kernels. Nevertheless, it is designated as a time-domain DAN since it is trained to predict waveform directly.
3.3.2 TD-DAN with fully time-domain encoders
Here, we replace the waveform encoder in TD-DAN SES with time-domain convolutional kernels. The problem is the definition of the IBMs in the spectro-temporal representations, which is originally computed based on the spectrogram (Eq.(11)). In formulation, the time-domain SES encoder encodes the mixture signal into ,
By setting the magnitude of the signal as , its IBM is formulated similarly,
We introduce time-domain kernels, the stacked time-domain STFT kernel and the free kernel :
The STFT convolutional encoder . The STFT transform is split into the real and the imaginary part with stacked convolutional kernel expressed as,
(24) (25) (26)
where usually equals to , columns of are convolutional kernels, is the sample index in a convolutional kernel of size , is the kernel index corresponding to the frequency of STFT, and is the pre-designed analysis window. This kernel is different from STFT, since it stacks the real and the imaginary part of the spectrum, which can be conducted with real-value convolutional operations.
The free convolutional encoder , whose convolutional kernel is trained together with the dereverberation and separation tasks.
The whole procedure with fully time-domain encoders follows Fig.(3), where the attractor is obtained by masks defined by and is calculated by Eq.(10)-(14), and the dereverberation and separation are conducted following Eq.(18)-(20). The speech presence assign is obtained by a threshold of the magnitude of the spectro-temporal representations. The network is trained to optimize the multi-task loss (Eq.(21)).
3.4 Auxiliary clustering loss
The reconstruction loss (Eq.(14)) indicates that the mask will be close to if the T-F bin embeddings are close to the speaker attractor, otherwise close to . The sparsity assumption declares that the observed signal contains at most one source at each of the T-F bins, which ensures the clustering performance in the DAN since most embeddings are optimized so that they are closed to some attractor to achieve binary-like masks. However, the reverberant signal may not follow the sparsity assumption. The distribution of the mixture signal with early reflection and reverberation is plotted in Fig.(4). Notably, approximate T-F bins have an IRM value larger than in the mixture of early reflections, while in the reverberant signal, the percentage declines significantly approximately . This occurs because the IRM of early reflections is the ratio of the target early part against the interference ones, while for Eq.(6) it is the ratio of the target early reflection against the target late reverberation, the interference early and late reverberation. The lack of high-value T-F bin masks indicates the difficulty of embedding clustering.
To achieve better clustering performance, we introduce the clustering loss, including the concentration loss and the discriminative loss. The concentration loss is designed for all DAN-based models, expressed as,
Its gradient is
which enforces embedding close to attractor when dominated by speaker . The within-class concentration loss may be in conflict with Eq.(14) to some degree, i.e., the loss pushes the embeddings concentrated around the attractors, which may lead to a suboptimal reconstruction loss. In practice, however, the joint optimization of the reconstruction and the concentration loss leads to better performance, which will be illustrated in Sec.5.
Another inter-class discrimination loss maximizes the distance among different attractors,
In fact, Eq.(14) includes the optimization of discrimination, whereby the attractor distance will be enlarged if is close to . The discrimination loss here is designed for free convolutional kernel in the SES to avoid the point at which small results in small as well as ambiguous attractors of different speakers. The training loss is updated to,
where and are factors for concentration and discrimination losses.
4 Experimental configuration
The experiments were conducted on the Spatialized Multi-Speaker Wall Street Journal (SMS-WSJ)111https://github.com/fgnt/sms_wsj, which artificially spatialized and mixed utterances taken from WSJ (Drude2019SMSWSJDP). It differs from the spatialized version of the WSJ0-2MIX (Wang2018MultiChannelDC) in that it considers all WSJ0+1 utterances, and strictly separates the speakers for the training, validation and test sets. The room impulse response was randomly sampled with different room sizes, array centers, array rotation, and source positions. The sound decay time (T60) was sampled uniformly from to . The simulated -channel audios contained early reflections (), late reverberation (
), and white noise. The detailed numbers of the dataset are listed in Table.1. Meanwhile, we also simulated a -speaker dataset as a more challenging task, which used the same RIRs and utterance split as the SMS-WSJ dataset.
In our experiments, we only used the first channel of the multi-channel signal. The networks were trained to map the reverberant multi-speaker signal to early reflections. As demonstrated in (Drude2019SMSWSJDP), the early reflections were close to the source signal in the measurement of speech intelligibility.
4.2 Training settings
The experiments were conducted with Asteroid, an audio source separation toolkit222https://github.com/mpariente/asteroid
based on PyTorch (Paszke2017AutomaticDI). The baseline Conv-TasNet and DAN followed the settings in Luo2018SpeakerIndependentSS and Luo2019ConvTasNetSI, respectively. We changed the DAN architecture from
-layer bi-directional long short-term memory (BLSTM) toTCN blocks, which led to small-sized models, and allowed for fair comparison among different frameworks.
The two-stream TD-DAN is composed of the SES and the SDS, which adopted the architecture corresponding to the baseline DAN and Conv-TasNet, respectively. By following the hyper-parameter notations in (Luo2019ConvTasNetSI), we list the architectures in Table.2. The SES in Sec.3.3 accepts LPS which was computed using STFT with window size
and stride. The SES in Sec.3.3.2 utilized convolutional kernels with the same settings as the LPS, and then was fine-tuned to and . The power threshold was set to keep the top and the top of the mixture spectrogram bins for LPS encoder and the time-domain encoder, respectively. The loss factor was only applied for the SES with free kernels, which was trained by fine-tuning the network with stacked STFT kernels for epoch.
We used Adam (Kingma2015AdamAM) with the learning rate starting from and then halved if no best validation model was found in epochs. The maximum number of epochs was set to . The TD-DANs were trained with -second segments and a batch size of on GPUs.
5 Results and discussion
In this section, we first present the results of the baseline and the different TD-DANs. Then we describe the experiment results on the dataset with variable numbers of speakers. Finally, the discussion of the performance is conducted for further understanding and improvement.
|Model||Architecture||Waveform encoder||Concentration loss||SI-SDR(dB)|
5.1 Comparison of the baseline and TD-DANs
The st part of Table.3 display the results of our baseline models, Conv-TasNet and DAN. The Conv-TasNet achieved SI-SDRs of dB, dB better than those of DAN without the concentration loss on the Dev./Eval. set. It is clear that the concentration loss largely bridged the gap between the oracle attractor and the ones obtained by K-Means, which was reduced from to dB.
The TD-DAN was designed following the architectures of DAN and TasNet. Specifically the SES corresponded to DAN while the SDS followed the Conv-TasNet, resulting in a fair comparison with the baseline. As listed in the nd part of Table.3, the SES with the LPS encoder combined with the TasNet-based SDS gave SI-SDRs of dB, exceeding the SI-SDRs of Conv-TasNet by around dB on the Dev./Eval. set. The concentration loss showed its efficiency by indicating an improvement of dB. The results demonstrate the feasibility of the TD-DAN architecture which solves the problem of dereverberation and separation with two parallel streams. Unlike tandem systems, the whole procedure requires no extra information such as anechoic multi-speaker signals.
As listed in the rd part of Table.3, the time-domain encoder of SES was explored with the fixed STFT kernel and the free kernel. The fixed STFT kernel exhibited performance degradation with the same window settings of LPS. After window settings were adjusted, the SES encoder was able to achieve SI-SDRs of dB. By setting the STFT kernel as trainable network parameters, the TD-DAN with the free waveform encoder achieved higher SI-SDRs of dB, dB higher than those of TD-DAN with the LPS encoder.
|Model||Speaker #||Waveform||2 speakers||3 speakers|
5.2 Experiments on mixtures with variable numbers of speakers
The merit of the DAN is that it can deal with mixture signals with variable numbers of speakers. To validate this feature, we further trained the TD-DAN on both datasets with both and speakers. The experiment results are listed in Table.4. The TD-DAN trained on the -speaker dataset was able to separate -speaker mixture signal with SI-SDR gains of dB and dB with the LPS and the STFT waveform encoder on the Dev./Eval. set compared with mixture signals, respectively. After fine-tuned on the concatenated dataset, the TD-DAN achieved SI-SDRs of dB and dB with the LPS and the STFT waveform encoder, respectively. The TD-DAN with the free waveform decoder was tested but no further improvement. On both - and -speaker dataset, the TD-TasNet employing the SES with STFT waveform encoder achieved the best performance in all signal measurement scores.
The performance of ideal masks is listed in the 2nd part of Table.4. It was observed that IRM (Eq.(6)) outperformed other masks in SI-SDR while WFM (Eq.(8)) achieved the best STOI. It is notable that the proposed TD-DAN exceeded IRM (Eq.(5)) and WFM (Eq.(7)) on the -speaker dataset. The SI-SDR gap of dB between IRM (Eq.(6)) and the TD-DAN in the -speaker dataset indicates that performing multi-speaker separation and dereverberation remains a difficult task even for time-domain techniques.
5.3 Visualization and discussion
Fig.5 plots examples of the spectrogram of the mixture and the enhanced signal. The proposed TD-DAN separated and dereverberated the reverberant mixture signal to be closer to the early reflections. Table.5 presents the performance under different reverberation time. With the larger T60 and more speakers, the performance of TD-DAN becomes worse, implying that reverberation makes the separation tasks much more difficult.
In this paper, we explore the frameworks of the TD-DAN for speech separation tasks in a reverberant environment with different waveform encoders, including the LPS encoder, the stacked time-domain STFT and free convolutional kernels. The experiment results implied that TD-DAN with the fixed STFT encoder achieved the best performance, surpassing the baseline TasNet with SI-SDRs of dB and dB on the 2- and 3-speaker Dev./Eval. dataset, respectively. In future work, we anticipate further exploring the free waveform encoder. Moreover, the multi-channel information is expected to be utilized for better dereverberation and separation.
This work is partially supported by the Strategic Priority Research Program of Chinese Academy of Sciences (No. XDC08010300), the National Natural Science Foundation of China (Nos. 11590772, 11590774, 11590770, 11774380).