In real-world speech communication, speech signal is often corrupted by background noise or speaker interference. It is desirable to have a front-end speech processing module to remove the background noise or to extract the foreground speech, such as, speaker verification [rao2019target], speech recognition [zmolikova2017speaker, delcroix2018single].
The recent studies on speaker-independent blind source/speech separation (BSS) have seen major progress, such as DPCL [hershey2016deep, isik2016single, wang2018alternative], DANet [chen2017deep, luo2018speaker], PIT [yu2017permutation, kolbaek2017multitalker, xu2018single], and TasNet [luo2018real, luo2019conv]. It aims to separate each source from a given speech mixture of a known number of speakers. Speech separation may require an extra step to channel the right speaker to the right voice stream, that is called the global permutation ambiguity challenge [spex2020]. Speaker extraction takes a different strategy from speech separation. It only extracts the target speaker’s voice given a reference speech of that target speaker. Speaker extraction avoids the problem of global permutation ambiguity and doesn’t require the knowledge about the number of speakers in the mixture. But it requires a reference speech.
A general approach for speaker extraction is conducted in frequency-domain, such as, SpeakerBeam [vzmolikova2017learning, delcroix2018single, delcroix2019compact, vzmolikova2019speakerbeam], Voicefilter [wang2018voicefilter], and SBF-MTSAL-Concat [xu2019optimization]. The frequency-domain method inherently suffers from a phase estimation issue that is required during the signal reconstruction. To avoid phase estimation, we recently proposed a time-domain solution called SpEx [spex2020] that exploited a MFCC-based speaker embedding obtained from the speaker encoder to form a top-down auditory attention to the target speaker. SpEx was claimed a time-domain solution because it takes the time-domain mixture speech as input and extracts the target speaker’s voice through an encoder-extractor-decoder network. However, the top-down auditory attention in SpEx system is supervised by a speaker encoder, that takes spectral features MFCC as input. As a result, there is a mismatch of latent feature space between the speech encoder and speaker encoder, that limits the efficiency and effectiveness of SpEx system.
To address the above mismatch problem in SpEx, we propose a complete time-domain speaker extraction framework, that is called SpEx+. We propose to share the same network structures and their weights between two speech encoders. By doing so, the mixture speech input, and the reference speech input are represented in an uniform latent feature space.
This paper is organized as follows. In Section 2, we motivate and design the proposed SpEx+ architecture. In Section 3, we report the experiments. Section 4 concludes the study.
2 SpEx+ Architecture
The proposed SpEx+ system consists of speech encoder, speaker encoder, speaker extractor, and speech decoder as shown in Fig. 1, that has a similar architecture as SpEx [spex2020]. The difference lies in the weight-shared speech encoders, also called twin speech encoders.
2.1 Twin speech encoder
Speech has a rich temporal structure over multiple time scales presenting phonemic, prosodic and linguistic content [toledano2018multi]. It has shown that speech analysis of multiple temporal resolutions leads to improved speech recognition performance [teng2016testing]. As shown in Fig. 1, we implement multi-scale speech encoding in speech encoder.
The speech encoder projects the input speech, either the mixture speech to be extracted or the reference speech from target speaker, into a common latent space. In SpEx+, the speech encoders support two processes, namely speaker encoding, and speech extraction. As the speaker encoding serves as the top-down voluntary focus for the speech extraction, we believe that it is beneficial for speaker encoding and speech extraction to share the same latent feature space. We propose a weight sharing strategy between the two speech encoders that have identical network structure as illustrated in Fig. 1. The two speech encoders run separately because they have to process different speech inputs. Therefore, we call them the twin speech encoders.
The twin speech encoders consist of several parallel 1-D CNNs with different filter lengths that result in various temporal resolutions. Although the number of multiple scales can be vary, we only study three different scales. The multi-scale embedding coefficients ( and
where represents a twin speech encoder, and is the shared parameters. , and
are the filter length of each filter to capture different temporal resolution in the 1-D CNN. To concatenate the embedding coefficients across different scale, we use a same stride,, across different scales. is the number of filters in the 1-D CNN.
Due to small filter size in time domain encoder, the twin speech encoder in time-domain takes a smaller window than that in short time Fourier transform (STFT). As a result, the speech encoder generates embedding coefficients at a frame rate that is needed by the speaker extractor, however, much higher than necessary for the speaker encoder. The high frame rate may lead to gradient vanishing problem with recurrent neural network (RNN). We thus stack several residual convolutional blocks in the Speaker Encoder structure, as illustrated in Fig. 1, to obtain a suitable speaker embedding from the reference speech. Meanwhile, each residual convolutional block applies a max-pooling to take out the silence.
2.2 Speaker encoder
Speaker encoder is designed to extract speaker embedding of the target speaker from the reference speech. In practice, we employ a 1-D CNN on the embedding coefficients from the reference speech, followed by residual network (ResNet) blocks with a number of . Then a 1-D CNN is used to project the representations into a fixed dimensional utterance-level speaker embedding together with a mean pooling operation. We have , where represents the speaker encoder.
A ResNet block, as shown in Fig. 2.b, consists of two CNNs with a kernel size of
and a 1-D max-pooling layer. A batch normalization (BN) layer and parametric ReLU (PReLU) are used to normalize and non-linearly transform the outputs from eachCNN. A skip connection is used to add the inputs to the activation representations from the second BN layer. The 1-D max-pooling layer with kernel size tries to take out the silence. Meanwhile, the time series of the representations are reduced by 3 times.
The training of speaker encoder can be seen as a sub-task in a multi-task learning [spex2020] of SpEx+ network. The speaker encoder can be jointly optimized by weighting a cross-entropy loss for speaker classification and a signal reconstruction loss [wu2019time]
for speaker extraction. During training, the gradients from two loss functions are back-propagated to optimize the speaker encoder for better speaker embedding. The details of the learning algorithm will be discussed in Section 2.4.
2.3 Speaker extractor and speech decoder
Speaker extractor is designed to estimate masks for the target speaker in each scale , conditioned on the embedding coefficients and the speaker embedding . Similar to Conv-TasNet [luo2019conv], our speaker extractor repeats a stack of temporal convolutional network (TCN) blocks with a number of for times. In this work, we keep and same as in SpEx [spex2020]. In each stack, the TCN has a exponential growth dilation factor in the dilated depth-wise separable convolution (De-CNN), as shown in Fig. 2.a. The first TCN block in each stack takes the speaker embedding and the learned representations over the mixture speech. The speaker embedding is repeated to be concatenated to the feature dimension of the representation.
Once the masks in various scales are estimated, the modulated responses are obtained by element-wise multiplication of the masks and the embedding coefficients . We then reconstruct the modulated responses into time-domain signals at multiple scales with the multi-scale speech decoder as follows:
where is an operation for element-wise multiplication. and are the speaker extractor to estimate the mask and the speech decoder to reconstruct the signal, respectively.
2.4 Multi-task learning
The training of SpEx+ aims to achieve high quality speech, at the same time, highly discriminative speaker embedding for reference speech. Such strategy has proven to be beneficial in SpEx study [spex2020]. We propose a multi-task learning implementation for SpEx+ training with two objectives, a multi-scale scale-invariant signal-to-distortion ratio (SI-SDR) loss for output speech quality, and a cross-entropy (CE) loss for speaker classification. The overall objective function is defined as,
where represents the model parameters, is the input mixture speech, is the reference speech of the target speaker, is the target clean speech,
is a one-hot vector representing the true class labels for the target speaker, andis a scaling parameter.
aims to minimize the signal reconstruction error, which is defined as follows, with and as the weights to different scales,
where and are the estimated signal and the target clean signal, respectively. Their means are normalized to zero.
is the cross-entropy loss for speaker classification, which is defined as
where is the number of speakers in the speaker classification task, is the true class label for a speaker. represents a weight matrix, is a softmax function and
represents the predicted probability.
3 Experiments and Discussion
We simulated a two-speakers database WSJ0-2mix-extr111https://github.com/xuchenglin28/speaker_extraction at sampling rate of 8kHz based on WSJ0 corpus. The simulated database contains 101 speakers and was divided into three sets: training set (20,000 utterances), development set (5,000 utterances), and test set (3,000 utterances). Specifically, the utterances from two speakers in WSJ0 “si_tr_s” corpus were randomly selected to generate the training and development set in a relative SNR between 0 to 5 dB. Similarly, the test set was generated by randomly mixing the utterances from two speakers in WSJ0 “si_dt_05” and “si_et_05” set. Since the speakers were unseen during training, the test set was considered as open condition evaluation.
During the data simulation, the first selected speaker was chosen as the target speaker, the other one was regarded as the interference speaker. The required reference speech of the target speaker is a randomly selected utterance that is different from the utterance in the mixture. The reference speech is used to obtain a speaker embedding to characterize the target speaker.
3.2 Experimental setup
We trained all models for 100 epochs on 4-second long segments. The learning rate was initialized toand decays by 0.5 if the accuracy of validation set was not improved in 2 consecutive epochs. Early stopping was applied if no best model is found in the validation set for 6 consecutive epochs. Adam was used as the optimizer. The filter lengths of convolutions in speech encoder and decoder were ms, ms, ms for speech of 8kHz sampling rate, respectively. The speaker extractor follows the same configuration in SpEx[spex2020]. The number of ResNet blocks in auxiliary network was set to 3, and the speaker embedding dimension was set to 256 in practice. For the loss configuration, we used , , to balance training loss.
For TseNet [xu2019time]
baseline, 400-dim i-vector were extracted using Kaldi tool as speaker embedding. The speech encoder, speaker extractor and speech decoder are the same as SpEx+ except that single scale was used. For SpEx baseline system, we improved the performance from original 15.1 dB to 17.15 dB in terms of SDR by replacing the sigmoid activation function with ReLU function to estimate the masks. And we also retained the speech segments less than 4-seconds using padding operation, rather than discarded them as in the original SpEx system. The TseNet baseline was also re-implemented by adopting the above strategies. All systems were implemented with Pytorch222https://pytorch.org/.
To examine the benefits of the twin speech encoders, we conducted two experiments, 1) SpEx+ (un-tied): the twin speech encoders share the same time-domain encoding network structure, but with independently trained network weights; 2) SpEx+ (tied): the twin speech encoders share the same time-domain encoding network structure, and the network weights are tied and updated together. We believe that SpEx+ (tied) configuration allows SpEx+ to project both the mixture speech and reference speech into the same latent space, that facilitates speech extraction process.
3.3 Comparative study on WSJ0-2mix-extr
We compare the proposed SpEx+ network with other baseline systems in terms of SDR, SI-SDR and PESQ. From Table 1
, we conclude that: 1) SpEx+ significantly outperforms previous state-of-the-art TseNet and SpEx with relative improvements of 23.6% and 9.1% in terms of SI-SDR, respectively. The improvements mainly come from the multi-scale time-domain speaker encoder in the proposed SpEx+. The time-domain speaker embedding shows stronger ability in characterizing the target speaker than the i-vector and MFCC-based speaker embedding. SpEx+ could be regarded as a complete end-to-end solution as neither speaker characterization (i.e. i-vector) nor frequency-domain feature extraction (i.e. MFCC) is required externally. 2) By sharing a common speech encoder for between the mixture speech and the reference speech, SpEx+ outperforms SpEx by using an unified latent feature space.
We further report the speaker extraction performance of various systems with different and same gender mixture speech, separately in Table 2. We understand that same gender task is more challenging than different gender, thus, shows lower SDR and PESQ in general. It is observed that SpEx+ system outperforms all others in both different and same gender tasks. SpEx+ achieves 4.1% and 2.5% relative improvement over SpEx in different gender, and 13.9% and 5.7% relative improvement in same gender in terms of SDR and PESQ, respectively. Finally we report the SDR distributions of the 3,000 test utterances in Fig. 3. We observe that the SpEx+ systems extract more utterances than other systems in the range of 15dB to 25dB.
3.4 Comparative Study on WSJ0-2mix
We first evaluate the effect of the duration of reference speech on WSJ0-2mix dataset, as reported in Table 3. We observe that longer duration of reference speech always leads to better performance, that confirms our intuition.
|SpEx+ (tied)||7.3s (avg.)||17.2||16.9|
We further compare a number of speaker extraction and speech separation techniques on WSJ0-2mix. From Table 4, we observe that SpEx+ achieves 13.7% and 4.8% relative improvements in terms of SI-SDR over the Conv-TasNet [luo2019conv] for speech separation and SpEx [spex2020] for speaker extraction, which employ same TCN blocks. We note that SpEx+ has the advantage over the group speech separation techniques (see BSS rows in Table 4) in dealing with unknown number of speakers, and global permutation ambiguity. A recent study shows that, by replacing the TCN block with a dual path RNN (DPRNN), DPRNN-TasNet [luo2020dual] improves the performance of speech separation. We would explore the use of DPRNN in SpEx+ as our future work.
In this paper, we proposed a complete time-domain speaker extraction network. We proposed to share a multi-scale twin speech encoder to transform the mixture and reference speech into same latent feature space. We also proposed a multi-scale time-domain speaker encoder to obtain a speaker embedding that characterize the target speaker and guide the the speaker extraction from the mixture speech. Experiments showed that our proposed SpEx+ achieved significant performance improvement, especially in same gender mixture condition.