Recently, deep learning based speech separation approaches have attracted increasing attention[9, 13, 12, 16]. Earlier approaches such as deep clustering  and permutation invariant training (PIT) 
, performed processing in the frequency-domain and generated time-frequency masks for each source in the mixture. More recently, a convolutional time-domain audio separation network (Conv-TasNet) has been proposed and led to great separation performance improvement surpassing ideal time-frequency masking[16, 17, 21]. The separation performance of TasNet has been further improved by exploiting spatial information when a microphone array is available .
Despite the great success of neural network-based speech separation, it requires knowing or estimating the number of sources in the mixture and still suffers from a global permutation ambiguity issue, i.e. an arbitrary mapping between source speakers and outputs. These limitations arguably limit the practical usage of speech separation. In contrast to speech separation, target speech extraction exploits an auxiliary clue to identify the target speaker in the mixture and extracts only speech of that speaker. After our initial work[30, 31], research on target speech extraction has then gained increasing attention [3, 7, 25, 29, 15], as it naturally avoids the global permutation ambiguity issue and does not require knowing the number of sources in the mixtures.
, which is a target speech extraction method that exploits a speaker embedding vector derived from an adaptation or enrollment utterance of the target speaker to guide a neural network towards extracting speech of that speaker. This is realized by combining two networks, a sequence summary network that computes the speaker embedding vector from the amplitude spectrum of the adaptation utterance and a speech extraction network that accepts the amplitude spectrum of the speech mixture and the embedding vector as inputs and generates a time-frequency mask for extracting the target speaker. In this paper, we call this approach frequency-domain SpeakerBeam (FD-SpeakerBeam).
We have shown that FD-SpeakerBeam could achieve competitive speech extraction performance and be used as a front-end for automatic speech recognition (ASR). However, we observe a great performance gap between same-gender and different-gender mixtures . It is indeed difficult to discriminate a target speaker in a mixture when speakers have similar voice characteristics.
In this paper, we investigate strategies to tackle this issue. First, following the success of TasNet, we propose a time-domain implementation of SpeakerBeam (TD-SpeakerBeam), whose speech extraction network accepts time-domain signals of the mixture, and outputs directly the time-domain signal of the target speaker. We also replace the sequence summary network with a convolutional network to obtain richer speaker embedding vectors.
Moreover, to further improve speaker discrimination capability, we extend TD-SpeakerBeam to accept spatial information from microphone array recordings as additional input features. We argue that simply adding spatial features to the input of TD-SpeakerBeam may limit the potential to process spatial information. Consequently, we propose an alternative approach, called internal combination, for exploiting spatial information more effectively within the SpeakerBeam framework.
Finally, to enforce learning more discriminative speaker embedding vectors, we propose using a multi-task loss for training SpeakerBeam, that combines a speech reconstruction loss with a speaker identification loss (SI-loss).
We performed experiments on two datasets, which show that (1) TD-SpeakerBeam greatly improves target speech extraction performance and outperforms a competitive system based on TasNet separation followed by an x-vector  based target speech selection module, (2) exploiting spatial features with the proposed internal combination helps target speaker extraction especially for same-gender mixtures, (3) the additional SI-loss consistently improves performance when a sufficient number of speakers are included in the training data, (4) by varying the number of training speakers, although TasNet performance does not change significantly, SpeakerBeam benefits greatly from more training speakers especially for same-gender mixtures because it helps improving target speakers identification. These results confirm the efficiency of the proposed strategies for improving the target speaker discrimination capability of SpeakerBeam.
2 Proposed time-domain SpeakerBeam
Let us first describe the implementation of TD-SpeakerBeam. Then, in section 2.2, we discuss approaches for exploiting spatial information when microphone array recordings are available. Finally, in section 2.3, we introduce a multi-task loss to improve speaker discrimination even when only a single microphone is available.
Figure 1-(a) is a block diagram of the proposed TD-SpeakerBeam. Let , and be the time-domain signals of the speech mixture, the adaptation utterance, and the estimated target speech for target speaker . SpeakerBeam is composed of two networks, an extraction network, and an auxiliary network. In the original FD-SpeakerBeam , these networks accept the amplitude spectrum of the mixture and adaptation signals as inputs and generate a time-frequency mask. In this paper, we modify the implementation of these networks to input and output time-domain signals.
The time-domain extraction network follows a similar configuration as Conv-TasNet , i.e. it consists of a 1d convolution layer that accepts the mixture signal (encoder layer), several convolution blocks, and finally, a 1d deconvolution (decoder layer) that outputs the extracted speech signal in the time-domain, .
There are two major differences with Conv-TasNet. First, the output consists of a single signal corresponding to the target speech only. Second, we insert an adaptation layer between the first and second convolution blocks111We found in preliminary experiments that placing the adaptation layer after the first convolution block achieved the best performance. to drive the network towards extracting the target speaker. The adaptation layer accepts a speaker embedding vector of the target speaker, , as auxiliary information. We use a multiplicative adaptation layer following our previous work , although other adaptation layers could be used [31, 29].
The target speaker embedding vector, , is computed by the time-domain auxiliary network. In the original FD-SpeakerBeam, the auxiliary network consists of a sequence summary network , i.e. a few fully connected layers followed by a time-averaging operation. Here, we propose using a convolutional auxiliary network to accept the time-domain input signal of the adaptation utterance . The auxiliary network consists of an encoder layer and a single convolution block similar to those used in the extraction network.
2.2 Spatial features
Spatial information extracted from multi-channel recordings can provide an alternative source of information about the mixtures that could help discriminate speakers better. There have been several works showing the benefit of adding spatial features to the input of speech enhancement networks [1, 26, 2]. For example,  recently showed that the inter-microphone phase difference (IPD) features could improve the separation performance of TasNet in reverberant conditions. The IPD of the mixture signal between two microphones is defined as,
is the short-time Fourier transform (STFT) coefficient of the mixture signal at microphone, is the time-frame index, and is the frequency index. Here we limit our investigation to the two-microphone case. Following , we use cosine and sine of the IPDs as spatial features,
where is the number of frequency bins. Note that the frame size and window shift of the STFT used to compute the IPD features may differ from the window size and shift used in the encoder of the extraction network. IPD features are thus upsampled to match the settings of the extraction network
IPD features provide spatial information related to the direction of sources in the mixture. SpeakerBeam extracts the target speaker based on the speaker embedding vector, , that may represent “spectral” information222Strictly speaking it is not the usual spectrum information as we use a learnable convolutional encoder layer to analyze the signal instead of STFT. about the target speaker, but does not include spatial information. Consequently, it is not obvious how to efficiently combine the IPD features and the target speaker embedding vector as they represent different information. In this paper, we consider two schemes, input combination and internal combination.
The input combination is similar to that proposed for TasNet in , where the IPD features (processed with a convolutional layer and upsampled) are concatenated to the output of the encoder layer of the extraction network. Input combination may force the initial convolution block to combine spatial information from the IPD features and “spectral” information from the mixture signal into a “spectral” representation, which allows the adaptation layer (coming after the first convolutional block) to select the target speaker by comparing this “spectral” representation with the target speaker embedding vector, . This may reduce the potential of the network to fully exploit spatial information by the upper layers of the network.
Figure 1-(b) shows a schematic diagram of TD-SpeakerBeam with the alternative internal IPD combination. It combines the IPD features (processed with a 1D convolutional layer, upsampling, and a convolution block) after the adaptation layer. Therefore, this lets the speaker selection operate based only on the “spectral” information, and the spatial information can be exploited by the upper layers without being obstructed by the adaptation layer.
2.3 multi-task learning with additional SI-loss
The extraction network and auxiliary networks are trained jointly from random initialization given the speech mixtures, , adaptation utterances, , and the target speech signals . In our previous works [30, 31], we trained SpeakerBeam using only a target speech reconstruction loss. In this paper, we propose using a multi-task loss for training TD-SpeakerBeam that combines scale-invariant source-to-noise ratio (SiSNR)  as signal reconstruction loss and cross-entropy-based SI-loss. The SI-loss is used to obtain more discriminative speaker embedding vectors. The multi-task loss is given by,
where are the model parameters, and is a one-hot vector representing the target speaker ID, is the SiSNR between the estimated and true target speech, is the cross entropy between the speaker label and the speaker embedding vector projected onto the training speaker space, , is a projection matrix, is a softmax function, and is a scaling parameter.
3 Related prior work
An alternative way to perform target speech extraction consists of performing speech separation followed by target speaker selection from the separated signals. Such a scheme was proposed in  for deep attractor network , but to the best of our knowledge has not been investigated for time-domain separation approaches. In the experiments, we compare TD-SpeakerBeam with TasNet separation followed by x-vector-based speaker selection , which can be considered a strong baseline for target speech extraction.
We borrowed from previous works on multi-channel source separation [26, 2] that IPD features may be good candidates for increasing extraction performance. Besides adding IPD features to the extraction network, an alternative approach was recently proposed , where a set of fixed beamformers combined with an attention module on the output of the beamformers was used to perform a rough initial target speech extraction followed by a refinement step with FD-SpeakerBeam. Investigating such a scheme with TD-SpeakerBeam or other spatial features will be part of our future works.
We performed experiments using two datasets, multi-channel WSJ0 2 mixtures (MC-WSJ0-2 mix) and CSJ-2mix. Table 1 shows details of the amount of data and the number of female and male speakers in the training and test sets.
MC-WSJ0-2 mix is a publicly available multi-channel version of the WSJ0-2mix corpus  that consists of mixtures of WSJ0 utterances. Multi-channel recordings are generated by convolving clean speech signals with room impulse responses simulated with the image method for reverberation time of up to about 600ms. The dataset consists of 8 channel recordings, but we use only 2 channels in our experiments. This dataset has only 101 training speakers. We use it thus only for the investigations on the use of spatial features.
The second dataset consists of single-channel 2-speaker mixtures that we simulated by mixing utterances from the corpus of spontaneous Japanese (CSJ) at SNR between -5 and 5 dBs. This dataset has a larger number of training speakers (937 speakers) and is used to investigate the effect of the SI-loss and the impact of the number of training speakers.
For both datasets, we randomly selected adaptation utterances of the target speaker in a mixture from the utterances of that speaker that differed from the utterance in the mixture. In the MC-WSJ0-2mix experiments, the adaptation utterances did not contain reverberation, although a similar level of performance could be achieved with reverberant adaptation utterances. We used an 8kHz sampling frequency for all our experiments.
4.2 Experimental settings
TD-SpeakerBeam was implemented based on the open source Conv-TasNet implementation. In particular, following the hyper-parameter notations in the original paper , we set the hyper-parameters to N=256, L=20, B=256, H=512, P=3, X=8, R=4. The auxiliary network consisted of an encoder layer and a single convolution block.
We compare the proposed TD-SpeakerBeam with (1) TasNet with oracle target speech selection, (2) TasNet with x-vector-based target speech selection, and (3) our previous implementation of FD-SpeakerBeam. TasNet used the network configuration described in [17, 10], with hyper-parameters equivalent to those of TD-SpeakerBeam. Oracle speaker selection was performed by finding the speaker permutation that maximizes the signal-to-distortion ratio (SDR). For x-vector-based speaker selection 
, we selected the target speech as the output of the TasNet separation module whose x-vector presented the highest cosine similarity with the x-vector of the adaptation utterance. We used the publicly available x-vector extractor model that was trained on multi-condition noisy and reverberant data to extract x-vectors[20, 11].
The network architecture of FD-SpeakerBeam consisted of 3 BLSTM layers followed by a sigmoid layer and 3 fully connected layers for the auxiliary network. FD-SpeakerBeam was trained with the MSE loss between the amplitude spectrum of clean target speech and masked signals. Details of the configuration can be found in . Note that many aspects of the network configuration and the training procedure differ from that of TD-SpeakerBeam. Consequently, the results of FD-SpeakerBeam are only indicative of the level of performance achieved in our previous works. A more fair comparison between the impact of working in the time and frequency domain in the context of speech separation can be found in .
For experiments with MC-WSJ0-2mix, we extracted IPD features using an STFT window of 32 msec and a shift of 16 msec. We compare TasNet with input IPD combination  and TD-SpeakerBeam with input and internal IPD combination.
4.3 Results with IPD features using MC-WSJ0-2mix
Table 2 shows the SDR for the MC-WSJ0-2mix experiments for mixtures of female-female (FF), male-male (MM) and female-male (FM) speakers. We confirmed that TasNet with oracle target speaker selection (row (2)) achieved high SDR performance. Moreover, TasNet with input combination of IPD features (row (3)) further improved performance especially for mixtures of same-gender speakers. These results can be considered an upper-bound for TasNet-based target speaker extraction. We omitted results with the internal IPD combination for TasNet, as it performed worse than using IPD features at the input.
Performance dropped greatly when using x-vector-based speaker selection (row (4) and (5)), especially for FF and MM cases. Although the x-vector extractor was trained on multi-condition data, it may still be affected by reverberation, which may partly contribute to the poor performance. However, reverberation is not the only reason for the performance drop because x-vector selection performed significantly worse than oracle even for the following CSJ experiments that do not include reverberation.
FD-SpeakerBeam (row (6)) performed slightly worse than TasNet (xvect). The proposed TD-SpeakerBeam (row (7)) outperformed all systems but TasNet with oracle speaker selection. Especially, there is a smaller performance gap between mixtures of speakers of the same and different genders than with FD-SpeakerBeam. We further confirmed that TD-SpeakerBeam with internal IPD combination (row (9)) improved performance by up to 1 dB and performed better than input combination (row (8)).
4.4 Results with the SI-loss on CSJ-2mix
|(6)||TD-SpkBeam + SI-loss||13.60||17.75||19.22||17.81|
Table 3 shows the SDR for TasNet with oracle and x-vector-based speaker selection, FD-SpeakerBeam and TD-SpeakerBeam without and with SI-loss. These results were obtained when using all 937 training speakers for training the models. TD-SpeakerBeam achieved much better performance than FD-SpeakerBeam and TasNet with or without oracle speaker selection. Moreover, SI-loss provided further consistent performance improvement of up to 1 dB.
Figure 2 shows the histogram of the SDR improvement for FF, MM and FM mixtures. TD-SpeakerBeam with or without SI-loss greatly reduced processing failures (SDR improvement of 0 dB or less). Moreover, the SI-loss led to better overall performance (more results with high SDR improvement).
Figure 3 shows the SDR as a function of the number of training speakers. The curves were obtained by creating 3 different training sets with 100, 500 and all 937 training speakers. In all cases, we used 50k mixtures. Interestingly, we observe that increasing the number of speakers has little effect on TasNet performance, but greatly improved the performance of SpeakerBeam. This suggests that, for SpeakerBeam, separating signals is somewhat easier than identifying speakers. The SI-loss provided consistent improvement when using more than 100 speakers (This is why we did not use the SI-loss in the MC-WSJ0 experiments). Note that performance remains significantly lower for FF mixtures, partly because there are fewer female speakers in the training set (see table 1) and also because it appears to be more challenging to separate FF mixtures, as also suggested by the lower performance of TasNet in this case.
In this paper, we proposed different strategies for improving the target speech discrimination capability of SpeakerBeam. We showed that a time-domain implementation greatly improved performance. Moreover, the performance gap between same-gender and different-gender mixtures could be reduced further by exploiting spatial information, using an additional SI-loss, or by increasing the number of training speakers.
Exploring multi-channel features for denoising-autoencoder-based speech enhancement. In Proc. of ICASSP’15, pp. 116–120. Cited by: §2.2.
-  (2019) A comprehensive study of speech separation: spectrogram vs waveform separation. In Proc. of Interspeech’19, pp. 4574–4578. Cited by: §1, §2.2, §2.2, §3, §4.2, §4.2, §5.
-  (2018) Deep extractor network for target speaker recovery from single channel speech mixtures. In Proc. of Interspeech’18, pp. 307–311. Cited by: §1.
-  (2017) Deep attractor network for single-microphone speaker separation. In Proc. of ICASSP’18, pp. 246–250. Cited by: §3.
-  (2019) Compact network for SpeakerBeam target speaker extraction. In Proc. of ICASSP’19, pp. 6965–6969. Cited by: §1, §2.1, §5.
-  (2018) Deep attractor networks for speaker re-identification and blind source separation. In Proc. of ICASSP’18, pp. 11–15. Cited by: §3.
-  (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. on Graphics 37 (4), pp. 112:1–112:11. Cited by: §1.
-  (1993) CSR-I (WSJ0) Complete LDC93S6A. Philadelphia: Linguistic Data Consortium. Note: https://catalog.ldc.upenn.edu/ldc93s6a Cited by: §4.1.
-  (2016) Deep clustering: discriminative embeddings for segmentation and separation. In Proc. of ICASSP’16, pp. 31–35. Cited by: §1.
-  https://github.com/funcwj/conv-tasnet. Cited by: §4.2, §4.2.
-  https://github.com/iiscleap/DIHARD-2019-baseline. Cited by: §4.2.
-  (2018) Listening to each speaker one by one with recurrent selective hearing networks. In Proc. of ICASSP’18, pp. 5064–5068. Cited by: §1.
Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. ASLP 25 (10), pp. 1901–1913. Cited by: §1.
-  (2019) SDR-half-baked or well done?. In Proc. of ICASSP’19, pp. 626–630. Cited by: §2.3.
-  (2019) Direction-aware speaker beam for multi-channel speaker extraction. In Proc. of Interpseech’19, Cited by: §1, §3, §5.
-  (2018) TasNet: surpassing ideal time-frequency masking for speech separation. In Proc. of ICASSP’18, Cited by: §1.
-  (2019) Conv-TasNet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. ASLP 27 (8), pp. 1256–1266. Cited by: §1, §2.1, §4.2, §4.2.
-  (2000) Spontaneous speech corpus of Japanese. In Proc. of LREC’00, pp. 947–952. Cited by: §4.1.
-  (2020) BEAM-TasNet: time-domain audio separation network meets frequency-domain beamformer. In Proc. of ICASSP’20 (Submitted), Cited by: §2.2.
-  (2018) Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In Proc. of Interspeech’18, pp. 2808–2812. Cited by: §4.2.
Furcax: end-to-end monaural speech separation based on deep gated (de) convolutional neural networks with adversarial example training. In Proc. of ICASSP’19, pp. 6985–6989. Cited by: §1.
-  (2016) Deep neural network-based speaker embeddings for end-to-end speaker verification. In Proc. of SLT’16, pp. 165–170. Cited by: §1, §3, §4.2.
-  (2016) Sequence summarizing neural network for speaker adaptation. In Proc. of ICASSP’16, pp. 5315–5319. Cited by: §1, §2.1.
-  (2006) Performance measurement in blind audio source separation. IEEE trans. ASLP 14 (4), pp. 1462–1469. Cited by: §4.2.
-  (2019) VoiceFilter: targeted voice separation by speaker-conditioned spectrogram masking. In Proc. of Interspeech’19, pp. 2728–2732. Cited by: §1.
-  (2018) Multi-channel deep clustering: discriminative spectral and spatial embeddings for speaker-independent speech separation. In Proc. of ICASSP’18, Cited by: §2.2, §3.
-  (2018) Multi-channel deep clustering: discriminative spectral and spatial embeddings for speaker-independent speech separation. In Proc. of ICASSP’18, pp. 1–5. Cited by: §4.1.
-  (2019) Margin matters: towards more discriminative deep neural network embeddings for speaker recognition. In Proc. of Interpseech’19, Cited by: §5.
-  (2019) OPTIMIZATION of speaker extraction neural network with magnitude and temporal spectrum approximation loss. In Proc. of ICASSP’19, pp. 6990–6994. Cited by: §1, §2.1.
-  (2017) Speaker-aware neural network based beamformer for speaker extraction in speech mixtures. In Proc. of Interspeech’17, pp. 2655–2659. Cited by: §1, §1, §2.2, §2.3.
-  (2019) SpeakerBeam: speaker aware neural network for target speaker extraction in speech mixtures. IEEE Journal of Selected Topics in Signal Processing 13 (4), pp. 800–814. Cited by: §1, §1, §1, §2.1, §2.1, §2.3, §4.2.