The human brain is able to focus auditory attention on a particular voice by masking out the acoustic background in the presence of multiple talkers and background noises [cherry1953some, getzmann2017switching]. This is called cocktail party effect or cocktail party problem.
Infants as young as five months have developed the ability to give special attention to their own names [conway2001cocktail]. Behavioral studies have shown that both the abilities to selectively attend to relevant stimuli and to effectively ignore irrelevant stimuli are developed progressively with increasing age across childhood [coch2005event]. These remarkable abilities are implemented with accurate processing of low-level stimulus attributes, segregation of auditory information into coherent voices, and selectively attending to a voice at the exclusion of others to facilitate higher level processing [hill2009auditory].
Humans’ ability of selective auditory attention has been clearly shown using multi-electrode surface recordings from the auditory cortex [mesgarani2012selective]. Attention is not a static, one way information distillation process. It is believed to be a modulation of focus between the bottom-up sensory-driven factors, such as a loud explosion that would attract attention, and the top-down task specific goal, such as a flight announcement of one’s interest in a busy airport [kaya2017modelling]. The modulation is done rapidly at real-time in response to the input acoustic stimulus and the top-down attention task in the cognitive process.
Recent physiological studies reveal that such attentional modulation takes place both locally by transforming the receptive field properties of the individual neurons and globally throughout the auditory cortex by rapid neural adaptation, or plasticity, of the cortical circuits[kaya2017modelling]. Computationally, the selective attention to an acoustic stimulus of interest can be described by a spectro-temporal receptive field, , which acts as a spectro-temporal mask. The modulated response [kaya2017modelling] to can be expressed as the element-wise multiplication between the stimulus and the mask, , where can be seen as the modulation of the input stimulus by a top-down voluntary focus, or top-down attention.
The top-down attention tasks vary with the application scenarios, for example, the flight announcement from a busy airport, the singing vocal from a music, or the speech of particular speaker from a multi-talker acoustic environment. In this paper, we are interested in how to pay a selective attention to a target speaker, a task which we call speaker extraction. Speaker extraction is highly demanded in real-world applications, such as, hearing aids [wang2017deep], speech recognition [li2015robust, watanabe2017new, xiao2016study], speaker verification[rao2019target_is], speaker diarization[sell2018diarization], and voice surveillance. A speaker independent speaker extraction system is expected to work for any target speaker unseen during the training, that we call open condition.
Building on the idea of spectro-temporal receptive field, there have been attempts to perform speaker extraction in frequency-domain through a spectro-temporal mask. The studies on computational auditory scene analysis (CASA) [lyon1983computational, meddis1991virtual, ellis1996prediction, seltzer2003harmonic, wang2006computational, hu2007auditory], non-negative matrix factorization (NMF) [hoyer2004non, cichocki2006new, schmidt2006single, smaragdis2007convolutive, virtanen2007monaural, parry2007incorporating], and factorial HMM-GMM [virtanen2006speech, kristjansson2006super, stark2011source]
, provide invaluable findings for solving the cocktail party problem. With the advent of deep learning, an idea was implemented to optimize the mask of individual speakers with deep recurrent neural networks for source separation of known speakers[huang2015joint]. However, machines have yet to achieve the same attention ability as humans in the presence of background noise or interference of competing speakers. The question is how to equip a network the ability to estimate the mask at run-time for a new speaker that is unseen by the system during training.
The studies on speaker-independent speech separation have seen major progress recently such as deep clustering (DC) [hershey2016deep, isik2016single, wang2018alternative], deep attractor network (DANet) [chen2017deep, luo2018speaker], permutation invariant training (PIT) [yu2017permutation, kolbaek2017multitalker, xu2018single, xu2018shifted], and time-domain audio separation network (TasNet) [luo2018tasnet, luo2018real, luo2018tasnet_arvix, luo2019conv]. Speech separation approaches mimic the human’s bottom-up sensory-driven attention. In general, speech separation methods require knowing or estimating the number of speakers in the mixture in advance. However, the number of speakers couldn’t always be known in advance in real world applications. Furthermore, speech separation methods may suffer from what is called global permutation ambiguity, where the separated voice for the same speaker may not stick to the same output stream when crossing long pauses or utterances because the separation is done utterance by utterance [luo2019conv].
Speaker extraction [vzmolikova2017learning, delcroix2018single, wang2018deep, xu2019optimization, xiao2019single, delcroix2019compact, ochiai2019unified, wang2019voicefilter, xu2019time] represents one of the solutions to the problem of unknown number of speakers and global permutation ambiguity. The idea is to provide a reference speech from a new speaker that is unseen during training. The system then uses such reference speech to direct the attention to the attended speaker, that emulates human’s top-down voluntary focus, as shown in Figure 1. Such speaker extraction technique is particularly useful when the system is expected to respond to a specific target speaker, for example, in speaker verification [rao2019target_is], where the reference speech of the target speaker is available through an enrolment process. In the prior work [vzmolikova2017learning, delcroix2018single, wang2018deep, xu2019optimization, xiao2019single, delcroix2019compact, ochiai2019unified, wang2019voicefilter], a common approach is to perform speaker extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra. Others have also studied complex ratio mask [fu2017complex, williamson2017time, tan2019complex]
in speech enhancement. The frequency-domain process relies on short-time Fourier transform (STFT) that faces the windowing effect, and phase estimation problem.
Inspired by Conv-TasNet [luo2018tasnet_arvix, luo2019conv] for speech separation, we propose a novel end-to-end network architecture for speaker extraction (SpEx). SpEx is composed of four network components: a speech encoder that encodes the time-domain mixture speech into spectrum-like feature representation that we call embedding coefficients, a speaker encoder that learns to represent the target speaker with a speaker embedding, a speaker extractor that estimates a receptive mask for the reference speech of the target speaker, and a speech decoder that reconstructs the clean speech for the target speaker by modulating the receptive mask with the embedding coefficients of the mixture speech. The SpEx architecture allows the joint training of all these four modules to take place with a multi-task learning algorithm.
The proposed SpEx is different from our earlier work [xu2019time]
where the speaker embedding, i-vector[Dehak&Kenny2011], is not involved in model training. It is also different from [vzmolikova2017learning, delcroix2018single, xu2019optimization, delcroix2019compact] where the speaker embedding is only trained to optimize the signal reconstruction loss. We will further discuss the difference between SpEx and TasNet in Section II-D. This paper makes the following contributions:
We emulate human’s ability of selective auditory attention by mimicking the top-down voluntary focus using a speaker encoder.
We propose a time-domain solution as an extension to Conv-TasNet from speech separation to speaker extraction, that avoids the phase estimation in frequency-domain approaches.
We propose a multi-task learning algorithm to jointly optimize the four network components of SpEx with an unified training process.
We propose a multi-scale encoding and decoding scheme that captures multiple temporal resolutions for improved voice quality.
Ii Time-domain Speaker Extraction Network
A speaker extraction network can be generally described in Figure 2. It consists of four network components. The speaker encoder encodes the reference speech into a speaker embedding, that is the feature representation of the target speaker. The speech encoder encodes the time-domain mixture speech into spectrum or spectrum-like feature representation. The speaker extractor estimates a mask that only lets pass the target speaker’s voice. Finally the speech decoder reconstructs the time-domain speech signal from the masked spectrum of the mixture speech. From the viewpoint of selective auditory attention, the masked spectrum is called the modulated response [kaya2017modelling].
In a frequency-domain implementation, a STFT module serves as the speech encoder that transforms time-domain speech signal into spectrum, with magnitude and phase, while an inverse STFT serves as the speech decoder. Similar to TasNet [luo2018tasnet, luo2019conv], we opt for a trainable neural network to serve as the speech encoder in time-domain speaker extraction. The speaker encoder is trained to convert time-domain speech signal into spectrum-like embedding, also called embedding coefficients. The proposed time-domain speaker extraction network (SpEx) is depicted in Figure 3 in detail.
Suppose that a signal of samples is the mixture of the target speaker’s voice and interference voices or background noise . We have,
where might be any number of interference, and might be either interference speech or background noise.
During the inference at run-time, given a mixture signal and a reference speech , the speaker extractor is expected to estimate that is close to subject to an optimization criterion.
Ii-a SpEx Architecture
Ii-A1 Speaker Encoder
In text-independent speaker recognition, it is common that we represent the speech with a fixed dimensional vector, such as i-vector [Dehak&Kenny2011], x-vector [snyder2016deep] and other similar feature representations [huang2018angular], that characterize the voiceprint of a speaker. The model that converts speech samples into feature representation is called speaker encoder , and the resulting feature representation is called speaker embedding.
In [wang2019voicefilter] and [xu2019time], a speaker encoder is pre-trained independently to extract a d-vector and i-vector for the target speaker. As such speaker encoders are pre-trained independently of speaker extraction systems, they are not optimized directly for speaker extraction purposes. Another idea is to train speaker encoders jointly with the speaker extraction system [vzmolikova2017learning, delcroix2018single, xu2019optimization, delcroix2019compact] with the loss (i.e., mean square error) between the extracted and clean speeches. Such speaker encoders are trained to optimize the signal reconstruction for speaker extraction, however, they do not aim directly to characterize nor discriminate the speakers.
To benefit from the idea of speaker encoder [wang2019voicefilter, xu2019time] and task-oriented optimization [vzmolikova2017learning, delcroix2018single, xu2019optimization, delcroix2019compact]
, we propose a multi-task learning algorithm to incorporate the speaker encoder as part of the SpEx network. The speaker encoder is jointly optimized by weighting a cross-entropy loss for speaker classification and a signal reconstruction loss between the extracted and clean speeches for speaker extraction. In practice, we use a bidirectional long-short term memory (BLSTM) to encode the context information of the reference speech into a speaker embedding with a mean pooling layer. In the multi-task learning process, the gradients from both the cross-entropy loss and the signal reconstruction loss are back-propagated to optimize the speaker encoder network. The details of the learning algorithm will be discussed in SectionII-B and II-C.
Ii-A2 Speech Encoder
There have been studies on how to address the phase estimation problem for frequency-domain methods. One is to optimize the real and imaginary parts separately [fu2017complex, williamson2017time, tan2019complex], another is to compensate the phase in the training process [gunawan2010iterative, wang2018alternative, wang2018end]. Such attempts have achieved limited successes due to the inexact phase estimation. Similar to Conv-TasNet [luo2018tasnet_arvix, luo2019conv], we opt for a time-domain approach, that transforms the time-domain mixture signal directly into a feature representation using a convolutional network.
In a frequency-domain approach, by applying Fourier transform, a speech signal is decomposed into an alternate representation, characterized by sines and cosines. Similarly, in a time-domain approach, we can consider the filters in the convolutional layers as the basis functions in analogy to the sines and cosines in the frequency-domain[reju2007convolution]. The feature representations are considered as the embedding coefficients. After all, the time-domain encoding is different from Fourier transform in that a) the feature representations don’t handle the real-and-imaginary parts separately; b) the basis functions are not pre-defined as sines or cosines, but rather trainable from the data.
The input mixture speech
can be encoded to embedding coefficients using a convolutional neural network in a similar way as other end-to-end speech processing systems[sainath2015learning, fu2018end, luo2019conv]. Inspired by [zhu2016learning, multiscale2], this paper proposes to encode the mixture speech into multi-scale speech embeddings using several parallel 1-D CNNs with filters each for various temporal resolutions. While the number of multiple scales can vary, we only study three different time scales in this work, without loss of generality. The filters of the parallel 1-D CNNs are of different lengths,
samples, to cover different window sizes. The CNNs are followed by a rectified linear unit (ReLU) activation function to produce the speech embedding.
To concatenate the embeddings across different time-scale, we align them by keeping the same stride,, across different scales. With the varying filter lengths, the encoder learns representations in multiple scales, for example, the short window has good resolution at high frequency and long window has high resolution at low frequency. Without trading the temporal resolution with frequency resolution like in STFT, we encode the time-domain signal into three temporal resolutions in the embedding . The embedding coefficients in each scale consist of a sequence of vectors, , which are defined as,
where , and is the segment of that has a window of samples shifting every samples. is also called the encoder basis.
Ii-A3 Speaker Extractor
One of the earliest theories of attention is Broadbent’s filter model [broadbent1958perception]. In psychoacoustic experiments, the stimuli are first processed according to their physical properties such as color, loudness, and pitch. The selective filters of listeners then allow for certain stimuli to pass through for further processing while other stimuli are rejected. The selective filter can be modelled by a mask that has been well studied in speech separation literature, such as ideal binary mask (IBM) [li2009optimality], ideal ratio mask (IRM) [narayanan2013ideal], ideal amplitude mask (IAM) [wang2014training], wiener-filter like mask (WFM) [erdogan2015phase] and phase sensitive mask (PSM) [erdogan2015phase].
In the SpEx framework, the speaker embedding describes the physical properties of the auditory source, a target speaker in this case, as the focus of the attention. The speaker extractor, as shown in Figure 3, is conditioned on the speaker embedding both during training and at run-time inference to estimate a filter mask, that is referred to as the receptive mask. We obtain the modulated response [kaya2017modelling] for each scale of the target speaker by applying the receptive mask on the embedding coefficients of the mixture signal in each scale,
where is an operator for element-wise multiplication. is the multi-scale embedding coefficients. and are the functions representing the speaker extractor and speaker encoder. is the reference speech of the target speaker to form an attention.
Specifically, the multi-scale embedding coefficients are firstly normalized by its mean and variance on channel dimension scaled by the trainable bias and gain parameters. Then, a 1-D CNN with kernel size, that is called CNN, is applied. The CNN with filters is performed to adjust the number of channels for the inputs and residual path of the subsequent blocks of temporal convolutional network (TCN). In this way, we have the multi-scale embedding coefficients as . At the same time, the speaker embedding vector from the speaker encoder is concatenated repeatedly to . The multi-scale embedding coefficients with speaker information are then defined as . Similarly, the speaker embedding vector is also concatenated repeatedly with the representations along the subsequent TCN blocks as shown in Figure 3.
Similar to Conv-TasNet, we stack the TCN blocks by exponentially increasing the dilation factor to capture the long-range dependency of the speech signal. Each TCN block, as shown in Figure 4, applies a dilated depth-wise separable convolution to reduce the number of parameters. The dilated depth-wise separable convolution consists of a dilated depth-wise convolution (“d-conv” in Figure 4) and a following CNN with filters. Since the number of input channels of the TCN block may be different from the number of the filters of the dilated depth-wise convolution, a CNN with filters is applied in advance to turn the number of input channels to . The dilated depth-wise convolution has a kernel size of , a number of filters and a dilation factor of . is the number of TCN blocks in a stack. Such a stack is repeated for times as shown in the speaker extractor in Figure 3.
To apply the mask on , must have the same dimensions as . As the output channels from the last TCN block may be different from the channels of the encoded representations , we apply one CNN to transform the dimension of the output channels from the last TCN block to be same as the encoded representations . The elements of the mask are estimated through a Sigmoid activation function to keep the range within . Finally, the masked embedding coefficients of the target speaker, that are also called the modulated responses [kaya2017modelling], are estimated by Eq. 3.
Ii-A4 Speech Decoder
The decoder reconstructs the time-domain speech signal from the modulated responses. Embedding coefficients at each scale lead to a modulated response. We reconstruct the multi-scale modulated response into time-domain signals (, , ) with the decoder bases , , and through a de-convolutional process. The decoder basis consists of the learned filters during training just as a Fourier basis that is composed of sine and cosine functions.
Ii-B Multi-scale Encoding and Decoding
Speech has a rich temporal structure over multiple time scales presenting phonemic, prosodic and linguistic content [multiscale2]. It was shown that speech analysis of multiple temporal resolutions leads to improved speech recognition performance[multiscale1]. As shown in Fig. 3, we implement multi-scale encoding in speech encoder and speaker extractor. The speaker encoder first encodes the mixture signal into a multi-scale embedding coefficients . The speaker extractor then estimates multi-scale masks , and generates the multi-scale modulated responses . We finally reconstruct the multi-scale modulated responses into time-domain signals at multiple scales with the speaker decoder.
During training, we calculate a multi-scale scale-invariant signal-to-distortion ratio (SI-SDR) loss, defined as , that aims to minimize the signal reconstruction error,
where and are the weights. , and are the signals reconstructed from the modulated responses , and , respectively. is the clean speech signal serving as the training target. We use the SI-SDR loss [le2019sdr], denoted as , as the measure of reconstruction error.
where and are the extracted and target signals of the target speaker, respectively. is the inner product. To ensure scale invariance, the signals and are normalized to zero-mean prior to the SI-SDR calculation.
The calculation of multi-scale SI-SDR loss is required only during training and not at run-time inference. At run-time inference, we evaluate the quality of the signals reconstructed at multiple scales individually, i.e. , , , and collectively as a weighted summation =(1--)++.
Ii-C Multi-task Learning
We propose to train the speaker encoder together with three other network components as a whole. The speech encoder, speaker extractor, and speech decoder are optimized to minimize the multi-scale SI-SDR loss, while the speaker encoder is optimized with two objective functions, the multi-scale SI-SDR loss and the cross-entropy loss for speaker classification.
The cross-entropy loss for speaker classification is defined as,
where is the number of speakers in the speaker classification task. is the true class label for speaker and
is the predicted speaker probability.
We combine with to optimize the speaker encoder network in a multi-task learning, as and represent two different optimization tasks. With the multi-task learning, the speaker encoder network is trained not only to characterize the unique properties of the target speaker, but also to contribute to the overall speaker extraction objective. The total loss is a weighted sum of and ,
Ii-D Relationship with TasNet
SpEx network can be seen as an extension to Conv-TasNet [luo2018tasnet_arvix, luo2019conv] from speech separation to speaker extraction applications. A comparison with TasNet framework will help the understanding of SpEx.
BLSTM-TasNet [luo2018tasnet, luo2018real] and Conv-TasNet [luo2018tasnet_arvix, luo2019conv] represent a successful attempt to address the phase estimation problem in frequency-domain speech separation. The techniques employ an encoder-separator-decoder pipeline, and learn trainable basis functions with a 1-D convolution and de-convolution instead of Fourier series consisting of sine and cosine functions. Speech separation is performed by estimating a mask for each speaker in the mixture using either a BLSTM in BLSTM-TasNet or a fully convolutional neural network (CNN) in Conv-TasNet. Conv-TasNet uses a TCN architecture together with a dilated separable depthwise convolution that represents an effective time-domain implementation.
The idea of SpEx is similar to that of Conv-TasNet in the sense that the speaker extractor of SpEx is based on the same TCN architecture[luo2018tasnet_arvix], and the encoder-extractor-decoder pipeline is inspired by the encoder-separator-decoder pipeline of Conv-TasNet. However, SpEx is also different from Conv-TasNet in the following ways:
Ii-D1 Top-down voluntary focus
SpEx features a speaker encoder as the top-down voluntary focus in selective auditory attention. It learns to single out one voice from the multi-talker mixture by modulating the input stimulus with a top-down attention. However, Conv-TasNet doesn’t employ such a mechanism. It learns to segregate two competing voices by estimating two filtering masks. Just like other speaker extraction techniques, SpEx addresses the issues of global permutation ambiguity and unknown number of speakers that we face in speech separation.
Ii-D2 Multi-task learning
As Conv-TasNet doesn’t involve a speaker encoder, it is trained only to optimize the reconstruction errors, equivalent to the SI-SDR loss in this paper. SpEx adopts a multi-task learning algorithm to jointly optimize all network components, with a cross-entropy loss for speaker classification and a SI-SDR loss for signal reconstruction. The speaker encoder is optimized by the total loss defined in Eq. 7. Such a total loss is different from the prior works, where the speaker extraction systems train the speaker encoder with either speaker classification loss [wang2019voicefilter, xu2019time], or signal reconstruction loss [vzmolikova2017learning, delcroix2018single, xu2019optimization, delcroix2019compact] as a single task.
Ii-D3 Multi-scale encoding and decoding
The TCN architecture in Conv-TasNet works well for single time scale embedding coefficients [luo2018tasnet_arvix, luo2019conv]. Multi-scale encoding is effective in deep neural networks approach to speech recognition[multiscale2]. We believe that, if the TCN architecture is trained with multi-scale embedding coefficients, it learns to reconstruct the rich temporal structure of speech in greater detail. This will be an interesting study of the TCN architecture.
Iii Experimental Setup
We simulated a two-speaker (WSJ0-2mix-extr) and a three-speaker (WSJ0-3mix-extr) mixture databases111Unlike in speech separation, speaker extraction technique requires a reference speaker to supervise the voluntary attention. We re-organize the well-known WSJ0-2mix and WSJ0-3mix with “max” data structure by selecting the first chosen speaker as the target speaker, while keeping the mixture speech the same. We rename the simulated database in this work to differentiate from the original WSJ0-2mix and WSJ0-3mix database. The simulating codes and the best baseline are available at: https://github.com/xuchenglin28/speaker_extraction. according to the well-known WSJ0-2mix and WSJ0-3mix [hershey2016deep]. The speech signals are sampled at a sampling rate of 8kHz based on the WSJ0 corpus [garofolo1993csr]. Each database has three datasets: training set ( utterances), development set ( utterances), and test set ( utterances).
Same as [hershey2016deep], the training set and development set are generated by randomly selecting two utterances for two-speaker database, and three utterances for three-speaker database, from male and
female speakers in the WSJ0 “si_tr_s” set at various signal-to-noise ratio (SNR) betweendB and dB. The training set is used for the training of network components, while the development set is for optimizing system configurations.
Similarly, the utterances from male and female speakers in the WSJ0 “si_dt_05” set and “si_et_05” set are randomly selected to create the test set. Since the speakers in the test set are excluded from the training and development sets, and the reference speech is not used in any of the speech mixing, the test set is developed for speaker independent evaluation, also called open condition evaluation.
To include both overlapping and non-overlapping speech in the dataset, we keep the maximum length of the mixing utterances as the length of the mixture. The speaker of the first randomly selected utterance is regarded as the target speaker. At run-time, the speaker extraction process is conditioned on a reference speech from the target speaker.
As the reference speech is randomly selected, the duration of the reference speech varies in training, development and test sets. We call this experimental condition as “Random”. In the test set, the average duration of the reference speech is 7.3s with a standard deviation of 2.7s, a maximum length of 19.6s, and a minimum length of 1.6s. The experiments are conducted under this “Random” condition if not stated otherwise.
In two-speaker database, we also group the reference speech for the test set into four duration groups, i.e. 7.5s, 15s, 30s and 60s, for the experiment on duration of reference speech, as reported in Section IV-A8.
While most of the comparative experiments are conducted on the two-speaker database, we also extend the experiments beyond two-speaker mixture. A three-speaker database is constructed in a similar way as the two-speaker database, except that the duration of the reference speech in the test set is kept as 15s and 60s. In the experiment for three-speaker mixture, we train the SpEx network under three conditions, two-speaker mixture only, three-speaker mixture only, and two-speaker and three-speaker mixture in combination. The trained SpEx systems are then evaluated on the two-speaker and three-speaker mixture test set, respectively.
The network is optimized by the Adam algorithm [kingma2014adam]. The learning rate begins with and halves when the loss increases on the development set for at least epochs. Early stopping scheme is applied when the loss increased on the development set for epochs. The utterances in the training and development set are broken into s segments222We discard the segments less than 4s or containing only silence for the target speech., and the minibatch size is set to .
Iii-B Speaker Encoder
The speaker encoder in Figure 2 translates the reference speech of the target speech into a top-down voluntary focus that the speaker extractor network can act upon. In Figure 3, we propose a detailed implementation, that is to repeatedly concatenate the speaker embedding vector with the inputs to TCN blocks. In this paper, we advocate the idea to incorporate the speaker encoder network as an integral part of the SpEx architecture during training and at run-time inference. As a contrastive experiment, we would like to know how such speaker encoder performs differently from a traditional i-vector extractor [Dehak&Kenny2011]. We choose i-vector because it has been one of the most effective techniques for text-independent speaker characterization.
Iii-B1 I-vector Extractor
An i-vector extractor converts a speech sample into a low-dimension vector. We first train the UBM and total variability matrix with the single speaker (clean) speech from the training and development sets. The acoustic features include 19 MFCCs together with energy, and their 1st- and 2nd-derivatives, followed by cepstral mean normalization [Atal74] with a window size of 3 seconds. The 60-dimensional acoustic features are extracted from a window length of 25ms with a shift of 10ms. A Hamming window is applied. An energy based voice activity detection method is used to remove the silence frames. The i-vector extractor is based on a gender-independent UBM with 512 mixtures and a total variability matrix with 400 total factors.
Iii-B2 Speaker Encoder
We use the same acoustic features as in the training of i-vector extractor. To leverage the temporal information of the whole reference speech, a BLSTM with cells in each forward and backward direction is used to capture the speaker information from the acoustic features. A non-linear layer with ReLU activation function with nodes is followed by the BLSTM. Another linear layer with nodes followed by a mean pooling is applied to extract the speaker embedding vector, that has the same dimension as the i-vector for fair comparison.
Iii-C Speaker Extraction Pipeline
The speaker extraction pipeline includes speech encoder, speaker extractor, and speech decoder. The parameters that are quoted in this section have been tuned empirically for the best performance on the development set.
Iii-C1 Speech Encoder
In the SpEx implementation detailed in Figure 3, the speech encoder encodes the mixture speech input into embedding coefficients by three parallel 1-D convolution of filters each, followed by a ReLU activation function. To learn multi-scale embeddings with different time resolutions, the three 1-D convolutions had filter lengths of with a stride of samples. windows are tuned to cover samples in this work.
Iii-C2 Speaker Extractor
As shown in Figure 3, a mean and variance normalization with trainable gain and bias parameters is applied to the embedding coefficients on the channel dimension, where is equal to
. A 1x1 convolution linearly transformed the normalized embedding coefficientsto the representations with channels, which determined the number of channels in the input and residual path of the subsequent CNN. The number of input channels and the kernel size of each depthwise convolution are set to and . TCN blocks are formed as a stack and repeated for times.
Iii-C3 Speech Decoder
The speech decoder in Figure 3 reconstructs the time-domain speech signal (, , ) from the modulated responses (, , ) through a de-convolution process. The filter in the de-convolution has the same configuration as that in the speech encoder, where the number of filters () is equal to and the filter lengths () are tuned to be samples.
Iii-D Reference Baselines
We select systems that represent the recent advances in single channel target speaker extraction as the baselines, and implement all of them for benchmarking. The baseline systems belong to the Speaker Beam Frontend (SBF) [delcroix2018single] family, which demonstrates state-of-the-art performance of frequency-domain speaker extraction techniques on databases that are similar to this paper.
SBF-IBM [delcroix2018single]: This architecture adopts a speaker adaptation layer in a context adaptive deep neural network (CADNN) [delcroix2016context] to track the target speaker from the input mixture in the speaker extraction. The weights in the adaptation layer are learned from a target speaker’s enrolled speech in the speaker embedding network. IBM is used to calculate the mask approximation loss as the objective function.
SBF-MSAL [delcroix2018single]: This architecture replaces the IBM objective function in SBF-IBM with a magnitude spectrum approximation loss (MSAL) to directly minimize the signal reconstruction error. It is reported that SBF-MSAL outperforms SBF-IBM.
SBF-MTSAL [xu2019optimization]: This architecture replaces the IBM objective function in SBF-IBM with a magnitude and temporal spectrum approximation loss (MTSAL), in which a temporal constraint is incorporated to ensure the temporal continuity of the output signal. It is reported that SBF-MTSAL outperforms SBF-MSAL.
SBF-MTSAL-Concat [xu2019optimization]: This architecture adopts a BLSTM as the speaker encoder to capture long range speaker characteristics. While the speaker encoder of SBF-MTSAL-Concat is similar to that of SpEx, it is trained only using the magnitude and temporal spectrum approximation loss without multi-task learning. No speaker classification loss is investigated. Nonetheless, it is reported that SBF-MTSAL-Concat outperforms all the above three SBF variations.
Iii-E Evaluation Metrics
We follow the same evaluation metrics in the speaker extraction literature[xu2019optimization] for ease of comparison. They are the signal-to-distortion ratio (SDR) [vincent2006performance] and perceptual evaluation of speech quality (PESQ) [rix2001perceptual]. We also include SI-SDR [le2019sdr], because SI-SDR is more suitable and robust for single channel speech separation or extraction than SDR. Since the speaker extraction aims to improve the speech quality and intelligibility, the subjective evaluation of A/B preference test is also conducted to evaluate the perceptual quality of the extracted speech by humans.
We report the results of 10 experiments in two groups. The first 9 experiments are carried out on the two-speaker mixture database, while the last experiment is on the three-speaker mixture database.
Iv-a Experiments on Two-speaker Mixture
Iv-A1 Frequency-domain vs. Time-domain
In this experiment, we would like to compare between two processing paradigms, the frequency-domain and the proposed time-domain methods. For frequency-domain implementation, we adopt STFT and inverse STFT as the speech encoder and decoder in Figure 2 respectively. For time-domain implementation, we adopt the speech encoder and decoder proposed in Section II. In both systems, we adopt i-vector extractor as the speaker encoder. As the i-vector extractor is trained independently from the speaker extraction pipeline, this comparison is focused on frequency-domain and time-domain speaker extraction pipeline. As the frequency-domain method uses a fixed short-time window of 256 samples, the time-domain systems are also implemented with a single short-time window, or single scale as opposed to multi-scale as discussed in Section II-A, for fair comparison.
We observe from Table I, that the time-domain speaker extraction systems (System 2-13) consistently outperform the frequency-domain counterpart (System 1), especially when time-domain systems have fewer than or roughly the same number of parameters as the frequency-domain system.
The results clearly show the advantage of the trainable speech encoder and decoder over the static STFT and inverse STFT in the frequency-domain. We consider that the better performance is attributed to the use of embedding coefficients in place of magnitude and phase spectra in the process, that avoids the need of phase estimation.
Iv-A2 Single-scale vs. Multi-scale
In this experiment, we would like to validate the idea of multi-scale speech embedding. We continue to use i-vector extractor as the speaker encoder. From the experiments reported in Table I, we observe that systems of more parameters perform better. By varying the filter length of the convolution layer in the speech encoder from System 9-13, we observe that the change of time-frequency resolution of the embedding coefficients has an impact on the system performance. The best SDR is achieved as 13.1dB with a filter length of 20 samples (). The best SI-SDR is 12.4dB with the filter length of 20 samples () and 80 samples (). The best PESQ is 2.94 with a filter length of 256 samples (). This finding is similar to that in speech recognition experiment [multiscale2] .
To benefit from the different time-frequency resolutions, we propose to have three 1-D CNNs with different filter length, short, middle, and long, in the speech encoder. The speaker extractor and speech decoder are also extended to be compatible for the multi-scale speech embedding, as shown in Figure 3. The speaker extractor estimates the mask for the target speaker at each scale. The speech decoder reconstructs the time-domain signal for each scale with the modulated response.
We explore different system configurations that are summarized in System 14-24 of Table II. Comparison between System 9 and System 14 shows that the multi-scale speech encoder achieves better performance than single-scale speech encoder. If the speech decoder has multiple outputs with the multi-scale speech embeddings, we could optimize the SpEx network with a weighted multi-scale SI-SDR loss, as defined in Eq. 4. With multi-scale speech encoder and decoder, the best performances of the SDR, SI-SDR and PESQ are achieved at dB, dB and when the weights and in Eq. 4 are tuned to be and . Comparing with the single-scale system, the performance of the multi-scale SpEx improves the SDR of , the SI-SDR of , and the PESQ of . Comparisons between System 16 and System 22-24 present that the best performance is achieved by picking the output stream with short window (high temporal resolution). By only picking the reconstructed signal instead of a weighted summation (=(1--)++), the number of parameters during evaluation is less than that during training.
Iv-A3 I-vector vs. Speaker Embedding
We have observed that the i-vector is effective in speaker characterization for both single-scale and multi-scale speaker extraction networks as reported in Tables I and II. We note that the i-vector is extracted independently of the speaker extraction network. In this experiment, we would like to replace the i-vector extractor with the speaker encoder. The speaker encoder is trained jointly with other components of the network using both the cross-entropy loss for speaker classification and the multi-scale SI-SDR loss for speaker extraction as System 25 to 31 in Table III.
We obtain the best SDR and SI-SDR of dB and dB when the weight for the sub-loss of the cross-entropy is tuned to be . Comparing with the i-vector based system (System 16 in Table II), we observe that the joint optimization of the speaker encoder and the speaker extraction pipeline (System 27 in Table III) with multi-task learning achieves relative improvements of in terms of SDR, in terms of SI-SDR, in terms of PESQ. As the SpEx network with joint optimization (Figure 3) achieves the best performance, we use the configuration hereafter.
Iv-A4 Benchmark against the Baselines
We compare the SpEx network as illustrated in Figure 3 with four competitive baselines [delcroix2018single, xu2019optimization]. As can be seen in Table IV, the SpEx network shows , and relative improvements over the best baseline, SBF-MTSAL-Concat, in terms of SDR, SI-SDR and PESQ under the open condition.
The time-domain speaker extraction architecture has shown three clear advantages over its frequency-domain counterparts.
(1) Because the SpEx network doesn’t decompose the speech signal into magnitude and phase spectra, it avoids inexact phase estimation.
(2) The SpEx network benefits from the long-range dependency of the speech signal captured by the stacked dilated depth-wise separable convolution with a manageable number of parameters. Without the recurrent connection, the SpEx method can be easily parallelized for fast training and inference.
(3) The SpEx network takes advantage of multi-scale speech embedding to have a good coverage of time-frequency resolution in the encoding, which doesn’t have to trade time resolution with frequency resolution like in short-time frequency analysis.
As an example, we illustrate the speaker extraction from a female-female mixture speech by the competitive baseline systems and the proposed SpEx network in Figure 5. From the log magnitude spectrum, we observe that the proposed SpEx network outperforms other baseline systems in terms of the recovered signal quality and purification. Some listening examples are available online 333 https://xuchenglin28.github.io/files/taslp2019/index.html, of which the first example is the audio illustrated in Figure 5.
Iv-A5 Different Gender vs. Same Gender
Generally speaking, speakers of the same gender sound closer than those of different gender. We further report the results of the experiments in Table IV for different and same gender mixture separately. We observe in Table V that the performance of different gender mixture is always better than the same gender. This has been observed in human listening test as reported by Treisman [treisman1964selective] in a behavioural study. It was found that difference in voice (i.e., male versus female) allows more efficient rejection of the irrelevant signal when messages are mixed and played to both ears (i.e., diotic).
From Table V, we also observe that the proposed SpEx network achieves and relative SDR improvement, and and relative PESQ improvement over the best baseline, SBF-MTSAL-Concat, for different and same gender conditions.
Iv-A6 Mixture with Different SNR
It is of interest to investigate how the proposed SpEx network performs for mixture speech of different SNR, where we consider the target speech as the foreground and the interference as the background noise. We train a SpEx network on the dataset that has the SNR range of [0-5] as described in Section III-A. The same SpEx network has been reported in Tables IV and V.
We divide the test set into SNR groups, namely [0, 1)dB, [1, 3)dB and [3, 5]dB. The results are summarized in Table VI. Without surprise, test data of higher SNR performs better than that of lower SNR. We also observe that the proposed SpEx network achieves , and relative SDR improvement over the best baseline system, SBF-MTSAL-Concat, for [0, 1)dB, [1, 3)dB and [3, 5]dB SNR group respectively. Since the SNR of the simulated database is limited from 0dB to 5dB, in the future work, we will investigate various SNR ranges, i.e., from -10dB to 20dB.
Iv-A7 Subjective Evaluation
Since the SBF-MTSAL-Concat represented the best baseline performance in the objective evaluation, we only conducted an A/B preference test between the proposed SpEx network and the SBF-MTSAL-Concat baseline to evaluate the signal quality and intelligibility in a listening test. We randomly selected pairs of listening examples, including the original target speaker’s reference and two extracted signals for the target speaker by the proposed SpEx network and the best baseline system. We invited a group of subjects to select their preference according to the quality and intelligibility. The listeners were asked to pay special attention to the amount of perceived distortion and interference from background. For each test, the subject listened to three audios in a group, the reference speech was firstly played, followed by the extracted speech in random order from the two systems. The subject didn’t have the information about which speech stemmed from which system.
We observe from Figure 6 that the listeners clearly favor the proposed SpEx network with a preference score of as opposed to that of for the best SBF-MTSAL-Concat system. Most listeners significantly favor the SpEx network with a significance level of , because of lower distortion and inter-speaker interference than the best baseline.
Iv-A8 Duration of the Reference Speech
As speaker extraction relies on the reference speech of the target speaker to develop the top-down voluntary focus, the duration of the reference speech plays a role in the process. We further look into the impact of the duration on speaker extraction performance. In the aforementioned experiments, the duration of the reference speech in training, development and test sets is at “Random” as described in Section III-A. Now let’s compare the “Random” setting with different duration groups (s, s, s and s) in the test set. The experimental results are summarized in Table VII.
Since the average duration of the reference speech in the “Random” condition of the test set is s, we firstly evaluate the performance on the test subset with reference speech of a duration s. It is noted that the results are similar between “Random” condition and the s subset. When we increase the duration of the reference speech in the test set to s, s and s, we observe that longer duration leads to better results in general. When we fix the duration of the reference speech to s for the training and development set, the performance drops slightly when comparing with those under the “Random” condition. However, we continue to observe that longer test speech duration always helps.
Iv-A9 Comparisons with Speech Separation on WSJ0-2mix
Most speech separation methods conducted their experiments on the well-known WSJ0-2mix database. To compare with speech separation methods, we trained the proposed SpEx model on WSJ0-2mix database to extract each speaker in the mixture by giving a reference speech of the corresponding speaker. In addition, we re-implemented the Conv-TasNet method [luo2018tasnet_arvix] with the same optimization scheme as our proposed SpEx as described in Section III-A.
From Table VIII, we observe that the proposed SpEx achieves comparable performance as Conv-TasNet [luo2018tasnet_arvix] with the same TCN architecture. While SpEx and Conv-TasNet are comparable in performance, just like other speaker extraction techniques, SpEx offers its unique advantages over other speech separation techniques in real-world applications.
As SpEx relies very much on the quality of the speaker embeddings, we observed that the proposed speaker encoder has outperformed i-vector encoder (refer to Table III). We will further investigate the performance of SpEx on the speaker database larger than WSJ0-2mix (101 speakers) in the future work.
Iv-B Experiments on Three-Speaker Mixture
The proposed SpEx network has the inherent ability to extract speech from mixture speech of more than two speakers using the same network architecture. We train the SpEx system under three conditions: only two-speaker mixture data, only three-speaker mixture data, and the combination of two- and three-speaker mixture data. We then evaluate the performance of the trained SpEx systems on two-speaker and three-speaker mixed test data, respectively. From Section IV-A8, we know that the longer duration of the reference speech in the test set achieves better performance. We keep the duration of the reference speech as s and s for a comparison for both two-speaker and three-speaker mixed test data in this experiment.
From Table IX, we observe that the performance of the two-speaker mixture is always better than the three-speaker mixture in the SpEx systems under three conditions with different training data. This is consistent with the findings in a human’s performance of a subject evaluation where both listening comprehension and auditory attention decrease significantly as the number of simultaneous audio channels increased [stifelman1994cocktail]. It further confirms that the longer duration of the reference speech achieves better performance. Because the longer duration of the reference speech derives better speaker embedding.
V Discussions and Conclusions
We propose an end-to-end speaker extraction network (SpEx) that emulates humans’ ability of selective auditory attention. The SpEx network forms a top-down voluntary focus by using the reference speech of the target speaker. It is particularly useful in cases where speakers are pre-registered to the system, for example, in speaker verification [rao2019target_is] where the target speaker is known to the system through enrollment.
The SpEx network also overcomes the phase estimation issue in frequency-domain speaker extraction. The improvements are attributed to the dilated convolutional encoder-decoder framework that performs in time-domain, the multi-scale encoding and decoding, and the multi-task learning algorithm. Our experiments show that the SpEx network significantly outperforms the frequency-domain counterparts.
The ability of human to detect a particular signal from other interference speech or background noise is greatly improved with two ears [arons1992review]. Previous studies [chen2017cracking, wang2018multi] on multi-channel speech separation have shown impressive improvements, particularly in the presence of reverberation and multiple interference speakers. Similarly, we may improve the speaker extraction performance under those adverse conditions by extending the SpEx network for multi-channel inputs, that will be an extension of this work. In addition, SpEx could be extended to enable DPRNN-TasNet [luo2019dual] for speaker extraction by replacing the TCN block with a dual path RNN for improved speech quality.
Humans tend to perceive sounds as coming from locations of visual events [arons1992review], for example, when we watch television, where an actor’s voice appears to be emanating from his mouth regardless of where the loudspeaker is located. The speaker encoder mechanism in this paper allows for an easy implementation of audio-visual speaker encoder, that will strengthen the top-down voluntary focus in the selective auditory attention.
Brain computer interaction helps connect human brain with assistive devices, i.e., hearing aid device. To assist people with hearing impairment, it would be interesting to study how SpEx can take non-invasive electro-encephalography (EEG) [o2015attentional] or invasive electro-corticography (ECoG) [han2019speaker] signals, instead of a reference speech, as input to decode the speech from the attended speaker.
In summary, the proposed SpEx network marks another step towards solving the cocktail party problem. It will potentially improve the performance of many down-stream speech processing applications, such as speaker verification [rao2019target_is] and speaker diarization.