Human interactions are often in a broad range of complex auditory scenes, consisting of several speech sources from different speakers and various noises. This complexity poses challenges for many speech technologies, because they usually assume one or zero speaker to be active at the same time [haeb2019speech]. To tackle these challenging scenes, many techniques have been studied.
Speech separation aims at isolating individual speaker’s voices from a recording with overlapped speech [huang2014deep, wang2016discriminative, hershey2016deep, isik2016single, yu2017permutation, chen2017deep, Drude2018Deep]. With the separation results, both the speech intelligibility for human listening and speech recognition accuracy could be improved [zeghidour2020wavesplit]. Different from the separation task, speaker extraction makes use of additional information to distinguish a target speaker from other participating speakers [delcroix2018single, wang2019voicefilter, xu2018modeling, xu2020spex]. Besides, speech denoising [donahue2018exploring, rethage2018wavenet] and speaker diarization [FujitaKHNW19, fujita2020end] tasks have also been studied for solving the problem of complex acoustic scenes.
Although many works have been proposed towards each task mentioned above, the processing of natural recordings is still challenging. Overall, these tasks are designed to accomplish one particular problem, which has assumptions that do not hold in complex speech recordings. For instance, speech separation was heavily explored with pre-segmented audio samples with a length of several seconds (less than 10 seconds), which makes it difficult to form reasonable results for long recordings. Because most existing separation methods only output a fixed number of speech sources with agnostic order, and it is unable to process the variable number of speakers and the relation of the orders between different segments. Similarly, the speaker diarization bypassed the overlapped part before. Recently, the emergence of EEND approaches [FujitaKHNW19, fujita2020end] could fix the problem of overlapped speech parts to some extent. However, the diarization results seem an intermediate product without the extraction of each speaker, especially for the overlapped parts.
To address these limitations, we believe that integrating speaker information (used in aim speaker extraction, speaker diarization) into speaker-independent tasks (e.g., speech separation, speech denoising and even speech recognition) will help broaden the application of these techniques towards real scenes. To be specific, we reconstruct the speech separation/extraction task with the strategy over probabilistic chain rule by importing the conditional probability based on speaker information. In practice, our model automatically infers the information of speakers’ identities and then takes it as condition to extract speech sources. The speaker information here is some learned hidden representation related to the speaker’s identity, which makes it also suitable for open speaker tasks. We believe this design actually better meets the expectation about an intelligent front-end speech processing pipeline. Because users usually want to get the information about not only the extracted clean speech sources but also which ones speak what.
In this work, we propose our Speaker-Conditional Chain Model (SCCM) to separate the speech sources of different speakers with overlapped speech. Meanwhile, the proposed method can handle a long recording with multiple rounds of utterances spoken by different speakers. Based on this model, we verified its effectiveness in getting both the identity information of each speaker and the extracted speech sources of them.
The contributions of this paper span the following aspects: (1) we built a common chain model for the processing of speech with one or more speakers. Through the inference-to-extraction pipeline, our model solves the problem about the variable and even unknown number of speakers; (2) with the same architecture, our model shows a comparative performance with the base model, while we could additionally offer accurate speaker identity information for further downstream usage; (3) we proved the effectiveness of this design for both short overlapped segments and long recordings with multi-round conversations, (4) we analyze the advantages and drawbacks of this model. Our demo video and Supplementary Material are available at https://shincling.github.io/.
2 Related work
2.1 Speech separation
As the core part of the cocktail party problem [cherry1953some], speech separation gains much attention recently. The common design of this task is to disentangle fully overlapped speech signals from a given short mixture (less than 10 seconds) with a fixed number of speakers. Under this design, from spectrogram-based methods [hershey2016deep, isik2016single, yu2017permutation, Kolbaek2017Multitalker, Luo2018Speaker] to time-domain methods [luo2018real-time, luo2018tasnet, luo2019dual], speaker-agnostic separation approaches have been intensively studied. However, with the steady improvement in performance, most existing approaches might overfit the fully overlapped audio data, which is far from the natural situation with less than 20% overlap ratio in conversations [ccetin2006analysis]. Besides, most existing separation models should know the number of speakers in advance and could only tackle the data with the same number of speakers [shi2018listen]. These constraints further limit their application to real scenes, while our proposed SCCM can provide a solution to the above sparse overlap and unknown speaker number issues. A similar idea with recurrent selective attention networks [kinoshita2018listening] has been proposed before to tackle the variable number of speakers in separation. However, this model performs with residual spectrograms without leveraging the time-domain methods. And their uPIT [Kolbaek2017Multitalker] based training is hard to process a long recording, due to the speaker tracing problem raised when chunking the long recording into short segments.
2.2 Speaker extraction
Another task related to our model is the speaker extraction [delcroix2018single, wang2019voicefilter, xu2018modeling, xu2020spex]. The idea of speaker extraction is to provide a reference from a speaker, and then use such reference to direct the attention to the specified speaker. The reference may be taken from different characteristic binding with the specific speaker, such as voiceprint, location, onset/offset information, and even visual representation [Ephrat2018Looking]. The speaker extraction technique is particularly useful when the system is expected to respond to a specific target speaker. However, for a meeting or conversation with multiple speakers, the demand for additional references makes it inconvenient. In our work, the reference could be directly inferred from the original recordings, which shows an advantage when the complete analysis of each speaker is needed.
3 Speaker-conditional chain model
This section describes our Speaker-Conditional Chain Model (SCCM). As illustrated in Figure 1, the chain here refers to a pipeline through two sequential components: speaker inference and speech extraction. These models are integrated based on a joint probability formulation, which will be described in Section 3.1
. Speaker identities play an important role in our strategy. The speaker inference module aims to predict the possible speaker identities and the corresponding embedding vectors. The speech extraction module takes each embedding from the speaker inference module as the query to disentangle the corresponding source audio from the input recording.
This design will bring several advantages. First, the possible speakers are inferred by a sequence-to-sequence model with an end-of-sequence label, which easily handles variable and unknown numbers of speakers. Second, the inference part is based on a self-attention network, which utilizes the full context information in a recording to form a speaker embedding. This avoids the calculation inefficiency problem in some clustering-based models [zeghidour2020wavesplit, hershey2016deep, isik2016single]
, which needs an iterative k-means algorithm in each frame. Third, the information about each speaker will make it suitable for our model to some further applications in speaker diarization or speaker tracking.
3.1 Problem setting and formulation
Assume there is a training dataset with a set of speaker identities with known distinct speakers in total. In a -length segment of waveform observation , there are different speakers . Each speaker 111Although during training, potentially during inference in the open speaker task, where the system could still provide a meaningful speaker embedding vector for downstream applications. has the corresponding speech source to form the set of sources
. The basic formulation of our strategy is to estimate the joint probability of speaker labels and corresponding sources, i.e.,. This is factorized with speaker inference probability and speech extraction probability as follows:
We further factorize each probability distribution based on the probabilistic chain rule.
The speaker inference probability in Eq. (1) recursively predicts variable numbers of speaker identities as follows:
The speech extraction probability in Eq. (1) is also factorized by using the probabilistic chain rule and the conditional independence assumption, as follows:
As illustrated in Figure 1(c), our speech extraction module takes the speaker identity , which is predicted from the speaker inference module in Eq. (2), to conduct a conditional extraction. Every speaker information here serves as the condition to guide the following extraction. For multi-round long recordings, the speaker information will be formed as global information from the whole observation to track the specific speaker. The network architecture of will be discussed in Section 3.3.
3.2 Speaker inference module
architecture as the encoder-decoder structure. In this part, we take the observation spectrogram (Short-Time Fourier Transform (STFT) coefficients) as an input. The reason we do not use the time-domain approach here is to avoid excessive computation complexities which may consume too much GPU memory to train the model, especially with inputs of long recordings.
In detail, for a given spectrogram containing frames and frequency bins, it is viewed as a sequence of frames. For the encoder part, we use the Transformer Encoder as follows:
where, is a linear projection that maps -dimensional vector to -dimensional vector for each column of the input matrix.
is the Transformer Encoder block that contains multi-head self-attention layer, position-wise feed-forward layer, and residual connections. By stacking the encodertimes, is an output of the encoder part.
For the decoder part, the neural network outputs probability distributionfor the -th speaker, calculated as follows:
where is the positional encoding in each step to predict the speaker. is the Transformer Decoder block, which takes the states from the output of encoder and the hidden state from the previous step to output the speakers embedding at this step. Finally, a linear projection with a softmax produces a -dimensional vector as the network output, where is the -th predicted probability distribution over the union of speaker set and the additional end-of-sequence label , i.e., .
3.3 Speech extraction module
For the speech extraction module, each speaker channel will be processed independently, as formed in Eq. (3). This part takes each inferred speaker embedding predicted in Eq. (7) instead of identity , and the raw waveform as input to produce the corresponding clean signal :
where, takes a similar architecture with time-domain speech separation methods from the Conv-TasNet [luo2018tasnet]. The difference lies in that we will output one channel towards each speaker embedding rather than separate several sources together. To be specific, at the end of the separator module in [luo2018tasnet], we will concatenate the with each frame of the output features. Then, a single channel operation is conducted towards this speaker, rather than multi-channel (as the number of speakers in this mixture). Besides this simple fusion approach, we have tested several different methods to integrate the condition vector into the model. For example, to concatenate it at the beginning of the separator, or use the similar method in [zeghidour2020wavesplit] with FiLM [perez2018film] in each block in TasNet’s separator. However, we found both of the other methods cause severe overfitting.
3.4 Training targets
Our whole model is end-to-end, with the loss , which corresponds to optimize the joint probability in Eq. (1). is calculated from both the cross-entropy loss , which corresponds to deal with speaker inference in Section 3.2, and the source reconstruction loss (SI-SNR) , which corresponds to deal with speech extraction in a non-probabilistic manner in Section 3.3. One critical problem in training SCCM is to decide the order of the inferred speakers. For one possible permutation , the speakers list and the speech sources will be re-ordered synchronously as follows:
Some former works have shown that the seq2seq structure helps to improve the accuracy in the inference module by setting a fixed order in training [shi2019ones]. We compared several options to use a random fixed order or use the order defined by the energy in the spectrogram (observed well in [weng2015deep]). But we found the order decided by the model itself gets better performance in practice. Therefore, we take the best permutation with least reconstruction error in the extraction part as the order to train the inference part as follows:
where we use in all our experiments.
As a generalized framework to tackle the problem of extracting speech sources of all speakers, we tested the effectiveness of SCCM with different tasks. Besides the signal reconstruction quality (e.g., SDRi, SI-SNRi) used in speech separation task, we also verified the performance over speaker identification and speech recognition. In our experiments, all data are resampled to 8 kHz. For the speaker inference module, the magnitude spectra are used as the input feature, computed from STFT with 32 ms window length, 8 ms hop size, and the sine window. More detailed configuration of the proposed architecture could be seen in Section A.1 of our Supplementary Material222https://drive.google.com/open?id=1aqJy465dLHaWPdMqG-BgjAgYEg70q7as .
4.1 Speech separation for overlapped speech
First, we evaluated our method on fully-overlapped speech mixtures from the Wall Street Journal (WSJ0) corpus. The WSJ0-2mix and 3mix datasets are the benchmarks designed for speech separation in [hershey2016deep]. In the validation set, we used the so-called Closed Conditions (CC) in [hershey2016deep, isik2016single], where the speakers are all from the training set. As a contrast, for the evaluation set, we use Open Condition (OC), which provides unknown speakers. For the separation performance, we compare our results with the TasNet, which is our base model described in Section 3.3, without changing any hyper-parameter. Table 1 listed the speech separation performance over the different training sets.
Table 1 shows that our SCCM got slightly worse performance than the base model in OC with the same architecture and training dataset. However, unlike the fixed-speaker-number speech separation method, SCCM could be trained and tested in the variable number of speakers with a single model thanks to our speaker-conditional strategy with the sequence-to-sequence model. As we expect, the training with both WSJ0-2mix and WSJ0-3mix datasets got better performance than the training with each dataset in close condition. Although we did not achieve obvious improvement in the OC case, with the careful tuning based on the cascading technique (the similar methods used in [Kolbaek2017Multitalker]), the separation performance gets a notable improvement, which also exceeds the base model. For the SCCM+ model, we use the extracted speech source, along with the raw observation, as input to go through another extraction module (TasNet). With this cascading method, the details of the extracted source get further optimized, which may fix the ambiguity caused by the independence assumption in Eq. (3).
Also, as the former node in the chain, the ability to predict the correct speakers or get the distinct and informative embeddings is quite crucial. Table 2 shows the performance of the speaker inference module, as discussed in Section 3.2. For the CC, micro-F1 is calculated to evaluate the correctness of the predicted speakers. For the OC, we use the speaker counting accuracy to measure the speaker inference module, which guarantees the success of the subsequent speech extraction module. From the results, we could see that the speaker inference module in SCCM could reasonably infer the correct speaker identity in CC and the correct number of speakers in OC.
It should be mentioned that the number of speakers in training data ( in Section 3.1) with WSJ0-2mix and 3mix is 101, much smaller than the number in a standard speaker recognition task (e.g., 1,211 in VoxCeleb1 [Nagrani17]). We infer that this limited number somewhat limits the performance of the speaker inference part and the following extraction module, especially for the open condition. Besides, compared with the state-of-the-art speaker recognition methods, our model takes the overlapped speech as input, which also brings more complexity.
|SI-SNRi CC||SI-SNRi OC|
4.2 Extraction performance for multi-round recordings
As mentioned before, the natural conversions in real scenes usually get multi-round utterances from several speakers. And the ratio of overlapped speech is less than 20% in general. For the conventional speech separation methods, there exists a problem with the consistent order of several speakers in different parts in a relatively long recording, especially when the dominant speaker changes [zeghidour2020wavesplit]. To validate this, we extend each mixture in the standard WSJ0-mix to multiple rounds. In detail (seen in Algorithm 1 and Section A.2 in Supplementary Material), we take the list of the original mixtures from WSJ0-2mix and sample several additional utterances from the provided speakers. After getting the sources from different speakers, the long recording will be formed by concatenating the sources one by one. The beginning of the following source gets a random shift around the end of the former one, making it similar to a natural conversation with an overlap-ratio around 15%.
Without any change in our model, we could directly train our SCCM on the synthetic multi-round data. It should be mentioned that our speaker inference module takes the whole spectrogram as an input. In contrast, the speech extraction module takes a random segment with 4 seconds from the long recording to avoid the problem with out-of-memory. Table 3 shows the performance difference compared with the base model. Both valid set and test set are fixed with four rounds of conversations with an average length of 10 seconds. As we expect, the results show that SCCM stays more stable than the baseline model with multi-round recordings. To further understand the model, we observed the attention status of the Decoder in Eq. (7). We find the attention of the inference reflects the speaker’s activities at different parts within a recording. More details and visualization could be viewed in Section A.3 in the Supplementary Material.
|Valid SI-SNRi||Test SI-SNRi|
|System||Overlap ratio in %|
4.3 Speech recognition in continuous speech separation
To further validate the downstream application, we conducted the speech recognition in the recently proposed continuous speech separation dataset [chen2020continuous]. LibriCSS is derived from LibriSpeech [panayotov2015librispeech] by concatenating the corpus utterances to simulate conversations. In line with the utterance-wise evaluation in LibriCSS, we directly use our trained model from the former multi-round task to test the recognition performance. The original raw recordings in LibriCSS are from far-field scenes with noise and reverberation, which is inconsistent with ours. So we use the single-channel clean mixtures and convert to 8 kHz to separate them. Moreover, we use the trained model from the Espnet’s [watanabe2018espnet] LibriSpeech recipe to recognize each utterance. Table 4 shows the WER results in this dataset.
We observed that (1) the results show a similar tendency with the provided baseline model in LibriCSS [chen2020continuous]. (2) With the increase of overlap ratio, the performance on the original clean mixture becomes much worse, while our model stays a low level of WER. (3) Because the training data of our model comes from the situation of multi speakers, the performance on the no-overlapped segments becomes worse. And we think this could be avoided by adding some single speaker’s segments in the training set.
We introduced the Speaker-conditional chain model as a common framework to process audio recordings with multiple speakers. Our model could be applied to tackle the separation problem towards fully-overlapped speech with variable and unknown number of speakers. Meanwhile, multi-round long audio recordings in natural scenes can also be modeled and extracted effectively using this method. Experimental results showed the effectiveness and good adaptability of the proposed model. Our following work will extend this model to the real scenes with noisy and reverberant multi-channel recordings. We would also like to explore the factors to improve the generalization ability of this approach, like the introduction of more speakers or changes in the network and training objectives.
Appendix A Supplementary Material
a.1 Model details
For the inference module, we used self-attention based encoder-decoder architecture to predict several possible speakers. For both the encoder and decoder, we used one encoder blocks with 512 attention units containing eight heads (). The size of dimension used in key and value is 64 (). We used 2048 internal units () in a position-wise feed-forward layer. And, we used the Adam optimizer with the learning rate decayed by a factor of
after every 20 epochs. We tested several different configuration in the model architecture, we found that the large number of layers (above 4) resulted in unconvergent training. And the configuration withshows similar results with .
Different from the original transformer model, we did not feed the output embeddings offset by one position to the next step in decoder. Instead, position is embedded with a linear layer to (as shown in Eq. (6)) to serve as input at each step. This is to ensure the decoding process can be done without knowing the order of the true speakers, and the order will be decided after the following extraction module by choosing the best permutation with the .
For the extraction module, we used the original configure from Conv-TasNet [luo2018tasnet] with . Also, we noticed the update of the base model in extraction could further improve the performance like the same tendency in [luo2019conv, luo2019dual]. In this paper, we mainly focus the relative performance over the original TasNet.
For the training strategy, we set a large ratio in Eq. (13) to balance the and , which get a large difference in their ranges. To be specific, with training continues, the cross-entropy criterion tends to a small positive number close to zero, while the non-probabilistic changes from positive to almost -20 because of the negative SI-SNR loss definition. Therefore, we set to keep a reasonable balance between these two factors. Besides, in practice, we found that the extraction module takes much more time to converge than the speaker inference module. To avoid the overfitting, the speaker inference module is early-stopped based on the in validation set, which the extraction module will continue until converged.
a.2 Simulation of WSJ0-mix multi-round recordings
a.3 Attention status
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [vaswani2017attention, bahdanau2014neural, kim2017structured]. For the speech related tasks, the vocal characteristics from one specific speaker stay stable in a short segment and a long conversation. Based on these, we use the self-attention based model in our inference part to utilize the relation between different frames from the same speaker. Therefore, the attention status could be used to check the specific process to find the possible speakers. As shown in Figure 2, we visualized one example from WSJ0-2mix test set about the real spectrograms of the two speakers and the corresponding attention status towards them. The attention status is from the multi-head self-attention block in the decoder, and we added the weights from each head to form the attention status .
As we expect, the attention status shows significant consistency with the real spectrogram. In particular, the attention tends to focus on the frame with larger difference. This is to say, if one speaker gets dominant in some frames, then the attention of this one tends to place emphasis on these dominant frames. Similarly, the attention from multi-round mixture also shows the consistency for one speaker in the whole audio, which could be taken as the implicit speech activity outputted by speaker diarization task.