Speech processing for multi-talker conversational speech, such as meeting recordings, is very challenging in the real world. It differs from single-talker scenarios in two main aspects. Firstly, it naturally contains overlapped speech from multiple speakers, so a speech separation process is often required. Secondly, a conversation can be of any length without any segmentation, which poses a challenge to the long-form speech processing capability of the system. There have been increasing interests in the conversational speech processing, including automatic speech recognition (ASR)[1, 2, 3], speech separation [4, 5, 6, 7], and speaker diarization [8, 9]. In this paper, we specifically focus on the speech separation problem for long-form speech.
Continuous speech separation (CSS)  is a framework to convert long-form unsegmented audio into overlap-free audio streams. In its representative instantiation with utterance-level permutation invariant training (uPIT) [10, 11], the input speech is first segmented by using a sliding window with overlaps, and speech separation is independently performed on each segment to generate separated signals. The separated signals in adjacent segments are then aligned via a stitching algorithm. This approach has not only proven to be effective in speech separation of simulated long-form signals [12, 13], but also shown large improvement in ASR [14, 15] and speaker diarization  tasks in realistic conversation scenarios. However, there are some drawbacks in such a uPIT-CSS approach: (1) It is computationally inefficient due to the large overlap between adjacent windows, which is essential for better stitching performance. (2) More importantly, the uPIT-CSS approach can only model the short-span relationship of utterances, e.g. 1.6s in , as it assumes at most active speakers in each window where is typically 2. When a long window is used for local separation, the assumption above is likely to be broken as more than speakers are likely to be present within a window. Therefore, its performance is limited due to the lack of access to a long-span context.
A recent study [16, 17] proposed a novel method tackling the above problems in the CSS framework, where the authors show that the label assignment in long-form speech separation can be regarded as a graph coloring problem, which leads to a generalized uPIT criterion named Graph-PIT. The computational complexity in the initial work  scales exponentially with the number of utterances in each segment, and was later reduced to be linear in the number of utterances via dynamic programming .
In this paper, we aim to solve the long-span speech processing approach without changing the PIT objective function. We propose Group-PIT (gPIT in short), a simple training data construction strategy to address this problem, to allow the separation network to directly process long-form speech in both training and inference stages. We show that by carefully designing the data simulation procedure and arranging the long-form reference signal into utterance groups, the number of possible permutations in each long-form audio (e.g. 60s) can be constrained to regardless of the number of active speakers and utterances. This allows training of speech separation models directly on long-form speech with the same training objective as uPIT, except that it is used for utterance groups rather than individual utterances. We also explore different long-form speech processing approaches with Group-PIT. Firstly, we show that the straightforward extension of CSS to gPIT-CSS with long-span separation can better process the long-form speech, which benefits from the direct long-span modeling. Secondly, we explore a two-stage gPIT-CSS approach with short-span separation and long-span tracking. This approach combines the properties of the local- and long-span processing, which is suitable for conditions where the long-form training data is difficult to obtain or simulate, e.g., realistic long-form conversation speech with spontaneous speaker interactions. The effectiveness of our proposed methods is validated on the simulated meetings based on the WSJ corpus [18, 19].
2 Stitching-based uPIT-CSS
We suppose the long-form input speech mixture consists of utterances and in total speakers. In the CSS framework, it is assumed that at most speakers are active at the same time so that all utterances in can be separated and placed into channels. Each output channel has the same length as the input, and only contains overlap-free utterances, as shown in Fig. 1.
A typical CSS pipeline with uPIT-based speech separation  is composed of three stages: segmentation, separation, and stitching. As illustrated in Fig. 2, the segmentation stage divides the long-form audio into several overlapped segments using a fixed-length sliding window. Each sliding window consists of three parts with , , and frames that represent the history, current, and future frames, respectively. The overlap length between adjacent segments is . The speech separation is then performed on each segment independently to generate overlap-free signals. Finally, the separation outputs in all segments are merged via a stitching algorithm to obtain the meeting-level separation result. This is done by first finding the best permutation of output channels with the highest overall similarity in each pair of adjacent segments and permuting them accordingly. Then, an overlap-and-average operation is performed along each channel across all segments.
It should be noted that the uPIT-based CSS assumes that the window length is small enough to only contain at most speakers so that uPIT-based speech separation models can be trained. It makes the uPIT-CSS difficult to use a long window where more than speakers will be likely to appear and potentially limits the modeling capacity due to the lack of access to the long-span context.
3 Group-PIT for long-span modeling
In this section, we introduce the proposed Group-PIT approach for long-form speech separation. First, we define the data arrangement in our proposed approach. Later, we propose two different approaches based on Group-PIT.
3.1 Group-PIT and the corresponding data arrangement
As illustrated in Fig. 1, in the original CSS pipeline, the placement of some separated utterances in different output channels may not be unique. For example, the last two utterances (orange and yellow) in Ch 1 can be swapped with the last utterance (blue) in Ch 2, while still satisfying the CSS constraint introduced in Section 2. This phenomenon is common and can easily happen when a relatively long silence exists in the midst of the input speech mixture. As a result, the number of possible permutations of separated utterances in output channels can be up to . More specifically, if we define an utterance group as a consecutive segment in which all utterances in channels only have possible permutations, then the number of possible permutations in the CSS problem is up to , where is the number of utterance groups in the CSS output. In Fig. 1, we can see that and , so there are permutations in total. Since this number increases exponentially with the number of utterance groups, it would be computationally expensive to extend CSS by using a longer window directly.
To remedy this issue, we propose to proactively arrange the training data so that is guaranteed for every long-form sample. This makes the number of possible permutations significantly small. In the following discussion, we adopt as three-fold overlaps are rarely observed in real meetings . When simulating a long-form speech sample, we first generate two overlap-free reference signals corresponding to two output channels by iteratively appending utterances to either of the channels based on the following rules: (i) the first utterance is appended to Ch 1, (ii) the -th utterance is appended to the channel where the end time, , of the lastly-appended utterance is earlier than the end time, , of the lastly-appended utterance on the other channel. When the
-th utterance is appended, its onset is randomly sampled from the uniform distribution. After generating two overlap-free reference signals, they are mixed to form the long-form audio mixture for training. This constraint guarantees that only one utterance group exists in each long-form speech sample. Given the small number of possible permutations, we can apply the conventional uPIT criterion except that it is applied for utterance groups rather than individual utterances. We call this method Group-PIT.
Compared to our proposed method, Graph-PIT [16, 17] is a more generalized approach that directly extends uPIT for long-span modeling. On the other hand, our proposed method simplifies the permutation problem, thus reducing the computational cost during training. The proposed method can be viewed as a special solution, which leverages the prior knowledge in the CSS problem, of the Graph-PIT. Such prior knowledge can result in different behaviors for the separation network when , while we focus on the case of in this paper and leave such cases for future work. It should be noted that with additional constraints, it is also possible for Graph-PIT to converge to the same solution as the proposed method.
Note that modeling long-form samples containing multiple utterance groups () during training is potentially inefficient. Because from the practical perspective, it is relatively easy to detect long silence regions by applying voice activity detection (VAD) as a preprocessing. The input mixture can then be divided into chunks without such silence, and the separation output for each chunk can still be regarded as a single utterance group. Note that in this work, utterances with short silence in between are considered to belong to the same utterance group.
3.2 gPIT-CSS with long-span separation
One straightforward way to extend the uPIT-CSS with by Group-PIT is using a longer sliding window that covers more than two utterances. In the training stage, we assume that the reference signal () only contains one utterance group. The training objective is then given by:
where is the -th output signal from the speech separation model, enumerates all possible permutations for channels, and denotes the permuted index for the -th channel.
In the inference stage, the same stitching-based process as in Section 2 is used for processing the entire meeting, except that a much longer window size can be used. It is thus possible to directly utilize the long-span audio context for better speech separation.
3.3 gPIT-CSS with short-span separation and long-span tracking
The separation approach in Section 3.2 solves the long-span separation problem in one shot. However, it usually requires matched training data to maximize its advantage in long-form modeling, which is not always available in practice. For example, it is challenging to simulate all the varieties in realistic long-form conversation speech that includes spontaneous speaker interactions. Without matched data, the long-span modeling could be potentially sub-optimal. Therefore, we explore another approach to apply Group-PIT in the CSS pipeline, where the speech separation procedure of one long-segment is decomposed to short-span separation and long-span tracking procedures.
The overview to process one long audio segment (such as 24s) is depicted in Fig. 3. In this approach, the long audio segment is further segmented by short sliding windows (such as 4s) with almost no overlap where frame. For each short window indexed by , a short-span separation model trained with the conventional uPIT objective function is applied to generate two overlap-free signals where . The long-span tracking network is then applied on the separated signals from all short windows to predict the frame-wise permutation for output channels as following:
Here, denotes the operation that stacks each frame and its adjacent frames along the feature dimension. denotes the operation to concatenate all features from short windows along the frame dimension. The term represents the frame-wise permutation indicator, where “1” indicates swapping the two separation results from current frame in Ch 1 and Ch 2, while “0” means no change. refers to the length of the input sequence. According to the permutation indicator, the short-span separation result is rearranged to form the final long-span output signal for each channel. Note that the above procedure is the explanation for the speech separation of one (relatively long) audio segment, which is still shorter than the duration of an entire recording. The entire recording is processed with the stitching algorithm as used in the gPIT-CSS with long-span separation in Section 3.2.
The cross-entropy loss is used to train the tracking network:
where is the oracle frame-wise permutation. The oracle permutation label is formed by comparing the frame-wise separation result with the two-channel reference spectrum. The data arrangement from Group-PIT is applied to construct the reference signals, i.e., after tracking alignment, the final separation result should follow the CSS arrangement as shown in Fig. 1. We freeze the short-span separation model when training the tracking network.
Note that the idea of combining short-span separation and long-span tracking was already investigated in the literature , but only for the utterance-level mixtures. Therefore, it is still unclear how well this approach works in the CSS framework. Our proposed extension with Group-PIT naturally fills this gap.
4 Experimental Setup
4.1 Data description
We experimented with simulated multi-talker recordings based on the WSJ corpus [18, 19]. The training and development sets were simulated based on the WSJ1  training set, with 283 speakers in total. The evaluation set was simulated based on the si_dt_05 and si_et_05 subsets from WSJ0 , with 18 speakers in total. The sampling rate of the audio was 16kHz. The simulation of all datasets follows the description in Section 3.1. The number of speakers ranges from 2 to 5, while the meeting length is fixed to 80s. For all datasets, we simulated two types of mixtures, i.e. partially overlapped mixtures (partial) and sequential mixtures (seq.). For training, development and evaluation, we used 27000, 2992, 2999 overlapped samples and 8000, 1500, 3000 sequential samples, respectively. For overlapped mixtures, the overlap ratio ranges from 20% to 60%. For sequential mixtures, note that we considered sequential utterances with a short pause (0.5s) belong to the same utterance group, and constrain them to be assigned to different channels even they are not overlapped. This property of separation networks has been shown to be important to handle quick speaker turns in real conversation .
4.2 Network Architectures
We adopt the time-frequency masking 
based speech separation method to examine the effectiveness of the proposed approaches. The window size and hop size for short-time Fourier transform (STFT) are 512 and 256, respectively. The loss functionin Eq. (1
) is the L2 loss between estimated and reference magnitude spectra. For the gPIT-CSS approach with long-span separation, we adopt the dual-path transformer (DP-transformer)[22, 23] architecture for its capability and efficiency in long sequence modeling. It consists of 16 encoder layers with 4 attention heads, and each layer has 128 attention dimensions and 1024 FF dimensions. For the gPIT-CSS approach with separation and tracking, we adopt the transformer model with 16 layers and a similar amount of parameters for short-span separation. For the tracking network, we also adopt the DP-transformer architecture for long-span modeling, which consists of 16 encoder layers with 128 attention dimensions and 1024 FF dimensions. The chunk size and hop size in the inter- and intra-chunk processing in all DP-transformer models are 150 and 75, respectively. The batch size is 96 for training Group-PIT models with a 4s sliding window on 8 GPUs. For other window lengths, we adjust the batch size accordingly to fit approximately the same amount of data into each batch as long as the memory can hold. The AdamW optimizer is used for training.
5 Experimental Results
|Model||(s)||Sliding window size (s)|
|Original partial mixture||-||————— 2.84 —————|
|+ Oracle permutation||4||11.93||8.59||5.59||3.66|
|Model||(s)||Sliding window size (s)|
|Original seq. mixture||-||————— 2.84 —————|
5.1 gPIT-CSS with long-span separation
As mentioned in Section 3.2, the proposed Group-PIT allows training of speech separation models on much longer segments than uPIT. Therefore, we first compared the performance of direct long-span separation models trained with different window lengths
. The best permutation of the meeting-level separation output channels is first determined, and the oracle utterance boundaries in each channel are then used to calculate the utterance-level scale-invariant signal-to-noise ratio (SI-SNR). The overlap between adjacent windows is set to 2s by default, i.e. for all models.
Table 1 shows the separation performance on the overlapped evaluation data (partial). It is shown that when evaluated with different sliding window lengths, models trained with a longer window tend to have better performance. This verifies our conjecture that a longer context can benefit the separation of long-form audios. In all conditions, the best performance is achieved when the same window length is used for both training and evaluation. In addition, we can observe that models trained with longer windows tend to reach the performance with oracle permutations111Here, “oracle permutations” means using the reference signal to determine the permutation of each window to stitch adjacent separated segments., which further demonstrates the effectiveness of the proposed approach. Note that the setting of 4s training data is usually adopted by uPIT-CSS systems, and the gPIT-CSS with 4s can serve as a reference for uPIT-CSS. However, it should be noted that gPIT and uPIT are not equivalent for this condition, as the gPIT training data might contain more than 2 speakers in one 4s training sample while uPIT strictly requires no more than 2 speakers for each sample.
For the sequential evaluation data (seq.), since no overlap exists, the SI-SNRs of the separation outputs tend to be very large (dB), which is inappropriate to compare due to the nonlinear scale in SI-SNR. Instead, we compare the frame-wise accuracy of speaker assignment in each output channel in Table 2. This is obtained by calculating the percentage of speaker turns in the best frame-wise permutation based on the final meeting-level separation output. It can be seen that models trained with longer windows also show higher frame-wise accuracies on the sequential mixture, which further shows the benefit of the proposed approach.
|Original partial mixture||no processing||-||2.84|
|gPIT-CSS (s)||+ stitching||-||4.36|
|+ oracle tracking||100%||17.16|
|gPIT-CSS (s)||+ stitching||-||7.50|
|+ oracle tracking||100%||17.20|
5.2 gPIT-CSS with short-span separation and long-span tracking
In this section, we evaluate the gPIT-CSS approach with short-span separation and long-span tracking. In contrast with the relatively large window length and overlap length used in Section 3.2, we only use a short sliding window (2s and 4s) with a 2-frame overlap for short-span separation. The tracking network is trained and evaluated using a 24s sliding window. The overlap between adjacent tracking windows is 12s and 2s for training and evaluation, respectively. Table 3 shows the performance of tracking-based models trained with different window lengths. Although the frame-wise tracking accuracy is not low, the overall SI-SNR performance is not as good as the direct long-span separation approaches with the best stitching window configuration in Table 1, which suggests this approach is more sensitive to frame-wise tracking errors. However, such comparison is unfair because much longer overlap sizes are used to achieve good performance in Table 1, leading to higher computational overhead. If we reduce the overlap size to only 2 frames, as shown in Table 3, the performance of the long-span speech separation (denoted as “stitching” in Table 3) is severely degraded. On the other hand, the tracking-based approach can significantly improve the final separation result while enjoying a much lower computational cost222The computational cost for our tracking network is roughly one third of the cost required for uPIT-CSS with 2s overlap, and the total computational cost becomes lower even with the overhead for the tracking network.. It is especially helpful when a shorter separation window is used, as more improvement is achieved with = 2s over = 4s.
In this paper, we explored the long-span speech separation approaches in the meeting scenario. A novel training scheme called Group-PIT was proposed to cope with the permutation problem in long-form speech. We showed that Group-PIT-based speech separation models can be trained directly on the arranged long-form speech with the same computational complexity as in uPIT. Moreover, we explored two different Group-PIT-based speech separation approaches for long-span speech processing, and their effectiveness was validated on the simulated data based on the WSJ corpus.
-  S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj, D. Snyder, A. S. Subramanian, J. Trmal, B. B. Yair, C. Boeddeker, Z. Ni, Y. Fujita, S. Horiguchi, N. Kanda, T. Yoshioka, and N. Ryant, “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” in Proc. 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020), 2020, pp. 1–7.
-  X. Chang, N. Kanda, Y. Gaur, X. Wang, Z. Meng, and T. Yoshioka, “Hypothesis stitcher for end-to-end speaker-attributed ASR on long-form multi-talker recordings,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6763–6767.
-  N. Kanda, X. Xiao, J. Wu, T. Zhou, Y. Gaur, X. Wang, Z. Meng, Z. Chen, and T. Yoshioka, “A comparative study of modular and joint approaches for speaker-attributed ASR on monaural long-form audio,” arXiv preprint arXiv:2107.02852, 2021.
T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, and F. Alleva, “Recognizing overlapped speech in meetings: A multichannel separation approach using neural networks,” inProc. Interspeech 2018, 2018, pp. 3038–3042.
-  Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, X. Xiao, and J. Li, “Continuous speech separation: Dataset and analysis,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7284–7288.
-  C. Li, Y. Luo, C. Han, J. Li, T. Yoshioka, T. Zhou, M. Delcroix, K. Kinoshita, C. Boeddeker, Y. Qian et al., “Dual-path RNN for long recording speech separation,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 865–872.
-  Z.-Q. Wang and D. Wang, “Localization based sequential grouping for continuous speech separation,” arXiv preprint arXiv:2107.06853, 2021.
-  T. von Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, and R. Haeb-Umbach, “All-neural online source separation, counting, and diarization for meeting analysis,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 91–95.
-  X. Xiao, N. Kanda, Z. Chen, T. Zhou, T. Yoshioka, S. Chen, Y. Zhao, G. Liu, Y. Wu, J. Wu et al., “Microsoft speaker diarization system for the VoxCeleb speaker recognition challenge 2020,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5824–5828.
-  D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 241–245.
M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
-  S. Chen, Y. Wu, Z. Chen, J. Wu, J. Li, T. Yoshioka, C. Wang, S. Liu, and M. Zhou, “Continuous speech separation with conformer,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5749–5753.
-  S. Chen, Y. Wu, Z. Chen, T. Yoshioka, S. Liu, J. Li, and X. Yu, “Don’t shoot butterfly with rifles: Multi-channel continuous speech separation with early exit transformer,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6139–6143.
-  D. Raj, P. Denisov, Z. Chen, H. Erdogan, Z. Huang, M. He, S. Watanabe, J. Du, T. Yoshioka, Y. Luo et al., “Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 897–904.
-  T. Yoshioka, I. Abramovski, C. Aksoylar, Z. Chen, M. David, D. Dimitriadis, Y. Gong, I. Gurvich, X. Huang, Y. Huang et al., “Advances in online audio-visual meeting transcription,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 276–283.
-  T. von Neumann, K. Kinoshita, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach, “Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers,” in Proc. Interspeech 2021, 2021, pp. 3490–3494.
-  T. von Neumann, C. Boeddeker, K. Kinoshita, M. Delcroix, and R. Haeb-Umbach, “Speeding up permutation invariant training for source separation,” arXiv preprint arXiv:2107.14445, 2021.
-  J. S. Garofolo, D. Graff, D. Paul, and P. David, LDC Catalog: CSR-I (WSJ0) Complete LDC93S6A, Philadelphia: Linguistic Data Consortium, 1993.
-  Linguistic Data Consortium and NIST Multimodal Information Group, LDC Catalog: CSR-II (WSJ1) Complete LDC94S13A, Philadelphia: Linguistic Data Consortium, 1994.
-  Y. Liu and D. Wang, “Divide and conquer: A deep CASA approach to talker-independent monaural speaker separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 12, pp. 2092–2102, 2019.
D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
-  Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 46–50.
-  C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 21–25.
-  J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630.