Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

by   Fan Yu, et al.

The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Mandarin meeting speech data with manual annotation, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants' headset microphone. We briefly describe the released dataset, track setups, baselines and summarize the challenge results and major techniques used in the submissions.


page 1

page 2

page 3

page 4


M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

Recent development of speech signal processing, such as speech recogniti...

The Accented English Speech Recognition Challenge 2020: Open Datasets, Tracks, Baselines, Results and Methods

The variety of accents has posed a big challenge to speech recognition. ...

The CUHK-TENCENT speaker diarization system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

This paper describes our speaker diarization system submitted to the Mul...

AISPEECH-SJTU accent identification system for the Accented English Speech Recognition Challenge

This paper describes the AISpeech-SJTU system for the accent identificat...

The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

The CHiME challenge series aims to advance robust automatic speech recog...

The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

We propose two improvements to target-speaker voice activity detection (...

DiPCo – Dinner Party Corpus

We present a speech data corpus that simulates a "dinner party" scenario...

1 Introduction

Who spoke what at when is the major aim of rich transcription of real-world multi-speaker meetings. Despite years of research [12, 11, 10], meeting rich transcription is still considered as one of the most challenging tasks in speech processing due to free speaking styles and complex acoustic conditions, such as overlapping speech, unknown number of speakers, far-field attenuated speech signals in large conference rooms, noise, reverberation, etc. As a result, tackling the problem requires a well-designed speech system with multiple related speech processing components, including but not limited to front-end signal processing, speaker identification, speaker diarization and automatic speech recognition (ASR).

The recent advances of deep learning has boosted a new wave of related research on meeting transcription, including speaker darization 

[34, 14, 21], speech separation [51, 19, 7] and multi-speaker ASR [50, 6, 26]. The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge (M2MeT) 111Challenge website: was designed with the aim to provide a common evaluation platform and a sizable dataset for Mandarin meeting transcription [52]. Along with the challenge, we made available the AliMeeting dataset to the participants, which contains 120 hours real meeting data recorded by 8-channel directional microphone array and headset microphone. Two tracks are particularly designed. Track 1 is speaker diarization, in which participants are tasked with addressing the “who spoke when” question by logging speaker-specific speech events on multi-speaker audio data. Track 2 focuses on transcribing multi-speaker speech that may contain overlapped segments from multiple speakers.

This paper summarizes the challenge outcomes. Specifically, we give a brief literature overview on speaker diarization and multi-speaker ASR in Section 2. Section 3 reviews the released dataset and the associated baselines. Sections 4 and 5 discuss the outcome of the challenge with major techniques and tricks used in submitted systems. Section 6 concludes the paper.

2 Related Works

Speaker diarization and multi-speaker ASR in meeting scenarios have attracted increasing attention. For speaker diarization, conventional clustering-based approaches usually contain a speaker embedding extraction step and a clustering step where the input audio stream is first converted into speaker-specific representation [42], followed by a clustering process, such as Variational Bayesian HMM clustering (VBx) [28], which aggregates the regions of each speaker into separated clusters. The clustering based approach is ineffective to recognize the overlapped speech without additional modules, because it assumes that each speech frame corresponds to only one of the speakers. Therefore, resegmentation was used by [39] to handle the overlap segments. However, overlapping speech detection (OSD) [3] is also a challenging task itself in most situations. Compared with majority studies that work on two-talker telephony conversations, there is a recent trend to handle more challenging speaker dairzation scenarios in complicated talking and acoustic environments [46, 2, 37, 38]. For the multi-speaker meetings recorded with a microphone array from distance, the scenario considered in this challenge, speaker darization becomes more challenging as speaker overlaps happen more frequently and sometimes several speakers speak at the same time in a conference discussion.

The advances of deep learning have shed light on the problem. As a typical solution, recent end-to-end neural diarization (EEND) [14] and its variants [21]

have replaced the individual sub-modules in traditional speaker diarization systems mentioned above with one neural network that directly provides the overlap-aware diarization results. More promisingly, thanks to the advances of speaker embedding extraction 

[25, 42], target speaker vocie actvity detecion (TS-VAD) [33, 18]

was proposed to judge target speaker’s activeness for each speech frame, which can estimate multiple speakers at the same time, leading to a promising solution to handle overlapped speech.

With the development of deep learning, end-to-end neural approaches have rapidly gained prominence in the speech recognition community [29]. However, ASR in complicated scenarios such as meetings is still not a solved problem with challenges including complex acoustic conditions, unknown number of speakers and overlapping speech. In other words, the challenges mentioned above in speaker diarization also exist in multi-speaker ASR. Besides multi-condition training and data augmentation, speech enhancement [22] and separation [31, 30] are considered as remedy to the complex acoustic conditions and multiple speakers. Speech enhancement that explicitly addresses the background noise has been widely studied [46, 20, 27], where multi-channel signals can be adopted if microphone array is deployed for audio recording. Speech separation and joint-training with ASR under permutation invariant training (PIT) scheme were studied to achieve a high-performance ASR system for overlapped speech [8]. Designing an end-to-end system that directly outputs multi-speaker transcriptions seems a straightforward solution to multi-talker ASR, such as the multi-channel input multi-speaker output (MIMO) approach [5, 53] and the end-to-end unmixing, fixed-beamformer and extraction (E2E-UFE) system [48]. Conditional chain [41] was also proposed to solve the PIT problem of the number of speakers is unknown. However, the above approaches rely on complicated joint training of front and back-end models or re-designing a complicated neural architecture. The SOT method [26] does not change the original ASR network structure designed for single speaker. Instead, it only introduces a special symbol to represent the speaker change. Moreover, speech recognition and diarization for unsegmented multi-talker recordings with speaker overlaps was discussed in the recent JSALT workshop to further promote reproducible research in this field [35].

Different from relevant datasets that have been released before [46, 24, 32], AliMeeting released in this challenge and Aishell-4 [13] are currently the only publicly available meeting datasets in Mandarin. Specifically, AliMeeting has more speakers and meeting venues, while particularly adding multi-speaker discussions with a high speaker overlap ratio.

3 Datasets, Tracks and Baselines

As described in our challenge evaluation plan [52], AliMeeting contains 118.75 hours 222Hours are calculated in single channel of audio. of speech data in total. The training set (Train) and evaluation set (Eval) are first released to participants for system development, with 104.75 and 4 hours of speech, respectively, with manual transcription and timestamp. During the challenge ranking period, the 10 hours test set (Test) is released for scoring. Specifically, the Train, Eval and Test sets contain 212, 8 and 20 meeting sessions respectively, and each session consists of a 15 to 30-minute discussion by 2-4 participants. To highlight speaker overlap, the sessions with 4 participants account for 59%, 50% and 57% sessions in Train, Eval and Test, respectively. For Train and Eval sets, we provide the 8-channel audio recorded from the microphone array in far-field as well as the near-field audio from the participant’s headset microphone, while the Test set only contains the 8-channel far-field audio.

The challenge consists of two tracks, namely speaker diarization (track 1) and multi-speaker ASR (track 2), measured and ranked on the Test set by Diarization Error Rate (DER) and Character Error Rate (CER) respectively. For both tracks, we also set up two sub-tracks. For the constrained data sub-track, system building for both tracks are restricted to AliMeeting [52], Aishell-4 [13] and CN-Celeb [9], while for the unconstrained data track, participants can use any data set publicly available.

We release baseline systems along with the Train and Eval data for quick start and reproducible research. For the 8-channel data of AliMeeting recorded by microphone array, we select the first channel to obtain Ali-far, and adopt CDDMA beamformer [23, 55] on 8-channel data to generate Ali-far-bf. We use prefix Train-*, Eval-* and Test-* to denote generated data associated with Train, Eval and Test sets. For example, Test-Ali-far-bf means the beamformed data for the Test set.

We adopt the Kaldi-based diarization system from the CHiME-6 challenge as the baseline system for track 1. The diarizaiton module includes speaker embedding extractor and clustering. DER is scored with collar size of 0 and 0.25 second, but the challenge ranking is based on the 0.25 second collar size. The speaker diarization results for the baseline system are shown in Table 1.

[t] Testing data Collar size = 0 Collar size = 0.25 Eval-Ali-far 24.52 15.24 Eval-Ali-far-bf 24.67 15.46 Test-Ali-far 24.95 15.60 Test-Ali-far-bf 25.16 15.79

Table 1: Speaker diarization results on Eval and Test in DER (%).

We use a Conformer-based [15] ASR model as our single speaker baseline (ConfomerA), which is trained by Train-Ali-near set using ESPnet [45]. We adopt Serialized Output Training (SOT) [26] to recognize speech from multiple speakers containing overlapped speech, generating transcriptions of multiple speakers one after another. The baseline results of multi-speaker ASR are shown in Table 2. Note that here SOT and SOT_bf are trained by Train-Ali-near and Train-Ali-far respectively. Compared with the single-speaker conformer model (ConfomerA), the two SOT multi-speaker models have obtained significant improvement on the Eval and Test sets, where SOT_bf achieves superior performance.

More details on the data arrangements, tracks and baseline results can be referred to the challenge evaluation plan paper [52].

Testing data ConfomerA SOT SOT_bf
Eval-Ali-far 49.0 30.8 34.3
Eval-Ali-far-bf 45.6 33.2 29.7
Test-Ali-far 50.4 32.4 35.9
Test-Ali-far-bf 46.3 33.9 30.9
Table 2: Multi-speaker ASR results on Eval and Test in CER (%).
Table 3: Top 8 ranking teams in terms of DER in track 1 and their major techniques.

4 Summary on Track 1 - Speaker Diarization

Finally 14 teams submitted their results to track 1 and the DER for the top 8 teams is summarized in Table 3. Observing the performance by the number of speakers, we can see that in general, the DER increases with the number of speakers in meeting sessions. For most teams, there are clear performance gaps between 2- and 3-speaker sessions and between 3- and 4-speaker sessions. The winner goes to team A41 [44] which achieves the lowest DER of 2.98%, surpassed the official baseline (15.60%) with a large margin. Interestingly, their system works equally well on both 3- and 4-speaker sessions. There are two key techniques ensuring their superior performance: using TS-VAD [33] to find speaker overlap and employing cross-channel self-attention [4] to further improve performance. Table 3 also summarizes the major techniques used by the top 8 teams, namely effective main approach, data augmentation strategy, front-end processing as well as post-processing. We will highlight these key techniques in the following.

4.1 Main Approach

With the assumption that each speech frame corresponds to only one speaker, a clustering-based speaker diarization system is incapable of handling overlapped speech without additional modules. Since AliMeeting has a high ratio of speaker overlap, it is beneficial to adopt effective methods to reduce the error brought by the overlapped speech. The top three teams all employ TS-VAD to find the overlap between speakers while end-to-end approach, e.g., EEND, is not considered. We believe that this is because the single-speaker speech segments in meeting recordings can be effectively used (through clustering) to obtain speaker embedding as the initial input for the TS-VAD model that has been proven consistently effective for handling overlapped speech in the literature. Instead of using the original TS-VAD that takes i-vector as target-speaker embedding, the winner team A41 

[44] uses the deep speaker embedding extracted by ResNet [16] to detect the target-speaker. Moreover, with the premise that different acoustic features are complementary, the second-place team V52 [54] proposes a multi-level feature fusion mechanism for TS-VAD, and the fusion between spatial-related and speaker-related features leads to 2% absolute DER reduction on Eval set. Some teams adopt approaches to improve the clustering-based algorithm itself. For example, it is effective to use overlap speech detection (OSD) to divide oracle VAD segments into single speaker segments and overlapped speech segments. Moreover, estimating the direction of arrival (DOA) to distinguish different speakers by the corresponding spatial information is proven to be beneficial. Team Q36 [43] demonstrates that re-assigning speaker labels to the overlapping segments by a speech seperation method can lead to 14.32% relative DER reduction on Eval set (7.47% to 6.40%).

4.2 Data Augmentation

Since the size of the released training data is relatively small, data augmentation is adopted by most teams. For example, noise augmentation and reverberation simulation are generally used, which improves the robustness of the model modestly. Simulated room impulse response (RIR) is used to convolve with the original speech to generate data with reverberation. To further augment the training samples, Team A41, C16 and Q36 adopt the amplification and tempo (change audio playback speed but do not change its pitch) to audio signals. Moreover, as speaker overlap is salient in the data, several teams create an extra simulated dataset based on Alimeeting and CN-celeb. In detail, utterances from different speakers are randomly selected from these data, and then combined with an overlap ratio from 0 to 40%. It is also worth noticing that the winner team A41 simulates data in an online manner in order to obtain more diverse data and stronger model robustness.

4.3 Front-End Processing

Front-end processing approaches, such as dereverberation, beamforming and speech enhancement, have proven to be effective for downstream tasks dealing with far-field speech. In the challenge, team X22 [17], C16 and Q36 adopt the weighted prediction error (WPE) based on long-term linear prediction for dereverberation, leading to an absolute 0.7% DER reduction on the Eval set. Moreover, the relevant experiments from team X22 show that the offline dereverberation mode is more effective than the online mode. Interestingly, team Q36 found that using multi-channel WPE is harmful to OSD while it is beneficial for speaker clustering and speech separation. Effective adoption of spatial information, including beamforming [1], is also mainly considered by the participants. In particular, team S76 proposes a novel architecture named discriminative multi-stream neural network (DMSNet) for overlapped speech detection. Instead of adopting beamforming, the winner team A41 employs cross-channel self-attention to integrate multi-channel signals, where the non-linear spatial correlations between different channels are learned and fused.

4.4 Post-Processing

Since the challenge does not restrict on the computation workload and system fusion, most teams employ the DOVER-Lap [36] to fuse multiple effective models. The improvement from DOVER-Lap fusion depends on the number and type of models, and the relative DER reduction ranges from 2% to 15%. Note that although conventional VBx clustering is not as good as TS-VAD, but it brings extra gain after model fusion. Re-clustering is also an effective method for conventional clustering-based speaker diarization, which is applied to further refine the number of speakers by combining the very similar clusters according to their cosine distances. Since our challenge provides oracle VAD, Team X22 fuses the results with oracle VAD by deleting wrong speech segments and labeling the silent segments, leading to 22.2% relative DER reduction on Eval set (13.04% to 10.14%).

Table 4: Top 5 ranking teams in terms of CER in track 2 and their major techniques.

5 Summary on Track 2 - Multi-speaker ASR

Twelve teams submitted their results to track 2 and the CER for the top 5 teams is summarized in Table 4. Similar to the observation in track 1, CER sharply increases with the number of speakers in the meeting sessions, mainly due to the high speaker overlap ratio in meetings with more speakers. The winner team R62 [49] obtained the lowest average CER of 18.79% with over 12% absolute CER reduction as compared with the baseline. The superior performance comes from a SOT-based multi-speaker ASR system with large-scale data simulation. Moreover, system fusion is also beneficial as reported by the winner team. In their approach, the standard conformer-based joint CTC/Attention Conformer [15] and U2++ [47] model with a bidirectional attention decoder are fused with clear performance gain. The main approaches, data augmentation strategies, front- and back-end processing are summarized in Table 4.

5.1 Main Approach

The 5 top teams all adopt the SOT approach [26] similar to our multi-speaker baseline system, resulting in over 15% CER reduction compared with single speaker ASR system on the Eval set. The SOT method has an excellent ability to model the dependencies among outputs for different speakers and no longer has a limitation on the maximum number of speakers. Undoubtedly, the Conformer architecture [15], which models both local and global context of speech, is used by all teams. It should be noted that besides Conformer, the winner team R62 [49] also uses the recent U2++ [47] structure, where a bidirectional attention decoder is used to integrate information from both directions at inference. As a result, the fusion of the two models brings 8.7% relative CER reduction on the Eval set.

5.2 Data Augmentation

Similar to track 1, various data augmentation tricks were widely adopted in track 2. Noise augmentation, reverberation simulation, speed perturbation and SpecAugmentation are the mainstream methods with stable performance improvement. According to the report provided by second-place team B24 [40], relative CER reduction of 13.5% can be achieved by multi-channel multi-speaker data simulation as compared with the baseline trained using Train-Ali-far. Compared with speaker diarization, data simulation for multi-speaker ASR is more complex, which needs to consider various factors such as speaker turn and conversation duration. Thus fine-grained data simulation is essential to ensure consistent performance gain. For example, the simulation on speaker overlapping ratio should be reasonable, including the coverage of extreme cases like sudden (very brief) interruption from another speaker. It is also worth noticing that the winner team R62 makes substantial efforts in data augmentation and simulation. Finally, they expand the original training data to about 18,000 hours, which achieves 9.7% absolute CER reduction compared with the baseline system.

5.3 Front-End Processing

Similar to track 1, the classical front-end processing techniques in far-field speech recognition, including beamforming, dereverberation and DOA, are also adopted in track 2 with performance gain. Specifically, beamforming is used by most teams and WPE-based dereverberation is considered by two teams, while DOA estimation of target speaker is only used by the winner team R62 among the top 5 teams. Front-end and back-end joint modeling using neural networks is also considered by the second- and third-place team (B24 and X18). With the premise that optimizing front-end and back-end separately will lead to sub-optimal performance, joint modeling will make the whole system to be optimized under the final metric. Team B24 and X18 both take multi-channel signal as the input of a neural front-end and then cascade the front-end with the back-end Conformer ASR model. The whole neural architecture is then jointed trained. According to the report from B24, joint modeling leads to 13.3% relative CER reduction (from 24.0% to 20.8%) on Eval set.

5.4 Post-Processing

As reported by several teams, the contribution from language modeling (LM), either -gram or neural LM, is very weak. This is mainly because the building of LM is only restricted to the transcripts of the training data while using extra text data is prohibited according to the challenge rule. Most teams employ model fusion which brings absolute improvement ranging from 10% to 15% on the Eval set as compared with the baseline. For example, the winner team R62 has eventually fused 7 models by simple ROVER, including 3 Conformer models and 4 U2++ models, trained with different configurations of data. Other fusion tricks include LM rescoring for single speaker and multi-speaker ASR models (team G34) and model averaging from different training stages (team B24).

6 Conclusions

This paper briefly describes the setup of the ICASSP 2022 multi-channel multi-party meeting transcription challenge (M2MeT) and summarizes the outcomes of the challenge, highlighting the major techniques used by the top performing teams. We conclude this paper with listing the following major findings. With limited a mount of data to train systems, data augmentation and simulation are effective for both speaker darization and multi-speaker ASR. Likewise, system fusion is another important trick with steady performance gain if system computational resource is not constrained. Front-end processing techniques are also beneficial for far-field scenarios including meeting transcription – the task at hand. But uniquely for meetings like the AliMeeting data, speaker overlap should be explicitly addressed. For speaker diarization, TS-VAD is still the superior approach to handle speaker overlap. By using the above-mentioned methods and tricks, the diarization error rate has been lowered to 3% on AliMeeting. For multi-speaker ASR, Conformer is still the state-of-the-art (single-speaker) ASR model used by most teams and Serialized Output Training is the easy-to-use approach to explicitly consider speaker overlap. Front-end and back-end joint modeling using neural networks is also a promising solution that deserves future investigation. The best performing system in the ASR track achieves 18.79% character error rate given the limited training data.


  • [1] X. Anguera, C. Wooters, and J. Hernando (2007) Acoustic beamforming for speaker diarization of meetings. Proc. TASLP 15 (7), pp. 2011–2022. Cited by: §4.3.
  • [2] J. Barker, S. Watanabe, E. Vincent, and J. Trmal (2018) The fifth’chime’speech separation and recognition challenge: dataset, task and baselines. In Proc. INTERSPEECH, pp. 1561–1565. Cited by: §2.
  • [3] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, et al. (2020) Pyannote. audio: neural building blocks for speaker diarization. In Proc. ICASSP, pp. 7124–7128. Cited by: §2.
  • [4] F. Chang, M. Radfar, A. Mouchtaris, and M. Omologo (2021) Multi-channel transformer transducer for speech recognition. arXiv preprint arXiv:2108.12953. Cited by: §4.
  • [5] X. Chang, W. Zhang, Y. Qian, et al. (2019) MIMO-speech: end-to-end multi-channel multi-speaker speech recognition. In Proc. ASRU, pp. 237–244. Cited by: §2.
  • [6] Z. Chen, J. Droppo, J. Li, and W. Xiong (2017) Progressive joint modeling in unsupervised single-channel overlapped speech recognition. Proc. TASLP 26 (1), pp. 184–196. Cited by: §1.
  • [7] Z. Chen, Y. Luo, and N. Mesgarani (2017) Deep attractor network for single-microphone speaker separation. In Proc. ICASSP, pp. 246–250. Cited by: §1.
  • [8] Z. Chen, T. Yoshioka, L. Lu, T. Zhou, et al. (2020) Continuous speech separation: Dataset and analysis. In Proc. ICASSP, pp. 7284–7288. Cited by: §2.
  • [9] Y. Fan, J. Kang, L. Li, K. Li, H. Chen, S. Cheng, P. Zhang, Z. Zhou, et al. (2020) CN-CELEB: A challenging chinese speaker recognition dataset. In Proc. ICASSP, pp. 7604–7608. Cited by: §3.
  • [10] J. G. Fiscus, J. Ajot, and J. S. Garofolo (2007) The rich transcription 2007 meeting recognition evaluation. In Proc.MTPH, pp. 373–389. Cited by: §1.
  • [11] J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo (2006) The rich transcription 2006 spring meeting recognition evaluation. In Proc. MLMI, pp. 309–322. Cited by: §1.
  • [12] J. G. Fiscus, N. Radde, J. S. Garofolo, A. Le, J. Ajot, et al. (2005) The rich transcription 2005 spring meeting recognition evaluation. In Proc. MLMI, pp. 369–389. Cited by: §1.
  • [13] Y. Fu, L. Cheng, S. Lv, Y. Jv, Y. Kong, Z. Chen, Y. Hu, et al. (2021)

    AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario

    In Proc. INTERSPEECH, pp. 3665–3669. Cited by: §2, §3.
  • [14] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watanabe (2019) End-to-end neural speaker diarization with self-attention. In Proc. ASRU, pp. 296–303. Cited by: §1, §2.
  • [15] A. Gulati, J. Qin, C. Chiu, N. Parmar, et al. (2020) Conformer: Convolution-augmented transformer for speech recognition. In Proc. INTERSPEECH, pp. 5036–5040. Cited by: §3, §5.1, §5.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. CVPR, pp. 770–778. Cited by: §4.1.
  • [17] M. He, X. Lv, W. Zhou, et al. (2022) The ustc-ximalaya system for the icassp 2022 multi-channel multi-party meeting transcription (m2met) challenge. In arXiv preprint arXiv:2202.04855, Cited by: §4.3.
  • [18] M. He, D. Raj, Z. Huang, J. Du, Z. Chen, and S. Watanabe (2021) Target-speaker voice activity detection with improved i-vector estimation for unknown number of speaker. arXiv preprint arXiv:2108.03342. Cited by: §2.
  • [19] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe (2016) Deep clustering: Discriminative embeddings for segmentation and separation. In Proc. ICASSP, pp. 31–35. Cited by: §1.
  • [20] J. Heymann, L. Drude, C. Boeddeker, et al. (2017) Beamnet: end-to-end training of a beamformer-supported multi-channel asr system. In Proc. ICASSP, pp. 5325–5329. Cited by: §2.
  • [21] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu (2020) End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors. In Proc. INTERSPEECH, pp. 269–273. Cited by: §1, §2.
  • [22] Y. Hu, Y. Liu, S. Lv, M. Xing, et al. (2021) DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. In Proc. INTERSPEECH, pp. 2472–2476. Cited by: §2.
  • [23] W. Huang and J. Feng (2020) Differential beamforming for uniform circular array with directional microphones.. In Proc. INTERSPEECH, pp. 71–75. Cited by: §3.
  • [24] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, et al. (2003) The ICSI meeting corpus. In Proc. ICASSP, pp. 1–5. Cited by: §2.
  • [25] A. Kanagasundaram, R. Vogt, D. Dean, S. Sridharan, and M. Mason (2011) I-vector based speaker recognition on short utterances. In Proc. INTERSPEECH, pp. 2341–2344. Cited by: §2.
  • [26] N. Kanda, Y. Gaur, X. Wang, Z. Meng, et al. (2020) Serialized output training for end-to-end overlapped speech recognition. In Proc. INTERSPEECH, pp. 2797–2801. Cited by: §1, §2, §3, §5.1.
  • [27] Y. Koizumi, S. Karita, A. Narayanan, S. Panchapagesan, et al. (2021) SNRi target training for joint speech enhancement and recognition. arXiv preprint arXiv:2111.00764. Cited by: §2.
  • [28] F. Landini, J. Profant, M. Diez, and L. Burget (2022) Bayesian hmm clustering of x-vector sequences (vbx) in speaker diarization: theory, implementation and analysis on standard tasks. Proc. CSL 71, pp. 101254. Cited by: §2.
  • [29] J. Li (2021) Recent advances in end-to-end automatic speech recognition. arXiv preprint arXiv:2111.01690. Cited by: §2.
  • [30] Y. Liu and D. Wang (2019) Divide and conquer: a deep casa approach to talker-independent monaural speaker separation. IEEE/ACM Trans. TASLP 27, pp. 2092–2102. Cited by: §2.
  • [31] Y. Luo and N. Mesgarani (2019) Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. In Proc. TASLP, pp. 1256–1266. Cited by: §2.
  • [32] I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, et al. (2005) The AMI meeting corpus. In Proc. ICMT, Vol. 88, pp. 100. Cited by: §2.
  • [33] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, et al. (2020) Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario. In Proc. INTERSPEECH, pp. 274–278. Cited by: §2, §4.
  • [34] T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, et al. (2022) A review of speaker diarization: recent advances with deep learning. Proc. CSL 72, pp. 101317. Cited by: §1.
  • [35] D. Raj, P. Denisov, Z. Chen, H. Erdogan, et al. (2021) Integration of speech separation, diarization, and recognition for multi-speaker meetings: system description, comparison, and analysis. In Proc. SLT, pp. 897–904. Cited by: §2.
  • [36] D. Raj, L. P. Garcia-Perera, Z. Huang, S. Watanabe, D. Povey, et al. (2021) DOVER-lap: a method for combining overlap-aware diarization outputs. In Proc. SLT, pp. 881–888. Cited by: §4.4.
  • [37] N. Ryant, K. Church, C. Cieri, et al. (2019) The second dihard diarization challenge: dataset, task, and baselines. In Proc. INTERSPEECH, pp. 978–982. Cited by: §2.
  • [38] N. Ryant, P. Singh, V. Krishnamohan, R. Varma, et al. (2020) The third dihard diarization challenge. arXiv preprint arXiv:2012.01477. Cited by: §2.
  • [39] G. Sell and D. Garcia-Romero (2015) Diarization resegmentation in the factor analysis subspace. In Proc. ICASSP, pp. 4794–4798. Cited by: §2.
  • [40] C. Shen, Y. Liu, W. Fan, et al. (2022) The volcspeech system for the icassp 2022 multi-channel multi-party meeting transcription challenge. In arXiv preprint arXiv:2202.04261, Cited by: §5.2.
  • [41] J. Shi, X. Chang, P. Guo, S. Watanabe, Y. Fujita, J. Xu, B. Xu, and L. Xie (2020) Sequence to multi-sequence learning via conditional chain mapping for mixture signals. In Proc. NIPS, Vol. 33, pp. 3735–3747. Cited by: §2.
  • [42] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust dnn embeddings for speaker recognition. In Proc. ICASSP, pp. 5329–5333. Cited by: §2, §2.
  • [43] J. Tian, X. Hu, and X. Xu (2022) Royalflush speaker diarization system for icassp 2022 multi-channel multi-party meeting transcription challenge. In arXiv preprint arXiv:2202.04814, Cited by: §4.1.
  • [44] W. Wang, X. Qin, and M. Li (2022) Cross-channel attention-based target speaker voice activity detection: experimental results for m2met challenge. In arXiv preprint arXiv:2202.02687, Cited by: §4.1, §4.
  • [45] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, et al. (2018) ESPnet: End-to-End speech processing toolkit. In Proc. INTERSPEECH, pp. 2207–2211. Cited by: §3.
  • [46] S. Watanabe, M. Mandel, J. Barker, and E. Vincent (2020) Chime-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings. In Proc. 6th International Workshop on Speech Processing in Everyday Environments, Cited by: §2, §2, §2.
  • [47] D. Wu, B. Zhang, C. Yang, et al. (2021) U2++: unified two-pass bidirectional end-to-end model for speech recognition. arXiv preprint arXiv:2106.05642. Cited by: §5.1, §5.
  • [48] J. Wu, Z. Chen, J. Li, et al. (2020) An end-to-end architecture of online multi-channel speech separation. pp. 81–85. Cited by: §2.
  • [49] S. Ye, P. Wang, S. Chen, et al. (2022) The royalflush system of speech recognition for m2met challenge. In arXiv preprint arXiv:2202.01614, Cited by: §5.1, §5.
  • [50] D. Yu, X. Chang, and Y. Qian (2017) Recognizing multi-talker speech with permutation invariant training. In Proc. INTERSPEECH, pp. 2456–2460. Cited by: §1.
  • [51] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proc. ICASSP, pp. 241–245. Cited by: §1.
  • [52] F. Yu, S. Zhang, Y. Fu, L. Xie, S. Zheng, Z. Du, W. Huang, P. Guo, Z. Yan, B. Ma, et al. (2021) M2MeT: the icassp 2022 multi-channel multi-party meeting transcription challenge. arXiv preprint arXiv:2110.07393. Cited by: §1, §3, §3, §3.
  • [53] W. Zhang, C. Boeddeker, S. Watanabe, Y. Qian, et al. (2021) End-to-end dereverberation, beamforming, and speech recognition with improved numerical stability and advanced frontend. In Proc. ICASSP, pp. 6898–6902. Cited by: §2.
  • [54] N. Zheng, N. Li, X. Wu, L. Meng, et al. (2022) The cuhk-tencent speaker diarization system for the icassp 2022 multi-channel multi-party meeting transcription challenge. In arXiv preprint arXiv:2202.01986, Cited by: §4.1.
  • [55] S. Zheng, W. Huang, X. Wang, H. Suo, J. Feng, and Z. Yan (2021) A real-time speaker diarization system based on spatial spectrum. In Proc. ICASSP, pp. 7208–7212. Cited by: §3.