Neural Speaker Diarization with Speaker-Wise Chain Rule

by   Yusuke Fujita, et al.

Speaker diarization is an essential step for processing multi-speaker audio. Although an end-to-end neural diarization (EEND) method achieved state-of-the-art performance, it is limited to a fixed number of speakers. In this paper, we solve this fixed number of speaker issue by a novel speaker-wise conditional inference method based on the probabilistic chain rule. In the proposed method, each speaker's speech activity is regarded as a single random variable, and is estimated sequentially conditioned on previously estimated other speakers' speech activities. Similar to other sequence-to-sequence models, the proposed method produces a variable number of speakers with a stop sequence condition. We evaluated the proposed method on multi-speaker audio recordings of a variable number of speakers. Experimental results show that the proposed method can correctly produce diarization results with a variable number of speakers and outperforms the state-of-the-art end-to-end speaker diarization methods in terms of diarization error rate.



page 1

page 2

page 3

page 4


End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

End-to-end speaker diarization for an unknown number of speakers is addr...

End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection

In this paper, we present a conditional multitask learning method for en...

Speaker-Conditional Chain Model for Speech Separation and Extraction

Speech separation has been extensively explored to tackle the cocktail p...

Diarization of Legal Proceedings. Identifying and Transcribing Judicial Speech from Recorded Court Audio

United States Courts make audio recordings of oral arguments available a...

Encoder-Decoder Based Attractor Calculation for End-to-End Neural Diarization

This paper investigates an end-to-end neural diarization (EEND) method f...

Voice Separation with an Unknown Number of Multiple Speakers

We present a new method for separating a mixed audio sequence, in which ...

End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

We present an end-to-end deep network model that performs meeting diariz...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker diarization is the process of partitioning audio according to the speaker identity, which is an essential step for multi-speaker audio applications such as generating written minutes of meetings [1, 2]. Related techniques have been evaluated in telephone conversations (CALLHOME [3]), meetings (ICSI [4, 5], AMI [6]), and various hard scenarios (DIHARD Challenge [7, 8, 9]). Recent studies on the home-party scenario (CHiME-5 [10]

) reported that speaker diarization helped improve automatic speech recognition performance

[11, 12, 13].

One popular approach to speaker diarization is clustering of frame-level speaker embeddings [14, 15, 16, 17, 18, 19, 20]

. For instance, i-vectors

[21], d-vectors [22], and x-vectors [23] are common speaker embeddings in speaker diarization tasks. Segment-level speaker embeddings, which are learned jointly with a region proposal network, were also studied [24]

. Clustering methods commonly used for speaker diarization are agglomerative hierarchical clustering (AHC)

[15, 18, 19]

, k-means clustering

[17, 20]

, and spectral clustering

[20]. Recently, neural-network-based clustering has been explored [25]. Although clustering-based methods performed well, it is not optimized to directly minimize diarization errors because clustering is an unsupervised process. To directly minimize diarization errors in a supervised manner, clustering-free methods have been studied [26, 27, 28].

End-to-end neural diarization (EEND) [27, 28] is one of such clustering-free methods. EEND uses a single neural network that accepts multi-speaker audio and outputs the joint speech activity of multiple speakers. In contrast to the aforementioned methods except for [24], EEND handles overlapping speech without using any external module. The permutation-free training scheme [29, 30] and self-attention based network [31] play critical roles in achieving state-of-the-art performance on two-speaker telephone conversation datasets.

However, EEND is limited to a fixed number of speakers because output nodes of the neural network are comprised of multiple speakers’ speech activities. Although one can consider an application to a variable number of speakers by building a neural network that covers a sufficiently large number of speakers, an increasing number of output nodes would require impractical computational resources.

In this paper, we solve the fixed number of speaker issue by a novel speaker-wise conditional inference method. This proposed method regards each speaker’s speech activity as a single random variable, and formulates speaker diarization as an estimation of the joint distribution of multiple speech activity random variables. With the probabilistic chain rule, we can decode a speaker-wise speech activity sequentially conditioned on previously estimated speech activities like the other chain rule-based conditional inference methods, including neural language models

[32] and sequence-to-sequence models [33]. Similar to these conditional inference methods, our method produces a variable number of speakers with an appropriate stop sequence condition.

For training efficiency, we investigate teacher-forcing [34] like the other sequence-to-sequence models. The difference between general sequence modeling and our problem is that the order of the speakers is not uniquely determined in advance. Therefore, our approach also searches an appropriate speaker permutation during training to provide the unique speaker order used in teacher-forcing.

The experimental results on CALLHOME and simulated mixture datasets reveal that our proposed method achieves significant improvement over a conventional EEND method. Even in a fixed two-speaker configuration, the speaker-wise conditional inference method outperformed a conventional EEND method. In a variable speaker configuration, the ratio of improvement was more significant in the larger number of speakers. Our source code will be available online at

2 Related work

Our study is inspired by a similar speaker-wise decoding model proposed for speech separation task [35], and it was applied in speaker diarization as a down-stream task of speech separation [36]. With these methods, the model outputs a “residual speaker mask” for the next input. However, the model assumes that the speaker masks are additive and sum to one for each time-frequency bin. This assumption is not directly applicable to our diarization task without using speech separation. In this paper, to remove this assumption, we formulate our speaker-wise decoding as a conditional inference with the probabilistic chain rule.

For a variable number of speakers, AHC [15] and UIS-RNN [26] have been studied. AHC can generate a variable number of speakers by stopping the cluster merging operation according to a score threshold. UIS-RNN can detect a new speaker in an online manner by using a Bayesian nonparametric model. However, these methods fail in processing overlapping speech. In contrast, our method can handle overlapping speech of a variable number of speakers.

Encoder-decoder based attractor [37] is another recent approach to the variable number of speakers based on deep attractor networks [38]. While this method produces clusters in the embedding space, the proposed method directly produces speech activities using a similar encoder-decoder architecture.

3 Method

In this section, we describe speaker-wise conditional EEND (SC-EEND) as an extension of EEND. In the conventional EEND method, the number of speakers should be fixed, as shown in Fig. 1(a). Instead, the proposed method can produce a variable number of speakers. As shown in Fig. 1(b), the speaker-wise conditional neural network (SC-NN) produces a speech activity of one speaker conditioned by speech activities of previously estimated speakers. According to given multi-speaker audio, the model can produce a variable number of speakers iteratively.

(a) Conventional EEND method
(b) Proposed SC-EEND method
Figure 1: System diagrams of the conventional EEND method and the proposed SC-EEND method.

3.1 Neural probabilistic model of speaker diarization

Given a -length time sequence of -dimensional audio features as a matrix , speaker diarization estimates a set of speech activities , where is a vector representing a time sequence of speech activity for speaker index , and is the number of speakers.

In the proposed method, we formulate speaker diarization using a probabilistic model. Our method regards each speaker’s speech activity

as a single random variable, and models joint distribution of multiple speech activity random variables for estimating the most probable speech activity

, as follows:


In the EEND method [27], the joint distribution of multiple speech activity is factorized into speech activity of each speaker using the following conditional independence assumption:


The EEND method models the distribution using a neural network function , which maps an input into an output matrix , as follows:


, an element of , is interpreted as a posterior of speech activity of speaker at time index . Then, is converted into a binary estimate using a certain threshold. Here, the order of speakers (i.e., speaker index ) is determined during training. The training loss for the neural network output is computed as follows,


where is a binary cross-entropy function between a neural network output and a label, is a set of all possible permutations of a sequence . The optimal order of speakers is determined as the sequence

. We refer to the loss function as permutation-invariant training (PIT) loss.

3.2 Speaker-wise chain rule

Instead of using the conditional independence assumption in the EEND method, we use a fully-conditional model. With the probabilistic chain rule, the joint probability in Eq. 1 is converted into conditional probabilities without using any approximation unlike Eq. 2:


With this model, each speaker’s speech activity is sequentially decoded using previously estimated speech activities as conditions. This model is similar to other conditional inference models.

In the proposed method, a neural network outputs a vector of speaker index ,


where is a speaker-wise conditional neural network accepts an input with a speech activity vector of previous speaker index .

To generate a variable number of speakers, Eq. 7 is iteratively applied to the next speaker until no speech activity (i.e. equals to the all-zero vector) is found.

3.3 Encoder-Decoder architecture

Since the proposed speaker-wise conditional neural network generates the output for a variable number of times, the encoder-decoder type of the neural network is a suitable choice.

For the encoder part, similar to EEND [28], we use the Transformer Encoder [31] as follows:


where, is a linear projection that maps -dimensional vector to -dimensional vector for each column of the input matrix.

is the Transformer Encoder block that contains a multi-head self-attention layer, a position-wise feed-forward layer, and residual connections. By stacking the encoder

times, is an output of the encoder part.

For the decoder part, the neural network output for -th iteration is computed as follows:


where concatenates two matrices along with the first axis, is a uni-directional LSTM that maps -dimensional vector to -dimensional vector while keeping -dimensional memory cell for each column of the input matrix. Finally, a linear projection with a sigmoid activation produces a -dimensional vector as a neural network output.

3.4 Teacher-forcing during training

In Eq. 7, the neural network accepts a speech activity vector of the previous speaker index that is estimated at the previous decoder iteration. However, the estimation error at the previous iteration hurts the performance at the next iteration. To reduce the error, we use the teacher-forcing [34] technique, which boosts the performance by exploiting ground-truth labels. During training, Eq. 7 is replaced with as follows:


Here, is a ground-truth speech activity of speaker index . However, a problem arises with training loss computation in Eq. 4. As described in Sec. 3.1, the order of speakers is determined during training. One cannot determine a speaker index before computing the PIT loss, which requires estimates of all speakers. To alleviate this problem, we examine two kinds of loss computation strategies, as follows.

3.4.1 Speaker-wise greedy loss

For each decoding iteration, the optimal speaker index is selected by minimizing binary cross-entropy loss among a set of speaker indices, and the activity of the selected speaker is fed into the next decoding iteration.

3.4.2 Two-stage permutation-invariant training loss

Two-stage permutation-invariant training (PIT) loss is computed as Algorithm 1. At the first stage, the neural network outputs are computed without teacher-forcing (Eq. 7). Next, the optimal order of speakers is determined using Eq. 4. Then, at the second stage, neural network outputs are computed with teacher-forcing by using the optimal order of speakers. The final loss is computed between the second stage outputs and the ordered labels computed at the first stage. Note that the computation time of the two-stage process is reasonable since the backward computation is required only in the second stage.

Input :  ,   // Audio features and a set of speech activities
  // maximum num. of speakers
Output : 
1   // Condition for the first iteration
2 for  to  do
3       ,   // Eq. 7
4         // Threshold
6   // Optimal order of speakers in terms of PIT loss (Eq. 4)
7   // Condition for the first iteration
8 for  to  do
9         // Eq. 13
10         // The next condition
12// Loss with the optimal order
14   // No speech
Algorithm 1 Two-stage PIT loss

4 Experimental setup

4.1 Data

We prepared simulated training/test sets for both two-speaker and variable-speaker audio mixtures. We also prepared real adaptation/test sets from CALLHOME [3]. The statistics of the datasets are listed in Table 1. For the simulated dataset with a variable number of speakers (Simulated-vspk), the overlap ratio is controlled to be similar among the differing number of speakers. The simulation method is the same as [28]. For the CALLHOME-2spk and CALLHOME-vspk sets, we used the identical set of the Kaldi CALLHOME diarization v2 recipe [39]111, thereby enabling a fair comparison with the x-vector clustering-based method.

Num. Num. Avg. Overlap
spk rec dur ratio
Training sets
Simulated-2spk 2 100,000 87.6 34.4
Simulated-vspk 1-4 100,000 128.1 30.0
Adaptation sets
CALLHOME-2spk 2 155 74.0 14.0
CALLHOME-vspk 2-7 249 125.8 17.0
Test sets
Simulated-vspk 1-4 2,500 128.1 30.0
CALLHOME-2spk 2 148 72.1 13.0
CALLHOME-vspk 2-6 250 123.2 16.7
Table 1: Statistics of training/adaptation/test sets.

4.2 Model configuration

4.2.1 x-vector clustering-based (x-vector+AHC) model

We compared the proposed method with a conventional clustering-based system [7], which were created using the Kaldi CALLHOME diarization v2 recipe. The recipe uses AHC with the probabilistic linear discriminant analysis (PLDA) scoring scheme. The number of clusters was fixed to be two for the two-speaker experiments, while it was estimated using a PLDA score for the variable-speaker experiments.

4.2.2 EEND and SC-EEND models

We built self-attention-based EEND models and the proposed SC-EEND models, mostly based on the configuration described in [28]. The configurations have small differences between the two-speaker and variable-speaker experiments, as follows.

For the two-speaker experiments, we used four encoder blocks with 256 attention units containing four heads. For the variable-speaker experiments, we used four encoder blocks with 384 attention units containing six heads. We used a subsampling ratio of 20 for variable-speaker experiments, which is twice larger than that of two-speaker experiments (10). Note that conventional EEND does not handle a variable number of speakers. We trained a fixed four-speaker model with zero-padded labels for three or fewer speakers in the training data.

5 Results

We evaluated the models with the diarization error rate (DER). On the DER computation, overlapping speech and non-speech regions are also evaluated in the experiments. We used a collar tolerance of 250 ms at the start and end of each segment.

5.1 Experiments on fixed two-speaker models

DERs on the two-speaker CALLHOME are shown in Table 2. The proposed SC-EEND without teacher forcing (TF) was slightly worse than conventional EEND. With teacher-forcing, DER was significantly reduced and outperformed the conventional EEND method. For the loss computation strategy, two-stage PIT loss (PIT+TF) was slightly better than speaker-wise greedy loss (Greedy+TF).

Model Training DER
x-vector+AHC - 11.53
SC-EEND Greedy+TF 9.01
Table 2: DERs on two-speaker CALLHOME.

5.2 Experiments on a variable number of speakers

DERs on the variable-speaker simulated test set are shown in Table 3. For SC-EEND without TF, we observed no significant improvement from the conventional EEND. With TF, again, significant improvement was observed, particularly on a large number of speakers. PIT+TF was significantly better than Greedy+TF.

Num. of speakers
Model Training 1 2 3 4
EEND PIT 1.16 6.40 11.59 21.75
SC-EEND PIT 0.96 6.32 11.75 22.52
SC-EEND Greedy+TF 0.85 5.25 10.56 18.28
SC-EEND PIT+TF 0.76 4.31 8.31 12.50
Table 3: DERs on variable-speaker simulated test set.

DERs on the variable-speaker CALLHOME is shown in Table 4. Even without TF, the SC-EEND outperformed the conventional EEND and x-vector+AHC method. SC-EEND with TF boosted performance significantly. The DER was even better than 17.94%, which is the DER of x-vector+AHC with the oracle number of speakers.

Model Training DER
x-vector+AHC - 19.01
EEND PIT 20.47
SC-EEND Greedy+TF 18.07
Table 4: DERs on variable-speaker CALLHOME. Note that Greedy+TF adaptation model was evaluated at epoch, because the adaptation was not stable after the epoch.

5.3 Analysis on speaker counting

For the variable-speaker CALLHOME experiment, we analyzed the accuracy of speaker counting. The results are shown in Table 5. The proposed method achieved better speaker counting accuracy than the x-vector+AHC method, while it was still hard to handle more than four speakers.

2 3 4 5 6


2 84 62 2 0 0
3 18 51 5 0 0
4 2 12 6 0 0
5 0 4 1 0 0
6 0 1 2 0 0
(a) x-vector+AHC (Acc: 54.6%)
2 3 4 5 6
2 130 17 1 0 0
3 17 54 3 0 0
4 4 13 3 0 0
5 0 3 2 0 0
6 0 2 1 0 0
(b) SC-EEND (Acc: 74.8%)
Table 5: Speaker counting results on variable-speaker CALLHOME. SC-EEND models was trained with PIT+TF.

6 Conclusions

We proposed a speaker-wise conditional inference method as an extension to the end-to-end neural diarization method. Experimental results showed that the proposed method outperformed the conventional EEND method in variable-speaker scenarios. When estimating a larger number of speakers, the proposed method showed its advantage more significantly. The proposed method achieved better speaker counting accuracy, but it was still hard to handle more than four speakers. We will explore such hard scenarios, including DIHARD challenges for our future work.


  • [1] S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker diarization systems,” IEEE Trans. on ASLP, vol. 14, no. 5, pp. 1557–1565, 2006.
  • [2] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals, “Speaker diarization: A review of recent research,” IEEE Trans. on ASLP, vol. 20, no. 2, pp. 356–370, 2012.
  • [3] “2000 NIST Speaker Recognition Evaluation,”
  • [4] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters, “The ICSI meeting corpus,” in Proc. ICASSP, vol. I, 2003, pp. 364–367.
  • [5] Ö. Çetin and E. Shriberg, “Overlap in meetings: ASR effects and analysis by dialog factors, speakers, and collection site,” in Proc. MLMI, 2006, pp. 212–224.
  • [6] S. Renals, T. Hain, and H. Bourlard, “Interpretation of multiparty meetings the AMI and Amida projects,” in 2008 Hands-Free Speech Communication and Microphone Arrays, 2008, pp. 115–118.
  • [7] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, and S. Khudanpur, “Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge,” in Proc. Interspeech, 2018, pp. 2808–2812.
  • [8] M. Diez, F. Landini, L. Burget, J. Rohdin, A. Silnova, K. Z̆molíková, O. Novotný, K. Veselý, O. Glembek, O. Plchot, L. Mos̆ner, and P. Matĕjka, “BUT system for DIHARD speech diarization challenge 2018,” in Proc. Interspeech, 2018, pp. 2798–2802.
  • [9] L. Sun, J. Du, C. Jiang, X. Zhang, S. He, B. Yin, and C.-H. Lee, “Speaker diarization with enhancing speech for the first DIHARD challenge,” in Proc. Interspeech, 2018, pp. 2793–2797.
  • [10] J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in Proc. Interspeech, 2018, pp. 1561–1565.
  • [11] C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann, and R. Haeb-Umbach, “Front-End Processing for the CHiME-5 Dinner Party Scenario,” in Proc. CHiME-5, 2018, pp. 35–40.
  • [12] N. Kanda, R. Ikeshita, S. Horiguchi, Y. Fujita, K. Nagamatsu, X. Wang, V. Manohar, N. E. Yalta Soplin, M. Maciejewski, S.-J. Chen, A. S. Subramanian, R. Li, Z. Wang, J. Naradowsky, L. P. Garcia-Perera, and G. Sell, “Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays,” in Proc. CHiME-5, 2018, pp. 6–10.
  • [13] N. Kanda, Y. Fujita, S. Horiguchi, R. Ikeshita, K. Nagamatsu, and S. Watanabe, “Acoustic modeling for distant multi-talker speech recognition with single- and multi-channel branches,” in Proc. ICASSP, 2019, pp. 6630–6634.
  • [14] S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Trans. on ASLP, vol. 21, no. 10, pp. 2015–2028, 2013.
  • [15] G. Sell and D. Garcia-Romero, “Speaker diarization with PLDA i-vector scoring and unsupervised calibration,” in Proc. SLT, 2014, pp. 413–417.
  • [16] M. Senoussaoui, P. Kenny, T. Stafylakis, and P. Dumouchel, “A study of the cosine distance-based mean shift for telephone speech diarization,” IEEE/ACM Trans. on ASLP, vol. 22, no. 1, pp. 217–227, 2014.
  • [17] D. Dimitriadis and P. Fousek, “Developing on-line speaker diarization system,” in Proc. Interspeech, 2017, pp. 2739–2743.
  • [18] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree, “Speaker diarization using deep neural network embeddings,” in Proc. ICASSP, 2017, pp. 4930–4934.
  • [19] M. Maciejewski, D. Snyder, V. Manohar, N. Dehak, and S. Khudanpur, “Characterizing performance of speaker diarization systems on far-field speech using standard methods,” in Proc. ICASSP, 2018, pp. 5244–5248.
  • [20] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, “Speaker diarization with LSTM,” in Proc. ICASSP, 2018, pp. 5239–5243.
  • [21] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. on ASLP, vol. 19, no. 4, pp. 788–798, 2011.
  • [22] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in Proc. ICASSP, 2018, pp. 4879–4883.
  • [23] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. ICASSP, 2018, pp. 5329–5333.
  • [24] Z. Huang, S. Watanabe, Y. Fujita, P. Garcia, Y. Shao, D. Poveya, and S. Khudanpur, “Speaker diarization with region proposal network,” in Proc. ICASSP, 2020.
  • [25] Q. Li, F. L. Kreyssig, C. Zhang, and P. C. Woodland, “Discriminative neural clustering for speaker diarisation,” arXiv preprint arXiv:1910.09703, 2019.
  • [26] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully supervised speaker diarization,” in Proc. ICASSP, 2019, pp. 6301–6305.
  • [27] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with permutation-free objectives,” in Proc. Interspeech, 2019.
  • [28] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with self-attention,” in Proc. ASRU, 2019.
  • [29] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, 2016, pp. 31–35.
  • [30] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. ICASSP, 2017, pp. 241–245.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NIPS, 2017, pp. 5998–6008.
  • [32]

    T. Mikolov, M. Karafiát, L. Burget, J. Černocký, and S. Khudanpur, “Recurrent neural network based language model,” in

    INTERSPEECH, 2010.
  • [33] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proc. NIPS, 2014, pp. 3104–3112.
  • [34] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation, vol. 1, no. 2, pp. 270–280, 1989.
  • [35] K. Kinoshita, L. Drude, M. Delcroix, and T. Nakatani, “Listening to each speaker one by one with recurrent selective hearing networks,” in Proc. ICASSP, 2018, pp. 5064–5068.
  • [36] T. von Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, and R. Haeb-Umbach, “All-neural online source separation, counting, and diarization for meeting analysis,” in Proc. ICASSP, 2019.
  • [37] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu, “End-to-end speaker diarization for unknown number of speakers
    with encoder-decoder based attractor,” in Proc. Interspeech, 2020 (submitted).
  • [38] Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independent speech separation with deep attractor network,” IEEE/ACM Trans. on ASLP, vol. 26, no. 4, pp. 787–796, 2018.
  • [39] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011.