Speaker diarization is the process of partitioning an audio recording into homogeneous segments according to the speaker’s identity. Speaker diarization has a wide range of applications, such as generating written records of meetings and a turn-taking analysis of telephone conversations [48, 1]
. It also improves automatic speech recognition performance in multi-speaker conversation scenarios in meetings (ICSI[18, 5], AMI [36, 20]) and home environments (CHiME-5 [3, 12, 4, 21, 20]).
The most common approach to speaker diarization is based on clustering of speaker embeddings [30, 40, 37, 39, 10, 15, 27, 51]. For instance, i-vectors [8, 40, 37, 27], d-vectors[50, 51], and x-vectors [42, 15]
are commonly used in speaker diarization tasks. These embeddings of short segments are partitioned into speaker clusters by using clustering algorithms, such as Gaussian mixture models[30, 40]
, agglomerative hierarchical clustering[30, 37, 15, 27], mean shift clustering 
, k-means clustering[10, 51], Links [29, 51]
, and spectral clustering. These clustering-based diarization methods have shown themselves to be effective on various datasets (see the DIHARD Challenge 2018 activities, e.g., [38, 9, 46]).
However, such clustering-based methods have a number of problems. First, they cannot be optimized to minimize diarization errors directly because the clustering is a type of unsupervised learning process. Second, they have trouble handling speaker overlaps, since the clustering algorithms implicitly assume one speaker per segment. Furthermore, they have trouble adapting their speaker embedding models to real audio recordings with speaker overlaps, because the speaker embedding model has to be optimized with single-speaker non-overlapping segments. These problems hinder speaker diarization when it is applied to real audio recordings that usually contain speaker overlaps.
To solve these problems, we propose End-to-End Neural Diarization (EEND). Different from most of the other methods, EEND does not rely on clustering. Instead, a neural network directly outputs the joint speech activities of all speakers for each time frame, given an input of a multi-speaker audio recording. Our method can naturally handle speaker overlaps during the training and inference period by exploiting a multi-label classification framework.
EEND is based on our previous studies [13, 14]. In , we proposed an optimal training scheme for a diarization model with a permutation-free objective function that provides minimal diarization errors. In , we extended that method by exploiting a self-attention-based neural network. Instead of a bidirectional long short-term memory (BLSTM) , we used a self-attention mechanism [26, 49], which resulted in a significant performance improvement over the BLSTM-based model.
In this paper, we reformulate speaker diarization as a simple multi-label classification, which is independent of the choice of neural network architecture: BLSTM or self-attention. Then we investigate the proposed method from various perspectives, by comparison the effect of different network architectures, visualizing latent representations, and evaluating it on multiple real datasets. The comparison of the network architectures revealed that a multi-head self-attention-based neural network is the key to achieving excellent performance. Experiments with different numbers of heads showed that the excellent performance could be obtained by making the number of heads sufficiently larger than the number of speakers. Experiments with different numbers of self-attention-based encoder blocks revealed that the EEND model performed better when it had more encoder blocks. By visualizing the latent representation, we showed that self-attention could capture global speaker characteristics in addition to local speech activity dynamics, making it especially suitable for dealing with the speaker diarization problem. The evaluation on the real datasets showed that the EEND method outperformed the state-of-the-art x-vector clustering-based method, while it correctly handled the speaker overlaps.
Ii Related work
Ii-a Clustering-based methods
Clustering-based methods are commonly used for speaker diarization. We used i-vector/x-vector clustering-based systems [38, 9, 41] as the baselines in our experiments. A diagram of a typical clustering-based system is depicted in Fig. 1(a).
To build the system, one has to prepare three independent models: (i) a speech activity detection (SAD) model for discriminating speech and non-speech, (ii) a speaker embedding extraction model for speaker identification, and (iii) a scoring model including the same/different speaker covariance matrices. None of these models can be trained to minimize the diarization errors directly. Optionally, a resegmentation process requires another model to refine speaker change points to produce the final diarization results.
Joint modeling methods have been studied in an effort to alleviate the complex preparation process and take into account the dependencies between these models. They include, for example, joint modeling of speaker embedding extraction and scoring [15, 32] and joint modeling of SAD and speaker embedding . However, the clustering process has remained unchanged because it is an unsupervised process.
In contrast to these methods, the EEND method uses only one neural network model, as depicted in Fig. 1(b). This method does not rely on clustering, and the model can be directly optimized with the reference diarization results of the training data.
This neural-network-based end-to-end approach, in which only one neural network model directly computes the final outputs, has been successfully applied in a variety of tasks, including neural machine translation[2, 47], automatic speech recognition [7, 6, 54], and text-to-speech [53, 44]. The proposed method is also categorized as such an approach.
Ii-B Clustering-free methods
The clustering process impares the model optimization aimed at minimizing diarization errors. To alleviate this problem, Zhang et al. proposed a clustering-free diarization method . This method is the first successful approach that does not cluster speaker embeddings and that is optimized with a diarization error minimization objective. The method formulates the speaker diarization problem on the basis of a factored probabilistic model, which consists of modules for determining speaker changes, speaker assignments, and feature generation. These models are jointly trained using input features and corresponding speaker labels. However, the SAD model and their speaker embedding (d-vector) extraction model have to be trained separately in their method. Moreover, their speaker-change model assumes one speaker for each segment, which hinders its application to speaker-overlapping speech.
In contrast to their method, the EEND method uses an end-to-end neural network that accepts audio features as input and outputs the joint speech activities of multiple speakers. The network is optimized using the entire recording, including non-speech and speaker overlaps, with a diarization-error-oriented objective.
Ii-C Self-attention mechanism
The self-attention mechanism was originally proposed for extracting sentence embeddings for text processing . Recently, the self-attention mechanism has shown superior performance in a variety of tasks, including machine translation , video classification , and image segmentation . For audio processing, a self-attention mechanism has been incorporated in acoustic modeling for ASR [45, 11], sound event detection , and speaker recognition . For speaker diarization, the self-attention mechanism has been applied to the speaker embedding extraction model  and the scoring model  of clustering-based methods. This study describes a self-attention mechanism for clustering-free speaker diarization.
Iii End-to-End Neural Diarization (EEND)
In this section, we describe a novel approach to speaker diarization problem exploiting a multi-label classification framework with a permutation-free training scheme. We refer to the proposed method as EEND.
Iii-a Speaker diarization as multi-label classification
The speaker diarization task can be formulated as a probabilistic multi-label classification problem, as follows.
Given an observation sequence of length ,
, from an audio signal, the speaker diaization problem is one of estimating the corresponding speaker label sequence. Here, is an -dimensional observation feature vector at time index . Speaker label denotes a joint activity for multiple () speakers at time index . For example, represent an overlap situation in which speakers and are both present at time index . Thus, determining is a sufficient condition to determine the speaker diarization information.
The most probable speaker label sequenceis selected from among all possible speaker label sequences , as follows:
can be factorized using the conditional independence assumption as follows:
Here, we assume that the frame-wise posterior is conditioned on all inputs, and each speaker is present independently. The frame-wise posteriors can be estimated using a neural-network-based model, as follows:
where is a neural network which accepts a sequence of input features and outputs , a -dimensional vector of the frame-wise posteriors at time index .
Iii-B Permutation-free training
The difficulty of training the model described above is that the model must deal with speaker permutations: changing the order of speakers within a correct label sequence is also regarded as correct. An example of permutations in a two-speaker case is shown in Fig. 2
. In this paper, we call this problem “label ambiguity.” This label ambiguity obstructs the training of the neural network when we use a standard binary cross-entropy loss function.
To cope with the label ambiguity problem, we employ the permutation-free training scheme, which considers all the permutations of the reference speaker labels. The permutation-free training scheme has been used in research on source separation [17, 56, 24]. Here, we apply a permutation-free loss function to a temporal sequence of speaker labels. The neural network is trained to minimize the permutation-free loss between the output predicted using Eq. 4 and the reference speaker label , as follows:
where is the set of all the possible permutations of (), and is the -th permutation of the reference speaker label, and is the binary cross entropy function between the label and the output.
Iv Neural network architectures for EEND
In this section, we explore two different architectures of neural networks for the EEND method.
Iv-a BLSTM-based neural network with Deep Clustering loss
According to Eq. 4, the neural-network-based function accepts a temporal sequence of feature vectors and outputs a vector for each time frame. Thus, this function can be modeled with bi-directional long short-term memory (BLSTM) as depicted in Fig. 3. The input features are transformed as follows:
where is a -th BLSTM layer which accepts an input sequence and outputs hidden activations at time index .111It is a concatenated vector of -dimensional forward and backward LSTM outputs. and are a linear projection matrix and bias, respectively. We use -layer stacked BLSTMs.
Assuming that the neural network extracts speaker embeddings in lower layers and then performs temporal segmentation using higher layers, the middle layer activations can be regarded as the speaker embeddings. Therefore, we place a speaker embedding training criterion on the middle layer activations.
Here, the -th layer activations obtained from Eq. 7 are transformed into normalized -dimensional embedding as follows:
where and are a linear projection matrix and a bias, respectively. is the element-wise hyperbolic tangent function and is the L2 normalization function. We apply the Deep Clustering (DC) loss function  so that the embeddings are partitioned into speaker-dependent clusters as well as overlapping and non-speech clusters. For example, in a two-speaker case, we generate four clusters (Non-speech, Speaker 1, Speaker 2, and Overlapping) as shown in Fig. 3.
DC loss function is expressed as follows:
where , and is a matrix in which each row represents a one-hot vector converted from , where those elements are in the power set of speakers. is the Frobenius norm. The loss function encourages the two embeddings at different time indices to be close together if they are in the same cluster and far away if they are in different clusters.
Next, we use multi-objective training introducing a mixing parameter :
Iv-B Self-attention-based neural network
By using BLSTM, each output frame is conditioned only on its previous hidden state, subsequent hidden state and current input feature. In contrast, by using a self-attention mechanism , each output frame is directly conditioned on all input frames by computing the pairwise similarity between all frame pairs. Here, we use a self-attention-based neural network instead of BLSTM, as depicted in Fig. 4. The input features are transformed as follows:
Here, and project an input feature into -dimensional vector. is the -th encoder block which accepts an input sequence of -dimensional vectors and outputs a -dimensional vector at time index . We use encoder blocks followed by the output layer for frame-wise posteriors.
The detailed architecture of the encoder block is depicted in Fig. 4. This configuration of the encoder block is almost the same as the one in the Speech-Transformer introduced in , but without positional encoding. The encoder block has two sub-layers. The first is a multi-head self-attention layer, and the second is a position-wise feed-forward layer.
Iv-B1 Multi-head self-attention layer
The multi-head self-attention layer transforms a sequence of input vectors as follows. The sequence of vectors is converted into a matrix; that is followed by layer normalization :
Then, query, key and value vectors are computed for each head
by using linear transformations:
where is the dimension of each head, is the number of heads, are query, key, and value projection matrices, respectively.
are bias vectors andis a -dimensional all-one vector. A pairwise similarity matrix is computed using the dot products of the query vectors and key vectors:
The pairwise similarity matrix is scaled by , and a softmax function is applied to form the attention weight matrix :
Then, using the attention weight matrix, the context vectors are computed as a weighted sum of the value vectors :
Finally, the context vectors for all heads are concatenated and projected using an output projection matrix and a bias :
Following the self-attention layer, a residual connection and layer normalization are applied:
Iv-B2 Position-wise feed-forward layer
The position-wise feed-forward layer transforms as follows:
where and are the first linear projection matrix and bias, respectively, andis the number of internal units in this layer. and are the second linear projection matrix and bias, respectively.
Finally, the output of the encoder block for each time frame is computed by applying a residual connection as follows:
Iv-B3 Output layer for frame-wise posteriors
V Experimental setup
|Num. of||Avg. dur.||Overlap|
To verify the effectiveness of the EEND method for various overlap situations, we prepared four training sets and five test sets, including simulated and real datasets. The statistics of the training and test sets are listed in Table I. The overlap ratio is computed as the ratio of the audio time during which two or more speakers are active to the audio time during which one or more speakers are active.
Note that the training data for the EEND method are different from those for the i-vector/x-vector clustering-based method. Whereas the clustering-based methods use single-speaker segments for training their speaker embedding extraction models, the EEND method uses audio mixtures of multiple speakers. Such mixtures can be simulated infinitely with a combination of single-speaker segments. Moreover, the EEND model can be trained with not only simulated mixtures but also real audio mixtures with speaker overlaps.
V-A1 Simulated datasets
Each mixture was simulated by Algorithm 1. Unlike the mixture simulations of source separation studies , we consider a diarization-style mixture: each speech mixture should have dozens of utterances per speaker with reasonable silence intervals between utterances. The silence intervals are controlled by the average interval of . Larger values of generate speech with less overlap.
The set of utterances used in the simulation was comprised of the Switchboard-2 (Phase I, II, III), Switchboard Cellular (Part 1, Part2), and NIST Speaker Recognition Evaluation datasets (2004, 2005, 2006, 2008). All recordings are telephone speech sampled at 8 kHz. There are 6,381 speakers in total. We split them into 5,743 speakers for the training set and 638 speakers for the test set. Note that the set of utterances for the training set is identical to that of the Kaldi CALLHOME diarization v2 recipe 222https://github.com/kaldi-asr/kaldi/tree/master/egs/callhome_diarization, thereby enabling a fair comparison with the x-vector clustering-based method.
Since there are no time annotations in these corpora, we extracted utterances using speech activity detection (SAD) on the basis of time-delay neural networks and statistics pooling333The SAD model: http://kaldi-asr.org/models/m4.
The set of background noises was from the MUSAN corpus . We used 37 recordings that are annotated as “background” noises. The set of 10,000 room impulse responses (RIRs) was from the Simulated Room Impulse Response Database used in . The SNR values were sampled from 10, 15, and 20 dB. These sets of non-speech corpora were also used for training the x-vector and SAD models in the x-vector clustering-based method.
We generated two-speaker mixtures for each speaker with 10-20 utterances (). For the simulated training set, 100,000 mixtures were generated with (SimBeta2). In addition, four sets of 100,000 mixtures with different values of (2, 3, 5, and 7) were combined to form 400,000 mixtures (SimLarge). For the simulated test set, 500 mixtures were generated with , 3, and 5. The overlap ratios of the simulated mixtures ranged from 19.5 to 34.4%.
V-A2 Real datasets
We used real telephone speech recordings as the real training set (Real). A set of 26,172 two-speaker recordings were extracted from the recordings of the Switchboard-2 (Phase I, II, III), Switchboard Cellular (Part 1, Part 2), and NIST Speaker Recognition Evaluation datasets. The overlap ratio of the training data was 3.7%, far less than that of the simulated mixtures.
We evaluated the proposed method on real telephone conversations in the CALLHOME dataset . We randomly split the two-speaker recordings from the CALLHOME dataset into two subsets: an adaptation set of 155 recordings and a test set of 148 recordings. The average overlap ratio of the test set was 13.0%.
In addition, we conducted an evaluation on the dialogue part of the Corpus of Spontaneous Japanese (CSJ) . The CSJ contains 54 two-speaker dialogue recordings444We excluded four out of 58 recordings that contain speakers in the official speech recognition evaluation sets.. They were recorded using headset microphones in separate soundproof rooms. The average overlap ratio of the CSJ test set was 20.1%, larger than the CALLHOME test set.
V-A3 Combined datasets
For generalizing a model to various environments, we conducted experiments using both a simulated training set (SimLarge) and the real training set (Real). We refer to the dataset as the combined training set (Comb).
V-B Model configuration
V-B1 Clustering-based systems
We compared the proposed method with two conventional clustering-based systems : the i-vector system and x-vector system were created using the Kaldi CALLHOME diarization v1 and v2 recipes.
These recipes use agglomerative hierarchical clustering (AHC) with the probabilistic linear discriminant analysis (PLDA) scoring scheme. The number of clusters was fixed to 2. Though the original recipes use oracle speech/non-speech marks, we used the SAD model with the configuration described in Sec. V-A.
V-B2 BLSTM-based EEND system
We configured the BLSTM-based EEND system (BLSTM-EEND) described in Sec. IV-A. The input features were 23-dimensional log-Mel-filterbanks with a 25-ms frame length and 10-ms frame shift. Each feature was concatenated with those from the previous seven frames and subsequent seven frames. To deal with a long audio sequence in our neural networks, we subsampled the concatenated features by a factor of ten. Consequently, a -dimensional input feature was fed into the neural network every 100 ms.
We used a five-layer BLSTM with 256 hidden units in each layer. The second layer of the BLSTM outputs was used to form a 256-dimensional embedding; we then calculated the Deep Clustering loss in this embedding to discriminate different speakers. The mixing parameter was set to 0.5. We used the Adam  optimizer with a learning rate of
. The batch size was 10. The number of training epochs was 20.
Because the output of the neural network is the probability of speech activity for each speaker, a threshold is required to obtain a decision on speech activity for each frame. We set the threshold to 0.5. Furthermore, we applied 11-frame median filtering to prevent production of unreasonably short segments.
For domain adaptation, the neural network was retrained using the CALLHOME adaptation set. We used the Adam optimizer with a learning rate of and ran five epochs. For the postprocessing, we adjusted the threshold to 0.6 so that the DER of the adaptation set had the minimum value.
V-B3 Self-attention-based EEND system
We configured a Self-attention-based EEND system (SA-EEND) as described in Sec. IV-B. Here, we used the same input features as were input to the BLSTM-EEND system. Note that the sequence length in the training stage was limited to 500 (50 seconds in audio time) because our system uses more memory than the BLSTM-based network does. Therefore, we split the input audio recordings into non-overlapping 50-second segments. In the inference stage, we used the entire sequence for each recording.
We used two encoder blocks with 256 attention units containing four heads (, , ). Note that most of our experiments were performed without residual connections in Eqs. 22 and 25. As described later in VI-F, adding residual connections further improved performance.
We used 1024 internal units in a position-wise feed-forward layer (. We used the Adam optimizer with the learning rate scheduler described in . The number of warm-up steps used in the learning rate scheduler was 25,000. The batch size was 64. The number of training epochs was 100. After 100 epochs, we used an averaged model obtained by averaging the model parameters of the last ten epochs. As with the BLSTM-EEND system, we applied 11-frame median filtering.
For domain adaptation, the averaged model was retrained using the CALLHOME adaptation set. We used the Adam optimizer with a learning rate of and ran 100 epochs. After 100 epochs, we used an averaged model obtained by averaging the model parameters of the last ten epochs.
V-C Performance metric
We evaluated the systems with the diarization error rate (DER) . Note that the DERs reported in many prior studies did not include misses or false alarm errors due to their using oracle speech/non-speech labels. Overlapping speech segments had also been excluded from the evaluation. For our DER computation, we evaluated all of the errors, including overlapping speech segments, because the proposed method includes both the speech activity detection and overlapping speech detection functionality. As is done typically, we used a collar tolerance of 250 ms at the start and end of each segment.
|trained with SimBeta2||12.28||14.36||19.69||26.03||39.33|
|trained with Real||36.23||37.78||40.34||23.07||25.37|
|trained with SimBeta2||7.91||8.51||9.51||13.66||22.31|
|trained with Real||32.72||33.84||36.78||10.76||20.50|
|trained with SimLarge||6.81||6.60||6.40||14.03||21.84|
|trained with Comb||6.92||6.54||6.38||11.99||22.26|
|w/o adaptation||with adaptatation|
|trained with SimBeta2||43.84||26.03|
|trained with Real||31.01||23.07|
|trained with SimBeta2||17.42||13.66|
|trained with SimLarge||16.31||14.03|
|trained with Real||12.66||10.76|
|trained with Comb||14.50||11.99|
|DER breakdown||SAD errors|
Vi-a Evaluation on simulated mixtures
DERs on various test sets are shown in Table II. The clustering-based systems performed poorly on heavily overlapping simulated mixtures. This result is within our expectations because the clustering-based systems did not consider speaker overlaps; there were more misses when the overlap ratio was high.
The BLSTM-EEND system trained with the simulated training set (SimBeta2) showed a significant DER reduction compared with the clustering-based systems on the simulated mixtures. Among the differing overlap ratios, it performed the best on the highest overlap ratio condition (). The BLSTM-EEND system worked well on the overlapping condition matched that of the training data.
The SA-EEND system trained with the simulated training set had significantly fewer DERs compared with the BLSTM-EEND system on every test set. As well as the BLSTM-EEND system, it showed the best performance on the highest overlap ratio condition (). However, the DER degradation under fewer overlapping conditions was smaller than that of the BLSTM-EEND system, which indicated that the self-attention blocks improved robustness to variable overlapping conditions.
Training the SA-EEND model with various overlap ratio conditions (SimLarge) showed an improvement over the single overlap ratio condition (SimBeta2) on every test set. It was revealed that overfitting to a specific overlap ratio could be mitigated by this multi-condition training.
Vi-B Evaluation on real test sets
In contrast to the excellent performance on the simulated mixtures, the BLSTM-EEND system had inferior DERs to those of the clustering-based systems evaluated on the real test sets. Although the BLSTM-EEND system showed performance improvements when the training data were switched from simulated to real data, its DERs were still higher than those of the clustering-based systems.
The SA-EEND system trained with the simulated training set (SimBeta2) showed remarkable improvements on the real test sets of CALLHOME and CSJ, which indicates the strong generalization capability of the self-attention blocks. For the CSJ, even without domain adaptation, the SA-EEND system performed better than the x-vector clustering-based method. Training the SA-EEND model with various overlap ratio conditions (SimLarge) yielded excellent generalizations to real test sets.
The SA-EEND system trained with the real training set (Real) performed better than SimLarge on the real test sets. However, it had poor DERs on the simulated test sets. We believed that the result was due to the small number of mixtures and low overlap ratio of the real training set. Finally, the SA-EEND system trained with the combined dataset (Comb) showed an excellent generalization capability, which was obtained by feeding it various overlap ratio conditions.
Vi-C Effect of domain adaptation
The EEND models trained with simulated training set were overfitted to the specific overlap ratio of the training set. We expected that the overfitting would be mitigated by using domain adaptation. DERs on the CALLHOME with and without domain adaptation are shown in Table III. As expected, the domain adaptation significantly reduced the DER; our system thus achieved even better results than those of the x-vector-based system.
A detailed DER comparison on the CALLHOME test set is shown in Table IV. The clustering-based systems had few SAD errors thanks to the robust SAD model trained with various noise-augmented data. However, there were numerous misses and confusion errors due to its lack of handling speaker overlaps. Compared with clustering-based systems, the proposed method produced significantly fewer confusion and miss errors. The domain adaptation reduced all error types except confusion errors.
Vi-D Visualization of self-attention
To analyze the behavior of the self-attention mechanism in our diarization system, Fig. 5 visualizes the attention weight matrix at the second encoder block, corresponding to in Eq. 19. Here, head 1 and head 2 have vertical lines at different positions. The vertical lines correspond to each speaker’s activity. The attention weight matrix with these vertical lines transformed the input features into the weighted mean of the same speaker frames. These heads actually captured the global speaker characteristics by computing the similarity between distant frames. Interestingly, heads 3 and 4 look like diagonal matrices, which result in local linear transforms. These heads are considered to act as speech/non-speech detectors. We conclude that the multi-head self-attention mechanism captures global speaker characteristics in addition to local speech activity dynamics, which leads to a reduction in DER.
Vi-E Effect of varying number of heads in self-attention blocks
The analysis in Sec. VI-D indicated that the different heads represented different speakers. To verify the importance of multiple heads, we trained models with different numbers of heads. The loss curves with for those models are shown in Fig. 6. The loss decreased as the number of heads increased and this trend continued for a large number of epochs. Note that for the single-head () experiment, we interrupted the training because the losses were consistent, around 0.67 during the first 12 epochs.
The DERs for different numbers of heads are shown in Table V. Here, performance improved as a result of increasing the number of heads. These results suggest that the SA-EEND models were trained to separate speakers via the global speaker characteristics represented by different heads, the required number of heads was at least the number of speakers, and more heads boosted performance.
Vi-F Effect of varying number of encoder blocks and warm-up steps
As noted in Sec. V-B3, most of our experiments were performed without residual connections in Eqs. 22 and 25. In this section, we examined deeper model configurations using more encoder blocks with residual connections. The loss curves for different numbers of encoder blocks and warm-up steps are shown in Fig. 7. The models with four encoder blocks reduced the validation loss compared with the one with two encoder blocks. Moreover, the validation loss was reduced by increasing the number of warm-up steps from 25,000 to 100,000. DERs for different numbers of encoder blocks are shown in Table VI. The results show that increasing the number of encoder blocks significantly improved performance.
The EEND system achieved a DER of 9.54%, whereas the x-vector clustering-based system had a DER of 11.53% on the CALLHOME dataset. Moreover, EEND had a DER of 20.39% on the CSJ dataset, while the x-vector clustering-based system had 22.96%. EEND had DERs from 4.56% to 3.85% on the simulated test set, while the x-vector clustering-based system had 19.78% to 28.77%.
We proposed End-to-End Neural Diarization (EEND), in which a neural network directly outputs speaker diarization results given a multi-speaker recording. We formulated the speaker diarization problem as a multi-label classification problem and introduced a permutation-free objective function to minimize diarization errors directly. We evaluated our method on simulated speech mixtures and real conversation datasets. The results showed that EEND method outperformed that of the state-of-the-art x-vector clustering-based method, and it correctly handled speaker overlaps. We explored the neural network architecture for the EEND method, and found that the self-attention-based neural network was the key to achieving excellent performance. By visualizing the attention weights, we showed that self-attention captured the global speaker characteristics in addition to local speech activity dynamics, making it especially suitable for dealing with the speaker diarization problem. Experiments with different numbers of heads showed that the excellent performance could be obtained by making the number of heads sufficiently larger than the number of speakers. Finally, experiments with different numbers of encoder blocks revealed that the EEND model performed better when it had more encoder blocks.
-  (2012) Speaker diarization: a review of recent research. IEEE Trans. on ASLP 20 (2), pp. 356–370. External Links: Cited by: §I.
-  (2015) Neural machine translation by jointly learning to align and translate. In Proc. ICLR, Cited by: §II-A.
-  (2018) The fifth ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines. In Proc. Interspeech, pp. 1561–1565. External Links: Cited by: §I.
-  (2018) Front-End Processing for the CHiME-5 Dinner Party Scenario. In Proc. CHiME-5, pp. 35–40. Cited by: §I.
-  (2006) Overlap in meetings: ASR effects and analysis by dialog factors, speakers, and collection site. In Proc. MLMI, pp. 212–224. Cited by: §I.
-  (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In Proc. ICASSP, pp. 4960–4964. Cited by: §II-A.
-  (2015) Attention-based models for speech recognition. In Proc. NIPS, pp. 577–585. Cited by: §II-A.
-  (2011) Front-end factor analysis for speaker verification. IEEE Trans. on ASLP 19 (4), pp. 788–798. External Links: Cited by: §I.
-  (2018) BUT system for DIHARD speech diarization challenge 2018. In Proc. Interspeech, pp. 2798–2802. Cited by: §I, §II-A.
-  (2017) Developing on-line speaker diarization system. In Proc. Interspeech, pp. 2739–2743. External Links: Cited by: §I.
-  (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. Proc. ICASSP, pp. 5884–5888. Cited by: §II-C, §IV-B.
-  (2018) The USTC-iFlytek Systems for CHiME-5 Challenge. In Proc. CHiME-5, pp. 11–15. Cited by: §I.
-  (2019 (to appear)) End-to-end neural speaker diarization with permutation-free objectives. In Proc. Interspeech, Cited by: §I.
-  (2019 (submitted)) End-to-end neural speaker diarization with self-attention. In Proc. ASRU, Cited by: §I.
-  (2017) Speaker diarization using deep neural network embeddings. In Proc. ICASSP, Vol. , pp. 4930–4934. External Links: Cited by: §I, §II-A.
-  (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18 (5), pp. 602 – 610. Note: IJCNN 2005 External Links: Cited by: §I.
-  (2016) Deep clustering: discriminative embeddings for segmentation and separation. In Proc. ICASSP, Vol. , pp. 31–35. External Links: Cited by: §III-B, §IV-A, §V-A1.
-  (2003) The ICSI meeting corpus. In Proc. ICASSP, Vol. I, pp. 364–367. External Links: Cited by: §I.
-  (2018) SELF-attention mechanism based system for DCASE2018 challenge task1 and task4. In DCASE2018 Challenge, Cited by: §II-C.
-  (2019) ACOUSTIC modeling for distant multi-talker speech recognition with single- and multi-channel branches. In Proc. ICASSP, pp. 6630–6634. Cited by: §I.
-  (2018) Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays. In Proc. CHiME-5, pp. 6–10. Cited by: §I.
-  (2015) Adam: A Method for Stochastic Optimization. In Proc. ICLR, Cited by: §V-B2.
-  (2017) A study on data augmentation of reverberant speech for robust speech recognition. In Proc. ICASSP, Vol. , pp. 5220–5224. External Links: Cited by: §V-A1.
Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. on ASLP 25 (10), pp. 1901–1913. External Links: Cited by: §III-B.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §IV-B1.
-  (2017) A structured self-attentive sentence embedding. In Proc. ICLR, Cited by: §I, §II-C, §IV-B.
-  (2018) Characterizing performance of speaker diarization systems on far-field speech using standard methods. In Proc. ICASSP, pp. 5244–5248. Cited by: §I.
-  (2003) Corpus of spontaneous japanese: its design and evaluation. In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, Cited by: §V-A2, TABLE I.
-  (2018) Links: a high-dimensional online clustering method. arXiv preprint arXiv:1801.10123. Cited by: §I.
LIUM_SPKDIARIZATION: an open source toolkit for diarization. In CMU SPUD Workshop, Cited by: §I.
-  (2018) Joint discriminative embedding learning, speech activity and overlap detection for the dihard speaker diarization challenge. In Proc. Interspeech, pp. 2818–2822. External Links: Cited by: §II-A.
-  (2019) Designing an effective metric learning pipeline for speaker diarization. In Proc. ICASSP, pp. 5806–5810. External Links: Cited by: §II-A, §II-C.
-  (2000) 2000 speaker recognition evaluation plan. Note: https://www.nist.gov/sites/default/files/documents/2017/09/26/spk-2000-plan-v1.0.htm_.pdf Cited by: §V-A2, TABLE I.
-  (2009) The 2009 (RT-09) rich transcription meeting recognition evaluation plan. Note: http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf Cited by: §V-C.
-  (2011) The Kaldi speech recognition toolkit. In Proc. ASRU, Cited by: §V-A1.
-  (2008) Interpretation of multiparty meetings the AMI and Amida projects. In 2008 Hands-Free Speech Communication and Microphone Arrays, Vol. , pp. 115–118. External Links: Cited by: §I.
-  (2014) Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In Proc. SLT, Vol. , pp. 413–417. External Links: Cited by: §I.
-  (2018) Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In Proc. Interspeech, pp. 2808–2812. External Links: Cited by: §I, §II-A, §V-B1.
-  (2014) A study of the cosine distance-based mean shift for telephone speech diarization. IEEE/ACM Trans. on ASLP 22 (1), pp. 217–227. External Links: Cited by: §I.
-  (2013) Unsupervised methods for speaker diarization: an integrated and iterative approach. IEEE Trans. on ASLP 21 (10), pp. 2015–2028. External Links: Cited by: §I.
-  (2019) Speaker recognition for multi-speaker conversations using x-vectors. In Proc. ICASSP, pp. 5796–5800. External Links: Cited by: §II-A.
-  (2018) X-vectors: robust DNN embeddings for speaker recognition. In Proc. ICASSP, Vol. , pp. 5329–5333. External Links: Cited by: §I.
-  (2015) MUSAN: a music, speech, and noise corpus. arXiv preprints arXiv:1510.08484. Cited by: §V-A1.
-  (2017) Char2Wav: end-to-end speech synthesis. In ICLR Workshop, Cited by: §II-A.
-  (2018) Self-attentional acoustic models. In Proc. Interspeech, pp. 3723–3727. Cited by: §II-C.
-  (2018) Speaker diarization with enhancing speech for the first DIHARD challenge. In Proc. Interspeech, pp. 2793–2797. Cited by: §I, §II-C.
-  (2014) Sequence to sequence learning with neural networks. In Proc. NIPS, pp. 3104–3112. Cited by: §II-A.
-  (2006) An overview of automatic speaker diarization systems. IEEE Trans. on ASLP 14 (5), pp. 1557–1565. External Links: Cited by: §I.
-  (2017) Attention is all you need. In Proc. NIPS, pp. 5998–6008. Cited by: §I, §II-C, §V-B3.
-  (2018) Generalized end-to-end loss for speaker verification. In Proc. ICASSP, Vol. , pp. 4879–4883. External Links: Cited by: §I.
-  (2018) Speaker diarization with LSTM. In Proc. ICASSP, Vol. , pp. 5239–5243. External Links: Cited by: §I.
-  (2018) Non-local neural networks. In Proc. CVPR, pp. 7794–7803. Cited by: §II-C.
-  (2017) Tacotron: towards end-to-end speech synthesis. In Proc. Interspeech, pp. 4006–4010. Cited by: §II-A.
-  (2017) Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1240–1253. Cited by: §II-A.
-  (2019) Cross-modal self-attention network for referring image segmentation. In Proc. CVPR, pp. 10502–10511. Cited by: §II-C.
-  (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proc. ICASSP, Vol. , pp. 241–245. External Links: Cited by: §III-B.
-  (2019) Fully supervised speaker diarization. In Proc. ICASSP, pp. 6301–6305. Cited by: §II-B.
-  (2018) Self-attentive speaker embeddings for text-independent speaker verification. In Proc. Interspeech, pp. 3573–3577. External Links: Cited by: §II-C.