End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification

02/24/2020
by   Yusuke Fujita, et al.
IEEE
0

The most common approach to speaker diarization is clustering of speaker embeddings. However, the clustering-based approach has a number of problems; i.e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps. To solve these problems, we propose the End-to-End Neural Diarization (EEND), in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, we formulate the speaker diarization problem as a multi-label classification problem and introduce a permutation-free objective function to directly minimize diarization errors. Besides its end-to-end simplicity, the EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations. We evaluated our method on simulated speech mixtures and real conversation datasets. The results showed that the EEND method outperformed the state-of-the-art x-vector clustering-based method, while it correctly handled speaker overlaps. We explored the neural network architecture for the EEND method, and found that the self-attention-based neural network was the key to achieving excellent performance. In contrast to conditioning the network only on its previous and next hidden states, as is done using bidirectional long short-term memory (BLSTM), self-attention is directly conditioned on all the frames. By visualizing the attention weights, we show that self-attention captures global speaker characteristics in addition to local speech activity dynamics, making it especially suitable for dealing with the speaker diarization problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

09/13/2019

End-to-End Neural Speaker Diarization with Self-attention

Speaker diarization has been mainly developed based on the clustering of...
09/12/2019

End-to-End Neural Speaker Diarization with Permutation-Free Objectives

In this paper, we propose a novel end-to-end neural-network-based speake...
05/05/2021

End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

We present an end-to-end deep network model that performs meeting diariz...
10/26/2020

Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds

Recent diarization technologies can be categorized into two approaches, ...
12/27/2018

Tied Hidden Factors in Neural Networks for End-to-End Speaker Recognition

In this paper we propose a method to model speaker and session variabili...
06/04/2020

Online End-to-End Neural Diarization with Speaker-Tracing Buffer

End-to-end speaker diarization using a fully supervised self-attention m...
04/02/2022

From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization

End-to-end neural diarization (EEND) is nowadays one of the most promine...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Speaker diarization is the process of partitioning an audio recording into homogeneous segments according to the speaker’s identity. Speaker diarization has a wide range of applications, such as generating written records of meetings and a turn-taking analysis of telephone conversations [48, 1]

. It also improves automatic speech recognition performance in multi-speaker conversation scenarios in meetings (ICSI

[18, 5], AMI [36, 20]) and home environments (CHiME-5 [3, 12, 4, 21, 20]).

The most common approach to speaker diarization is based on clustering of speaker embeddings [30, 40, 37, 39, 10, 15, 27, 51]. For instance, i-vectors [8, 40, 37, 27], d-vectors[50, 51], and x-vectors [42, 15]

are commonly used in speaker diarization tasks. These embeddings of short segments are partitioned into speaker clusters by using clustering algorithms, such as Gaussian mixture models

[30, 40]

, agglomerative hierarchical clustering

[30, 37, 15, 27], mean shift clustering [39]

, k-means clustering

[10, 51], Links [29, 51]

, and spectral clustering

[51]. These clustering-based diarization methods have shown themselves to be effective on various datasets (see the DIHARD Challenge 2018 activities, e.g., [38, 9, 46]).

However, such clustering-based methods have a number of problems. First, they cannot be optimized to minimize diarization errors directly because the clustering is a type of unsupervised learning process. Second, they have trouble handling speaker overlaps, since the clustering algorithms implicitly assume one speaker per segment. Furthermore, they have trouble adapting their speaker embedding models to real audio recordings with speaker overlaps, because the speaker embedding model has to be optimized with single-speaker non-overlapping segments. These problems hinder speaker diarization when it is applied to real audio recordings that usually contain speaker overlaps.

To solve these problems, we propose End-to-End Neural Diarization (EEND). Different from most of the other methods, EEND does not rely on clustering. Instead, a neural network directly outputs the joint speech activities of all speakers for each time frame, given an input of a multi-speaker audio recording. Our method can naturally handle speaker overlaps during the training and inference period by exploiting a multi-label classification framework.

EEND is based on our previous studies [13, 14]. In [13], we proposed an optimal training scheme for a diarization model with a permutation-free objective function that provides minimal diarization errors. In [14], we extended that method by exploiting a self-attention-based neural network. Instead of a bidirectional long short-term memory (BLSTM) [16], we used a self-attention mechanism [26, 49], which resulted in a significant performance improvement over the BLSTM-based model.

In this paper, we reformulate speaker diarization as a simple multi-label classification, which is independent of the choice of neural network architecture: BLSTM or self-attention. Then we investigate the proposed method from various perspectives, by comparison the effect of different network architectures, visualizing latent representations, and evaluating it on multiple real datasets. The comparison of the network architectures revealed that a multi-head self-attention-based neural network is the key to achieving excellent performance. Experiments with different numbers of heads showed that the excellent performance could be obtained by making the number of heads sufficiently larger than the number of speakers. Experiments with different numbers of self-attention-based encoder blocks revealed that the EEND model performed better when it had more encoder blocks. By visualizing the latent representation, we showed that self-attention could capture global speaker characteristics in addition to local speech activity dynamics, making it especially suitable for dealing with the speaker diarization problem. The evaluation on the real datasets showed that the EEND method outperformed the state-of-the-art x-vector clustering-based method, while it correctly handled the speaker overlaps.

Ii Related work

Ii-a Clustering-based methods

(a) Clustering-based method
(b) EEND method
Fig. 1: System diagrams for speaker diarization. While clustering-based method requires three different models, EEND method requires one model.

Clustering-based methods are commonly used for speaker diarization. We used i-vector/x-vector clustering-based systems [38, 9, 41] as the baselines in our experiments. A diagram of a typical clustering-based system is depicted in Fig. 1(a).

To build the system, one has to prepare three independent models: (i) a speech activity detection (SAD) model for discriminating speech and non-speech, (ii) a speaker embedding extraction model for speaker identification, and (iii) a scoring model including the same/different speaker covariance matrices. None of these models can be trained to minimize the diarization errors directly. Optionally, a resegmentation process requires another model to refine speaker change points to produce the final diarization results.

Joint modeling methods have been studied in an effort to alleviate the complex preparation process and take into account the dependencies between these models. They include, for example, joint modeling of speaker embedding extraction and scoring [15, 32] and joint modeling of SAD and speaker embedding [31]. However, the clustering process has remained unchanged because it is an unsupervised process.

In contrast to these methods, the EEND method uses only one neural network model, as depicted in Fig. 1(b). This method does not rely on clustering, and the model can be directly optimized with the reference diarization results of the training data.

This neural-network-based end-to-end approach, in which only one neural network model directly computes the final outputs, has been successfully applied in a variety of tasks, including neural machine translation

[2, 47], automatic speech recognition [7, 6, 54], and text-to-speech [53, 44]. The proposed method is also categorized as such an approach.

Ii-B Clustering-free methods

The clustering process impares the model optimization aimed at minimizing diarization errors. To alleviate this problem, Zhang et al. proposed a clustering-free diarization method [57]. This method is the first successful approach that does not cluster speaker embeddings and that is optimized with a diarization error minimization objective. The method formulates the speaker diarization problem on the basis of a factored probabilistic model, which consists of modules for determining speaker changes, speaker assignments, and feature generation. These models are jointly trained using input features and corresponding speaker labels. However, the SAD model and their speaker embedding (d-vector) extraction model have to be trained separately in their method. Moreover, their speaker-change model assumes one speaker for each segment, which hinders its application to speaker-overlapping speech.

In contrast to their method, the EEND method uses an end-to-end neural network that accepts audio features as input and outputs the joint speech activities of multiple speakers. The network is optimized using the entire recording, including non-speech and speaker overlaps, with a diarization-error-oriented objective.

Ii-C Self-attention mechanism

The self-attention mechanism was originally proposed for extracting sentence embeddings for text processing [26]. Recently, the self-attention mechanism has shown superior performance in a variety of tasks, including machine translation [49], video classification [52], and image segmentation [55]. For audio processing, a self-attention mechanism has been incorporated in acoustic modeling for ASR [45, 11], sound event detection [19], and speaker recognition [58]. For speaker diarization, the self-attention mechanism has been applied to the speaker embedding extraction model [46] and the scoring model [32] of clustering-based methods. This study describes a self-attention mechanism for clustering-free speaker diarization.

Iii End-to-End Neural Diarization (EEND)

In this section, we describe a novel approach to speaker diarization problem exploiting a multi-label classification framework with a permutation-free training scheme. We refer to the proposed method as EEND.

Iii-a Speaker diarization as multi-label classification

The speaker diarization task can be formulated as a probabilistic multi-label classification problem, as follows.

Given an observation sequence of length ,

, from an audio signal, the speaker diaization problem is one of estimating the corresponding speaker label sequence

. Here, is an -dimensional observation feature vector at time index . Speaker label denotes a joint activity for multiple () speakers at time index . For example, represent an overlap situation in which speakers and are both present at time index . Thus, determining is a sufficient condition to determine the speaker diarization information.

The most probable speaker label sequence

is selected from among all possible speaker label sequences , as follows:

(1)

can be factorized using the conditional independence assumption as follows:

(2)
(3)

Here, we assume that the frame-wise posterior is conditioned on all inputs, and each speaker is present independently. The frame-wise posteriors can be estimated using a neural-network-based model, as follows:

(4)

where is a neural network which accepts a sequence of input features and outputs , a -dimensional vector of the frame-wise posteriors at time index .

Iii-B Permutation-free training

The difficulty of training the model described above is that the model must deal with speaker permutations: changing the order of speakers within a correct label sequence is also regarded as correct. An example of permutations in a two-speaker case is shown in Fig. 2

. In this paper, we call this problem “label ambiguity.” This label ambiguity obstructs the training of the neural network when we use a standard binary cross-entropy loss function.

Fig. 2: Two-speaker EEND model trained with permutation-free loss. The binary cross entropy (BCE) loss of frame-wise posteriors are computed with two permutations of reference labels.

To cope with the label ambiguity problem, we employ the permutation-free training scheme, which considers all the permutations of the reference speaker labels. The permutation-free training scheme has been used in research on source separation [17, 56, 24]. Here, we apply a permutation-free loss function to a temporal sequence of speaker labels. The neural network is trained to minimize the permutation-free loss between the output predicted using Eq. 4 and the reference speaker label , as follows:

(5)

where is the set of all the possible permutations of (), and is the -th permutation of the reference speaker label, and is the binary cross entropy function between the label and the output.

Iv Neural network architectures for EEND

In this section, we explore two different architectures of neural networks for the EEND method.

Iv-a BLSTM-based neural network with Deep Clustering loss

Fig. 3: BLSTM-based EEND model with Deep Clustering (DC) loss.

According to Eq. 4, the neural-network-based function accepts a temporal sequence of feature vectors and outputs a vector for each time frame. Thus, this function can be modeled with bi-directional long short-term memory (BLSTM) as depicted in Fig. 3. The input features are transformed as follows:

(6)
(7)
(8)

where is a -th BLSTM layer which accepts an input sequence and outputs hidden activations at time index .111It is a concatenated vector of -dimensional forward and backward LSTM outputs. and are a linear projection matrix and bias, respectively. We use -layer stacked BLSTMs.

Assuming that the neural network extracts speaker embeddings in lower layers and then performs temporal segmentation using higher layers, the middle layer activations can be regarded as the speaker embeddings. Therefore, we place a speaker embedding training criterion on the middle layer activations.

Here, the -th layer activations obtained from Eq. 7 are transformed into normalized -dimensional embedding as follows:

(9)

where and are a linear projection matrix and a bias, respectively. is the element-wise hyperbolic tangent function and is the L2 normalization function. We apply the Deep Clustering (DC) loss function [17] so that the embeddings are partitioned into speaker-dependent clusters as well as overlapping and non-speech clusters. For example, in a two-speaker case, we generate four clusters (Non-speech, Speaker 1, Speaker 2, and Overlapping) as shown in Fig. 3.

DC loss function is expressed as follows:

(10)

where , and is a matrix in which each row represents a one-hot vector converted from , where those elements are in the power set of speakers. is the Frobenius norm. The loss function encourages the two embeddings at different time indices to be close together if they are in the same cluster and far away if they are in different clusters.

Next, we use multi-objective training introducing a mixing parameter :

(11)

Iv-B Self-attention-based neural network

Fig. 4: Self-attention-based EEND model.

By using BLSTM, each output frame is conditioned only on its previous hidden state, subsequent hidden state and current input feature. In contrast, by using a self-attention mechanism [26], each output frame is directly conditioned on all input frames by computing the pairwise similarity between all frame pairs. Here, we use a self-attention-based neural network instead of BLSTM, as depicted in Fig. 4. The input features are transformed as follows:

(12)
(13)

Here, and project an input feature into -dimensional vector. is the -th encoder block which accepts an input sequence of -dimensional vectors and outputs a -dimensional vector at time index . We use encoder blocks followed by the output layer for frame-wise posteriors.

The detailed architecture of the encoder block is depicted in Fig. 4. This configuration of the encoder block is almost the same as the one in the Speech-Transformer introduced in [11], but without positional encoding. The encoder block has two sub-layers. The first is a multi-head self-attention layer, and the second is a position-wise feed-forward layer.

Iv-B1 Multi-head self-attention layer

The multi-head self-attention layer transforms a sequence of input vectors as follows. The sequence of vectors is converted into a matrix; that is followed by layer normalization [25]:

(14)

Then, query, key and value vectors are computed for each head

by using linear transformations:

(15)
(16)
(17)

where is the dimension of each head, is the number of heads, are query, key, and value projection matrices, respectively.

are bias vectors and

is a -dimensional all-one vector. A pairwise similarity matrix is computed using the dot products of the query vectors and key vectors:

(18)

The pairwise similarity matrix is scaled by , and a softmax function is applied to form the attention weight matrix :

(19)

Then, using the attention weight matrix, the context vectors are computed as a weighted sum of the value vectors :

(20)

Finally, the context vectors for all heads are concatenated and projected using an output projection matrix and a bias :

(21)

Following the self-attention layer, a residual connection and layer normalization are applied:

(22)

Iv-B2 Position-wise feed-forward layer

The position-wise feed-forward layer transforms as follows:

(23)
(24)

where and are the first linear projection matrix and bias, respectively, and

is the rectified linear unit activation function.

is the number of internal units in this layer. and are the second linear projection matrix and bias, respectively.

Finally, the output of the encoder block for each time frame is computed by applying a residual connection as follows:

(25)

Iv-B3 Output layer for frame-wise posteriors

The frame-wise posteriors are calculated from (in Eq. 13) by using layer normalization and a fully-connected layer as follows:

(26)
(27)

where and are the linear projection matrix and bias, respectively, and

is the element-wise sigmoid function.

V Experimental setup

V-a Data

Num. of Avg. dur. Overlap
mixtures (sec) ratio (%)
Traning sets
SimBeta2 Simulated () 100,000 87.6 34.4
Real SWBD+SRE 26,172 304.7 3.7
SimLarge Simu. () 400,000 126.4 23.4
Comb Real+SimLarge 426,172 137.3 20.5
Test sets
1 Simulated () 500 87.3 34.4
2 Simulated () 500 103.8 27.2
3 Simulated () 500 137.1 19.5
4 CALLHOME [33] 148 72.1 13.0
5 CSJ [28] 54 766.3 20.1
TABLE I: Statistics of training and test sets.

To verify the effectiveness of the EEND method for various overlap situations, we prepared four training sets and five test sets, including simulated and real datasets. The statistics of the training and test sets are listed in Table I. The overlap ratio is computed as the ratio of the audio time during which two or more speakers are active to the audio time during which one or more speakers are active.

Note that the training data for the EEND method are different from those for the i-vector/x-vector clustering-based method. Whereas the clustering-based methods use single-speaker segments for training their speaker embedding extraction models, the EEND method uses audio mixtures of multiple speakers. Such mixtures can be simulated infinitely with a combination of single-speaker segments. Moreover, the EEND model can be trained with not only simulated mixtures but also real audio mixtures with speaker overlaps.

V-A1 Simulated datasets

Each mixture was simulated by Algorithm 1. Unlike the mixture simulations of source separation studies [17], we consider a diarization-style mixture: each speech mixture should have dozens of utterances per speaker with reasonable silence intervals between utterances. The silence intervals are controlled by the average interval of . Larger values of generate speech with less overlap.

Input :    // Sets of speakers, noises, RIRs and SNRs
  // Set of utterance lists
  // #speakers per mixture
  // Max. and min. #utterances per speaker
  // Average interval
Output :   // Mixture
1
2 Sample a set of speakers from
3   // Set of speakers’ signals
4 forall  do
5         // Concatenated signal
6       Sample from   // RIR
7       Sample from for  to  do
8             Sample   // Interval
9            
10      
11      
12
13
14 Sample from   // Background noise
15 Sample from   // SNR
16 Determine a mixing scale from and
17 repeat until the length of is reached
18
Algorithm 1 Mixture simulation.

The set of utterances used in the simulation was comprised of the Switchboard-2 (Phase I, II, III), Switchboard Cellular (Part 1, Part2), and NIST Speaker Recognition Evaluation datasets (2004, 2005, 2006, 2008). All recordings are telephone speech sampled at 8 kHz. There are 6,381 speakers in total. We split them into 5,743 speakers for the training set and 638 speakers for the test set. Note that the set of utterances for the training set is identical to that of the Kaldi CALLHOME diarization v2 recipe [35]222https://github.com/kaldi-asr/kaldi/tree/master/egs/callhome_diarization, thereby enabling a fair comparison with the x-vector clustering-based method.

Since there are no time annotations in these corpora, we extracted utterances using speech activity detection (SAD) on the basis of time-delay neural networks and statistics pooling333The SAD model: http://kaldi-asr.org/models/m4.

The set of background noises was from the MUSAN corpus [43]. We used 37 recordings that are annotated as “background” noises. The set of 10,000 room impulse responses (RIRs) was from the Simulated Room Impulse Response Database used in [23]. The SNR values were sampled from 10, 15, and 20 dB. These sets of non-speech corpora were also used for training the x-vector and SAD models in the x-vector clustering-based method.

We generated two-speaker mixtures for each speaker with 10-20 utterances (). For the simulated training set, 100,000 mixtures were generated with (SimBeta2). In addition, four sets of 100,000 mixtures with different values of (2, 3, 5, and 7) were combined to form 400,000 mixtures (SimLarge). For the simulated test set, 500 mixtures were generated with , 3, and 5. The overlap ratios of the simulated mixtures ranged from 19.5 to 34.4%.

V-A2 Real datasets

We used real telephone speech recordings as the real training set (Real). A set of 26,172 two-speaker recordings were extracted from the recordings of the Switchboard-2 (Phase I, II, III), Switchboard Cellular (Part 1, Part 2), and NIST Speaker Recognition Evaluation datasets. The overlap ratio of the training data was 3.7%, far less than that of the simulated mixtures.

We evaluated the proposed method on real telephone conversations in the CALLHOME dataset [33]. We randomly split the two-speaker recordings from the CALLHOME dataset into two subsets: an adaptation set of 155 recordings and a test set of 148 recordings. The average overlap ratio of the test set was 13.0%.

In addition, we conducted an evaluation on the dialogue part of the Corpus of Spontaneous Japanese (CSJ) [28]. The CSJ contains 54 two-speaker dialogue recordings444We excluded four out of 58 recordings that contain speakers in the official speech recognition evaluation sets.. They were recorded using headset microphones in separate soundproof rooms. The average overlap ratio of the CSJ test set was 20.1%, larger than the CALLHOME test set.

V-A3 Combined datasets

For generalizing a model to various environments, we conducted experiments using both a simulated training set (SimLarge) and the real training set (Real). We refer to the dataset as the combined training set (Comb).

V-B Model configuration

V-B1 Clustering-based systems

We compared the proposed method with two conventional clustering-based systems [38]: the i-vector system and x-vector system were created using the Kaldi CALLHOME diarization v1 and v2 recipes.

These recipes use agglomerative hierarchical clustering (AHC) with the probabilistic linear discriminant analysis (PLDA) scoring scheme. The number of clusters was fixed to 2. Though the original recipes use oracle speech/non-speech marks, we used the SAD model with the configuration described in Sec. V-A.

V-B2 BLSTM-based EEND system

We configured the BLSTM-based EEND system (BLSTM-EEND) described in Sec. IV-A. The input features were 23-dimensional log-Mel-filterbanks with a 25-ms frame length and 10-ms frame shift. Each feature was concatenated with those from the previous seven frames and subsequent seven frames. To deal with a long audio sequence in our neural networks, we subsampled the concatenated features by a factor of ten. Consequently, a -dimensional input feature was fed into the neural network every 100 ms.

We used a five-layer BLSTM with 256 hidden units in each layer. The second layer of the BLSTM outputs was used to form a 256-dimensional embedding; we then calculated the Deep Clustering loss in this embedding to discriminate different speakers. The mixing parameter was set to 0.5. We used the Adam [22] optimizer with a learning rate of

. The batch size was 10. The number of training epochs was 20.

Because the output of the neural network is the probability of speech activity for each speaker, a threshold is required to obtain a decision on speech activity for each frame. We set the threshold to 0.5. Furthermore, we applied 11-frame median filtering to prevent production of unreasonably short segments.

For domain adaptation, the neural network was retrained using the CALLHOME adaptation set. We used the Adam optimizer with a learning rate of and ran five epochs. For the postprocessing, we adjusted the threshold to 0.6 so that the DER of the adaptation set had the minimum value.

V-B3 Self-attention-based EEND system

We configured a Self-attention-based EEND system (SA-EEND) as described in Sec. IV-B. Here, we used the same input features as were input to the BLSTM-EEND system. Note that the sequence length in the training stage was limited to 500 (50 seconds in audio time) because our system uses more memory than the BLSTM-based network does. Therefore, we split the input audio recordings into non-overlapping 50-second segments. In the inference stage, we used the entire sequence for each recording.

We used two encoder blocks with 256 attention units containing four heads (, , ). Note that most of our experiments were performed without residual connections in Eqs. 22 and 25. As described later in VI-F, adding residual connections further improved performance.

We used 1024 internal units in a position-wise feed-forward layer (. We used the Adam optimizer with the learning rate scheduler described in [49]. The number of warm-up steps used in the learning rate scheduler was 25,000. The batch size was 64. The number of training epochs was 100. After 100 epochs, we used an averaged model obtained by averaging the model parameters of the last ten epochs. As with the BLSTM-EEND system, we applied 11-frame median filtering.

For domain adaptation, the averaged model was retrained using the CALLHOME adaptation set. We used the Adam optimizer with a learning rate of and ran 100 epochs. After 100 epochs, we used an averaged model obtained by averaging the model parameters of the last ten epochs.

V-C Performance metric

We evaluated the systems with the diarization error rate (DER) [34]. Note that the DERs reported in many prior studies did not include misses or false alarm errors due to their using oracle speech/non-speech labels. Overlapping speech segments had also been excluded from the evaluation. For our DER computation, we evaluated all of the errors, including overlapping speech segments, because the proposed method includes both the speech activity detection and overlapping speech detection functionality. As is done typically, we used a collar tolerance of 250 ms at the start and end of each segment.

Vi Results

Simulated Real
CH CSJ
Clustering-based
i-vector 33.74 30.93 25.96 12.10 27.99
x-vector 28.77 24.46 19.78 11.53 22.96
BLSTM-EEND
trained with SimBeta2 12.28 14.36 19.69 26.03 39.33
trained with Real 36.23 37.78 40.34 23.07 25.37
SA-EEND
trained with SimBeta2 7.91 8.51 9.51 13.66 22.31
trained with Real 32.72 33.84 36.78 10.76 20.50
trained with SimLarge 6.81 6.60 6.40 14.03 21.84
trained with Comb 6.92 6.54 6.38 11.99 22.26
TABLE II: DERs (%) on various test sets. For EEND systems, the CALLHOME (CH) results were obtained with domain adaptation.
w/o adaptation with adaptatation
x-vector clustering 11.53 N/A
BLSTM-EEND
trained with SimBeta2 43.84 26.03
trained with Real 31.01 23.07
SA-EEND
trained with SimBeta2 17.42 13.66
trained with SimLarge 16.31 14.03
trained with Real 12.66 10.76
trained with Comb 14.50 11.99
TABLE III: DERs (%) on the CALLHOME with and without domain adaptation.
DER breakdown SAD errors
Method DER MI FA CF MI FA
i-vector 12.10 7.74 0.54 3.82 1.4 0.5
x-vector 11.53 7.74 0.54 3.25 1.4 0.5
SA-EEND
 no-adapt
12.66 7.42 3.93 1.31 3.3 0.6
 adapted 10.76 6.68 2.40 1.68 2.3 0.5
TABLE IV: Detailed DERs (%) evaluated on the CALLHOME. DER is composed of Misses (MI), False alarms (FA), and Confusion errors (CF). The SAD errors are composed of Misses (MI) and False alarms (FA) errors.
Fig. 5: Attention weight matrices at the second encoder block. The input was the CALLHOME test set (recording id: iagk). The model was trained with the real training set followed by domain adaptation. The top two rows show the reference speech activity of two speakers.

Vi-a Evaluation on simulated mixtures

DERs on various test sets are shown in Table II. The clustering-based systems performed poorly on heavily overlapping simulated mixtures. This result is within our expectations because the clustering-based systems did not consider speaker overlaps; there were more misses when the overlap ratio was high.

The BLSTM-EEND system trained with the simulated training set (SimBeta2) showed a significant DER reduction compared with the clustering-based systems on the simulated mixtures. Among the differing overlap ratios, it performed the best on the highest overlap ratio condition (). The BLSTM-EEND system worked well on the overlapping condition matched that of the training data.

The SA-EEND system trained with the simulated training set had significantly fewer DERs compared with the BLSTM-EEND system on every test set. As well as the BLSTM-EEND system, it showed the best performance on the highest overlap ratio condition (). However, the DER degradation under fewer overlapping conditions was smaller than that of the BLSTM-EEND system, which indicated that the self-attention blocks improved robustness to variable overlapping conditions.

Training the SA-EEND model with various overlap ratio conditions (SimLarge) showed an improvement over the single overlap ratio condition (SimBeta2) on every test set. It was revealed that overfitting to a specific overlap ratio could be mitigated by this multi-condition training.

Vi-B Evaluation on real test sets

In contrast to the excellent performance on the simulated mixtures, the BLSTM-EEND system had inferior DERs to those of the clustering-based systems evaluated on the real test sets. Although the BLSTM-EEND system showed performance improvements when the training data were switched from simulated to real data, its DERs were still higher than those of the clustering-based systems.

The SA-EEND system trained with the simulated training set (SimBeta2) showed remarkable improvements on the real test sets of CALLHOME and CSJ, which indicates the strong generalization capability of the self-attention blocks. For the CSJ, even without domain adaptation, the SA-EEND system performed better than the x-vector clustering-based method. Training the SA-EEND model with various overlap ratio conditions (SimLarge) yielded excellent generalizations to real test sets.

The SA-EEND system trained with the real training set (Real) performed better than SimLarge on the real test sets. However, it had poor DERs on the simulated test sets. We believed that the result was due to the small number of mixtures and low overlap ratio of the real training set. Finally, the SA-EEND system trained with the combined dataset (Comb) showed an excellent generalization capability, which was obtained by feeding it various overlap ratio conditions.

Vi-C Effect of domain adaptation

The EEND models trained with simulated training set were overfitted to the specific overlap ratio of the training set. We expected that the overfitting would be mitigated by using domain adaptation. DERs on the CALLHOME with and without domain adaptation are shown in Table III. As expected, the domain adaptation significantly reduced the DER; our system thus achieved even better results than those of the x-vector-based system.

A detailed DER comparison on the CALLHOME test set is shown in Table IV. The clustering-based systems had few SAD errors thanks to the robust SAD model trained with various noise-augmented data. However, there were numerous misses and confusion errors due to its lack of handling speaker overlaps. Compared with clustering-based systems, the proposed method produced significantly fewer confusion and miss errors. The domain adaptation reduced all error types except confusion errors.

Vi-D Visualization of self-attention

To analyze the behavior of the self-attention mechanism in our diarization system, Fig. 5 visualizes the attention weight matrix at the second encoder block, corresponding to in Eq. 19. Here, head 1 and head 2 have vertical lines at different positions. The vertical lines correspond to each speaker’s activity. The attention weight matrix with these vertical lines transformed the input features into the weighted mean of the same speaker frames. These heads actually captured the global speaker characteristics by computing the similarity between distant frames. Interestingly, heads 3 and 4 look like diagonal matrices, which result in local linear transforms. These heads are considered to act as speech/non-speech detectors. We conclude that the multi-head self-attention mechanism captures global speaker characteristics in addition to local speech activity dynamics, which leads to a reduction in DER.

Fig. 6: Loss curves on simulated validation set () for different numbers of heads. These models were trained with SimBeta2.

Vi-E Effect of varying number of heads in self-attention blocks

Num. Simulated Real
heads CH CSJ
2 12.60 13.42 16.12 16.49 26.05
4 7.91 8.51 9.51 13.66 22.31
8 6.84 7.06 7.85 13.44 23.58
16 7.19 7.52 7.88 13.28 24.35
TABLE V: DERs (%) with different number of heads. The models are trained with SimBeta2.

The analysis in Sec. VI-D indicated that the different heads represented different speakers. To verify the importance of multiple heads, we trained models with different numbers of heads. The loss curves with for those models are shown in Fig. 6. The loss decreased as the number of heads increased and this trend continued for a large number of epochs. Note that for the single-head () experiment, we interrupted the training because the losses were consistent, around 0.67 during the first 12 epochs.

The DERs for different numbers of heads are shown in Table V. Here, performance improved as a result of increasing the number of heads. These results suggest that the SA-EEND models were trained to separate speakers via the global speaker characteristics represented by different heads, the required number of heads was at least the number of speakers, and more heads boosted performance.

Vi-F Effect of varying number of encoder blocks and warm-up steps

Fig. 7: Loss curves on simulated validation set () for different numbers of layers and warm-up steps. These models were trained with SimBeta2.
Enc. Warm. Res. Simulated Real
blocks steps con. CH CSJ
2 25k N 7.91 8.51 9.51 13.66 22.31
2 25k Y 7.36 7.59 7.78 12.50 23.38
4 25k Y 5.66 5.39 5.01 10.16 20.39
4 50k Y 5.01 4.64 4.10 10.25 21.50
4 100k Y 4.56 4.50 3.85 9.54 20.48
x-vector clustering 28.77 24.46 19.78 11.53 22.96
TABLE VI: DERs (%) for different numbers of encoder blocks and warm-up steps with/without residual connections. The models were trained with SimBeta2

As noted in Sec. V-B3, most of our experiments were performed without residual connections in Eqs. 22 and 25. In this section, we examined deeper model configurations using more encoder blocks with residual connections. The loss curves for different numbers of encoder blocks and warm-up steps are shown in Fig. 7. The models with four encoder blocks reduced the validation loss compared with the one with two encoder blocks. Moreover, the validation loss was reduced by increasing the number of warm-up steps from 25,000 to 100,000. DERs for different numbers of encoder blocks are shown in Table VI. The results show that increasing the number of encoder blocks significantly improved performance.

The EEND system achieved a DER of 9.54%, whereas the x-vector clustering-based system had a DER of 11.53% on the CALLHOME dataset. Moreover, EEND had a DER of 20.39% on the CSJ dataset, while the x-vector clustering-based system had 22.96%. EEND had DERs from 4.56% to 3.85% on the simulated test set, while the x-vector clustering-based system had 19.78% to 28.77%.

Vii Conclusion

We proposed End-to-End Neural Diarization (EEND), in which a neural network directly outputs speaker diarization results given a multi-speaker recording. We formulated the speaker diarization problem as a multi-label classification problem and introduced a permutation-free objective function to minimize diarization errors directly. We evaluated our method on simulated speech mixtures and real conversation datasets. The results showed that EEND method outperformed that of the state-of-the-art x-vector clustering-based method, and it correctly handled speaker overlaps. We explored the neural network architecture for the EEND method, and found that the self-attention-based neural network was the key to achieving excellent performance. By visualizing the attention weights, we showed that self-attention captured the global speaker characteristics in addition to local speech activity dynamics, making it especially suitable for dealing with the speaker diarization problem. Experiments with different numbers of heads showed that the excellent performance could be obtained by making the number of heads sufficiently larger than the number of speakers. Finally, experiments with different numbers of encoder blocks revealed that the EEND model performed better when it had more encoder blocks.

References

  • [1] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals (2012) Speaker diarization: a review of recent research. IEEE Trans. on ASLP 20 (2), pp. 356–370. External Links: Document, ISSN 1558-7916 Cited by: §I.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In Proc. ICLR, Cited by: §II-A.
  • [3] J. Barker, S. Watanabe, E. Vincent, and J. Trmal (2018) The fifth ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines. In Proc. Interspeech, pp. 1561–1565. External Links: Document Cited by: §I.
  • [4] C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann, and R. Haeb-Umbach (2018) Front-End Processing for the CHiME-5 Dinner Party Scenario. In Proc. CHiME-5, pp. 35–40. Cited by: §I.
  • [5] Ö. Çetin and E. Shriberg (2006) Overlap in meetings: ASR effects and analysis by dialog factors, speakers, and collection site. In Proc. MLMI, pp. 212–224. Cited by: §I.
  • [6] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In Proc. ICASSP, pp. 4960–4964. Cited by: §II-A.
  • [7] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. In Proc. NIPS, pp. 577–585. Cited by: §II-A.
  • [8] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (2011) Front-end factor analysis for speaker verification. IEEE Trans. on ASLP 19 (4), pp. 788–798. External Links: Document, ISSN 1558-7916 Cited by: §I.
  • [9] M. Diez, F. Landini, L. Burget, J. Rohdin, A. Silnova, K. Z̆molíková, O. Novotný, K. Veselý, O. Glembek, O. Plchot, L. Mos̆ner, and P. Matĕjka (2018) BUT system for DIHARD speech diarization challenge 2018. In Proc. Interspeech, pp. 2798–2802. Cited by: §I, §II-A.
  • [10] D. Dimitriadis and P. Fousek (2017) Developing on-line speaker diarization system. In Proc. Interspeech, pp. 2739–2743. External Links: Document Cited by: §I.
  • [11] L. Dong, S. Xu, and B. Xu (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. Proc. ICASSP, pp. 5884–5888. Cited by: §II-C, §IV-B.
  • [12] J. Du, T. Gao, L. Sun, F. Ma, Y. Fang, D. Liu, Q. Zhang, X. Zhang, H. Wang, J. Pan, J. Gao, C. Lee, and J. Chen (2018) The USTC-iFlytek Systems for CHiME-5 Challenge. In Proc. CHiME-5, pp. 11–15. Cited by: §I.
  • [13] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe (2019 (to appear)) End-to-end neural speaker diarization with permutation-free objectives. In Proc. Interspeech, Cited by: §I.
  • [14] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watanabe (2019 (submitted)) End-to-end neural speaker diarization with self-attention. In Proc. ASRU, Cited by: §I.
  • [15] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree (2017) Speaker diarization using deep neural network embeddings. In Proc. ICASSP, Vol. , pp. 4930–4934. External Links: Document, ISSN 2379-190X Cited by: §I, §II-A.
  • [16] A. Graves and J. Schmidhuber (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18 (5), pp. 602 – 610. Note: IJCNN 2005 External Links: ISSN 0893-6080, Document Cited by: §I.
  • [17] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe (2016) Deep clustering: discriminative embeddings for segmentation and separation. In Proc. ICASSP, Vol. , pp. 31–35. External Links: Document, ISSN 2379-190X Cited by: §III-B, §IV-A, §V-A1.
  • [18] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters (2003) The ICSI meeting corpus. In Proc. ICASSP, Vol. I, pp. 364–367. External Links: Document, ISSN 1520-6149 Cited by: §I.
  • [19] W. Jun and L. Shengchen (2018) SELF-attention mechanism based system for DCASE2018 challenge task1 and task4. In DCASE2018 Challenge, Cited by: §II-C.
  • [20] N. Kanda, Y. Fujita, S. Horiguchi, R. Ikeshita, K. Nagamatsu, and S. Watanabe (2019) ACOUSTIC modeling for distant multi-talker speech recognition with single- and multi-channel branches. In Proc. ICASSP, pp. 6630–6634. Cited by: §I.
  • [21] N. Kanda, R. Ikeshita, S. Horiguchi, Y. Fujita, K. Nagamatsu, X. Wang, V. Manohar, N. E. Yalta Soplin, M. Maciejewski, S. Chen, A. S. Subramanian, R. Li, Z. Wang, J. Naradowsky, L. P. Garcia-Perera, and G. Sell (2018) Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays. In Proc. CHiME-5, pp. 6–10. Cited by: §I.
  • [22] D. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In Proc. ICLR, Cited by: §V-B2.
  • [23] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur (2017) A study on data augmentation of reverberant speech for robust speech recognition. In Proc. ICASSP, Vol. , pp. 5220–5224. External Links: Document, ISSN 2379-190X Cited by: §V-A1.
  • [24] M. Kolbæk, D. Yu, Z. Tan, and J. Jensen (2017)

    Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks

    .
    IEEE/ACM Trans. on ASLP 25 (10), pp. 1901–1913. External Links: Document, ISSN 2329-9290 Cited by: §III-B.
  • [25] J. Lei Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §IV-B1.
  • [26] Z. Lin, M. Feng, C. Nogueira dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. In Proc. ICLR, Cited by: §I, §II-C, §IV-B.
  • [27] M. Maciejewski, D. Snyder, V. Manohar, N. Dehak, and S. Khudanpur (2018) Characterizing performance of speaker diarization systems on far-field speech using standard methods. In Proc. ICASSP, pp. 5244–5248. Cited by: §I.
  • [28] K. Maekawa (2003) Corpus of spontaneous japanese: its design and evaluation. In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, Cited by: §V-A2, TABLE I.
  • [29] P. A. Mansfield, Q. Wang, C. Downey, L. Wan, and I. L. Moreno (2018) Links: a high-dimensional online clustering method. arXiv preprint arXiv:1801.10123. Cited by: §I.
  • [30] S. Meignier (2010)

    LIUM_SPKDIARIZATION: an open source toolkit for diarization

    .
    In CMU SPUD Workshop, Cited by: §I.
  • [31] V. A. Miasato Filho, D. A. Silva, and L. G. Depra Cuozzo (2018) Joint discriminative embedding learning, speech activity and overlap detection for the dihard speaker diarization challenge. In Proc. Interspeech, pp. 2818–2822. External Links: Document Cited by: §II-A.
  • [32] V. S. Narayanaswamy, J. J. Thiagarajan, H. Song, and A. Spanias (2019) Designing an effective metric learning pipeline for speaker diarization. In Proc. ICASSP, pp. 5806–5810. External Links: Document, ISSN 2379-190X Cited by: §II-A, §II-C.
  • [33] NIST (2000) 2000 speaker recognition evaluation plan. Note: https://www.nist.gov/sites/default/files/documents/2017/09/26/spk-2000-plan-v1.0.htm_.pdf Cited by: §V-A2, TABLE I.
  • [34] NIST (2009) The 2009 (RT-09) rich transcription meeting recognition evaluation plan. Note: http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf Cited by: §V-C.
  • [35] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely (2011) The Kaldi speech recognition toolkit. In Proc. ASRU, Cited by: §V-A1.
  • [36] S. Renals, T. Hain, and H. Bourlard (2008) Interpretation of multiparty meetings the AMI and Amida projects. In 2008 Hands-Free Speech Communication and Microphone Arrays, Vol. , pp. 115–118. External Links: Document, ISSN Cited by: §I.
  • [37] G. Sell and D. Garcia-Romero (2014) Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In Proc. SLT, Vol. , pp. 413–417. External Links: Document, ISSN Cited by: §I.
  • [38] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, and S. Khudanpur (2018) Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In Proc. Interspeech, pp. 2808–2812. External Links: Document Cited by: §I, §II-A, §V-B1.
  • [39] M. Senoussaoui, P. Kenny, T. Stafylakis, and P. Dumouchel (2014) A study of the cosine distance-based mean shift for telephone speech diarization. IEEE/ACM Trans. on ASLP 22 (1), pp. 217–227. External Links: Document, ISSN 2329-9290 Cited by: §I.
  • [40] S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass (2013) Unsupervised methods for speaker diarization: an integrated and iterative approach. IEEE Trans. on ASLP 21 (10), pp. 2015–2028. External Links: Document, ISSN 1558-7916 Cited by: §I.
  • [41] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur (2019) Speaker recognition for multi-speaker conversations using x-vectors. In Proc. ICASSP, pp. 5796–5800. External Links: Document, ISSN 2379-190X Cited by: §II-A.
  • [42] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust DNN embeddings for speaker recognition. In Proc. ICASSP, Vol. , pp. 5329–5333. External Links: Document, ISSN 2379-190X Cited by: §I.
  • [43] D. Snyder, G. Chen, and D. Povey (2015) MUSAN: a music, speech, and noise corpus. arXiv preprints arXiv:1510.08484. Cited by: §V-A1.
  • [44] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio (2017) Char2Wav: end-to-end speech synthesis. In ICLR Workshop, Cited by: §II-A.
  • [45] M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel (2018) Self-attentional acoustic models. In Proc. Interspeech, pp. 3723–3727. Cited by: §II-C.
  • [46] L. Sun, J. Du, C. Jiang, X. Zhang, S. He, B. Yin, and C. Lee (2018) Speaker diarization with enhancing speech for the first DIHARD challenge. In Proc. Interspeech, pp. 2793–2797. Cited by: §I, §II-C.
  • [47] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Proc. NIPS, pp. 3104–3112. Cited by: §II-A.
  • [48] S. E. Tranter and D. A. Reynolds (2006) An overview of automatic speaker diarization systems. IEEE Trans. on ASLP 14 (5), pp. 1557–1565. External Links: Document, ISSN 1558-7916 Cited by: §I.
  • [49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. NIPS, pp. 5998–6008. Cited by: §I, §II-C, §V-B3.
  • [50] L. Wan, Q. Wang, A. Papir, and I. L. Moreno (2018) Generalized end-to-end loss for speaker verification. In Proc. ICASSP, Vol. , pp. 4879–4883. External Links: Document, ISSN 2379-190X Cited by: §I.
  • [51] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno (2018) Speaker diarization with LSTM. In Proc. ICASSP, Vol. , pp. 5239–5243. External Links: Document, ISSN 2379-190X Cited by: §I.
  • [52] X. Wang, R. B. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proc. CVPR, pp. 7794–7803. Cited by: §II-C.
  • [53] Y. Wang, R.J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous (2017) Tacotron: towards end-to-end speech synthesis. In Proc. Interspeech, pp. 4006–4010. Cited by: §II-A.
  • [54] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi (2017) Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1240–1253. Cited by: §II-A.
  • [55] L. Ye, M. Rochan, Z. Liu, and Y. Wang (2019) Cross-modal self-attention network for referring image segmentation. In Proc. CVPR, pp. 10502–10511. Cited by: §II-C.
  • [56] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proc. ICASSP, Vol. , pp. 241–245. External Links: Document, ISSN 2379-190X Cited by: §III-B.
  • [57] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang (2019) Fully supervised speaker diarization. In Proc. ICASSP, pp. 6301–6305. Cited by: §II-B.
  • [58] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey (2018) Self-attentive speaker embeddings for text-independent speaker verification. In Proc. Interspeech, pp. 3573–3577. External Links: Document Cited by: §II-C.