End-to-End Neural Speaker Diarization with Self-attention

09/13/2019
by   Yusuke Fujita, et al.
0

Speaker diarization has been mainly developed based on the clustering of speaker embeddings. However, the clustering-based approach has two major problems; i.e., (i) it is not optimized to minimize diarization errors directly, and (ii) it cannot handle speaker overlaps correctly. To solve these problems, the End-to-End Neural Diarization (EEND), in which a bidirectional long short-term memory (BLSTM) network directly outputs speaker diarization results given a multi-talker recording, was recently proposed. In this study, we enhance EEND by introducing self-attention blocks instead of BLSTM blocks. In contrast to BLSTM, which is conditioned only on its previous and next hidden states, self-attention is directly conditioned on all the other frames, making it much suitable for dealing with the speaker diarization problem. We evaluated our proposed method on simulated mixtures, real telephone calls, and real dialogue recordings. The experimental results revealed that the self-attention was the key to achieving good performance and that our proposed method performed significantly better than the conventional BLSTM-based method. Our method was even better than that of the state-of-the-art x-vector clustering-based method. Finally, by visualizing the latent representation, we show that the self-attention can capture global speaker characteristics in addition to local speech activity dynamics. Our source code is available online at https://github.com/hitachi-speech/EEND.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/24/2020

End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification

The most common approach to speaker diarization is clustering of speaker...
09/12/2019

End-to-End Neural Speaker Diarization with Permutation-Free Objectives

In this paper, we propose a novel end-to-end neural-network-based speake...
06/04/2020

Online End-to-End Neural Diarization with Speaker-Tracing Buffer

End-to-end speaker diarization using a fully supervised self-attention m...
10/14/2021

Auxiliary Loss of Transformer with Residual Connection for End-to-End Speaker Diarization

End-to-end neural diarization (EEND) with self-attention directly predic...
10/26/2020

Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds

Recent diarization technologies can be categorized into two approaches, ...
08/03/2020

Self-attention encoding and pooling for speaker recognition

The computing power of mobile devices limits the end-user applications i...
09/02/2020

SAGRNN: Self-Attentive Gated RNN for Binaural Speaker Separation with Interaural Cue Preservation

Most existing deep learning based binaural speaker separation systems fo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker diarization is the process of partitioning an audio recording into homogeneous segments according to the speaker’s identity. The speaker diarization has a wide range of applications, such as information retrieval from broadcast news, generating minutes of meetings, and a turn-taking analysis of telephone conversations [47, 1]. It also helps automatic speech recognition performance in multi-speaker conversation scenarios in meetings (ICSI [17, 5], AMI [35, 19]) and home environments (CHiME-5 [3, 12, 4, 20, 19]).

Typical speaker diarization systems are based on the clustering of speaker embeddings [29, 39, 36, 38, 10, 14, 26, 50]. For instance, i-vectors [8, 39, 36, 26], d-vectors[49, 50], and x-vectors [41, 14]

are commonly used in speaker diarization tasks. These embeddings of short segments are partitioned into speaker clusters by using clustering algorithms, such as Gaussian mixture models

[29, 39]

, agglomerative hierarchical clustering

[29, 36, 14, 26], mean shift clustering [38]

, k-means clustering

[10, 50], Links [28, 50]

, and spectral clustering

[50]. These clustering-based diarization methods have shown themselves to be effective on various datasets (see the DIHARD Challenge 2018 activities, e.g., [37, 9, 45]).

However, such clustering-based methods have a number of problems. First, they cannot be optimized to minimize diarization errors directly, because the clustering procedure is a type of unsupervised learning methods. Second, they have trouble handling speaker overlaps, since the clustering algorithms implicitly assume one speaker per segment. Furthermore, they have trouble adapting their speaker embedding models to real audio recordings with speaker overlaps, because the speaker embedding model has to be optimized with single-speaker non-overlapping segments. These problems hinder the speaker diarization application from working on real audio recordings that usually contain overlapping segments.

To solve these problems, we propose Self-Attentitive End-to-End Neural Diarization (SA-EEND). Different from most of the other methods, our proposed method does not rely on clustering. Instead, a self-attention-based neural network directly outputs the joint speech activities of all speakers for each time frame, given an input of a multi-speaker audio recording. Our method can naturally handle speaker overlaps during the training and inference time by exploiting a multi-label classification framework. The neural network is trained in an end-to-end fashion using a recently proposed permutation-free objective function that provides minimal diarization errors [13].

This paper shows that our method achieves a significant performance improvement over end-to-end neural diarization (EEND) [13], for which promising but preliminary results were reported with a bidirectional long short-term memory (BLSTM) [15]. In particular, it shows that the self-attention mechanism [25, 48] is the key to achieving good speaker-diarization performance in this paper. We demonstrate that the self-attention mechanism gives significantly better results for multiple datasets compared with the BLSTM-based method [13] and the state-of-the-art x-vector-based speaker diarization method. In contrast to BLSTM, which is conditioned only on its previous and next hidden states, the self-attention layer is conditioned on all the other input frames by computing the pairwise similarity between all frame pairs. We believe that this mechanism is the key to speaker diarization since it can capture global speaker characteristics in addition to local speech activity dynamics. By visualizing the learned representation, we show that some self-attention heads capture speaker-dependent global characteristics, while the remaining heads represent temporal features.

2 Related work

2.1 Clustering-based methods

(a) X-vector clustering-based method
(b) EEND method
Figure 1: System diagrams for speaker diarization

The x-vector clustering-based system is commonly used for speaker diarization [37, 9, 40]. A diagram of the system is depicted in Fig. 1(a). To build the system, one has to prepare three independent models: (i) a speech activity detection (SAD) neural network, (ii) x-vector extraction neural network, and (iii) PLDA model including the same/different speaker covariance matrices. None of these models can be trained to directly minimize the diarization errors. Joint modeling methods have been studied in an effort to alleviate the complex preparation process and take into account the dependencies between these models. They include, for example, joint modeling of x-vector extraction and PLDA scoring [14, 31] and joint modeling of SAD and speaker embedding [30]. However, the clustering process has remained unchanged because it is an unsupervised process.

In contrast to these methods, the EEND method uses only one neural network model, as depicted in Fig. 1(b). This method does not rely on clustering, and the model can be directly optimized with the reference diarization results of the training data.

This neural-network-based end-to-end approach, in which only one neural network model directly computes the final outputs, has been successfully applied in a variety of tasks, including neural machine translation

[2, 46], automatic speech recognition [7, 6, 53], and text-to-speech [52, 43].

2.2 Direct optimization minimizing diarization errors

A fully supervised diarization method has been proposed for optimization based on a diarization error minimization objective [56]. This is the first successful approach that does not cluster speaker embeddings. The method formulates the speaker diarization problem on the basis of a factored probabilistic model, which consists of modules for determining speaker changes, speaker assignments, and feature generation. These models are jointly trained using input features and corresponding speaker labels. However, the SAD model and their speaker embedding (d-vector) model have to be trained separately in their method. Moreover, their speaker-change model assumes one speaker for each segment, which hinders its application to speaker-overlapping speech.

In contrast to their method, the EEND method uses an end-to-end neural network that accepts audio features as input and outputs the joint speech activities of multiple speakers. The network is optimized using the entire recording, including non-speech and speaker overlaps, with a diarization-error-oriented objective. This end-to-end model was first introduced in [13]; this paper describes an extension of the model that includes a self-attention mechanism.

2.3 Self-attention mechanism

The self-attention mechanism was originally proposed for extracting sentence embeddings for text processing [25]. Recently, the self-attention mechanism has shown superior performance in a variety of tasks, including machine translation [48], video classification [51], and image segmentation [54]. For audio processing, a self-attention mechanism has been incorporated in acoustic modeling for ASR [44, 11], sound event detection [18], and speaker recognition [57]. For speaker diarization, the self-attention mechanism has been applied to the speaker embedding extraction model [45] and the scoring model [31] of clustering-based methods. This study describes a self-attention mechanism for clustering-free speaker diarization.

3 Proposed method: Self-Attentive End-to-End Neural Diarization

3.1 End-to-end neural diarization: review

Here, we describe the EEND method proposed in [13]. The speaker diarization task can be formulated as a multi-label classification problem, as follows.

Given a -length observation sequence

from an audio signal, speaker diaization problem tries to estimate the corresponding speaker label sequence

. Here, is a -dimensional observation feature vector at time index . Speaker label denotes a joint activity for multiple () speakers at time index . For example, and represent an overlap situation in which speakers and are both present at time index . Thus, determining is a sufficient condition to determine the speaker diarization information.

The most probable speaker label sequence

is selected from among all possible speaker label sequences , as follows:

(1)

can be factorized using the conditional independence assumption as follows:

(2)
(3)

Here, we assume that the frame-wise posterior is conditioned on all inputs, and each speaker is present independently. The frame-wise posterior can be estimated using a neural-network-based model.

3.2 Self-attention-based neural network

In [13], a BLSTM based neural network was used for estimating the frame-wise posteriors . In this paper, we propose self-attentive end-to-end neural diarization (SA-EEND), which uses self-attention-based encoding blocks instead of BLSTMs, as depicted in Fig. 2. The input features are transformed as follows:

(4)
(5)

Here, and project an input feature into -dimensional vector. is the -th encoder block which accepts an input sequence of -dimensional vectors and outputs a -dimensional vector at time index . We use encoder blocks followed by the output layer for frame-wise posteriors.

The architecture of the encoder block is depicted in Fig. 2. This configuration of the encoder block is almost the same as the one in the Speech-Transformer introduced in [11], but without positional encoding. The encoder block has two sub-layers. The first is a multi-head self-attention layer, and the second is a position-wise feed-forward layer.

3.2.1 Multi-head self-attention layer

The multi-head self-attention layer transforms a sequence of input vectors as follows. The sequence of vectors is converted into a matrix, followed by layer normalization [24]:

(6)

Then, for each head, a pairwise similarity matrix is computed using the dot products of query vectors and key vectors :

(7)

where, are query and key projection matrices for the -th head, respectively. is a dimension of each head, and is the number of heads. The pairwise similarity matrix is scaled by and a softmax function is applied to form the attention weight matrix :

(8)

Then, using the attention weight matrix, context vectors are computed as a weighted sum of the value vectors :

(9)

where is the value projection matrix. Finally, the context vectors for all heads are concatenated and projected using the output projection matrix :

(10)

Following the self-attention layer, a residual connection and layer normalization is applied:

(11)

3.2.2 Position-wise feed-forward layer

The position-wise feed-forward layer transforms as follows:

(12)

where and are the first linear projection matrix and bias, respectively, is an all-one row vector, and

is the rectified linear unit activation function.

is the number of internal units in this layer. and are the second linear projection matrix and bias, respectively.

Finally, the output of the encoder block for each time frame is computed by applying a residual connection as follows:

(13)

3.2.3 Output layer for frame-wise posteriors

The frame-wise posteriors are calculated from (in Eq. 5) using layer normalization and a fully-connected layer as follows:

(14)
(15)

where and are the linear projection matrix and bias, respectively, and

is the element-wise sigmoid function.

3.3 Permutation-free training

The difficulty of training the model described above is that the model must deal with speaker permutations: changing the order of speakers within a correct label sequence is also regarded as correct. An example of permutations in a two-speaker case is shown in Fig. 2

. In this paper, we call this problem “label ambiguity.” This label ambiguity obstructs the training of the neural network when we use a standard binary cross entropy loss function.

Figure 2: Two-speaker SA-EEND model trained with permutation-free loss.

To cope with the label ambiguity problem, the permutation-free training scheme considers all the permutations of the reference speaker labels. The permutation-free training scheme has been used in research on source separation [16, 55, 23]. Here, we apply the permutation-free loss function to a temporal sequence of speaker labels. The neural network is trained to minimize the permutation-free loss between the output predicted in Eq. 15 and the reference speaker label , as follows:

(16)

where is the set of all the possible permutations of (), and is the -th permutation of the reference speaker label, and is the binary cross entropy function between the label and the output.

4 Experimental setup

4.1 Data

# mixtures avg. duration overlap
(sec) ratio (%)
Traning sets
Simulated () 100,000 87.6 34.4
Real (SWBD+SRE) 26,172 304.7 3.7
Test sets
Simulated () 500 87.3 34.4
Simulated () 500 103.8 27.2
Simulated () 500 137.1 19.5
CALLHOME [32] 148 72.1 13.0
CSJ [27] 54 766.3 20.1
Table 1: Statistics of training and test sets.

To verify the effectiveness of the SA-EEND method for various overlap situations, we prepared two training sets and five test sets, including simulated and real datasets. The statistics of the training and test sets are listed in Table 1. The overlap ratio is computed as the ratio of the audio time during which two or more speakers are active, to the audio time during which one or more speakers are active.

Note that training data for the EEND method is different from those for the x-vector clustering-based method. Whereas the x-vector clustering-based method uses single-speaker segments for training their x-vector neural network, the EEND method uses audio mixtures of multiple speakers. Such mixtures can be simulated infinitely with a combination of single-speaker segments. Moreover, the EEND model can be trained with not only simulated mixtures but real audio mixtures with speaker overlaps.

4.1.1 Simulated mixtures

Each mixture was simulated by Algorithm 1. Unlike the mixture simulation of source separation studies [16], we consider a diarization-style mixture: each speech mixture should have dozens of utterances per speaker with reasonable silence intervals between utterances. The silence intervals are controlled by the average interval of . Larger values of generate speech with less overlap.

Input :    // Set of speakers, noises, RIRs and SNRs
  // Set of utterance lists
  // #speakers per mixture
  // Max. and min. #utterances per speaker
  // Average interval
Output :   // Mixture
1
2 Sample a set of speakers from
3   // Set of speakers’ signals
4 forall  do
5         // Concatenated signal
6       Sample from   // RIR
7       Sample from for  to  do
8             Sample   // Interval
9            
10      
11      
12
13
14 Sample from   // Background noise
15 Sample from   // SNR
16 Determine a mixing scale from and
17 repeat until reach the length of
18
Algorithm 1 Mixture simulation.

The set of utterances used for the simulation was comprised of the Switchboard-2 (Phase I, II, III), Switchboard Cellular (Part 1, Part2), and NIST Speaker Recognition Evaluation datasets (2004, 2005, 2006, 2008). All recordings are telephone speech sampled at 8 kHz. There are 6,381 speakers in total. We split them into 5,743 speakers for the training set and 638 speakers for the test set. Note that the set of utterances for the training set is identical to that of the Kaldi CALLHOME diarization v2 recipe [34]111https://github.com/kaldi-asr/kaldi/tree/master/egs/callhome_diarization, making it fair comparison with the x-vector clustering-based method.

Since there are no time annotations in these corpora, we extracted utterances using speech activity detection (SAD) on the basis of time-delay neural networks and statistics pooling222The SAD model: http://kaldi-asr.org/models/m4.

The set of background noises was from the MUSAN corpus [42]. We used 37 recordings that are annotated as “background” noises. The set of 10,000 room impulse responses (RIRs) was from the Simulated Room Impulse Response Database used in [22]. The SNR values were sampled from 10, 15, and 20 dBs. These sets of non-speech corpora are also used for training the x-vector and SAD models in the x-vector clustering-based method.

We generated two-speaker mixtures for each speaker with 10-20 utterances (). For the simulated training set, 100,000 mixtures were generated with . For the simulated test set, 500 mixtures were generated with , 3, and 5. The overlap ratios of the simulated mixtures are ranging from 19.5 to 34.4%.

4.1.2 Real datasets

We used real telephone speech recordings as the real training set. A set of 26,172 two-speaker recordings were extracted from the recordings of the Switchboard-2 (Phase I, II, III), Switchboard Cellular (Part 1, Part 2), and NIST Speaker Recognition Evaluation datasets. The overlap ratio of the training data was 3.7%, far less than that of the simulated mixtures.

We evaluated the proposed method on real telephone conversations in the CALLHOME dataset [32]. We randomly split the two-speaker recordings from the CALLHOME dataset into two subsets: an adaptation set of 155 recordings and a test set of 148 recordings. The average overlap ratio of the test set was 13.0%.

In addition, we conducted an evaluation on the dialogue part of the Corpus of Spontaneous Japanese (CSJ) [27]. The CSJ contains 54 two-speaker dialogue recordings333We excluded four out of 58 recordings that contain speakers in the official speech recognition evaluation sets.. They were recorded using headset microphones in separate soundproof rooms. The average overlap ratio of the CSJ test set was 20.1%, larger than the CALLHOME test set.

4.2 Model configuration

4.2.1 Clustering-based systems

We compared the proposed method with two conventional clustering-based systems [37]: the i-vector system and x-vector system were created using the Kaldi CALLHOME diarization v1 and v2 recipes.

These recipes use agglomerative hierarchical clustering (AHC) with the probabilistic linear discriminant analysis (PLDA) scoring scheme. The number of clusters was fixed to 2. Though the original recipes use oracle speech/non-speech marks, we used the SAD model with the same configuration as described in Sec. 4.1.

4.2.2 BLSTM-based EEND system

We configured a BLSTM-based EEND method (BLSTM-EEND), as described in [13]. The input features were 23-dimensional log-Mel-filterbanks with a 25-ms frame length and 10-ms frame shift. Each feature was concatenated with those from the previous seven frames and subsequent seven frames. To deal with a long audio sequence in our neural networks, we subsampled the concatenated features by a factor of ten. Consequently, a -dimensional input feature was fed into the neural network every 100 ms.

We used a five-layer BLSTM with 256 hidden units in each layer. The second layer of the BLSTM outputs was used to form a 256-dimensional embedding; we then calculated the deep clustering loss in this embedding to discriminate different speakers. We used the Adam [21] optimizer with a learning rate of

. The batch size was 10. The number of training epochs was 20.

Because the output of the neural network is the probability of speech activity for each speaker, a threshold is required to obtain the decision of speech activity for each frame. We set the threshold to 0.5. Furthermore, we applied 11-frame median filtering to prevent production of unreasonably short segments.

For domain adaptation, the neural network was retrained using the CALLHOME adaptation set. we used the Adam optimizer with a learning rate of and ran 5 epochs. For the postprocessing, we adjusted the threshold to 0.6 so that the DER of the adaptation set has the minimum value.

4.2.3 Self-attentive EEND system

Here, we used the same input features as were input to the BLSTM-EEND system. Note that the sequence length at the training stage was limited to 500 (50 seconds in audio time) because our system uses more memory than the BLSTM-based network does. Therefore, we split the input audio recordings into non-overlapping 50-second segments. At the inference stage, we used the entire sequence for each recording.

We used two encoder blocks with 256 attention units containing four heads (, , ). We used 1024 internal units in a position-wise feed-forward layer (. We used the Adam optimizer with the learning rate scheduler introduced in [48]. The number of warm-up steps used in the learning rate scheduler was 25,000. The batch size was 64. The number of training epochs was 100. After 100 epochs, we used an averaged model obtained by averaging the model parameters of the last 10 epochs. As with the BLSTM-EEND system, we applied 11-frame median filtering.

For domain adaptation, the averaged model was retrained using the CALLHOME adaptation set. We used the Adam optimizer with a learning rate of and ran 100 epochs. After 100 epochs, we used an averaged model obtained by averaging the model parameters of the last 10 epochs.

4.3 Performance metric

We evaluated the systems with the diarization error rate (DER) [33]. Note that the DERs reported in many prior studies did not include misses or false alarm errors due to their using oracle speech/non-speech labels. Overlapping speech segments had also been excluded from the evaluation. For our DER computation, we evaluated all of the errors, including overlapping speech segments, because the proposed method includes both the speech activity detection and overlapping speech detection functionality. As is done typically, we used a collar tolerance of 250 ms at the start and end of each segment.

5 Results

Simulated Real
CH CSJ
Clustering-based
i-vector 33.74 30.93 25.96 12.10 27.99
x-vector 28.77 24.46 19.78 11.53 22.96
BLSTM-EEND
trained with sim. 12.28 14.36 19.69 26.03 39.33
trained with real 36.23 37.78 40.34 23.07 25.37
SA-EEND
trained with sim. 7.91 8.51 9.51 13.66 22.31
trained with real 32.72 33.84 36.78 10.76 20.50
Table 2: DERs (%) on various test sets. For EEND systems, the CALLHOME (CH) results are obtained with domain adaptation.
w/o adaptation with adaptatation
x-vector clustering 11.53 N/A
BLSTM-EEND
trained with sim. 43.84 26.03
trained with real 31.01 23.07
SA-EEND
trained with sim. 17.42 13.66
trained with real 12.66 10.76
Table 3: DERs (%) on the CALLHOME with and without domain adaptation.
DER breakdown SAD errors
Method DER MI FA CF MI FA
i-vector 12.10 7.74 0.54 3.82 1.4 0.5
x-vector 11.53 7.74 0.54 3.25 1.4 0.5
SA-EEND
 no-adapt
12.66 7.42 3.93 1.31 3.3 0.6
 adapted 10.76 6.68 2.40 1.68 2.3 0.5
Table 4: Detailed DERs (%) evaluated on the CALLHOME. DER is composed of Misses (MI), False alarms (FA), and Confusion errors (CF). The SAD errors are composed of Misses (MI) and False alarms (FA) errors.
Figure 3: Attention weight matrices at the second encoder block. The input was the CALLHOME test set (recording id: iagk). The model was trained with the real training set followed by domain adaptation. The top two rows show the reference speech activity of two speakers.

5.1 Evaluation on simulated mixtures

DERs on various test sets are shown in Table 2. The clustering-based systems performed poorly on heavily overlapped simulated mixtures. This result is within our expectations, because the clustering-based systems did not consider speaker overlaps; there are more misses when the overlap ratio is high.

The BLSTM-EEND system trained with the simulated training set showed a significant DER reduction compared with the clustering-based systems on the simulated mixtures. Among the differing overlap ratios, it showed the best performance on the highest overlap ratio condition (). The BLSTM-EEND system worked well on the overlapping condition matched with training data.

The proposed system, SA-EEND, trained with the simulated training set had significantly fewer DERs compared with the BLSTM-EEND system on every test set. As well as the BLSTM-EEND system, it showed the best performance on the highest overlap ratio condition (). However, the DER degradation on the less overlapping conditions was smaller than that of the BLSTM-EEND system, which indicated that the self-attention blocks improved robustness to variable overlapping conditions.

5.2 Evaluation on real test sets

In contrast to the good performance on the simulated mixtures, the BLSTM-EEND system had inferior DERs to those of the clustering-based systems evaluated on the real test sets. Although the BLSTM-EEND system showed performance improvements when the training data were switched from simulated to real data, its DERs were still higher than those of the clustering-based systems.

The proposed system, SA-EEND, trained with the simulated training set showed remarkable improvements on real datasets of the CALLHOME and CSJ, which indicates the strong generalization capability of the self-attention blocks. For the CSJ, even without domain adaptation, the proposed system performed better than the x-vector clustering-based method.

The SA-EEND system trained with the real training set performed the best on the real test sets, however, it had poor DERs on the simulated mixtures. We expected that the result was due to the small number of mixtures and low overlap ratio of the real training set. It would be much improved by feeding more real data with more speaker overlaps, or by combining with simulated training data.

5.3 Effect of domain adaptation

The EEND models trained with simulated training set were overfitted to the specific overlap ratio of the training set. We expected that the overfitting would be mitigated by using domain adaptation. DERs on the CALLHOME with and without domain adaptation are shown in Table 3. As expected, the domain adaptation significantly reduced the DER; our system thus achieved even better results than those of the x-vector-based system.

A detailed DER comparison on the CALLHOME test set is shown in Table 4. The clustering-based systems had few SAD errors thanks to the robust SAD model trained with various noise-augmented data. However, there were numerous misses and confusion errors due to its lack of handling speaker overlaps. Compared with clustering-based systems, the proposed method produced significantly fewer confusion and miss errors. The domain adaptation reduced all error types except confusion errors.

5.4 Visualization of self-attention

To analyze the behavior of the self-attention mechanism in our diarization system, Fig. 3 visualizes the attention weight matrix at the second encoder block, corresponding to in Eq. 8

. Here, head 1 and head 2 have vertical lines at different positions. The vertical lines correspond to each speaker’s activity. The attention weight matrix with these vertical lines transformed the input features into the weighted mean of the same speaker frames. These heads actually captured the global speaker characteristics by computing the similarity between distant frames. Interestingly, heads 3 and 4 look like identity matrices, which results in position-independent linear transforms. These heads are considered to work for speech/non-speech detection. We conclude that the multi-head self-attention mechanism captures global speaker characteristics in addition to local speech activity dynamics, which leads to a reduction in DER. Experiments on various combinations of the number of heads and the number of speakers would be an interesting future work.

6 Conclusion

We incorporated a self-attention mechanism in the end-to-end neural diarization model. We evaluated our model on simulated mixtures and two real datasets. Experimental results showed that the self-attention mechanism significantly reduced DERs and showed higher generalization quality compared with a BLSTM-based neural diarization system. The self-attention based systems even outperformed x-vector clustering-based systems. We also showed that the self-attention blocks actually captured global speaker characteristics by visualizing the latent representation.

References

  • [1] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals (2012) Speaker diarization: a review of recent research. IEEE Trans. on ASLP 20 (2), pp. 356–370. External Links: Document, ISSN 1558-7916 Cited by: §1.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In Proc. ICLR, Cited by: §2.1.
  • [3] J. Barker, S. Watanabe, E. Vincent, and J. Trmal (2018) The fifth ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines. In Proc. Interspeech, pp. 1561–1565. External Links: Document Cited by: §1.
  • [4] C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann, and R. Haeb-Umbach (2018) Front-End Processing for the CHiME-5 Dinner Party Scenario. In Proc. CHiME-5, pp. 35–40. Cited by: §1.
  • [5] Ö. Çetin and E. Shriberg (2006) Overlap in meetings: ASR effects and analysis by dialog factors, speakers, and collection site. In Proc. MLMI, pp. 212–224. Cited by: §1.
  • [6] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In Proc. ICASSP, pp. 4960–4964. Cited by: §2.1.
  • [7] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. In Proc. NIPS, pp. 577–585. Cited by: §2.1.
  • [8] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (2011) Front-end factor analysis for speaker verification. IEEE Trans. on ASLP 19 (4), pp. 788–798. External Links: Document, ISSN 1558-7916 Cited by: §1.
  • [9] M. Diez, F. Landini, L. Burget, J. Rohdin, A. Silnova, K. Z̆molíková, O. Novotný, K. Veselý, O. Glembek, O. Plchot, L. Mos̆ner, and P. Matĕjka (2018) BUT system for DIHARD speech diarization challenge 2018. In Proc. Interspeech, pp. 2798–2802. Cited by: §1, §2.1.
  • [10] D. Dimitriadis and P. Fousek (2017) Developing on-line speaker diarization system. In Proc. Interspeech, pp. 2739–2743. External Links: Document Cited by: §1.
  • [11] L. Dong, S. Xu, and B. Xu (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. Proc. ICASSP, pp. 5884–5888. Cited by: §2.3, §3.2.
  • [12] J. Du, T. Gao, L. Sun, F. Ma, Y. Fang, D. Liu, Q. Zhang, X. Zhang, H. Wang, J. Pan, J. Gao, C. Lee, and J. Chen (2018) The USTC-iFlytek Systems for CHiME-5 Challenge. In Proc. CHiME-5, pp. 11–15. Cited by: §1.
  • [13] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe (2019 (to appear)) End-to-end neural speaker diarization with permutation-free objectives. In Proc. Interspeech, Cited by: §1, §1, §2.2, §3.1, §3.2, §4.2.2.
  • [14] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree (2017) Speaker diarization using deep neural network embeddings. In Proc. ICASSP, Vol. , pp. 4930–4934. External Links: Document, ISSN 2379-190X Cited by: §1, §2.1.
  • [15] A. Graves and J. Schmidhuber (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18 (5), pp. 602 – 610. Note: IJCNN 2005 External Links: ISSN 0893-6080, Document, Link Cited by: §1.
  • [16] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe (2016) Deep clustering: discriminative embeddings for segmentation and separation. In Proc. ICASSP, Vol. , pp. 31–35. External Links: Document, ISSN 2379-190X Cited by: §3.3, §4.1.1.
  • [17] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters (2003) The ICSI meeting corpus. In Proc. ICASSP, Vol. I, pp. 364–367. External Links: Document, ISSN 1520-6149 Cited by: §1.
  • [18] W. Jun and L. Shengchen (2018) SELF-attention mechanism based system for DCASE2018 challenge task1 and task4. In DCASE2018 Challenge, Cited by: §2.3.
  • [19] N. Kanda, Y. Fujita, S. Horiguchi, R. Ikeshita, K. Nagamatsu, and S. Watanabe (2019) ACOUSTIC modeling for distant multi-talker speech recognition with single- and multi-channel branches. In Proc. ICASSP, pp. 6630–6634. Cited by: §1.
  • [20] N. Kanda, R. Ikeshita, S. Horiguchi, Y. Fujita, K. Nagamatsu, X. Wang, V. Manohar, N. E. Yalta Soplin, M. Maciejewski, S. Chen, A. S. Subramanian, R. Li, Z. Wang, J. Naradowsky, L. P. Garcia-Perera, and G. Sell (2018) Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays. In Proc. CHiME-5, pp. 6–10. Cited by: §1.
  • [21] D. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In Proc. ICLR, Cited by: §4.2.2.
  • [22] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur (2017) A study on data augmentation of reverberant speech for robust speech recognition. In Proc. ICASSP, Vol. , pp. 5220–5224. External Links: Document, ISSN 2379-190X Cited by: §4.1.1.
  • [23] M. Kolbæk, D. Yu, Z. Tan, and J. Jensen (2017)

    Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks

    .
    IEEE/ACM Trans. on ASLP 25 (10), pp. 1901–1913. External Links: Document, ISSN 2329-9290 Cited by: §3.3.
  • [24] J. Lei Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.2.1.
  • [25] Z. Lin, M. Feng, C. Nogueira dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. In Proc. ICLR, Cited by: §1, §2.3.
  • [26] M. Maciejewski, D. Snyder, V. Manohar, N. Dehak, and S. Khudanpur (2018) Characterizing performance of speaker diarization systems on far-field speech using standard methods. In Proc. ICASSP, pp. 5244–5248. Cited by: §1.
  • [27] K. Maekawa (2003) Corpus of spontaneous japanese: its design and evaluation. In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, Cited by: §4.1.2, Table 1.
  • [28] P. A. Mansfield, Q. Wang, C. Downey, L. Wan, and I. L. Moreno (2018) Links: a high-dimensional online clustering method. arXiv preprint arXiv:1801.10123. Cited by: §1.
  • [29] S. Meignier (2010) LIUM_SPKDIARIZATION: an open source toolkit for diarization. In CMU SPUD Workshop, Cited by: §1.
  • [30] V. A. Miasato Filho, D. A. Silva, and L. G. Depra Cuozzo (2018) Joint discriminative embedding learning, speech activity and overlap detection for the dihard speaker diarization challenge. In Proc. Interspeech, pp. 2818–2822. External Links: Document, Link Cited by: §2.1.
  • [31] V. S. Narayanaswamy, J. J. Thiagarajan, H. Song, and A. Spanias (2019) Designing an effective metric learning pipeline for speaker diarization. In Proc. ICASSP, pp. 5806–5810. External Links: Document, ISSN 2379-190X Cited by: §2.1, §2.3.
  • [32] NIST (2000) 2000 speaker recognition evaluation plan. Note: https://www.nist.gov/sites/default/files/documents/2017/09/26/spk-2000-plan-v1.0.htm_.pdf Cited by: §4.1.2, Table 1.
  • [33] NIST (2009) The 2009 (RT-09) rich transcription meeting recognition evaluation plan. Note: http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf Cited by: §4.3.
  • [34] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely (2011) The Kaldi speech recognition toolkit. In Proc. ASRU, Cited by: §4.1.1.
  • [35] S. Renals, T. Hain, and H. Bourlard (2008) Interpretation of multiparty meetings the AMI and Amida projects. In 2008 Hands-Free Speech Communication and Microphone Arrays, Vol. , pp. 115–118. External Links: Document, ISSN Cited by: §1.
  • [36] G. Sell and D. Garcia-Romero (2014) Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In Proc. SLT, Vol. , pp. 413–417. External Links: Document, ISSN Cited by: §1.
  • [37] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, and S. Khudanpur (2018) Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In Proc. Interspeech, pp. 2808–2812. External Links: Document Cited by: §1, §2.1, §4.2.1.
  • [38] M. Senoussaoui, P. Kenny, T. Stafylakis, and P. Dumouchel (2014) A study of the cosine distance-based mean shift for telephone speech diarization. IEEE/ACM Trans. on ASLP 22 (1), pp. 217–227. External Links: Document, ISSN 2329-9290 Cited by: §1.
  • [39] S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass (2013) Unsupervised methods for speaker diarization: an integrated and iterative approach. IEEE Trans. on ASLP 21 (10), pp. 2015–2028. External Links: Document, ISSN 1558-7916 Cited by: §1.
  • [40] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur (2019) Speaker recognition for multi-speaker conversations using x-vectors. In Proc. ICASSP, pp. 5796–5800. External Links: Document, ISSN 2379-190X Cited by: §2.1.
  • [41] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust DNN embeddings for speaker recognition. In Proc. ICASSP, Vol. , pp. 5329–5333. External Links: Document, ISSN 2379-190X Cited by: §1.
  • [42] D. Snyder, G. Chen, and D. Povey (2015) MUSAN: a music, speech, and noise corpus. arXiv preprints arXiv:1510.08484. Cited by: §4.1.1.
  • [43] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio (2017) Char2Wav: end-to-end speech synthesis. In ICLR Workshop, Cited by: §2.1.
  • [44] M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel (2018) Self-attentional acoustic models. In Proc. Interspeech, pp. 3723–3727. Cited by: §2.3.
  • [45] L. Sun, J. Du, C. Jiang, X. Zhang, S. He, B. Yin, and C. Lee (2018) Speaker diarization with enhancing speech for the first DIHARD challenge. In Proc. Interspeech, pp. 2793–2797. Cited by: §1, §2.3.
  • [46] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Proc. NIPS, pp. 3104–3112. Cited by: §2.1.
  • [47] S. E. Tranter and D. A. Reynolds (2006) An overview of automatic speaker diarization systems. IEEE Trans. on ASLP 14 (5), pp. 1557–1565. External Links: Document, ISSN 1558-7916 Cited by: §1.
  • [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. NIPS, pp. 5998–6008. Cited by: §1, §2.3, §4.2.3.
  • [49] L. Wan, Q. Wang, A. Papir, and I. L. Moreno (2018) Generalized end-to-end loss for speaker verification. In Proc. ICASSP, Vol. , pp. 4879–4883. External Links: Document, ISSN 2379-190X Cited by: §1.
  • [50] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno (2018) Speaker diarization with LSTM. In Proc. ICASSP, Vol. , pp. 5239–5243. External Links: Document, ISSN 2379-190X Cited by: §1.
  • [51] X. Wang, R. B. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proc. CVPR, pp. 7794–7803. Cited by: §2.3.
  • [52] Y. Wang, R.J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous (2017) Tacotron: towards end-to-end speech synthesis. In Proc. Interspeech, pp. 4006–4010. Cited by: §2.1.
  • [53] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi (2017) Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1240–1253. Cited by: §2.1.
  • [54] L. Ye, M. Rochan, Z. Liu, and Y. Wang (2019) Cross-modal self-attention network for referring image segmentation. In Proc. CVPR, pp. 10502–10511. Cited by: §2.3.
  • [55] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proc. ICASSP, Vol. , pp. 241–245. External Links: Document, ISSN 2379-190X Cited by: §3.3.
  • [56] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang (2019) Fully supervised speaker diarization. In Proc. ICASSP, pp. 6301–6305. Cited by: §2.2.
  • [57] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey (2018) Self-attentive speaker embeddings for text-independent speaker verification. In Proc. Interspeech, pp. 3573–3577. External Links: Document, Link Cited by: §2.3.