Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

09/17/2019
by   Naoyuki Kanda, et al.
0

This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20 error rate (WER) and diarization error rate (DER). Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 our method can solve speaker diarization simultaneously as a by-product and achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/26/2019

Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

In this paper, we propose a novel auxiliary loss function for target-spe...
02/12/2021

Content-Aware Speaker Embeddings for Speaker Diarisation

Recent speaker diarisation systems often convert variable length speech ...
12/08/2021

A study on native American English speech recognition by Indian listeners with varying word familiarity level

In this study, listeners of varied Indian nativities are asked to listen...
08/04/2020

"This is Houston. Say again, please". The Behavox system for the Apollo-11 Fearless Steps Challenge (phase II)

We describe the speech activity detection (SAD), speaker diarization (SD...
05/05/2016

The IBM Speaker Recognition System: Recent Advances and Error Analysis

We present the recent advances along with an error analysis of the IBM s...
10/05/2016

Monaural Multi-Talker Speech Recognition using Factorial Speech Processing Models

A Pascal challenge entitled monaural multi-talker speech recognition was...
01/11/2019

Advanced Rich Transcription System for Estonian Speech

This paper describes the current TTÜ speech transcription system for Est...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Our main goal is to develop a monaural conversation transcription system that can not only perform automatic speech recognition (ASR) of multiple talkers but also determine who spoke the utterance when, known as speaker diarization [37, 8]. For both ASR and speaker diarization, the main difficulty comes from speaker overlaps. For example, a speaker-overlap ratio of about 15% was reported in real meeting recordings [43]. For such overlapped speech, neither conventional ASR nor speaker diarization provides a result with sufficient accuracy. It is known that mixing two speech significantly degrades ASR accuracy [44, 6, 17]. In addition, no speaker overlaps are assumed with most conventional speaker diarization techniques, such as clustering of speech partitions (e.g. [37, 25, 31, 33, 40]), which works only if there are no speaker overlaps. Due to these difficulties, it is still very challenging to perform ASR and speaker diarization for monaural recordings of conversation.

One solution to the speaker-overlap problem is applying a speech-separation method such as deep clustering [12] or deep attractor network [4]. However, a major drawback of such a method is that the training criteria for speech separation do not necessarily maximize the accuracy of the final target tasks. For example, if the goal is ASR, it will be better to use training criteria that directly maximize ASR accuracy.

In one line of research using ASR-based training criteria, multi-speaker ASR based on permutation invariant training (PIT) has been proposed [44, 3, 34, 30, 2]

. With PIT, the label-permutation problem is solved by considering all possible permutations when calculating the loss function

[45]. PIT was first proposed for speech separation [45] and soon extended to ASR loss with promising results [44, 3, 34, 30, 2]. However, a PIT-ASR model produces transcriptions for each utterance of speakers in an unordered manner, and it is no longer straightforward to solve speaker permutations across utterances. To make things worse, a PIT model trained with ASR-based loss normally does not produce separated speech waveforms, which makes speaker tracing more difficult.

In another line of research, target-speaker (TS) ASR, which automatically extracts and transcribes only the target speaker’s utterances given a short sample of that speaker’s speech, has been proposed [47, 6]. Žmolíková et al. proposed a target-speaker neural beamformer that extracts a target speaker’s utterances given a short sample of that speaker’s speech [47]. This model was recently extended to handle ASR-based loss to maximize ASR accuracy with promising results [6]. TS-ASR can naturally solve the speaker-permutation problem across utterances. Importantly, if we can execute TS-ASR for each speaker correctly, speaker diarization is solved at the same time just by extracting the start and end time information of the TS-ASR result. However, one obvious drawback of TS-ASR is that it cannot be applied when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding.

Based on this background, we propose a speech recognition and speaker diarization method that is based on TS-ASR but can be applied without knowing the speaker information in advance. To remove the limitation of TS-ASR, we propose an iterative method, in which (i) the estimation of target-speaker embeddings and (ii) TS-ASR based on the estimated embeddings are alternately executed. As an initial trial, we evaluated the proposed method by using real dialogue recordings in the Corpus of Spontaneous Japanese (CSJ). Although it contains the speech of only two speakers, the speaker-overlap ratio of the dialogue speech is very high; 20.1% . Thus, this is very challenging even for state-of-the-art ASR and speaker diarization. We show that the proposed method effectively reduced both word error rate (WER) and diarizaton error rate (DER).

Figure 1: (left) Overview of simultaneous speech recognition and speaker diarization, (right) proposed iterative maximization method.

2 Simultaneous ASR and Speaker Diarization

In this section, we first explain the problem we targeted then the proposed method with reference to Figure 1.

2.1 Problem statement

The overview of the problem is shown in Figure 1 (left). We assume a sequence of observations , where is the number of observations, and is the -th observation consisting of a sequence of acoustic features. Such a sequence is naturally generated when we separate a long recording into small segments based on voice activity detection which is a basic preprocess for ASR so as not to generate overly large lattices. We also assume a tuple of word hypotheses for an observation where is the number of speakers, and represents the speech-recognition hypothesis of the -th speaker given observation . We assume contains not only word sequences but also their corresponding frame-level time alignments of phonemes and silences. Finally, we assume a tuple of speaker embeddings , where represents the -dim speaker embedding of the -th speaker.

Then, our objective is to find the best possible given a sequence of observations as follows.

(1)
(2)
(3)

Here, the starting point is the conventional maximum a posteriori-based decoding given but for multiple speakers. We then introduce the speaker embeddings as a hidden variable (Eq. 2). Finally, we approximate the summation by using a max operation (Eq. 3).

Our motivation to introduce , which is constant across all observation indices , is to explicitly enforce the order of speakers in to be constant over indices . It should be emphasized that if we can solve the problem, speaker diarization is solved at the same time just by extracting the start and end time information of each hypothesis in . Also note that there are possible solutions by swapping the order of speakers in , and it is sufficient to find just one such solution.

2.2 Iterative maximization

It is not easy to directly solve , so we propose to alternately maximize and . Namely, we first fix and find that maximizes . We then fix and find that maximizes . By iterating this procedure,

can be increased monotonically. Note that it can be said by a simple application of the chain rule that finding

that maximizes with a fixed is equivalent to finding that maximizes . The same thing can be said for the estimation of with a fixed .

For the -th iteration of the maximization (), we first find the most plausible estimation of given the -th speech-recognition hypothesis as follows.

(4)

Here, the estimation of is dependent on for . Assume that the overlapped speech corresponds to a “third person” who is different from any person in the recording, Eq. 4 can be achieved by estimating the speaker embeddings only from non-overlapped regions (upper part of Figure 1 (right)). In this study, we used i-vector [5] as the representation of speaker embeddings, and estimated i-vector based only on the non-overlapped region given for each speaker111The idea to extract speaker embeddings from non-overlapped regions has been proposed (e.g. [18, 24]). Note that, since we do not have an estimation of for the first iteration, is initialized only by . In this study, we estimated the i-vector for each speaker given the speech region that was estimated by the clustering-based speaker diarization method. More precicely, we estimated the i-vector for each then applied

-cluster K-means clustering. The center of each cluster

222Using cluster centers does not strictly follow Eq. 4, but we used them for the procedural simplicity. was used for the initial speaker embeddings .

We then update given speaker embeddings .

(5)
(6)
(7)

Here, we estimate the most plausible hypotheses given estimated embeddings and observation (Eq. 5). We then assume the conditional independence of given for each segment (Eq. 6). Finally, we further assume the conditional independence of given for each speaker (Eq. 7). The final equation can be solved by applying TS-ASR for each segment for each speaker (lower part of Figure 1 (right)). We will review the detail of TS-ASR in the next section.

3 TS-ASR: Review

3.1 Overview of TS-ASR

TS-ASR is a technique to extract and recognize only the speech of a target speaker given a short sample utterance of that speaker [47, 46, 6]

. Originally, the sample utterance was fed into a special neural network that outputs an averaged embedding to control the weighting of speaker-dependent blocks of the acoustic model (AM). However, to make the problem simpler, we assume that a

-dimensional speaker embedding is extracted from the sample utterance. In this context, TS-ASR can be expressed as the problem to find the best hypothesis given observation and speaker embedding as follows.

(8)

If we have a well-trained TS-ASR, Eq. 7 can be solved by simply applying the TS-ASR for each segment for each speaker .

3.2 TS-AM with auxiliary output network

3.2.1 Overview

Although any speech recognition architecture can be used for TS-ASR, we adopted a variant of the TS-AM that was recently proposed and has promising accuracy [17]. Figure 2 describes the TS-AM that we applied for this study. This model has two input branches. One branch accepts acoustic features as a normal AM while the other branch accepts an embedding that represents the characteristics of the target speaker. In this study, we used a log Mel-filterbank (FBANK) and i-vector [5, 29] for the acoustic features and target-speaker embedding, respectively.

A unique component of the model is in its output branch. The model has multiple output branches that produce outputs and for the loss functions for the target and interference speakers, respectively. The loss for the target speaker is defined to maximize the target-speaker ASR accuracy, while the loss for interference speakers is defined to maximize the interference-speaker ASR accuracy. We used lattice-free maximum mutual information (LF-MMI) [28] for both criteria.

Figure 2: Overview of target-speaker AM architecture with auxiliary interference speaker loss [17]. A number with an arrow indicates a time splicing index, which forms the basis of a time-delay neural network (TDNN) [26]. The input features were advanced by five frames, which has the same effect as reference label delay.

The original motivation of the output branch for interference speakers was the improvement of TS-ASR by achieving a better representation for speaker separation in the shared layers. However, it was also shown that the output branch for interference speakers can be used for the secondary ASR for interference speakers given the embedding of the target speaker [17]. In this paper, we found out that the latter property worked very well for the ASR for dialogue recordings, which will be explained in the evaluation section.

The network is trained with a mixture of multi-speaker speech given their transcriptions. We assume that, for each training sample, (a) transcriptions of at least two speakers are given, (b) the transcription for the target speaker is marked so that we can identify the target speaker’s transcription, and (c) a sample for the target speaker can be used to extract speaker embeddings. These assumptions can be easily satisfied by artificially generating training data by mixing the speech of multiple speakers.

3.2.2 Loss function

The main loss function for the target speaker is defined as

(9)
(10)

where corresponds to the index of training samples in this case. The term indicates a numerator (or reference) graph that represents a set of possible correct state sequences for the utterance of the target speaker of the -th training sample, denotes a hypothesis state sequence for the -th training sample, and denotes a denominator graph, which represents a possible hypothesis space and normally consists of a 4-gram phone language model in LF-MMI training [28].

The auxiliary interference speaker loss is then defined to maximize the interference-speaker ASR accuracy, which we expect to enhance the speaker separation ability of the neural network. This loss is defined as

(11)

where denotes a numerator (or reference) graph that represents a set of possible correct state sequences for the utterance of the interference speaker of the -th training sample.

Finally, the loss function for training is defined as the combination of the target and interference losses,

(12)

where is the scaling factor for the auxiliary loss. In our evaluation, we set . Setting , however, corresponds to normal TS-ASR.

4 Evaluation

4.1 Experimental settings

4.1.1 Main evaluation data: real dialogue recordings

We conducted our experiments on the CSJ [23], which is one of the most widely used evaluation sets for Japanese speech recognition. The CSJ consists of more than 600 hrs of Japanese recordings.

While most of the content is lecture recordings by a single speaker, CSJ also contains 11.5 hrs of 54 dialogue recordings333We excluded 4 out of 58 dialogue recordings that have speaker duplication with the official E1, E2, and E3 evaluation sets. (average 12.8 min per recording) with two speakers, which were the main target of ASR and speaker diarization in this study. During the dialogue recordings, two speakers sat in two adjacent sound proof chambers divided by a glass window. They could talk with each other over voice connection through a headset for each speaker. Therefore, speech was recorded separately for each speaker, and we generated mixed monaural recordings by mixing the corresponding speeches of two speakers. When mixing two recordings, we did not apply any normalization of speech volume. Due to this recording procedure, we were able to use non-overlapped speech to evaluate the oracle WERs.

It should be noted that, although the dialogue consisted of only two speakers, the speaker overlap ratio of the recordings was very high due to many backchannels and natural turn-taking. Among all recordings, 16.7% of the region was overlapped speech while 66.4% was spoken by a single speaker. The remaining 16.9% was silence. Therefore, 20.1% (=16.7/(16.7+66.4)) of speech regions was speaker overlap. From the viewpoint of ASR, 33.5% (= (16.7*2)/(16.7*2+66.4)) of the total duration to be recognized was overlapped. These values were even higher than those reported for meetings with more than two speakers [1, 43]. Therefore, these dialogue recordings are very challenging for both ASR and speaker diarization. We observed significantly high WER and DER, which is discussed in the next section.

4.1.2 Sub evaluation data: simulated 2-speaker mixture

To evaluate TS-ASR, we also used the simulated 2-speaker-mixed data by mixing the three official single-speaker evaluation sets of CSJ, i.e., E1, E2, and E3 [20]. Each set includes different groups of 10 lectures (5.6 hrs, 30 lectures in total). The E1 set consists of 10 lectures of 10 male speakers, and E2 and E3 each consists of 10 lectures of 5 female and 5 male speakers. We generate two-speaker mixed speech by adding randomly selected speech (= interference-speaker speech) to the original speech (= target-speaker speech) with the constraint that the target and interference speakers were different, and each interference speaker was selected only once from the dataset. When we mixed the two speeches, we configured them to have the same power level, and shorter speech was mixed with the longer speech from a random starting point selected to ensure the end point of the shorter one did not exceed that of the longer one.

4.1.3 Training data and training settings

The rest of the 571 hrs of 3,207 lecture recordings (excluding the same speaker’s lectures in the evaluation sets) were used for AM and language model (LM) training. We generated two-speaker mixed speech for training data in accordance with the following protocol.

  1. Prepare a list of speech samples (= main list).

  2. Shuffle the main list to create a second list under the constraint that the same speaker does not appear in the same line in the main and second lists.

  3. Mix the audio in the main and second lists one-by-one with a specific signal-to-interference ratio (SIR). For training data, we randomly sampled an SIR as follows.

    • In 1/3 probability, sample the SIR from a uniform distribution between -10 and 10 dB.

    • In 1/3 probability, sample the SIR from a uniform distribution between 10 and 60 dB. The transcription of the interference speaker was set to null.

    • In 1/3 probability, sample the SIR from a uniform distribution between -60 and -10 dB. The transcription of the target speaker was set to null.

  4. The volume of each mixed speech was randomly changed to enhance robustness against volume difference.

A speech for extracting a speaker embedding was also randomly selected for each speech mixture from the main list. Note that the random perturbation of volume was applied only for the training data, not for evaluation data.

We trained a TS-AM consisting of a convolutional neural network (CNN), time-delay NN (TDNN)

[38]

, and long short-term memory (LSTM)

[13], as shown in Fig. 2

. The input acoustic feature for the network was a 40-dimensional FBANK without normalization. A 100-dimensional i-vector was also extracted and used for the target-speaker embedding to indicate the target speaker. For extracting this i-vector, we randomly selected an utterance of the same speaker. We conducted 8 epochs of training on the basis of LF-MMI, where the initial learning rate was set to 0.001 and exponentially decayed to 0.0001 by the end of the training. We applied

-regularization and CE-regularization [28]

with scales of 0.00005 and 0.1, respectively. The leaky hidden Markov model coefficient was set to 0.1. A backstitch technique

[41] with a backstitch scale of 1.0 and backstitch interval of 4 was also used.

For comparison, we trained another TS-AM without the auxiliary loss. We also trained a “clean AM” using clean, non-speaker-mixed speech. For this clean model, we used a model architecture without the auxiliary output branch, and an i-vector was extracted every 100 msec for online speaker/environment adaptation.

In decoding, we used a 4-gram LM trained using the transcription of the training data. All our experiments were conducted on the basis of the Kaldi toolkit [27].

4.2 Preliminary experiment with simulated 2-speaker mixture

4.2.1 Evaluation of TS-ASR

We first evaluated the TS-AM with two-speaker mixture of the E1, E2, and E3 evaluation sets. For each test utterance, a sample of the target speaker was randomly selected from the other utterances in the test set. We used the same random seed over all experiments, so that they could be conducted under the same conditions.

Model Evaluation Data E1 E2 E3 Avg.
Clean AM 1-spk. 8.94 7.31 7.44 7.90
Clean AM 2-spk. mixed 87.60 85.44 91.05 88.03
TS-AM () 2-spk. mixed 26.01 18.16 18.16 20.78
TS-AM () 2-spk. mixed 25.68 17.94 17.96 20.53
Table 1: WERs (%) for two-speaker-mixed evaluation sets of CSJ.

The results are listed in Table 1. Although the clean AM produced a WER of 7.90% for the original clean dataset, the WER severely degraded to 88.03% by mixing two speakers. The TS-AM then significantly recovered the WER to 20.78% (). Although the improvement was not so significant compared with that reported in [17], the auxiliary loss further improved the WER to 20.53% (). Note that E1 contains only male speakers while E2 and E3 contain both female and male speakers. Because of this, E1 showed larger degradation of WER when 2 speakers were mixed.

Test set Target spk. Interference spk.
E1 (10 male) 25.68 26.91
E2 (5 female, 5 male) 17.94 18.46
E3 (5 female, 5 male) 17.96 18.36
Avg. 20.53 21.24
Table 2: WERs (%) for two-speaker-mixed evaluation sets of CSJ. Main output branch was used for target-speaker ASR and auxiliary output branch was used for interference-speaker ASR.
# Speaker Embeddings AM Evaluation Gender Pair Total
Initialization Update Data Different Same
1 - - Clean-AM 1-spk. 18.49 21.14 19.93
2 Oracle - Clean-AM w/ & Clean-AM w/ 2-spk. mixed 94.46 94.01 94.22
3 Oracle - TS-AM (tgt) w/ & TS-AM (tgt) w/ 2-spk. mixed 26.83 47.33 37.96
4 Oracle - TS-AM (tgt) w/ & TS-AM (int) w/ 2-spk. mixed 25.99 53.80 41.09
5 K-means TS-AM (tgt) w/ & TS-AM (tgt) w/ 2-spk. mixed 40.99 64.97 54.01
6 K-means TS-AM (tgt) w/ & TS-AM (int) w/ 2-spk. mixed 30.00 58.61 45.54
7 K-means TS-AM (tgt) w/ & TS-AM (int) w/ 2-spk. mixed 26.45 53.93 41.37
8 K-means TS-AM (tgt) w/ & TS-AM (int) w/ 2-spk. mixed 25.46 52.82 40.31
9 K-means TS-AM (tgt) w/ & TS-AM (int) w/ 2-spk. mixed 25.20 52.50 40.03
  • Result obtained with some oracle information such as non-overlapped evaluation data or oracle speaker embeddings

Table 3: WERs (%) for dialogue speech in CSJ
Method Gender Pair Total
Different Same
i-vector with K-means 25.94 37.32 32.37
# 6 of Table 3 15.99 37.00 27.87
# 9 of Table 3 10.76 35.30 24.63
i-vector with AHC [32] 14.34 38.48 27.99
x-vector with AHC [32] 13.77 30.02 22.96
  • Trained using combination of Switchboard and NIST SRE datasets

Table 4: DERs (%) for dialogue speech
Miss False Alarm Confusion DER
Different Gender Pair 9.6 0.8 0.4 10.76
Same Gender Pair 22.5 2.2 10.6 35.30
Total 16.9 1.6 6.2 24.63
Table 5: Details of DER (%) for # 9 of Table 3

4.2.2 Interference-speaker ASR by auxiliary output branch

Before moving to the evaluation of dialogue recordings, we also evaluated the use case of the auxiliary output branch for interference speakers to conduct interference-speaker ASR. In this experiment, we provided the target speaker’s embeddings for the TS-AM and evaluated the WERs of the ASR results using the auxiliary output branch. The results are shown in Table 2. We confirmed that the auxiliary output branch worked very well for the secondary ASR. This clearly indicates that the shared layers of the neural network were learned to separate speakers. In addition, we found out that this secondary ASR can be effectively incorporated into the proposed method, which we will explain in the next section.

4.3 Experiment with dialogue recordings

Since we confirmed that TS-ASR worked as expected, we then conducted experiments for dialogue recordings, which were the main target of this study.

4.3.1 WER evaluation with oracle non-overlapped speech

We first evaluated the lower limit of WER for the dialogue recordings by using the non-overlapped dialogue recordings (see Section 4.1 for recording settings). We used the original non-overlapped recordings with the ground-truth segmentation and conducted ASR with the clean AM. The results are shown in the first line of Table 3. We observed a WER of 19.93%, which was the lower limit for the recordings in this experiment. The WER was worse compared with those for lecture recordings (E1, E2, and E3). We observed more substitution errors of backchannels for dialogue recordings, which was very short and difficult to recognize.

4.3.2 WER evaluation using oracle speaker embeddings

We then conducted experiments for the mixed monaural dialogue recordings. For preprocessing, we separated each dialogue recording into speech segments by using simple power-based voice activity detection. Note that each segment could contain the speech of two speakers. We counted a recognized word as correct only when it and the recognized speaker both matched the reference label. Since there was ambiguity in the order of speakers in the reference label, we calculated the best WER among possible permutations of speakers.

We first conducted an experiment with oracle speaker embeddings to confirm the oracle WER for two-speaker mixed recordings. The results are shown in the second to fourth rows of Table 3. We extracted the oracle i-vector for each speaker by using only the non-overlapped region determined from the ground-truth segmentation.

When we used the clean AM with the oracle speaker embeddings, we observed a very poor WER of 94.22% (# 2 of Table 3). This was within our expectation because the clean AM was not trained to extract the target speaker’s speech.

When we evaluated using the TS-AM with embeddings and , we observed the best oracle WER of 37.96% (# 3 of Table 3). This result was the lower limit of WER for two-speaker mixed recordings in this experiment. As another application of the TS-AM, we also used the auxiliary output branch of the TS-AM with embedding to recognize speaker . This result is shown in the fourth row in Table 3. It showed a slightly worse WER of 41.09% compared with the WER with the TS-AM with embeddings and (# 4 of Table 3). This was also within our expectation because the auxiliary branch produced a slightly worse result than the main branch according to Table 2.

4.3.3 WER evaluation using estimated speaker embeddings with iterative update (proposed method)

Finally, we evaluated the proposed method by starting from the estimated speaker embeddings. In this evaluation, we estimated the i-vector for each speech partition divided by power-based voice activity detection then applied K-means clustering. The number of clusters was set to 2 to be the same as the number of speakers. The center of each cluster was used for the initial set of speaker embeddings. Note that we denoted the cluster center of the larger cluster as and that of the smaller cluster as .

Similar to the comparison of # 3 and # 4 of Table 3, we also compared two methods without and with auxiliary output branch. The results are shown in the fifth and sixth rows. Contrary to the experiment with the oracle speaker embeddings, the method using the auxiliary output branch with the embedding to recognize the speaker produced much better WER of 45.54% than the method using the main output branch with the embedding to recognize the speaker . This is because the K-means-clustering-based speaker embedding estimation was not sufficient to generate two discriminative embeddings of and . Considering that embedding was selected as the center of the smaller cluster, it would not be as reliable as embedding . In such a case, using a single embedding with the auxiliary output branch is better than using an unreliable embedding with the main output branch.

We then evaluated the proposed method of applying speaker-embedding estimation and TS-ASR alternately. The results are shown in the seventh to ninth rows of Table 3. We observed clear improvement in the WER both for different gender pairs and same gender pairs, and achieved a WER of 40.03%, which differd by only 2.1% from the oracle WER of 37.96% with the oracle speaker embeddings. Note that we observed better results by the proposed method than that of the method # 4 of Table 3 even though the latter method used the speaker embeddings obtained from the ground-truth segmentation. We believe it was because the ground-truth segmentation contained an unignorable amount of silence frames that degraded the purity of speaker embeddings, while strict exclusion of silence frames was achieved by using the TS-ASR results.

4.3.4 Evaluation of DERs

Table 4 lists the DERs of three methods. Note that we set 0.25 sec of the no-score collar according to convention, and calculated DER including overlapped regions. It is also noted that we regarded silence frames of less than 0.5 sec as speech regions for the proposed method because we found the silence information that the ASR produced was too strict compared to the ground-truth segmentation developed by human transcribers. The first row is a naive method based on the clustering of i-vectors, which was used for the embedding initialization. As expected, it produced a very poor DER of 32.37% due to heavy speech overlaps in the recordings. Just by applying TS-ASR, we observed an improvement in the DER to 27.87%, especially for different gender pairs. By using the proposed method, the DER further improved to 24.63%.

To compare with the state-of-the-art speaker diarization method, we also tested the agglomerative hierarchical clustering (AHC)-based method

[11, 32], the results of which are shown in the last two rows in Table 4. The i-vector and x-vector extractors were trained using about 3,000 hrs of training data consisting of Switchboard-2 (Phase I, II, III), Switchboard Cellular (Part 1, Part2), and NIST Speaker Recognition Evaluation (2004, 2005, 2006, 2008) datasets. Speaker embeddings were extracted every 0.75 sec, and AHC with probabilistic linear discriminant analysis (PLDA) was used to create a speaker cluster. Although it is not directly comparable due to the difference in training data, we confirmed that our method produced a reasonably good DER although it was trained much smaller training data. Our method showed a better DER than that of “i-vector with AHC” and a DER close to that of “x-vector with AHC”. The proposed method even achieved a better DER than the x-vector-based method for the different gender pairs although we used an i-vector trained by much smaller data for the proposed method.

Finally, we discuss the detailed error analysis of the DER for the method with speaker embedding updated three times (Table 5). We first found that the main source of the DER came from the miss error. We also found that the confusion error and false alarm were low even for same gender pairs. When applying the proposed method for the different gender pairs, almost no confusion and false alarm were produced. This means that TS-ASR worked very conservatively, i.e., it tended to output null when it was not able to find a reliable word hypothesis. From the results in Tables 4 and 5, we can expect further improvement in the DER if we apply more discriminative speaker embeddings such as d-vector [40, 39] and x-vector [11, 35], which is one important direction for our future work.

5 Conclusion

In this paper, we defined the problem of simultaneous ASR and speaker diarization, and proposed an iterative method, in which (i) the estimation of speaker embeddings in the recordings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using real dialogue recordings in the CSJ in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both the WER and DER. Our proposed method with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 % from the WER of TS-ASR given oracle speaker embeddings. Furthermore, our method achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.

There are many directions to enhance this research. First, our proposed method should be examined using recordings with more than two speakers. Second, the use of more discriminative speaker embeddings, such as d-vector [40, 39] and x-vector [11, 35], will improve performance of both ASR and speaker diarization. Third, the initialization of speaker embeddings should be explored with more advanced speaker diarization techniques [40, 9, 10]. Finally, advanced ASR techniques, such as data augmentation [14, 19, 21, 22], model ensemble [36, 7, 15], improved training criterion [16, 42], will also improve overall performance. We will explore these directions for future work.

References

  • [1] Ö. Çetin and E. Shriberg (2006) Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition. In Proc. ICSLP, Cited by: §4.1.1.
  • [2] X. Chang, Y. Qian, K. Yu, and S. Watanabe (2019) End-to-end monaural multi-speaker asr system without pretraining. In Proc. ICASSP, Cited by: §1.
  • [3] Z. Chen, J. Droppo, J. Li, W. Xiong, Z. Chen, J. Droppo, J. Li, and W. Xiong (2018) Progressive joint modeling in unsupervised single-channel overlapped speech recognition. IEEE/ACM Trans. on ASLP 26 (1), pp. 184–196. Cited by: §1.
  • [4] Z. Chen, Y. Luo, and N. Mesgarani (2017) Deep attractor network for single-microphone speaker separation. In Proc. ICASSP, pp. 246–250. Cited by: §1.
  • [5] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (2011) Front-end factor analysis for speaker verification. IEEE Trans. on ASLP 19 (4), pp. 788–798. Cited by: §2.2, §3.2.1.
  • [6] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani (2018) Single channel target speaker extraction and recognition with speaker beam. In Proc. ICASSP, pp. 5554–5558. Cited by: §1, §1, §3.1.
  • [7] L. Deng and J. C. Platt (2014) Ensemble deep learning for speech recognition. In Proc. INTERSPEECH, pp. 1915–1919. Cited by: §5.
  • [8] J. G. Fiscus, J. Ajot, and J. S. Garofolo (2007) The rich transcription 2007 meeting recognition evaluation. In Multimodal Technologies for Perception of Humans, pp. 373–389. Cited by: §1.
  • [9] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe (2019) End-to-end neural speaker diarization with permutation-free objectives. In Proc. INTERSPEECH, Cited by: §5.
  • [10] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe (2019) End-to-end neural speaker diarization with self-attention. In Proc. ASRU, Note: to appear Cited by: §5.
  • [11] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree (2017) Speaker diarization using deep neural network embeddings. In Proc. ICASSP, pp. 4930–4934. Cited by: §4.3.4, §4.3.4, §5.
  • [12] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe (2016) Deep clustering: discriminative embeddings for segmentation and separation. In Proc. ICASSP, pp. 31–35. Cited by: §1.
  • [13] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.3.
  • [14] N. Jaitly and G. E. Hinton (2013) Vocal tract length perturbation (VTLP) improves speech recognition. In Proc. ICML Workshop on Deep Learning for Audio, Speech and Language, Vol. 117. Cited by: §5.
  • [15] N. Kanda, Y. Fujita, and K. Nagamatsu (2017)

    Investigation of lattice-free maximum mutual information-based acoustic models with sequence-level Kullback-Leibler divergence

    .
    In Proc. ASRU, pp. 69–76. Cited by: §5.
  • [16] N. Kanda, Y. Fujita, and K. Nagamatsu (2018) Lattice-free state-level minimum Bayes risk training of acoustic models. In Proc. INTERSPEECH, pp. 2923–2927. Cited by: §5.
  • [17] N. Kanda, S. Horiguchi, R. Takashima, Y. Fujita, K. Nagamatsu, and S. Watanabe (2019) Auxiliary interference speaker loss for target-speaker speech recognition. arXiv preprint arXiv:1906.10876. Cited by: §1, Figure 2, §3.2.1, §3.2.1, §4.2.1.
  • [18] N. Kanda, R. Ikeshita, S. Horiguchi, Y. Fujita, K. Nagamatsu, X. Wang, V. Manohar, N. E. Y. Soplin, M. Maciejewski, S. Chen, et al. (2018) The Hitachi/JHU CHiME-5 system: advances in speech recognition for everyday home environments using multiple microphone arrays. In Proc. CHiME-5, pp. 6–10. Cited by: footnote 1.
  • [19] N. Kanda, R. Takeda, and Y. Obuchi (2013) Elastic spectral distortion for low resource speech recognition with deep neural networks. In Proc. ASRU, pp. 309–314. Cited by: §5.
  • [20] T. Kawahara, H. Nanjo, T. Shinozaki, and S. Furui (2003) Benchmark test for speech recognition using the Corpus of Spontaneous Japanese. In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, Cited by: §4.1.2.
  • [21] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur (2015) Audio augmentation for speech recognition. In Proc. INTERSPEECH, pp. 3586–3589. Cited by: §5.
  • [22] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur (2017) A study on data augmentation of reverberant speech for robust speech recognition. In Proc. ICASSP, pp. 5220–5224. Cited by: §5.
  • [23] K. Maekawa (2003) Corpus of spontaneous japanese: its design and evaluation. In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, Cited by: §4.1.1.
  • [24] V. Manohar, S. Chen, Z. Wang, Y. Fujita, S. Watanabe, and S. Khudanpur (2019) Acoustic modeling for overlapping speech recognition: JHU CHiME-5 challenge system. In Proc. ICASSP, pp. 6665–6669. Cited by: footnote 1.
  • [25] S. Meignier and T. Merlin (2010) LIUM SpkDiarization: an open source toolkit for diarization. In CMU SPUD Workshop, Cited by: §1.
  • [26] V. Peddinti, D. Povey, and S. Khudanpur (2015) A time delay neural network architecture for efficient modeling of long temporal contexts. In Proc. INTERSPEECH, pp. 3214–3218. Cited by: Figure 2.
  • [27] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlíček, Y. Qian, P. Schwarz, et al. (2011) The Kaldi speech recognition toolkit. In Proc. ASRU, Cited by: §4.1.3.
  • [28] D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur (2016) Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Proc. INTERSPEECH, pp. 2751–2755. Cited by: §3.2.1, §3.2.2, §4.1.3.
  • [29] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny (2013) Speaker adaptation of neural network acoustic models using i-vectors. In Proc. ASRU, pp. 55–59. Cited by: §3.2.1.
  • [30] H. Seki, T. Hori, S. Watanabe, J. Le Roux, and J. R. Hershey (2018) A purely end-to-end system for multi-speaker speech recognition. In Proc. ACL, pp. 2620–2630. Cited by: §1.
  • [31] G. Sell and D. Garcia-Romero (2014) Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In Proc. SLT, pp. 413–417. Cited by: §1.
  • [32] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, and S. Khudanpur (2018) Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In Proc. INTERSPEECH, pp. 2808–2812. Cited by: §4.3.4, Table 4.
  • [33] M. Senoussaoui, P. Kenny, T. Stafylakis, and P. Dumouchel (2013) A study of the cosine distance-based mean shift for telephone speech diarization. IEEE/ACM Trans. on ASLP 22 (1), pp. 217–227. Cited by: §1.
  • [34] S. Settle, J. Le Roux, T. Hori, S. Watanabe, and J. R. Hershey (2018) End-to-end multi-speaker speech recognition. In Proc. ICASSP, pp. 4819–4823. Cited by: §1.
  • [35] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust dnn embeddings for speaker recognition. In Proc. ICASSP, pp. 5329–5333. Cited by: §4.3.4, §5.
  • [36] Y. Tachioka, S. Watanabe, J. Le Roux, and J. R. Hershey (2013) A generalized discriminative training framework for system combination. In Proc. ASRU, pp. 43–48. Cited by: §5.
  • [37] S. E. Tranter and D. A. Reynolds (2006) An overview of automatic speaker diarization systems. IEEE Trans. on ASLP 14 (5), pp. 1557–1565. Cited by: §1.
  • [38] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang (1989) Phoneme recognition using time-delay neural networks. IEEE Trans. ASSP 37 (3), pp. 328–339. Cited by: §4.1.3.
  • [39] L. Wan, Q. Wang, A. Papir, and I. L. Moreno (2018) Generalized end-to-end loss for speaker verification. In Proc. ICASSP, pp. 4879–4883. Cited by: §4.3.4, §5.
  • [40] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno (2018) Speaker diarization with LSTM. In Proc. ICASSP, pp. 5239–5243. Cited by: §1, §4.3.4, §5.
  • [41] Y. Wang, V. Peddinti, H. Xu, X. Zhang, D. Povey, and S. Khudanpur (2017) Backstitch: counteracting finite-sample bias via negative steps. In Proc. INTERSPEECH, pp. 1631–1635. Cited by: §4.1.3.
  • [42] C. Weng and D. Yu (2019) A comparison of lattice-free discriminative training criteria for purely sequence-trained neural network acoustic models. In Proc. ICASSP, pp. 6430–6434. Cited by: §5.
  • [43] T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, and F. Alleva (2018) Recognizing overlapped speech in meetings: a multichannel separation approach using neural networks. In Proc. INTERSPEECH, pp. 3038–3042. Cited by: §1, §4.1.1.
  • [44] D. Yu, X. Chang, and Y. Qian (2017) Recognizing multi-talker speech with permutation invariant training. In Proc. INTERSPEECH, pp. 2456–2460. Cited by: §1, §1.
  • [45] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proc. ICASSP, pp. 241–245. Cited by: §1.
  • [46] K. Žmolíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, and T. Nakatani (2017) Learning speaker representation for neural network based multichannel speaker extraction. In Proc. ASRU, pp. 8–15. Cited by: §3.1.
  • [47] K. Žmolíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, and T. Nakatani (2017) Speaker-aware neural network based beamformer for speaker extraction in speech mixtures. In Proc. INTERSPEECH, Cited by: §1, §3.1.