Speaker tracking is the process of identifying all regions uttered by a target speaker in an audio [ASTSBOSTDFNE]. Similarly to speaker diarization, which answers the question ”who spoke when?”, speaker tracking searches for those regions, but assigns speaker identities. This process is an important pre-processing step for many multi-speaker applications such as virtual assistants and broadcast news transcription and indexing .
As shown in [garcia2019speaker], diarization and tracking are two methods closely related. Although tracking would benefit from the diarization, in this research, we explored the possibility to include a neural network as a robust classifier that can operate similarly to the PLDA. The goal is that it can naturally provide results for diarization and tracking. Since there are just a few studies on speaker tracking [ASTSBOSTDFNE, 8272968, sonmez1999speaker], we use diarization as the main background and inspiration of this work.
Most of the standard speaker diarization systems focus on offline clustering as it uses all the contextual information to label the speech regions. Examples of such algorithms include agglomerative hierarchical clustering (AHC)[article:bsdfdsdc2019, article:sdwdsefdci]
, k-means[article:pal2019study, article:hdwkfscisdobn]article:lbsmwscfsd, article:atscfsdunme], etc. These clustering methods cannot be used in real-time applications since they require the complete speech data upfront. If the application is latency-sensitive it requires to have speaker labels generated as soon as speech segments are available for the system.
[wang2017speaker] presents an embedding based speaker diarization system. d-vectors were used [article:dnnfsftdsv] along with an LSTM-based speaker verification in combination with spectral clustering to successfully perform offline diarization; however, the diarization error rate almost doubles in its online modality. Another online diarization approach is introduced in [Ghahabi2019]
. They propose a DNN (deep neural network) embedding suitable for online processing referred as speaker-corrupted embedding. The diarization algorithm uses cosine similarity to compare the speaker models and the embedding in order to make the labeling decisions.
In this paper, we propose an online speaker tracking pipeline by replacing the unsupervised offline clustering module from the standard diarization system with a online tracking method that uses a DNN as an embedding robust classifier. As shown in Figure 1, our speaker tracking system shares many of its components with the standard diarization pipeline [article:bsfdsdc2018, zhang2018fully, dihseallftjtitidc], with the main difference being the clustering algorithm.
The experimental results on CALLHOME and DIHARD II single channel reveal that our method achieves significant improvements over the PLDA baseline. 111 Our code is available at https://github.com/CarlosRCS9/kaldi/tree/paper-dnn-tracking
In this section, we introduce our speaker tracking framework, Figure 1 illustrates the overall steps of our tracking pipeline.
Speech segmentation and embedding extraction.
Speaker model generation.
Speaker segment identification.
2.1 Speech segmentation and embedding extraction
The first module in our pipeline is inspired by the standard diarization system, it uses a Voice Activity Detector (VAD) to determine the speech regions in the input audio signal, excluding the non-speech regions from subsequent processing. A sliding window, further divides these regions into a set of smaller, overlapping speech segments, establishing the temporal resolution of the speaker tracking results. The output of this module is a set of voiced-speech segments. We decided to use an oracle VAD as segmentation mechanism to focus our efforts on checking whether our proposed architecture can track speakers accurately.
2.1.1 Embedding extraction
The next step in the pipeline is to extract an embedding from each segment. Our system was tested following the i-vector- and x-vector-based approaches [article:sdwpisauc, article:xrdefsr]. The i-vector, introduced by Dehak et al. [article:fefafsv], is a speaker representation that provides a way to reduce large-dimensional input speech data to a small-dimensional feature vector that retains most of the relevant channel and speaker information. The x-vector, introduced by Snyder et al. [article:dnnbsefetesv, article:xrdefsr] is an embedding extracted from a deep neural network trained to discriminate between speakers, mapping variable-length speech segments to a fixed-length feature vector. Nowadays, the x-vector approach provides state-of-the-art performance in many speaker recognition fields, such as, speaker verification and speaker diarization [article:srfmscux, article:slrux, article:gxftixv, ryant2019second].
2.2 Speaker model generation
After the segment embeddings are extracted, a speaker model is generated for each tracked speaker. Such task is performed by averaging the embeddings within a time window at the beginning of each target speaker enrollment in the input audio. We define model time as the window width used to generate the speaker’s models. With this approach, the system operates in an online fashion in which with a few labeled samples of the target speakers it can find their appearances along the complete audio.
2.3 Speaker segment identification
The resulting segment embeddings and the speaker models are then passed through a speaker identification/verification stage. This task is performed by a speaker tracking DNN, the key component of our pipeline. According to the run-time latency, this module follows an online tracking strategy. It produces a speaker label immediately after a segment is available without the knowledge of future segments, making it easier for the system to deal with large amounts of data.
Figure 2 illustrates the structure of the network’s input and output layers during the segment labeling process. For a given utterance the input and output sequences of the network (, ) are defined as follows:
The speech segmentation and embedding extraction module provides a sequence of embeddings , where each has a 1:1 correspondence to the segments obtained from the input utterance, and is the dimension of every embedding.
The speaker model generation module provides the sequence where , such that each entry of the sequence is a model of one of the tracked speakers.
The input sequence of our network is defined as the concatenation of to each element of . .
The sequence is given by the speaker labels of the segments.
The output sequence is given by where . At training time is given by the ground-truth labels. At inference,
is computed by the estimated labels.
Table 1 summarizes the final DNN architecture used in this work. The first three convolutional layers of the network provide a comparison stream for each of the speakers models, where the similarity measure between the segment embedding and a single speaker model is computed with the contextual information of all the speaker models. Then the fully-connected feed-forward layers use these streams to score the similarity of the target speaker model and the incoming segment. 222 We also tested recurrent neural network architectures (bidirectional LSTM), but it was discarded as it seemed to ignore the speaker model features at inference time, and remembered the input sequence established at training phase.
We also tested recurrent neural network architectures (bidirectional LSTM), but it was discarded as it seemed to ignore the speaker model features at inference time, and remembered the input sequence established at training phase.
|Layer type||# filters||Kernel size||Input output|
During training, all possible permutations of the elements of are computed and appended to every input
with two main goals: reduce overfitting by forcing all output neurons to score the same speaker models, and augment the number of training samples. This procedure ensures the DNN scoring to be independent of the speaker model sequence order. Figure3
shows how the training data is furthermore augmented with the addition of zero padding as non-speaker model feature. This procedure simulates a verification task since the network has to decide whether the current segment embedding belongs to one of the available models or not.
At inference time, the input layer of the network receives the incoming segment embedding and an array of target speakers models, the length of the array is the same as output neurons, so each score is related to an index in the speakers models array. In an identification setup we label the segment with the highest score index. If the task requires verification, a certainty threshold is used to label the segments.
The baseline system uses probabilistic linear discriminant analysis (PLDA) scoring as the similarity measure333PLDA scoring computes the loglikehood ratio between two embeddings, since it has proven to achieve state-of-the-art performance in many speaker recognition tasks. It provides a powerful distortion-resistant mechanism to distinguish between different speakers, and robust to same speaker variability. [article:apaflatisr, article:ltfpisv, ryant2019second, article:unsdsftdi2c].
Due to the online nature of our pipeline, the post-processing step is applied as soon as a segment label is emitted, this step refines the tracking results by merging the same-speaker contiguous segments. And also by adjusting the labels within a window of three contiguous segments , it modifies the in-between segment label if the surrounding labels are equal to each other and differ from the in-between label, producing three contiguous segments with the same label.
This section describes our experimental setup and results. We decided on a 1 s width and 0.5 s step sliding window at the speech segmentation step, discarding segments shorter than 0.5 s to ensure sufficient speaker information. Both i- and x-vectors were extracted using the Kaldi’s CALLHOME diarization recipes444https://github.com/kaldi-asr/kaldi/tree/master/egs/callhome diarization/v1 and /v2 [dihseallftjtitidc]. For CALLHOME x-vector experiments, a publicly available555https://kaldi-asr.org/models/m6 [article:xrdefsr] model and PLDA backend were used.
3.1 Evaluation metrics
The system performance was evaluated in terms of Equal Error Rate (EER) and minimum Detection Cost Function (minDCF), as the key component of our tracking framework follows a speaker verification approach. In addition, we report Diarization Error Rate (DER) [article:trt2smre] since our framework shares characteristics with the standard diarization system.
We tested our system on two standard public datasets: (1) 2000 NIST Speaker Recognition Evaluation (LDC2001S97), Disk-8, usually referred to as “CALLHOME”, it contains 500 utterances distributed across six languages: Arabic, English, German, Japanese, Mandarin, and Spanish. Each utterance contains up to 7 speakers; (2) DIHARD II single channel development and evaluation subsets (LDC2019E31, LDC2019E32), focused on ”hard” speaker diarization, contains 5-10 minute English utterances selected from 11 conversational domains, each including approximately 2 hours of audio. Since our approach is supervised, we perform a 2-fold cross-validation on each dataset using standard partitions: callhome1 and callhome2 from Kaldi’s CALLHOME diarization recipe [dihseallftjtitidc], and DIHARD II single channel’s development and evaluation subsets. Then, the partitions results are combined to report the averaged DER, ERR and minDCF of each dataset.
3.3 Overlap preparation
A set of our experiments is focused on speaker overlap, so it was necessary to augment the datasets, as they have a low percentage of speaker overlap (CALLHOME 16%, DIHARD II single channel 9%). To perform this task the non-overlapping audio segments of each speaker are extracted using the ground-truth labels, then merged into a set of single-speaker utterances for each recording. After that, the single-speaker utterances are pairwise overlapped to create a new set of two-speaker-overlapping utterances. Finally, the new overlapping utterances are cut into segments and inserted into their original recordings at random locations. The resulting datataset contains an additional 18% of speaker overlap in CALLHOME, and 30% in DIHARD II single channel.
The baseline system follows exactly the same procedure as our proposed tracking method. The only difference is the replacement of the DNN-based speaker segment identification module with a PLDA-based one.
In the first set of experiments we provide an optimum set of conditions for speaker tracking, the number of tracked speakers is fixed to 2, with the input audio signal containing only speech from them, and there is no overlapped speech instances.
|DER||EER minDCF||DER||EER minDCF|
|3.0 s||6.90||17.40 0.91||5.65||4.53 0.52|
|5.5 s||5.41||14.81 0.88||4.93||3.90 0.53|
|10.5 s||4.56||12.98 0.85||4.54||2.83 0.37|
|3.0 s||33.95||41.66 0.99||11.53||11.23 0.79|
|5.5 s||28.65||36.68 1.00||7.80||6.93 0.60|
|10.5 s||24.66||32.57 1.00||5.63||4.17 0.50|
|DIHARD II i-vector|
|3.0 s||19.01||36.62 0.99||18.44||21.53 0.96|
|5.5 s||16.22||34.73 0.99||13.97||15.19 0.89|
|10.5 s||13.29||33.70 0.99||11.03||11.76 0.82|
|3.0 s||28.49||38.01 1.00||20.56||29.01 0.99|
|5.5 s||28.54||38.01 1.00||19.28||27.04 0.99|
|10.5 s||27.05||39.69 1.00||13.36||17.51 0.97|
DER (%), EER (%) and minDCF (0.1% target probability) on two datasets given the optimum conditions.
In Table 2 we can see that the DNN based tracking system significantly outperforms the PLDA baseline in EER and minDCF, which drops from 12.98% to 2.83% and 0.85 to 0.37, respectively in CALLHOME with the speaker models generated with 10.5 s of labeled samples. We also observe that the i-vectors provide good DER performance even when fewer labeled samples are provided. Additionally, we followed the same optimum conditions with x-vectors, the advantage of our supervised approach in EER and minDCF is consistent in both datasets. A further improvement is shown in terms of DER, which drops form 24.66% to 5.63%. It should be noted that both, DNN and PLDA systems, degrade when x-vectors are used, this is further illustrated in Figure 4, where the minDCF curves show a clear advantage of the i-vectors in our supervised approach. We attribute this behavior to the training of the x-vectors (1.5 seconds window) and the very short segments used to produce the embeddings.
|DER||EER minDCF||DER||EER minDCF|
|3.0 s||7.37||22.58 0.99||6.25||11.66 0.79|
|5.5 s||5.69||22.52 0.99||4.56||7.77 0.64|
|10.5 s||4.53||22.39 1.00||4.43||5.89 0.52|
|3.0 s||32.57||46.58 1.00||12.37||6.81 0.67|
|5.5 s||27.61||41.52 1.00||8.13||5.26 0.66|
|10.5 s||22.81||38.06 1.00||6.24||4.40 0.65|
|DIHARD II i-vector|
|5.5 s||16.40||32.86 0.99||15.42||17.96 0.94|
|10.5 s||13.05||32.67 0.99||11.44||14.67 0.96|
|5.5 s||29.48||56.65 1.00||16.84||15.89 0.98|
|10.5 s||30.04||59.99 1.00||13.77||15.06 0.99|
In a second set of experiments we increased the complexity of the previous conditions by mixing a set of non-target speaker segments in every recording. Such segments were built from the speaker models of other recordings within the same cross-validation fold. We are interested in the EER and minDCF results.
As shown in Table 3 the DNN-based system continues to outperform the PLDA baseline. We also observe that the performance gap between i- and x-vector disappears in the DNN system, since the i-vector error increased while the x-vector’s pretty much remained the same. We can conclude that our system is robust enough when it encounters non-target speakers along the recording.
Finally, we evaluate our proposed system considering overlapped speech, as described in Section 3.3. In this set of experiments, the number of tracked speakers is fixed to 2, with the input audio signal containing non-overlapping and overlapping speech from them, Table 4 shows promising results in both datasets, with 12% DER in CALLHOME with the additional 18% of augmented overlapped speech; and 28.04% DER in DIHARD II single channel with its 30% additional overlap.
|Model time||CALLHOME||DIHARD II|
|DER||EER minDCF||DER||EER minDCF|
|3.0 s||20.82||13.20 0.94||36.94||30.35 0.99|
|5.5 s||15.78||9.72 0.90||31.99||24.78 0.98|
|10.5 s||12.64||7.55 0.83||28.04||20.63 0.94|
In this paper, we propose a novel embedding-based speaker tracking DNN model focused on online tracking. We demonstrated efficiency of our approach through several experiments on two standard public datasets: CALLHOME and DIHARD II single channel. Validation results show a promising performance improvement compared to the PLDA baseline, as it drops the DER, EER and minDCF in the different experimental conditions, such as increased number of non-target speakers within a recording, and overlapping speakers.
For future research, we would like to extend our current DNN model to an online diarization and tracking system, where a recurrent neural network (RNN) will be responsible for selecting and updating the speaker models without having to resort to external sources. We expect such system to provide not only the diarization results, but also the set of speaker models that it will generate during an adaptive diarization process.
We would like to thank Diego Nigel Joaquin Campos Sobrino and Mario Alejandro Campos Soberanis for their helpful discussions.