Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification

05/15/2020
by   Yanpei Shi, et al.
The University of Sheffield
0

Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. In this paper, a hierarchical attention network is proposed to solve a weakly labelled speaker identification problem. The use of a hierarchical structure, consisting of a frame-level encoder and a segment-level encoder, aims to learn speaker related information locally and globally. Speech streams are segmented into fragments. The frame-level encoder with attention learns features and highlights the target related frames locally, and output a fragment based embedding. The segment-level encoder works with a second attention layer to emphasize the fragments probably related to target speakers. The global information is finally collected from segment-level module to predict speakers via a classifier. To evaluate the effectiveness of the proposed approach, artificial datasets based on Switchboard Cellular part1 (SWBC) and Voxceleb1 are constructed in two conditions, where speakers' voices are overlapped and not overlapped. Comparing to two baselines the obtained results show that the proposed approach can achieve better performances. Moreover, further experiments are conducted to evaluate the impact of utterance segmentation. The results show that a reasonable segmentation can slightly improve identification performances.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/17/2019

H-VECTORS: Utterance-level Speaker Embedding Using A Hierarchical Attention Model

In this paper, a hierarchical attention network to generate utterance-le...
10/29/2020

T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model

Identifying multiple speakers without knowing where a speaker's voice is...
06/19/2021

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

One-shot voice conversion has received significant attention since only ...
06/22/2018

Weakly Supervised Training of Speaker Identification Models

We propose an approach for training speaker identification models in a w...
07/13/2020

DNN Speaker Tracking with Embeddings

In multi-speaker applications is common to have pre-computed models from...
06/08/2021

Speech BERT Embedding For Improving Prosody in Neural TTS

This paper presents a speech BERT model to extract embedded prosody info...
09/22/2021

Diarisation using location tracking with agglomerative clustering

Previous works have shown that spatial location information can be compl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker identification using deep neural networks becomes an active research area in recent years

[variani2014deep, wang2018attention]. In traditional supervised speaker identification training, the data used for training needs hand labelling, where the segments and the corresponding speaker labels are manually annotated [karu2018weakly]. It might be expensive to process a large dataset with a large number of speakers using hand annotation [karu2018weakly, jia2019leveraging].

Instead of hand annotating speaker labels in supervised training, weakly supervised training only relies on the set of speaker labels that occur in the corresponding utterance [zhou2018brief]. This kind of weakly labelled large data collections are available online [karu2018weakly]. Making use of such data collections would be helpful for training with a large amount of data.

Weakly supervised training have been widely used in speech technology. In [karu2018weakly], Karu et.al proposed a DNN based weakly supervised speaker identification training technique. In their work, speaker diarization is firstly applied, and i-vectors are then extracted for each segments. A DNN is trained to predict the set of speaker labels without the true mapping from the i-vectors to the speaker labels. In [xu2017unsupervised], Xu et,at. proposed a DNN based approach for multi-label audio tagging. In their work, an auto-encoder is trained to predict multiple labels using one input utterance. In [xu2018large]

, Xu et al. proposed to use a gated convolutional neural network for audio classification. In their work, the model is trained to predict one or more classes from an audio without time stamp labels.

Except for speech technology, weakly supervised learning has been widely used in other domains. In [liu2019weakly]

, Liu et,al. proposed a weakly supervised transfer learning approach to classify multi-temporal remote-sensing images using one labelled image. In

[xu2019weakly], Xu et,al. proposed a weakly supervised training approach for image semantic segmentation using image-level labels.

In this work, a hierarchical attention network [yang2016hierarchical] based weakly supervised speaker identification approach is proposed. In the training and test data, each utterance contains multiple speakers and only the utterance-level labels are available. Different speakers might occur in different part of the input utterance, and some segments might contain multiple overlapped speakers. The model is trained to predict the set of all of the speaker labels from one input utterance [zhang2007multi, xu2017unsupervised]. The proposed hierarchical attention network contains a frame-level encoder with attention, and a segment-level encoder with attention, which capture speaker information locally and globally [shi2020h]

. The frame-level encoder with attention tries to find the important frames within a segment, and the segment-level encoder tries to find the most important parts in the input utterance for speaker identities. Finally, the whole input utterance is compressed into a single vector and input to a DNN classifier. The score vector for each speaker is obtained using a sigmoid function. The proposed hierarchical attention network (HAN) enables the model to highlight and pay attention to the most important parts of input utterance relates the speaker identities.

The rest of the paper is organized as follow: Section 2 presents the architecture of our approach. Section 3 depicts the data and the data construction process, the experimental setup, the baselines to be compared and implementation details. The results are obtained and shown in Section 4, and a conclusion is in Section 5.

2 Proposed Model

Figure 1: Architecture of the proposed Hierarchical Attention Network.

Figure 1 shows the architecture of the hierarchical attention network. The network consists of several parts: a frame-level encoder and attention layer, a segment-level encoder and attention layer, and two fully connected layers as a classifier. Given the input acoustic frame vectors, the proposed model applies attention mechanism locally and globally. It predicts multiple speakers in the input utterance. The details of each part will be introduced in the following subsections.

2.1 Frame-Level Encoder and Attention

An utterance is divided into segments: using a sliding window with length and step . Each segment contains -dimensional acoustic frame vectors , where denotes the th segment, denotes the th frame, .

In the frame-level encoder, a TDNN [peddinti2015time] is used on each segment, and followed by a bidirectional GRU [chung2014empirical] in order to get information from both directions of acoustic frames and contextual information.

The output of a frame-level encoder contains the information of the segment .

In the frame-level attention layer, a two-layer MLP is first used to convert into score vector , by which a normalized importance weight vector can be computed via a softmax function [yang2016hierarchical, rimer2004softprop].

(1)
(2)

where and are a scalar score and normalized score for each time step respectively. , and are the parameters of a two-layer MLP. These parameters are shared when processing segments. A weighted output of the frame-level encoder is computed by

(3)

Following [snyder2018x], statistics pooling is applied on to compute its mean vector () and std () vector over time. A segment vector is then obtained by concatenating the two vectors:

(4)

2.2 Segment Level Encoder and Attention

For the segment-level encoder and attention, the segment vector sequence is input to a stack of TDNN layers followed by a attention that descript in section 2.1. as the omission of the GRU layer can well accelerate training when processing a large number of samples.

The output of the frame level encoder and attention is . The weight vector of segment level attention can be computed as follows [Pan2019AutomaticHA]:

(5)

where and are a scalar score and normalized score for each segment vector respectively. , and are the parameters of a two-layer MLP. A vector is generated using a statistics pooling over all weighted segments:

(6)

The final speaker identity classifier is constructed using a two-layer MLP followed by a sigmoid activation function

[ito1991representation] with being its input. The final speaker identities are the output vector which contains the scores (between 1 and 0) for each speaker. The model is trained using binary cross entropy loss [xu2017unsupervised]:

(7)

, where denotes the predicted score vector and denote the reference label vector, denotes the batch size.

3 Experiments

3.1 Data

Figure 2: The illustration of the data construction process. (a): Concat; (b): Overlap.
Name Original Dataset Type #Select Speaker #Utterance Train #Utterance Test
SWBC-S SWBC Telephone 254 6000 20,000
SWBC-L SWBC Telephone 254 100,000 20,000
Vox-S Voxceleb1 Interview 1000 15000 30,000
Vox-L Voxceleb1 Interview 1000 150,000 30,000
Table 1: Details for the construction of the four datasets: SWBC-S, Vox-L, SWBC-S and Vox-L.

In this work, Switchboard Cellular Part 1 (SWBC) [swb] and Voxceleb1 [nagrani2017voxceleb] dataset are used, as both of them are benchmark datasets and have been widely used in speaker identification. The SWBC dataset contains 130 hours telephone speech with 254 speakers (129 male and 125 female) under various environment conditions. The Voxceleb1 dataset contains 1251 speakers with more than 150,000 utterances collected in the wild. 20-dimensional MFCC [tiwari2010mfcc] is used as the input acoustic features.

3.1.1 Data Construction

As there is no ready-made data for our task, new datasets are conducted manually by using the utterances from the Voxceleb1 and the SWBC dataset. To conduct weakly supervised training, two scenarios are designed: Overlap and Concat. Figure 2 (a) shows an example of the Concat scenario where the three speakers’ voices are concatenated without an overlap. Figure 2 (b), shows an example of Overlap scenario where the three speakers’ voices are utterly overlapped.

Based on the two scenarios above, in order to test the robustness of the proposed approach, for each of the two scenarios, four datasets are generated based on SWBC and Voxceleb1. Table 1 shows the details of the four datasets. For the first dataset (SWBC-S, “S” represents small), SWBC dataset is used and each speaker occurs 30 times in the training set averagely. ”SWBC-L” (“L” represents large) contains more training data, each speaker occurs 200 times in the training data averagely, while the amount of the test data keeps the same. The small and large version of the datasets is used to test the robustness of the proposed model in small and large training data. Similar to the configurations in the SWBC based datasets, the datasets that based on Voxceleb1 also have small and large scenarios. In ”Vox-S”, 1000 speakers are randomly selected from the Voxceleb1 dataset. Each speaker occurs 30 times in the training set. In ”Vox-L” dataset, each speaker occurs 300 times in the training set, while the test set is the same as ”Vox-S”. For each of the eight datasets, the number of speakers in each utterance is randomly chosen from one to three in all of the datasets.

3.2 Experiment Setup

The proposed model is compared with two baselines: X-vectors [snyder2018x] and Attentive X-vector (Att-Xvector) [zhu2018self, okabe2018attentive, wang2018attention, rahman2018attention]. X-vectors contains TDNN based frame-level feature extractor, statistics pooling and DNN based segment-level feature extractor. Att-Xvectors uses an global attention mechanism after the TDNN based frame-level feature extractor. The proposed approach is denoted as ”H-vector” and it is split into to scenarios: H-vector+sliding window and H-vector+static window. In H-vector+sliding window, the window length is set to 20 frames, and the step length is set to 10 frames. In H-vector+static window, the is set to 20 frames, and the is set to the same as , which means there is no overlap for each local segments.

In table 1, each of the four datasets contains two scenarios (Concat and Overlap). In the training process, for all of the eight datasets, the number of speakers in the generated utterances is not fixed, changing from one to three. When the number of speakers is one, the generated utterance is the same as the original utterance. When the number of speakers are two or three, the output utterance contains multiple speakers with or without overlap.

There are no overlaps between the training and test data. The length of all of the generated utterances are fixed at five seconds.

3.3 Evaluation Metric

In this work, equal error rate (EER) [cheng2004method, murphy2012machine]

is used as the evaluation metric, as it is widely used in multi-label audio tagging

[xu2017unsupervised]. The EER is defined as the point when the false negative (FN) equals to the false positive rate (FP) rate. EER is computed for each individual input and averaged across the whole test set [cheng2004method].

3.4 Implementation

Level Model Input Output
Frame-Level TDNN (M,20) (M,256)
Bi-GRU (M,256) (M,512)
Attention (M,512) (M,512)
Statistics Pooling (M,512) (1,1024)
Segment-Level TDNN1 (N,1024) (N,512)
TDNN2 (N,512) (N,512)
TDNN3 (N,512) (N,1500)
Attention (N,1500) (N,1500)
Statistics Pooling (N,1500) (1,3000)
Utterance-Level DNN (512) (1,3000) (1,512)
DNN (K) (1,512) (1,K)
Table 2: Architecture of the proposed hierarchical attention network architecture, where K denotes the total number of speakers.

Table 2

shows the details of the proposed model architecture. The TDNN in both frame-level and segment-level encoder operates at the current time step. Batch normalization

[ioffe2015batch] are added after each layer except for attention layer. Adam optimiser [Kingma2014AdamAM] is used for all experiments with , , and . The initial learning rate is .

Figure 3: The results obtained using the four models (X-vectors, Attentive X-vectors, H-vector with static window and H-vector with sliding window) in different test conditions (1, 2, 3 or multiple speakers) on the eight designed datasets (SWBC-S, SWBC-L, Vox-S and Vox-L) and scenarios (Concat and Overlap). For all of the figures, the x-axis represents the number of speakers in test utterance. In H-vector with static window, the window size is 20 frames. In H-vector with sliding window, the window size is 20 frames, the step size is 10 frames.

4 Results

Data Type Window Size EER (%)
SWBC-S SWBC-L Vox-S Vox-L
Concat 10 12.56 7.15 18.29 13.69
15 11.87 6.85 18.08 13.34
20 11.27 6.47 17.48 13.08
25 11.69 6.59 17.81 13.29
30 12.11 6.92 18.21 13.66
Overlap 10 17.81 15.71 34.37 26.46
15 16.89 15.05 33.48 25.85
20 16.24 14.56 32.77 25.39
25 15.99 15.58 32.26 25.94
30 16.59 16.02 32.86 26.17
Table 3: The obtained results of the proposed H-vector architecture using different window size (from 10 to 30 frames), step size is kept at 10 frames.
Data Type Step Size EER (%)
SWBC-S SWBC-L Vox-S Vox-L
Concat 5 11.95 6.74 18.01 13.65
10 11.27 6.47 17.48 13.08
15 11.34 6.29 17.98 12.82
20 11.45 6.96 18.21 13.15
25 11.86 6.84 18.56 13.42
Overlap 5 16.49 14.92 33.87 25.51
10 16.24 14.56 32.77 25.39
15 16.88 14.13 33.53 24.86
20 17.22 14.82 33.92 25.46
25 17.78 15.11 34.25 25.81
Table 4: The obtained results of the proposed H-vector architecture using different step size (from 5 to 25 frames), window size is kept at 20 frames.

Figure 3 shows the results obtained using the four models (X-vectors, Attentive X-vectors, H-vector with static window and H-vector with sliding window) in different test conditions (1, 2, 3 or multiple speakers) on the eight designed datasets (SWBC-S, SWBC-L, Vox-S and Vox-L) and scenarios (Concat and Overlap). In each figure, the X-axis represents the number of speakers in an utterance. “”One, “two”, “three” means the case where an utterance contains only one, two or three speakers, respectively. “Multiple speaker” means the combination of the three cases.

H-vector+Sliding window performs better in almost all of the conditions. The H-vector+static window performs better than the two baselines. These results show that capturing local and global information in weakly supervised speaker identification is helpful. The obtained results by X-vector is worst, this might because it treat each frame has equal importance. Comparing with Att-xvector, one of the reason of the improvement of the proposed H-vector might because of the distributed attention mechanism. Att-Xvector only applied attention mechanism globally.

Among all of the test conditions, the best results are obtained when the number of speakers in each utterance is one, and the worse case is when each utterance contains three speakers. This might due to the difficulty of the test conditions. A similar reason also occurs in the two different data construction scenarios (Concat and Overlap). In these two scenarios, the results obtained on Concat scenario is better than that on Overlap scenario. This might because when the speakers’ voice are overlapped together, it is more difficult to distinguish different speakers. However, the proposed H-vector+sliding window performs better than the baselines in different test conditions and different data construction scenarios.

Moreover, when the training data is small, the proposed H-vector+sliding window still performs better than the baselines and H-vector+static window, reaching 11.5 % and 3.4 % relative improvement than X-vectors and Att-Xvectors in SWBC-S dataset in Concat scenario. It shows robustness of the proposed H-vector+sliding window when there is no enough training data.

In order to test the effectiveness of the window size (M) and step size (H), Table 3 and 4 show the obtained results using the proposed H-vector+sliding window when using different window size and step size. In Overlap scenario, the equal error rate is more sensitive to the change of window size and step size. This might because in Overlap scenario, different speaker signals are overlapped in time domain, some speaker features might influence to each other. Different window size and step size allows the frame-level encoder and attention to capture more local features. Furthermore, in most of the cases, the best results is obtained when the window size is 20 frames, the step size is 10 frames, in which the step size is set to the half-size of the window size.

5 Conclusion and Future Work

In this work, a hierarchical attention network is proposed to solve the weakly labelled speaker identification problem. The input utterance is split into each local segments using a sliding window. Frame-level and segment-level encoder and attention capture speaker information locally and globally. The experiments are done with different test conditions and different amount of training data. The obtained results show that the proposed hierarchical attention network with sliding window performs better than X-vector and Attentive Xvector baselines, as well as hierarchical attention network with static window. In the future work, more complex network architectures and larger dataset such as Voxceleb2 will be investigated.

Acknowledgement

This work was in part supported by Innovate UK Grant number 104264 MAUDIE.

References