BUT VOiCES 2019 System Description

07/13/2019 ∙ by Hossein Zeinali, et al. ∙ VUT v Brně 0

This is a description of our effort in VOiCES 2019 Speaker Recognition challenge. All systems in the fixed condition are based on the x-vector paradigm with different features and DNN topologies. The single best system reaches 1.2 relative improvement. The open condition allowed us to use external data which we did for the PLDA adaptation and achieved less than  10 improvement. In the submission to open condition, we used 3 x-vector systems and also one i-vector based system.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This submission is a description of our effort in VOiCES 2019 Speaker Recognition challenge [VOiCESevalplan]. Most of the systems are based on x-vectors [snyder2018x] with an exception of the i-vector subsystem for open condition which uses concatenation of MFCCs and Stacked bottlenecks (SBN) features [Paja_icassp2016]. Our systems utilize different features (MFCC, PLP, Mel-Filterbanks), DNN topologies and Gaussian or Heavy-tailed PLDA backend.

Below, we present our experimental setup and the description of individual subsystems. We list the results of individual systems together with the fusion in Table 3.

2 Experimental Setup

2.1 Training data, Augmentations

For x-vector training we used only Voxceleb 1 and 2 dataset with 166 thousands audio files (distributed in 1.2 million speech segments) from 7146 speakers. We performed the following data augmentations based on the Kaldi recipe and created additional 5 million segments based on these augmentations:

2.1.1 Retransmitted NIST SRE10 close talk data

In order to perform PLDA adaptation based on training data in open condition track, we made use of our dataset of retransmitted audio [JSTSP2019:Szoke]. Part of it has been benchmarked on the task of speaker verification in [Interspeech2018:Mosner]. A subset 333We used mainly telephone recordings recorded over close talk microphones of NIST 2010 Speaker Recognition evaluations (SRE) dataset was replayed by Adam audio A7X studio monitor in numerous rooms and acoustic conditions. In each room, multiple speaker positions were considered – sitting speaker, standing speaker and non-standard position (pointed to the ceiling, lying on the floor etc.). In addition to naturally occurring noise such as AC, vents, or common street noise coming through windows, noise source (radio receiver) was present in some sessions.

The corrupted audio was always simultaneously recorded by 31 microphones placed within the rooms. Synchronicity was governed by proprietary recording hardware.

The original dataset consists of 932 utterances with 30sec durations 444The original files have duration of 5 or 3 minutes, but we take only 30 sec chunks to limit overall retransmission time.. There are 459 recordings from 150 female speakers and 473 recordings from 150 male speakers. The whole set was retransmitted in 5 rooms. Changes of the loudspeaker positions in some of the rooms resulted in 9 recording sessions.

2.2 Input features

We use different features for several systems with this settings:

  • Kaldi MFCC - Fsamp=16kHz, frequency limits 20-7600Hz, 25ms frame length, 40 filter banks, 30 coefficients + energy

  • HTK MFCC - Fsamp=16kHz, frequency limits 0-8kHz, 25ms frame length, 30 filter banks, 24 coefficients + energy

  • Kaldi PLP - Fsamp=16kHz, frequency limits 20-7600Hz, 25ms frame length, 40 filter banks, 30 coefficients

  • Kaldi FBank - Fsamp=16kHz, frequency limits 20-7600Hz, 25ms frame length, 40 filter banks

  • SBN - Fsamp=8kHz, 80 dimensional bottleneck features trained on Fisher English, more details in Section 2.3

The Kaldi MFCC, PLP and FBank are processed with short time mean normalization over 3sec window. For HTK MFCC short time variance normalization is also applied.

2.3 Stacked Bottleneck Features (SBN)

Bottleneck Neural-Network (BN-NN) refers to such topology of a NN, one of whose hidden layers has significantly lower dimensionality than the surrounding layers. A bottleneck feature vector is generally understood as a by-product of forwarding a primary input feature vector through the BN-NN and reading off the vector of values at the bottleneck layer. We have used a cascade of two such NNs for our experiments. The output of the first network is

stacked in time, defining context-dependent input features for the second NN, hence the term Stacked Bottleneck Features.

The NN input features are 24 log Mel-scale filter bank outputs augmented with fundamental frequency features from 4 different estimators (Kaldi, Snack555http://kaldi.sourceforge.net, www.speech.kth.se/snack/, and two other according to  [Laskowski:LREC:2010] and [talkin:pitch:95]). Together, we have 13 related features, see [Karafiat:IS2014] for more details. The conversation-side based mean subtraction is applied on the whole feature vector. 11 frames of log filter bank outputs and fundamental frequency features are stacked together. Hamming window followed by DCT consisting of 0 to 5 base are applied on the time trajectory of each parameter resulting in coefficients on the first stage NN input.

The configuration for the first NN is , where is the number of targets. The dimensionality of the bottleneck layer, was fixed to 80. This was shown as optimal in [Matejka:Odyssey2014]. The dimensionality of other hidden layers was set to 1500. The bottleneck outputs from the first NN are sampled at times , , , and , where is the index of the current frame. The resulting 400-dimensional features are input to the second stage NN with the same topology as first stage. The 80 bottleneck outputs from the second NN (referred as SBN) are taken as features for the conventional GMM/UBM i-vector based SID system.

We used 8kHz SBN trained on Fisher English.

2.4 Voice Activity Detection

We used 2 VAD approaches:

VAD-Energy is energy based VAD from Kaldi SRE16 recipe without any modification. Note that for FBank and PLP the Kaldi VADs from MFCC were used.

VAD-NN consists of two carefully designed parts: a neural network (NN) which produces per-frame scores, and a post-processing stage which builds the segments based on the scores.

The NN was trained on the Fisher English. The input dimension is 288, while there are 2 hidden layers, each of 400 sigmoid neurons, and the final softmax layer has 2 outputs, corresponding to the classes: speech, non-speech. The NN has 277k parameters.

The input features for the NN consist of 15 log-Mel filter-bank outputs and 3 Kaldi-pitch features [Ghahremani:ICASSP2014]. We apply per-speaker mean normalization estimated on the whole unsegmented recordings. Then we apply frame splicing with 31 frame-long context, where the temporal trajectory of each feature is scaled by a Hamming window and reduced to 16 dimensions by Discrete Cosine Transform. The final 288-dimensional features are globally mean and variance normalized on the NN input.

In the post-processing, we bypass the NN output softmax function (allowing us to interpret the outputs as log-likelihoods), then we convert the two outputs to logit-posteriors, and then we smooth the score by averaging over consecutive 31 frames. In the final step, the speech segments were extracted by thresholding the posterior at the value of -0.5.

3 i-vector Systems

The system is based on gender independent i-vectors [DehakN_TASLP:2010, PLDA:kenny]. HTK MFCC with deltas and double deltas and SBN feature vectors were extracted from recordings (SBN were downsampled to 8kHz). Final feature vector is concatenation of both as they proved to perform very well in NIST SRE [Paja_icassp2016]. This system uses VAD-NN. Universal background model (UBM) contained 2048 components and was trained on Voxceleb 1 and 2 utterances from 7,146 speakers (450 hours). We then trained 600-dimensional i-vector extractor. UBM, i-vector and PLDA were trained only with clean Voxceleb data.

For the purpose of probabilistic linear discriminant analysis (PLDA) training we preprocessed all training, enroll and test data by means of single-channel weighted prediction error (WPE) dereverberation [IEEETrans2010:Nakatani] to suppress effects of room acoustic conditions.

4 x-vector Systems

All x-vectors used VAD-Energy from Kaldi SRE16 recipe 666We did not find big impact on performance when using different VAD within x-vector paradigm and it seems Kaldi simple VAD performs good for x-vector.. The systems were trained in Kaldi toolkit [povey2011kaldi] using SRE16 recipe with modifications described below:

  • Using different feature sets

  • Training networks with 9 epochs (instead of 3). We did not see any considerable difference with 12 epochs.

  • Using modified example generation - we used 200 frames in all training segments instead of randomizing it between 200-400 frames. We also have changed generation of the training examples so that it is not random and uses almost all available speech from all training speakers in a better way.

  • The x-vector DNN was trained on 1.2 million speech segments from 7146 speakers plus additional 5 million segments obtained with data augmentation. We generated around 700 archives that each of them contains exactly 15 training examples from each speaker (i.e. around 107K examples in each archive).

  • The architecture of the network for x-vector extraction is shown in Table 1 and for the BIG system it is in the Table 2.

Layer Layer context (Input) output
frame1 (5 K) 512
frame2 512 512
frame3 (3 512) 512
frame4 512 512
frame5 (3 512) 512
frame6 512 512
frame7 (3 512 512
frame8 512 512
frame9 512 1500
stats pooling 1500 3000
segment1 3000 512
segment2 512 512
softmax 512 N
Table 1: x-vector topology proposed in [snyder2019speaker]. K in the first layer is used to indicate using different features with different dimensions and N is the number of speakers.
Layer Layer context (Input) output
frame1 (5 K) 1024
frame2 1024 1024
frame3 (5 1024) 1024
frame4 1024 1024
frame5 (3 1024) 1024
frame6 1024 1024
frame7 (3 1024 1024
frame8 1024 1024
frame9 1024 2000
stats pooling 2000 4000
segment1 4000 512
segment2 512 512
softmax 512 N
Table 2: BIG NN architecture. Where K is the feature dimensionality and N is the number of speakers.

5 Backend

5.1 Heavy-tailed PLDA

Our i-vector system used HT-PLDA backend [SilnovaIS18]

. It was trained on VoxCeleb 1 and 2 datasets. Training set consisted of 166 thousands audio files from 7146 speakers. Length normalization, centering, LDA, reducing dimensionality of vectors to 300, followed by another length normalization were applied to all i-vectors. All i-vectors were centered using the mean computed on training data. We fixed the size of the speaker subspace to 200. Degrees of freedom parameter was set to infinity at the training time and to 2 at scoring time. Finally, we performed adaptive score normalization as described in Section 


5.2 Gaussian PLDA

For all x-vector based systems we trained Gaussian PLDA backend. As in the case of HT-PLDA, we used concatenated data from VoxCeleb 1 and 2 for training. In this case, we train the backend only on x-vectors extracted from the original utterances augmented with reverberation and noise. X-vectors extracted from the non-augmented files were not used for backend training. Centering, LDA dimensonality reduction to 250 dimensions followed by length normalization was applied to x-vectors. All data were centered using the training data mean. Speaker and channel subspace size was set to 250 (i.e full rank). Same as in the case of HT-PLDA, we applied adaptive score normalization described in Section 5.4.

5.3 Adaptation (ADAPT)

For open condition, we used 280k files of BUT retransmitted data (see Section 2.1.1

) to perform domain adaptation by model interpolation. That is, we train smaller G-PLDA model on retransmitted data, size of both speaker and channel subspaces was fixed to 150. The final adapted model is derived from the two G-PLDA models so that the modeled within- and across-speaker covariance matrices are a weighted combination of the covariance matrices from the constituent models. Similarly, the model means are also interpolated. Interpolation weights are set to 0.6 for the original model and 0.4 for the adaptation one. The systems which use this adaptation are denoted ADAPT in the Table 


5.4 Score normalization

We used adaptive symmetric score normalization (adapt S-norm) which computes an average of normalized scores from Z-norm and T-norm [PLDA:kenny, ICSLP:Matejka]. In adaptive version [ICSLP:Matejka, ICASSP2005:Sturim, Odyssey2006:Zigel], only part of the cohort is selected to compute mean and variance for normalization. Usually top scoring or most similar files are selected, where is set to be 400 for all experiments. The cohort is created from training data and consist of approximately 15k files, (two files per speaker).

6 Calibration & Fusion

VOiCES dev SITW core-core system VAD FEA MinDCF PRBEP EER MinDCF PRBEP EER 1 x-vector Kaldi FBANK 0.141 1908.8 1.23 0.188 461.4 1.80 fixed 2 x-vector Kaldi PLP 0.163 2204.3 1.44 0.191 464.6 1.92 3 x-vector BIG Kaldi MFCC 0.163 2186.8 1.29 0.177 430.2 1.77 4 i-vector VAD-NN MFCC+SBN 0.428 5911.6 4.46 0.275 693.4 3.19 open 5 x-vector ADAPT Kaldi FBANK 0.146 1954.0 1.13 0.202 495.8 1.99 6 x-vector ADAPT Kaldi PLP 0.157 2123.3 1.31 0.195 481.3 2.11 fixed 1+2+3 PRIMARY 1 0.122 1647.1 1.04 0.17 427.3 1.65 fixed 1 CONTRASTIVE 2 0.141 1908.8 1.23 0.188 461.4 1.80 open 3+4+5+6 PRIMARY 1 0.119 1596.1 1.00 0.17 432.1 1.73

Table 3: Development results

The submission strategy was one common fusion trained on the labeled VoiCES development data [VOiCEScorpus, VOiCESevalplan]. Each system provided log-likelihood ratio scores that could be subjected to score normalization. These scores were first pre-calibrated and then passed into the fusion. The output of the fusion was then again re-calibrated.

Both calibration and fusion were trained with logistic regression optimizing the cross-entropy between the hypothesized and true labels on a development set. Our objective was to improve the error rate on the development set itself, but we were also monitoring error-rate trends on Speakers In The Wild dataset. Results of individual systems and fusions are listed in Table