This submission is a description of our effort in VOiCES 2019 Speaker Recognition challenge [VOiCESevalplan]. Most of the systems are based on x-vectors [snyder2018x] with an exception of the i-vector subsystem for open condition which uses concatenation of MFCCs and Stacked bottlenecks (SBN) features [Paja_icassp2016]. Our systems utilize different features (MFCC, PLP, Mel-Filterbanks), DNN topologies and Gaussian or Heavy-tailed PLDA backend.
Below, we present our experimental setup and the description of individual subsystems. We list the results of individual systems together with the fusion in Table 3.
2 Experimental Setup
2.1 Training data, Augmentations
For x-vector training we used only Voxceleb 1 and 2 dataset with 166 thousands audio files (distributed in 1.2 million speech segments) from 7146 speakers. We performed the following data augmentations based on the Kaldi recipe and created additional 5 million segments based on these augmentations:
2.1.1 Retransmitted NIST SRE10 close talk data
In order to perform PLDA adaptation based on training data in open condition track, we made use of our dataset of retransmitted audio [JSTSP2019:Szoke]. Part of it has been benchmarked on the task of speaker verification in [Interspeech2018:Mosner]. A subset 333We used mainly telephone recordings recorded over close talk microphones of NIST 2010 Speaker Recognition evaluations (SRE) dataset was replayed by Adam audio A7X studio monitor in numerous rooms and acoustic conditions. In each room, multiple speaker positions were considered – sitting speaker, standing speaker and non-standard position (pointed to the ceiling, lying on the floor etc.). In addition to naturally occurring noise such as AC, vents, or common street noise coming through windows, noise source (radio receiver) was present in some sessions.
The corrupted audio was always simultaneously recorded by 31 microphones placed within the rooms. Synchronicity was governed by proprietary recording hardware.
The original dataset consists of 932 utterances with 30sec durations 444The original files have duration of 5 or 3 minutes, but we take only 30 sec chunks to limit overall retransmission time.. There are 459 recordings from 150 female speakers and 473 recordings from 150 male speakers. The whole set was retransmitted in 5 rooms. Changes of the loudspeaker positions in some of the rooms resulted in 9 recording sessions.
2.2 Input features
We use different features for several systems with this settings:
Kaldi MFCC - Fsamp=16kHz, frequency limits 20-7600Hz, 25ms frame length, 40 filter banks, 30 coefficients + energy
HTK MFCC - Fsamp=16kHz, frequency limits 0-8kHz, 25ms frame length, 30 filter banks, 24 coefficients + energy
Kaldi PLP - Fsamp=16kHz, frequency limits 20-7600Hz, 25ms frame length, 40 filter banks, 30 coefficients
Kaldi FBank - Fsamp=16kHz, frequency limits 20-7600Hz, 25ms frame length, 40 filter banks
SBN - Fsamp=8kHz, 80 dimensional bottleneck features trained on Fisher English, more details in Section 2.3
The Kaldi MFCC, PLP and FBank are processed with short time mean normalization over 3sec window. For HTK MFCC short time variance normalization is also applied.
2.3 Stacked Bottleneck Features (SBN)
Bottleneck Neural-Network (BN-NN) refers to such topology of a NN, one of whose hidden layers has significantly lower dimensionality than the surrounding layers. A bottleneck feature vector is generally understood as a by-product of forwarding a primary input feature vector through the BN-NN and reading off the vector of values at the bottleneck layer. We have used a cascade of two such NNs for our experiments. The output of the first network isstacked in time, defining context-dependent input features for the second NN, hence the term Stacked Bottleneck Features.
The NN input features are 24 log Mel-scale filter bank outputs augmented with fundamental frequency features from 4 different estimators (Kaldi, Snack555http://kaldi.sourceforge.net, www.speech.kth.se/snack/, and two other according to [Laskowski:LREC:2010] and [talkin:pitch:95]). Together, we have 13 related features, see [Karafiat:IS2014] for more details. The conversation-side based mean subtraction is applied on the whole feature vector. 11 frames of log filter bank outputs and fundamental frequency features are stacked together. Hamming window followed by DCT consisting of 0 to 5 base are applied on the time trajectory of each parameter resulting in coefficients on the first stage NN input.
The configuration for the first NN is , where is the number of targets. The dimensionality of the bottleneck layer, was fixed to 80. This was shown as optimal in [Matejka:Odyssey2014]. The dimensionality of other hidden layers was set to 1500. The bottleneck outputs from the first NN are sampled at times , , , and , where is the index of the current frame. The resulting 400-dimensional features are input to the second stage NN with the same topology as first stage. The 80 bottleneck outputs from the second NN (referred as SBN) are taken as features for the conventional GMM/UBM i-vector based SID system.
We used 8kHz SBN trained on Fisher English.
2.4 Voice Activity Detection
We used 2 VAD approaches:
VAD-Energy is energy based VAD from Kaldi SRE16 recipe without any modification. Note that for FBank and PLP the Kaldi VADs from MFCC were used.
VAD-NN consists of two carefully designed parts: a neural network (NN) which produces per-frame scores, and a post-processing stage which builds the segments based on the scores.
The NN was trained on the Fisher English. The input dimension is 288, while there are 2 hidden layers, each of 400 sigmoid neurons, and the final softmax layer has 2 outputs, corresponding to the classes: speech, non-speech. The NN has 277k parameters.
The input features for the NN consist of 15 log-Mel filter-bank outputs and 3 Kaldi-pitch features [Ghahremani:ICASSP2014]. We apply per-speaker mean normalization estimated on the whole unsegmented recordings. Then we apply frame splicing with 31 frame-long context, where the temporal trajectory of each feature is scaled by a Hamming window and reduced to 16 dimensions by Discrete Cosine Transform. The final 288-dimensional features are globally mean and variance normalized on the NN input.
In the post-processing, we bypass the NN output softmax function (allowing us to interpret the outputs as log-likelihoods), then we convert the two outputs to logit-posteriors, and then we smooth the score by averaging over consecutive 31 frames. In the final step, the speech segments were extracted by thresholding the posterior at the value of -0.5.
3 i-vector Systems
The system is based on gender independent i-vectors [DehakN_TASLP:2010, PLDA:kenny]. HTK MFCC with deltas and double deltas and SBN feature vectors were extracted from recordings (SBN were downsampled to 8kHz). Final feature vector is concatenation of both as they proved to perform very well in NIST SRE [Paja_icassp2016]. This system uses VAD-NN. Universal background model (UBM) contained 2048 components and was trained on Voxceleb 1 and 2 utterances from 7,146 speakers (450 hours). We then trained 600-dimensional i-vector extractor. UBM, i-vector and PLDA were trained only with clean Voxceleb data.
For the purpose of probabilistic linear discriminant analysis (PLDA) training we preprocessed all training, enroll and test data by means of single-channel weighted prediction error (WPE) dereverberation [IEEETrans2010:Nakatani] to suppress effects of room acoustic conditions.
4 x-vector Systems
All x-vectors used VAD-Energy from Kaldi SRE16 recipe 666We did not find big impact on performance when using different VAD within x-vector paradigm and it seems Kaldi simple VAD performs good for x-vector.. The systems were trained in Kaldi toolkit [povey2011kaldi] using SRE16 recipe with modifications described below:
Using different feature sets
Training networks with 9 epochs (instead of 3). We did not see any considerable difference with 12 epochs.
Using modified example generation - we used 200 frames in all training segments instead of randomizing it between 200-400 frames. We also have changed generation of the training examples so that it is not random and uses almost all available speech from all training speakers in a better way.
The x-vector DNN was trained on 1.2 million speech segments from 7146 speakers plus additional 5 million segments obtained with data augmentation. We generated around 700 archives that each of them contains exactly 15 training examples from each speaker (i.e. around 107K examples in each archive).
|Layer||Layer context||(Input) output|
|frame1||(5 K) 512|
|frame3||(3 512) 512|
|frame5||(3 512) 512|
|frame7||(3 512 512|
|stats pooling||1500 3000|
|Layer||Layer context||(Input) output|
|frame1||(5 K) 1024|
|frame3||(5 1024) 1024|
|frame5||(3 1024) 1024|
|frame7||(3 1024 1024|
|stats pooling||2000 4000|
5.1 Heavy-tailed PLDA
Our i-vector system used HT-PLDA backend [SilnovaIS18]
. It was trained on VoxCeleb 1 and 2 datasets. Training set consisted of 166 thousands audio files from 7146 speakers. Length normalization, centering, LDA, reducing dimensionality of vectors to 300, followed by another length normalization were applied to all i-vectors. All i-vectors were centered using the mean computed on training data. We fixed the size of the speaker subspace to 200. Degrees of freedom parameter was set to infinity at the training time and to 2 at scoring time. Finally, we performed adaptive score normalization as described in Section5.4.
5.2 Gaussian PLDA
For all x-vector based systems we trained Gaussian PLDA backend. As in the case of HT-PLDA, we used concatenated data from VoxCeleb 1 and 2 for training. In this case, we train the backend only on x-vectors extracted from the original utterances augmented with reverberation and noise. X-vectors extracted from the non-augmented files were not used for backend training. Centering, LDA dimensonality reduction to 250 dimensions followed by length normalization was applied to x-vectors. All data were centered using the training data mean. Speaker and channel subspace size was set to 250 (i.e full rank). Same as in the case of HT-PLDA, we applied adaptive score normalization described in Section 5.4.
5.3 Adaptation (ADAPT)
For open condition, we used 280k files of BUT retransmitted data (see Section 2.1.1
) to perform domain adaptation by model interpolation. That is, we train smaller G-PLDA model on retransmitted data, size of both speaker and channel subspaces was fixed to 150. The final adapted model is derived from the two G-PLDA models so that the modeled within- and across-speaker covariance matrices are a weighted combination of the covariance matrices from the constituent models. Similarly, the model means are also interpolated. Interpolation weights are set to 0.6 for the original model and 0.4 for the adaptation one. The systems which use this adaptation are denoted ADAPT in the Table3.
5.4 Score normalization
We used adaptive symmetric score normalization (adapt S-norm) which computes an average of normalized scores from Z-norm and T-norm [PLDA:kenny, ICSLP:Matejka]. In adaptive version [ICSLP:Matejka, ICASSP2005:Sturim, Odyssey2006:Zigel], only part of the cohort is selected to compute mean and variance for normalization. Usually top scoring or most similar files are selected, where is set to be 400 for all experiments. The cohort is created from training data and consist of approximately 15k files, (two files per speaker).
6 Calibration & Fusion
The submission strategy was one common fusion trained on the labeled VoiCES development data [VOiCEScorpus, VOiCESevalplan]. Each system provided log-likelihood ratio scores that could be subjected to score normalization. These scores were first pre-calibrated and then passed into the fusion. The output of the fusion was then again re-calibrated.
Both calibration and fusion were trained with logistic regression optimizing the cross-entropy between the hypothesized and true labels on a development set. Our objective was to improve the error rate on the development set itself, but we were also monitoring error-rate trends on Speakers In The Wild dataset. Results of individual systems and fusions are listed in Table3.