In recent years, there have been many attempts to take advantage of neural networks (NNs) in speaker verification (SV). They slowly found their way into the state-of-the-art systems that are based on modeling the fixed-length utterance representations, such as i-vectors (Dehak et al., 2011), by Probabilistic Linear Discriminant Analysis (PLDA) (Prince, 2007).
Most of the efforts to integrate the NNs into the SV pipeline involved replacing or improving one or more of the components of an i-vector + PLDA system (feature extraction, calculation of sufficient statistics, i-vector extraction or PLDA classifier) with a neural network. On the front-end level, let us mention for example using NN bottleneck features (BNF) instead of conventional Mel Frequency Cepstral Coefficients(MFCC, Lozano-Diez et al., 2016) or simply concatenating BNF and MFCCs (Matějka et al., 2016)
which greatly improves the performance and increases system robustness. Higher in the modeling pipeline, NN acoustic models can be used instead of Gaussian Mixture Models (GMM) for extraction of sufficient statistics(Lei et al., 2014) or for either complementing PLDA (Novoselov et al., 2015; Bhattacharya et al., 2016) or replacing it (Ghahabi and Hernando, 2014).
These lines of work have logically resulted in attempts to train a larger DNN directly for the SV task, i.e., binary classification of two utterances as a target or a non-target trial (Heigold et al., 2016; Zhang et al., 2016; Snyder et al., 2016; Rohdin et al., 2018). Such architectures are known as end-to-end systems and have been proven competitive for text-dependent tasks (Heigold et al., 2016; Zhang et al., 2016) as well as text-independent tasks with short test utterances and an abundance of training data (Snyder et al., 2016). In text-independent tasks with longer utterances and moderate amount of training data, the i-vector inspired end-to-end system (Rohdin et al., 2018) already outperforms generative baselines, but at the cost of high complexity in memory and computational costs during training.
While the fully end-to-end SV systems have been struggling with large requirements on the amount of training data (often not available to the researchers) and high computational costs, focus in SV has shifted back to generative modeling, but now with utterance representations obtained from a single NN. Such NN takes the frame level features of an utterance as an input and directly produces an utterance level representation, usually referred to as an embedding (Variani et al., 2014; Heigold et al., 2016; Zhang et al., 2016; Bhattacharya et al., 2017; Snyder et al., 2017). The embedding is obtained by the means of a pooling mechanism (for example taking the mean) over the frame-wise outputs of one or more layers in the NN (Variani et al., 2014), or by the use of a recurrent NN (Heigold et al., 2016). One effective approach is to train the NN for classifying a set of training speakers, i.e., using multiclass training (Variani et al., 2014; Bhattacharya et al., 2017; Snyder et al., 2017). In order to do SV, the embeddings are extracted and used in a standard backend, e.g., PLDA. Such systems have recently been proven superior to i-vectors for both short and long utterance durations in text-independent SV (Snyder et al., 2017, 2018).
Hand in hand with development of new modeling techniques that increase the performance of SV on particular benchmarks comes a requirement to continuously verify stability and improve robustness of the SV system under various scenarios and acoustic conditions. One of the most important properties of a robust system is the ability to cope with the distortions caused by noise and reverberation and by the transmission channel itself. In SV, one way is to tackle this problem in the late modeling stage and use multi-condition training (Martínez et al., 2014; Lei et al., 2012) of PLDA, where we introduce noise and reverberation variability into the within-class variability of speakers. This approach can be further combined with domain adaptation (Glembek et al., 2014) which requires having certain amount of usually unsupervised target data. In the very last stage of the system, SV outputs can be adjusted per-trial basis via various kinds of adaptive score normalization (Sturim and Reynolds, 2005; Matějka et al., 2017; Swart and Brümmer, 2017).
Another way to increase the robustness is to focus on the quality of the input acoustic signal and enhance it before it enters the SV system. Several techniques were introduced in the field of microphone arrays, such as active noise canceling, beamforming and filtering (Kumatani et al., 2012). For single microphone systems, front-ends utilize signal pre-processing methods, for example Wiener filtering, adaptive voice activity detection (VAD), gain control, etc. ETSI (2007). Various designs of robust features (Plchot et al., 2013)
can also be used in combination with normalization techniques such as cepstral mean and variance normalization or short-time gaussianization(Pelecanos and Sridharan, 2006).
At the same time when DNNs were finding their way into basic components of the SV systems, the interest in NN has also increased in the field of signal pre-processing and speech enhancement. An example of classical approach to remove a room impulse response is proposed in Dufera and Shimamura (2009)
, where the filter is estimated by an NN. NNs have also been used for speech separation inYanhui et al. (2014). NN-based autoencoder for speech enhancement was proposed in Xu et al. (2014a) with optimization in Xu et al. (2014b) and finally, reverberant speech recognition with signal enhancement by a deep autoencoder was tested in the Chime Challenge and presented in Mimura et al. (2014).
In this work, we focus on improving the robustness of SV via a DNN autoencoder as an audio pre-processing front-end. The autoencoder is trained to learn a mapping from noisy and reverberated speech to clean speech. The frame-by-frame aligned examples for DNN training are artificially created by adding noise and reverberation to the Fisher speech corpus. Resulting SV systems are tested both on real and simulated data. The real data cover both telephone conversations (NIST SRE2010 and SRE2016) and speech recorded over various microphones (NIST SRE2010, PRISM, Speakers In The Wild - SITW). Simulated data are created to produce challenging conditions by either adding the noise and reverberation into the clean microphone data or by re-transmission of the clean telephone and microphone data to obtain naturally reverberated data.
After we explore the benefits of DNN-based audio pre-processing with standard generative SV systems based on i-vectors and PLDA, we attempt to improve an already better baseline system where DNN replaces the crucial i-vector extraction step. We use the architecture proposed by David Snyder Snyder (2017), Snyder et al. (2017) which already presents the x-vector (the embedding) as a robust feature for PLDA modeling, and provides state-of-the-art results across various acoustic conditions (Novotný et al., 2018b). We experiment with using the denoising autoencoder as a pre-processing step while training the x-vector extractor or just during the test stage. To further compare with the best i-vector system, we also experiment with using SBN features concatenated with MFCCs to train our x-vector extractor.
Finally, we offer experimental evidence and thorough analysis to demonstrate that the DNN-based signal enhancement increases the performance of the text-independent speaker verification system for both i-vector and x-vector based systems. We further combine the proposed method with multi-condition training that can significantly improve the SV performance and we show that we can profit from combination of both techniques.
2 Speaker Recognition Systems (SRE)
In this work we compare four systems, combining two feature extraction techniques—MFCC, and Stack Bottle-neck features (SBNs) concatenated with MFCCs—and two front-end modelling techniques—the i-vectors and the x-vectors, defined in Matějka et al. (2014), Kenny (2010), Dehak et al. (2011) and Snyder et al. (2017). Please note, that each of the modeling techniques uses slightly different MFCC extraction. See further description for details.
After feature extraction, voice activity detection (VAD) was performed by the BUT Czech phoneme recognizer, described in Matějka et al. (2006), dropping all frames that are labeled as silence or noise. The recognizer was trained on Czech CTS data, but we have added noise with varying SNR to 30% of the database. This VAD was used both in the hyper-parameter training, as well as in the testing phase.
In all cases, speaker verification score was produced by comparing two i-vectors (or x-vectors) corresponding to the segments in the verification trial by Probabilistic Latent Discriminant Analysis (PLDA, Kenny, 2010) for reference].
2.1 MFCC i-vector system
In this system, we used cepstral features, extracted using a 25 ms Hamming window. We used 24 Mel-filter banks and we limited the bandwidth to the 120–3800Hz range. 19 MFCCs together with zero-th coefficient were calculated every 10 ms. This 20-dimensional feature vector was subjected to short time mean- and variance-normalization using a 3 s sliding window. Delta and double delta coefficients were then calculated using a five-frame window, resulting in a 60-dimensional feature vector.
The acoustic modelling in this system is based on i-vectors. To train the i-vector extractor, we use 2048-component diagonal-covariance Universal Background Model (GMM-UBM), and we set the dimensionality of i-vectors to 600. We then apply LDA to reduce the dimensionality to 200. Such i-vectors are then centered around a global mean followed by length normalization (Dehak et al., 2011; Garcia-Romero and Espy-Wilson, 2011).
2.2 SBN-MFCC i-vector system
Bottleneck Neural-Network (BN-NN) refers to such a topology of a NN, where one of the hidden layers has significantly lower dimensionality than the surrounding ones. A bottleneck feature vector is generally understood as a by-product of forwarding a primary input feature vector through the BN-NN and reading off the vector of values at the bottleneck layer. We have used a cascade of two such NNs for our experiments. The output of the first network is stacked in time, defining context-dependent input features for the second NN, hence the term Stacked Bottleneck features (Figure 1).
The NN input features are 24 log Mel-scale filter bank outputs augmented with fundamental frequency features from 4 different estimators (Kaldi, Snack111http://kaldi.sourceforge.net, http://www.speech.kth.se/snack/, and other two according to Laskowski and Edlund (2010) and Talkin (1995)). Together, we have 13 related features, see Karafiát et al. (2014) for details. Conversation-side based mean subtraction is applied on the whole feature vector, then 11 frames of log filter bank outputs and fundamental frequency features are stacked. Hamming window and DCT projection (0 to 5 DCT base) are applied on the time trajectory of each parameter resulting in coefficients on the first stage NN input.
The configuration of the first NN is , where is the number of target triphones. The dimensionality of the bottleneck layer, was set to 30. The dimensionality of other hidden layers was set to 1500. The bottleneck outputs from the first NN are sampled at times , , , and , where is the index of the current frame. The resulting 400-dimensional features are inputs to the second stage NN with the same topology as the first stage. The network was trained on Fisher English corpus, and data were augmented with two noisy copies.
Finally, the 30-dimensional bottleneck outputs from the second NN (referred to as SBN) were concatenated with MFCC features (as used in the previous system) and used as an input to the conventional GMM-UBM i-vector system, with 2048 components in the UBM and 600-dimensional i-vectors.
2.3 The x-vector systems
These SRE systems are based on a deep neural network (DNN) architecture for the extraction of embeddings as described in Snyder et al. (2017) and Snyder et al. (2018). Specifically, we use the original Kaldi recipe (Snyder, 2017) and 512 dimensional embeddings extracted from the first layer after the pooling layer (embedding-a, also referred to as the x-vector), which is consistent with Snyder et al. (2018).
Input features to the DNN were MFCCs, extracted using a 25 ms Hamming window. We used 23 Mel-filter banks and we limited the bandwidth to 20–3700 Hz range. 23 MFCCs were calculated every 10 ms. This 20-dimensional feature vector was subjected to short time mean- and variance-normalization using a 3 s sliding window. Note the differences to the MFCC features for i-vector system described above (mainly the number of Mel-filter banks, bandwidth, no delta/double delta coefficients).
The embedding DNN can be divided into three parts. The first part operates on the frame level and begins with 5 layers of time-delay architecture, described in Peddinti et al. (2015)
. The first four layers contain each 512 neurons, the last layer before statistics pooling has 1500 neurons. The consequent pooling layer gathers mean and standard deviation statistics from all frame-level inputs. The single vector of concatenated means and standard deviations is propagated through the rest of the network, where embeddings are extracted. This part consists of two hidden layers each with 512 neurons and the final output layer has a dimensionality corresponding to the number of speakers. The DNN uses Rectified Linear Units (ReLUs) as nonlinearities in hidden layers, soft-max in the output layer and is trained by optimizing multi-class cross entropy.
In addition, we also trained an x-vector extractor on MFCC features concatenated with SBN from Section 2.2. Apart from changing the input features, we kept the architecture of the embedding DNN the same as for the MFCC system.
3 Signal Enhancement Autoencoder
For training the denoising autoencoder, we needed fairly large amount of clean speech from which we formed a parallel dataset of clean and augmented (noisy, reverberated or both) utterances. We chose Fisher English database Parts 1 and 2 as they span a large number of speakers (11971) and the audio is relatively clean and without reverberation. These databases combined contain over 20,000 telephone conversational sides or approximately 1800 hours of audio.
Our autoencoder introduced in Plchot et al. (2016) and in Novotný et al. (2018a) consists of three hidden layers with 1500 neurons in each layer. The input of the autoencoder was a central frame of a log-magnitude spectrum with a context of +/- 15 frames (in total 3999-dimensional input). The output is a 129-dimensional enhanced central frame log-magnitude spectrum, see the topology in Figure 2.
It was necessary to perform feature normalization during the training and then repeat similar process during actual denoising. We used the mean and variance normalization with mean and variance estimated per input utterance. At the output layer, de-normalization with parameters estimated on a clean variant of the file was used during training while during denoising, the mean and variance were global and estimated on the cross-validation set. Using log on top of the magnitude spectrum decreases the dynamic range of the features and leads to a faster convergence.
As an objective function for training the autoencoder, we used the Mean Square Error (MSE) between the autoencoder outputs from training utterances and spectra of their clean variants. We were using both clean and augmented recordings during the training as we wanted the autoencoder to keep its robustness and produce good results also on relatively clean data.
3.1 Adding noise
We prepared a dataset of noises that consists of three different sources:
240 samples (4 minutes long) taken from the Freesound library222http://www.freesound.org (real fan, HVAC, street, city, shop, crowd, library, office and workshop).
5 samples (4 minutes long) of artificially generated noises: various spectral modifications of white noise + 50 and 100 Hz hum.
18 samples (4 minutes long) of babbling noises by merging speech from 100 random speakers from Fisher database using speech activity detector.
Noises were divided into two disjoint groups for training (223 files) and development (40 files).
We prepared a set of with room impulse responses (RIRs) consisting of real room impulse responses from several databases: AIR333http://www.iks.rwth-aachen.de/en/research/tools-downloads/databases/aachen-impulse-response-database/, C4DM444http://isophonics.net/content/room-impulse-response-data-set (Stewart and Sandler, 2010), MARDY555http://www.commsp.ee.ic.ac.uk/ sap/resources/mardy-multichannel-acoustic-reverberation-database-at-york-database/, OPENAIR666http://www.openairlib.net/auralizationdb, RVB 2014777http://reverb2014.dereverberation.com/index.html, RWCP888http://www.openslr.org/13/ and RVB 2014999http://reverb2014.dereverberation.com/index.html. Together, they cover all types of rooms (small rooms, big rooms, lecture room, restrooms, halls, stairs etc.). All room models have more than one impulse response per room (different RIR was used for source of the signal and source of the noise to simulate their different locations). Rooms were split into two disjoint sets, with 396 rooms for training and 40 rooms for development.
3.3 Composition of the training set
To mix the reverberation, noise and signal at given SNR, we followed the procedure showed in Figure 3. The pipeline begins with two branches, when speech and noise are reverberated separately. Different RIRs from the same room are used for signal and noise, to simulate different positions of sources.
The next step is A-weighting, applied to simulate the perception of the human ear to added noise (Aarts, 1992). With this filtering, the listener would be able to better perceive the SNR, because most of the noise energy is coming from frequencies, that the human ear is sensitive to.
In the following step, we set a ratio of noise and signal energies to obtain the required SNR. Energies of the signal and noise are computed from frames given by original signal’s voice activity detection (VAD). It means the computed SNR is really present in speech frames which are important for SV (frames without voice activity are removed during processing).
The useful signal and noise are then summed at desired SNR, and filtered with telephone channel (see page 9 in ITU, 1994) to compensate for the fact that our noise samples are not coming from the telephone channel, while the original clean data (Fisher) are in fact telephone. The final output is a reverberated and noisy signal with required SNR, which simulates a recording passing through the telephone channel (as was the original signal) in various acoustic environments. In case we want to add only noise or reverberation only, the appropriate part of the algorithm is used.
4 Experimental Setup
4.1 Training data
To train the UBM and the i-vector extractor, we used the PRISM (Ferrer et al., 2011) training dataset definition without added noise or reverberation. The PRISM set comprises Fisher 1 and 2, Switchboard phase 2 and 3 and Switchboard cellphone phases 1 and 2, along with a set of Mixer speakers. This includes the 66 held out speakers from SRE10 (see Section III-B5 of Ferrer et al., 2011), and 965, 980, 485 and 310 speakers from SRE08, SRE06, SRE05 and SRE04, respectively. A total of 13,916 speakers are available in Fisher data and 1,991 in Switchboard data. Four variants of gender-independent PLDA were trained: the first variant was trained on the clean training data only, while the training sets for the other variants were augmented with artificially added mix of different noises and reverberated data (this portion was based on of the clean training data, i.e. approximately 24k segments).
4.2 Evaluation data
We evaluated our systems on the female portions of the following NIST SRE 2010 (NIST, 2010) and PRISM conditions:
tel-tel: SRE 2010 extended telephone condition involving normal vocal effort conversational telephone speech in enrollment and test (known as “condition 5”).
int-int: SRE 2010 extended interview condition involving interview speech from different microphones in enrollment and test (known as “condition 2”).
int-mic: SRE 2010 extended interview-microphone condition involving interview enrollment speech and normal vocal effort conversational telephone test speech recorded over a room microphone channel (known as “condition 4”).
prism,noi: Clean and artificially noised waveforms from both interview and telephone conversations recorded over lavalier microphones. Noise was added at different SNR levels and recordings are tested against each other.
prism,rev: Clean and artificially reverberated waveforms from both interview and telephone conversations recorded over lavalier microphones. Reverberation was added with different RTs and recordings are tested against each other.
prism,chn: English telephone conversation with normal vocal effort recorded over different microphones from both SRE2008 and 2010 are tested against each other.
Additionally, we used the Core-Core condition from the SITW challenge – sitw-core-core. The SITW (see McLaren et al., 2016) dataset is a large collection of real-world data exhibiting speech from individuals across a wide array of challenging acoustic and environmental conditions. These audio recordings do not contain any artificially added noise, reverberation or other artifacts. This database was collected from open-source media. The sitw-core-core condition comprises audio files each containing a continuous speech segment from a single speaker. Enrollment and test segments contain between 6-180 seconds of speech. We evaluated all trials (both genders).
We also tested our systems on the NIST SRE 2016, described in NIST (2016), but we split the trial set by language into Tagalog (sre16-tgl-f) and Cantonese (sre16-yue-f). We use only female trials (both single- and multi-session). Concerning the experiments with SRE’16, it is important to note that we did not use the SRE’16 unlabeled development set in any way, and we did not perform any score normalization (such as adaptive s-norm).
The speaker verification performance is evaluated in terms of the equal error rate (EER).
4.3 NIST retransmitted set (BUT-RET)
To evaluate the impact of room acoustics on the accuracy of speaker verification, a proper dataset of reverberant audio is needed. An alternative that fills a qualitative gap between unsatisfying simulation (despite the improvement of realism reported in Ravanelli et al., 2016) and costly and demanding real speaker recording, is retransmission. To our advantage, we can also use the fact that a known dataset can be retransmitted so that the performances are readily comparable with known benchmarks. Hence, this was the method to obtain a new dataset.
The retransmission took place in a room with floor plan displayed in Figure 4. The configuration fits several purposes: the loudspeaker–microphone distance rises steadily for microphones 1…6 to study deterioration as a function of distance, microphones 7…12 form a large microphone array mainly focused to explore beamforming (beyond the scope of this paper but studied in Mošner et al., 2018).
For this work, a subset of NIST SRE 2010 data was retransmitted. The dataset consists of 459 female recordings with nominal durations of three and eight minutes. The total number of female speakers is 150. The files were played in sequence and recorded simultaneously by a multi-channel acquisition card that ensured sample precision synchronization.
We denote the retransmitted data as condition BUT-RET-, where BUT-RET-orig, represents original (not retransmitted) data and BUT-RET-merge, which is created by pooling scores from all fourteen microphones.
4.4 PLDA augmentation sets
For augmenting the PLDA training set, we created new artificially corrupted training sets from the PRISM training data. We used noises and RIRs described in Section 3. To mix the reverberation, noise and signal at given SNR, we followed the procedure outlined in Figure 3, but omitting the last step of applying the telephone channel. We trained the four following PLDAs (with abbreviations used further in the text):
Clean: PLDA was trained on original PRISM data, without additive augmentation.
N: PLDA was trained on i) original PRISM data, and ii) portion (24k segments) of the original training data corrupted by noise.
RR: PLDA was trained on i) original PRISM data, and ii) portion of the original training data corrupted by reverberation using real room impulse responses.
RR+N: PLDA was trained on i) original PRISM data, ii) noisy augmented data, and iii) reverberated data as described above.
Note that the sizes of all 3 augmentation sets are the same.
4.5 Augmentation sets for the embedding system
When defining the data set for training the embedding system, we were trying to stay close to the recipe introduced by Snyder (2017), but we introduced modifications to the training data that allowed us to test on larger set of benchmarks (PRISM, NIST SRE 2010). Every speaker must have at least 6 utterances after augmentation (unlike 8 in the original recipe) and every training sample must be at least 500 frames long. As consequence of these constraints and given the augmentation process described below, we ended up with 11383 training speakers.
In the original Kaldi recipe, the training data were augmented with reverberation, noise, music, and babble noise and combined with the original clean data. The package of all noises and room impulse responses can be downloaded from OpenSLR101010http://www.openslr.org/resources/28/rirs_noises.zip (Ko et al., 2017), and includes MUSAN noise corpus (843 noises).
For data augmentation with reverberation, the total amount of RIRs is divided into two equally distributed lists for medium and small rooms.
For augmentation with noise, we created three replicas of the original data. The first replica was modified by adding MUSAN noises at SNR levels in the range of 0–15 dB. In this case, the noise was added as a foreground noise (that means several non-overlapping noises can be added to the input audio). The second replica was mixed with music at SNRs ranging from 5 to 15 dB as background noise (one noise per audio with the given SNR). The last noisy replica of training data was created by mixing in babble noise. SNR levels were at 13–20 dB and we used 3–7 noises per audio. The augmented data were pooled and a random subset of 200k audios was selected and combined with clean data. The process of data augmentation is also described in Snyder et al. (2018).
Apart from the original recipe, as described in the previous paragraph, we also added our own processing: real room impulse responses and stationary noises described in Section 3. The original RIR list was extended by our list of real RIRs and we kept one reverberated replica. Our stationary noises were used to create another replica of data with SNR levels in range 0–20 dB. We combined all replicas and selected a subset of 200k files. As a result, after performing all augmentations, we obtain 5 replicas for each original utterance. The whole process of creating the x-vector extractor training set is depicted in Figure 5.
5 Experiments and Discussion
We provide a set of results, where we study the influence of DNN autoencoder signal enhancement on a variety of systems. Our autoencoder approach is also compared to the multi-condition training of PLDA, which can also improve the performance of the system in corrupted acoustic environment. At the end, we combine the autoencoder with the multi-condition training, and we find a better performing combination.
We trained autoencoders for signal enhancement simultaneously for denoising and dereverberation, which provides better robustness towards an unknown form of signal corruption, compared to autoencoder trained on noise or reverberation only (as studied in Novotný et al., 2018a).
We also created different multi-condition training sets for PLDA (described in Section 4.4), similarly as for the autoencoder training (see Section 3). We used exactly the same noises and reverberation for segment corruption as in the autoencoder training, allowing to compare the performance of systems using the autoencoder and systems based on multi-condition training.
Our results are listed in Table 1 for the i-vector-based systems, and in Table 3 for the x-vector based ones. The results in each table are separated into four main blocks based on a combination of features and signal augmentation: i) system trained with MFCC without signal enhancement, ii) system trained with MFCC with signal enhancement, iii) system trained with SBN-MFCC without enhancement, iv) and system trained with SBN-MFCC and signal enhancement. In each block, the first column corresponds to the system where PLDA was trained only on clean data. The next three columns represent results when using different multi-condition training: N, RR or NRR (as described in Section 4.4).
Finally, the rows of the table are also divided based on the type of the condition, into telephone channel, microphone and artificially created conditions. The last row denoted as avg gives the average EER over all conditions and each value set in bold is the minimum EER in the particular condition. We did not use any type of adaptation or any other technique used for results improvement in conditions from SRE16 and others.
5.1 I-vector systems experiments
|MFCC ORIG||MFCC DENOISED||SBN-MFCC ORIG||SBN-MFCC DENOISED|
Let us begin with comparing systems with and without signal enhancement. In this case, we focus on PLDA trained on clean data only. In the first case, the i-vector system was trained using the MFCC features. We see mixed results. In the first set of conditions representing a telephone channel, we see degradation. When we consider that this is a reasonably clean condition, this enhancement was expected not to be very effective.
In the second block of results (interview speech), the situation is better, except int-mic condition. We can notice an improvement in the system with signal enhancement. An interesting result can be spotted in condition prism,chn, where, with signal enhancement, we obtain more than 40 % relative improvement.
The next block of artificially corrupted condition from PRISM also reports improvements and the last set of results with our retransmitted data too, in addition we can see there is no degradation in original condition BUT-RET-orig.
Let us now focus on the i-vector system based on the SBN-MFCC features. In the past, the SBN-MFCC features provided good robustness against noisy conditions. We verify this statement comparing columns MFFC-ORIG and SBN-MFCC-ORIG in Table 1 (systems without signal enhancement). We see that except for the SRE 2016 and BUT-RET-merge conditions, the system trained with stacked bottle-neck features yields better performance compared to the original MFCC system. When comparing systems with and without signal enhancement, the situation is similar to the MFCC case. We see degradation on the telephone channels and a portion of the interview speech conditions. We obtain 30 % relative improvement in BUT-RET-merge where the system without enhancement is even worse than the previous i-vector system. This could indicate that the bottle-neck features provide better robustness to noise than to reverberation.
In Section 4.5, we described the augmentation setup for the x-vector system in comparison to the i-vector extractor training setup. Our presented i-vector extractors were trained on the original clean data only. Our hypothesis is that generative i-vector extractor training does not benefit from data augmentation in the same form as x-vector can. The comparison of our MFCC i-vector extractor trained on the original clean data and augmented data (the type of augmentation is the same as described in Section 3) is shown in Table 2. We see some improvement in some conditions, but mostly degradation. The reason is that generative i-vector extraction training is unsupervised. When we add augmented data to the training list, i-vector extraction is forced to reserve a portion of parameters for representation of variability of noise, reverberation and so it limits parameters for speaker variability. In the supervised discriminative x-vector approach, we are forcing the x-vector extractor to do the opposite. The extractor is forced to distinguish the speakers, and data augmentation in the training can be beneficial.
|iX ORIG||iX AUG|
5.2 X-vector systems experiments
|MFCC ORIG||MFCC DENOISED||SBN-MFCC ORIG||SBN-MFCC DENOISED|
We evaluated our speech enhancement autoencoder also with the system based on x-vectors, which is currently considered as state-of-the-art. In our experiments and system design, we have deviated from the original Kaldi recipe (Snyder et al., 2018). For training the x-vector extractor, we extended the number of speakers and we also created more variants of augmented data. We extended the original data augmentation recipe by adding real room impulse responses and an additional set of stationary noises (the extension process is also described in Novotný et al. (2018b), the x-vector network used here is labeled as Aug III. in the paper). In the PLDA backend training, we also added the augmented data for multi-condition training (see Section 4.4).
Let us point out, that the denoising autoencoder was trained on a subset of augmented data for training the x-vector DNN. The set of noises and real room impulse responses are therefore the same as in our extended set for training the x-vector extractor (as described in Section 3) and there is no advantage in autoencoder possibly seeing additional augmentations. It is also useful to refer the interested reader to our analysis in Novotný et al. (2018b), where we show the benefit of having such a large augmentation set for x-vector extractor training.
Let us first compare the x-vector network trained with original MFCC and with SBN-MFCC features. In systems based on i-vectors, bottle-neck features provided sometimes very significant improvement, but for x-vector-based systems, the gains are much lower or the performance stays the same or even degrades for condition BUR-RET-merge. This degradation, however, completely disappears after using denoising in x-vector training and subsequently multi-condition training in PLDA. For the telephone data with low reverberation, we can observe either steady performance on tel-tel or slightly better performance on more challenging and non-English data in SRE’16 conditions. This is in contrast with i-vectors, where we only see either steady performance on easy tel-tel or degradation on more challenging SRE’16. In general, the positive effect of SBN-MFCC features on x-vector system is small, but more stable than in i-vector system.
When we focus on the effect of signal enhancement in the x-vector-based system, we see much higher improvement compared to i-vectors. There are still several cases where the enhancement causes mostly degradation (MFCC: int-mic, BUT-RET-orig; SBN-MFCC: tel-tel, int-mic, BUT-RET-orig—mostly clean conditions). Otherwise, the enhancement provides nice improvement across rest of the conditions and features used for system training. At this point, it is useful to point out that unlike with i-vectors, where denoising is applied only for i-vector extraction, we actually apply enhancement already on top of x-vector training data. The effect of applying enhancement only during x-vector extraction like with i-vectors can be seen in Table 4. We can observe that also here, we gain some improvements, but they are generally smaller than with enhancement deployed already during x-vector training (which can be observed in Table 3).
X-vector systems generally provide greater robustness across different signal corruptions. It was natural for us to expect, that x-vector systems should not need signal enhancement, and that they would implicitly learn it themselves, especially in the first part of DNN described in Section 2.3. To our belief, a reason why enhancement helped in our case is that denoising is not the target task of the x-vector DNN. Even though we did have multiple corrupted samples per speaker in the DNN training set, it may be possible that we simply didn’t have enough. And since the x-vector training is generally known to be data-hungry, it is therefore likely that if we had more corrupted samples per speaker, it would be in the DNN’s natural capabilities to learn the task of de-noising.
Let us also point out that if a single type of noise (or channel in general) appears systematically with a concrete speaker, the noise becomes a part of the speaker identity and therefore the NN does not compensate for it.
So far, we have compared results on systems, where PLDA was trained on clean data only and we study possible improvements of enhancement across several systems. Multi-condition training of PLDA, where we add a portion of augmented data into PLDA training is another possible approach on how to improve system performance and its robustness.
From the results, we can see that multi-condition training, can provide improvement across all condition and systems without signal enhancement. We can see that the ideal combination of the augmented data for multi-condition training of PLDA depends on a condition. In noisy condition (prism,noi), it is more effective to use noise augmentation only. For reverberated condition (prism,rev, BUT-RET-merge) we can see more benefits in using reverberated augmentation set compared to others.
5.3 Final remarks
Although EER is a common metric summarizing performance, it does not cover all operating points. In this section, we present the performance of various systems via DET and DCF curves as to see a complex behavior of the systems.
In order to summarize our observation without overwhelming the reader with too many plots, we have chosen two representative conditions, that are closest to the real-world scenario—sre16-yue-f (Figure 6) and BUT-RET-merge (see Figure 7). More specifically, the sre16-yue-f condition was chosen because a) it contains original noisy audio, and b) compared to the rest of the conditions, there is a high channel mismatch between the training data and the evaluation data. The BUT-RET-merge condition was chosen because it realistically reflects real reverberation.
Looking at the graphs reveals that the benefit from using the studied techniques can be substantial. It is worth noting that according to the tables above, denoising may not be effective w.r.t. EER, however, when looking at the DET curves, we see that there are operating points that do benefit from denoising in a fairly large extent.
Apart from i-vector system on the sre16-yue-f condition, the DET or DCF curves corresponding to the denoised system are generally better than those using the original noisy data over the whole range of operating points.
In this paper, we analyzed several aspects of DNN-autoencoder enhancement for designing robust speaker verification systems. We studied the influence of the enhancement on different speaker verification system paradigms (generative i-vectors vs. discriminative x-vectors) and we analyzed possible improvement with different features.
Our results indicate that the DNN autoencoder speech signal enhancement can be helpful to improve system robustness against noise and reverberation. Our results confirm, that it is a stable and universal technique for robustness improvement independently on the system. We also compared the PLDA multi-condition training with audio enhancement. Both approaches are complementary and systems can benefit from simultaneous usage of both.
After observing improvements achieved with enhancement of the x-vector extractor training data, a possible future work is to train the x-vector extractor in a multi-task fashion, combining speaker separation and signal enhancement objective functions and possibly benefit even more from the joint optimization.
The work was supported by Czech Ministry of Interior project No. VI20152020025 “DRAPAK”, Google Faculty Research Award program, Czech Science Foundation under project No. GJ17-23870Y, and by Czech Ministry of Education, Youth and Sports from the National Programme of Sustainability (NPU II) project “IT4Innovations excellence in science - LQ1602”.
- Aarts (1992) Aarts, R. M., 1992. A comparison of Some Loudness Measures for Loudspeaker Listening Tests. J. Audio Eng. Soc 40 (3), 142–146, http://www.extra.research.philips.com/hera/people/aarts/RMA_papers/aar92a.pdf.
- Bhattacharya et al. (2017) Bhattacharya, G., Alam, J., Kenny, P., 08 2017. Deep Speaker Embeddings for Short-Duration Speaker Verification. In: Interspeech 2017. pp. 1517–1521.
- Bhattacharya et al. (2016) Bhattacharya, G., Alam, J., Kenny, P., Gupta, V., 2016. Modelling speaker and channel variability using deep neural networks for robust speaker verification. In: 2016 IEEE Spoken Language Technology Workshop, SLT 2016, San Diego, CA, USA, December 13-16.
- Dehak et al. (2011) Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P., May 2011. Front-End Factor Analysis For Speaker Verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), 788–798.
- Dufera and Shimamura (2009) Dufera, B., Shimamura, T., Jan 2009. Reverberated speech enhancement using neural networks. In: Proc. International Symposium on Intelligent Signal Processing and Communication Systems, ISPACS 2009. pp. 441–444.
- ETSI (2007) ETSI, 2007. Speech Processing, Transmission and Quality Aspects (STQ). Tech. Rep. ETSI ES 202 050, European Telecommunications Standards Institute (ETSI).
- Ferrer et al. (2011) Ferrer, L., Bratt, H., Burget, L., Cernocky, H., Glembek, O., Graciarena, M., Lawson, A., Lei, Y., Matejka, P., Plchot, O., Scheffer, N., Dec. 2011. Promoting robustness for speaker modeling in the community: the PRISM evaluation set. In: Proceedings of SRE11 analysis workshop. Atlanta.
- Garcia-Romero and Espy-Wilson (2011) Garcia-Romero, D., Espy-Wilson, C. Y., 2011. Analysis of i-vector length normalization in Gaussian-PLDA speaker recognition systems. In: Proc. Interspeech.
Ghahabi and Hernando (2014)
Ghahabi, O., Hernando, J., May 2014. Deep belief networks for i-vector based speaker recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1700–1704.
Glembek et al. (2014)
Glembek, O., Ma, J., Matějka, P., Zhang, B., Plchot, O., Burget, L.,
Matsoukas, S., 2014. Domain Adaptation Via Within-class Covariance
Correction in I-Vector Based Speaker Recognition Systerms. In:
Proceedings of ICASSP 2014. IEEE Signal Processing Society, pp. 4060–4064.
- Heigold et al. (2016) Heigold, G., Moreno, I., Bengio, S., Shazeer, N., March 2016. End-to-end text-dependent speaker verification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5115–5119.
- ITU (1994) ITU, October 1994. ITU-T O.41. https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-O.41-199410-I!!PDF-E&type=items.
- Karafiát et al. (2014) Karafiát, M., Grézl, F., Veselý, K., Hannemann, M., Szőke, I., Černocký, J., 2014. BUT 2014 Babel system: Analysis of adaptation in NN based systems. In: Interspeech 2014. pp. 3002–3006.
- Kenny (2010) Kenny, P., June 2010. Bayesian speaker verification with Heavy–Tailed Priors. keynote presentation, Proc. of Odyssey 2010.
- Ko et al. (2017) Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., Khudanpur, S., March 2017. A study on data augmentation of reverberant speech for robust speech recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5220–5224.
- Kumatani et al. (2012) Kumatani, K., Arakawa, T., Yamamoto, K., McDonough, J., Raj, B., Singh, R., Tashev, I., December 2012. Microphone Array Processing for Distant Speech Recognition: Towards Real-World Deployment. In: APSIPA Annual Summit and Conference. Hollywood, CA, USA.
- Laskowski and Edlund (2010) Laskowski, K., Edlund, J., may 2010. A Snack implementation and Tcl/Tk Interface to the Fundamental Frequency Variation Spectrum Algorithm. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Valletta, Malta.
- Lei et al. (2012) Lei, Y., Burget, L., Ferrer, L., Graciarena, M., Scheffer, N., 2012. Towards Noise-Robust Speaker Recognition Using Probabilistic Linear Discriminant Analysis. In: Proceedings of ICASSP. Kyoto, JP.
- Lei et al. (2014) Lei, Y., Scheffer, N., Ferrer, L., McLaren, M., May 2014. A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1695–1699.
Lozano-Diez et al. (2016)
Lozano-Diez, A., Silnova, A., Matějka, P., Glembek, O., Plchot, O.,
Pešán, J., Burget, L., Gonzalez-Rodriguez, J., 2016. Analysis
and Optimization of Bottleneck Features for Speaker Recognition.
In: Proceedings of Odyssey 2016. Vol. 2016. International Speech
Communication Association, pp. 352–357.
- Martínez et al. (2014) Martínez, D. G., Burget, L., Stafylakis, T., Lei, Y., Kenny, P., LLeida, E., 2014. Unscented Transform For Ivector-based Noisy Speaker Recognition. In: Proceedings of ICASSP 2014. Florencie, Italy.
Matějka et al. (2016)
Matějka, P., Glembek, O., Novotný, O., Plchot, O., Grézl, F.,
Burget, L., Černocký, J., 2016. Analysis Of DNN
Approaches To Speaker Identification. In: Proceedings of the 41th
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP 2016), 2016. IEEE Signal Processing Society, pp. 5100–5104.
Matějka et al. (2017)
Matějka, P., Novotný, O., Plchot, O., Burget, L., Diez, M. S.,
Černocký, J., 2017. Analysis of Score Normalization in
Multilingual Speaker Recognition. In: Proceedings of Interspeech 2017.
Vol. 2017. International Speech Communication Association, pp. 1567–1571.
- Matějka et al. (2006) Matějka, P., Burget, L., Schwarz, P., Černocký, J., 2006. Brno University of Technology System for NIST 2005 Language Recognition Evaluation. In: Proceedings of Odyssey 2006. San Juan, Puerto Rico.
- Matějka et al. (2014) Matějka, P., et al., 2014. Neural network bottleneck features for language identification. In: IEEE Odyssey: The Speaker and Language Recognition Workshop. Joensu, Finland.
McLaren et al. (2016)
McLaren, M., Ferrer, L., Castan, D., Lawson, A., 2016. The Speakers in the
Wild (SITW) Speaker Recognition Database. In: Interspeech 2016. pp.
- Mimura et al. (2014) Mimura, M., Sakai, S., Kawahara, T., 2014. Reverberant speech recognition combining deep neural networks and deep autoencoders. In: Proc. Reverb Challenge Workshop. Florence, Italy.
- Mošner et al. (2018) Mošner, L., Matějka, P., Novotný, O., Černocký, J., 2018. Dereverberation and beamforming in far-field speaker recognition. In: Proceedings of ICASSP.
- NIST (2010) NIST, 2010. The NIST year 2010 Speaker Recognition Evaluation Plan. https://www.nist.gov/sites/default/files/documents/itl/iad/mig/NIST_SRE10_evalplan-r6.pdf.
NIST, 2016. The NIST year 2016 Speaker Recognition Evaluation Plan.
- Novoselov et al. (2015) Novoselov, S., Pekhovsky, T., Kudashev, O., Mendelev, V. S., Prudnikov, A., Sept 2015. Non-linear PLDA for i-vector speaker verification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 214–218.
Novotný et al. (2018a)
Novotný, O., Matějka, P., Plchot, O., Glembek, O.,
2018a. On the use of dnn autoencoder for robust speaker
recognition. Tech. rep.
Novotný et al. (2018b)
Novotný, O., Plchot, O., Matějka, P., Mošner, L., Glembek,
O., 2018b. On the use of x-vectors for robust speaker
recognition. In: Proceedings of Odyssey 2018. Vol. 2018. International Speech
Communication Association, pp. 168–175.
- Peddinti et al. (2015) Peddinti, V., Povey, D., Khudanpur, S., 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. In: INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6-10, 2015. pp. 3214–3218.
- Pelecanos and Sridharan (2006) Pelecanos, J., Sridharan, S., 2006. Feature Warping for Robust Speaker Verification. In: Proceedings of Odyssey 2006: The Speaker and Language Recognition Workshop. Crete, Greece.
Plchot et al. (2016)
Plchot, O., Burget, L., Aronowitz, H., Matějka, P., 2016. Audio
Enhancing With DNN Autoencoder For Speaker Recognition. In:
Proceedings of the 41th IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP 2016), 2016. IEEE Signal Processing Society,
- Plchot et al. (2013) Plchot, O., Matsoukas, S., Matějka, P., Dehak, N., Ma, J., Cumani, S., Glembek, O., Heřmanský, H., Mesgarani, N., Soufifar, M. M., Thomas, S., Zhang, B., Zhou, X., 2013. Developing A Speaker Identification System For The DARPA RATS Project. In: Proceedings of ICASSP 2013. Vancouver, CA.
Prince, S. J. D., 2007. Probabilistic linear discriminant analysis for inferences about identity. In: Proc. International Conference on Computer Vision (ICCV). Rio de Janeiro, Brazil.
Ravanelli et al. (2016)
Ravanelli, M., Svaizer, P., Omologo, M., 2016. Realistic Multi-Microphone
Data Simulation for Distant Speech Recognition. In: Interspeech
2016. pp. 2786–2790.
- Rohdin et al. (2018) Rohdin, J., Silnova, A., Diez, M., Plchot, O., Matějka, P., Burget, L., 2018. End-to-end DNN based speaker recognition inspired by i-vector and PLDA. In: Proceedings of ICASSP. IEEE Signal Processing Society.
- Snyder (2017) Snyder, D., 2017. NIST SRE 2016 Xvector Recipe. https://david-ryan-snyder.github.io/2017/10/04/model_sre16_v2.html.
- Snyder et al. (2017) Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S., 2017. Deep Neural Network Embeddings for Text-Independent Speaker Verification. Proc. Interspeech 2017, 999–1003.
- Snyder et al. (2018) Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S., 2018. X-vectors: Robust DNN Embeddings for Speaker Recognition. In: Proceedings of ICASSP.
- Snyder et al. (2016) Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., Khudanpur, S., Dec 2016. Deep neural network-based speaker embeddings for end-to-end speaker verification. In: 2016 IEEE Spoken Language Technology Workshop (SLT). pp. 165–170.
- Stewart and Sandler (2010) Stewart, R., Sandler, M., March 2010. Database of omnidirectional and B-format room impulse responses. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 165–168.
- Sturim and Reynolds (2005) Sturim, D. E., Reynolds, D. A., March 2005. Speaker adaptive cohort selection for Tnorm in text-independent speaker verification. In: Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. Vol. 1. pp. I/741–I/744 Vol. 1.
Swart and Brümmer (2017)
Swart, A., Brümmer, N., 2017. A Generative Model for Score
Normalization in Speaker Recognition. In: Interspeech 2017, 18th Annual
Conference of the International Speech Communication Association, Stockholm,
Sweden, August 20-24, 2017. pp. 1477–1481.
- Talkin (1995) Talkin, D., 1995. A Robust Algorithm for Pitch Tracking (RAPT). In: Kleijn, W. B., Paliwal, K. (Eds.), Speech Coding and Synthesis. Elsevier, New York.
- Variani et al. (2014) Variani, E., Lei, X., McDermott, E., Moreno, I. L., Gonzalez-Dominguez, J., May 2014. Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4052–4056.
- Xu et al. (2014a) Xu, Y., Du, J., Dai, L.-R., Lee, C.-H., Jan. 2014a. An Experimental Study on Speech Enhancement Based on Deep Neural Networks. IEEE Signal processing letters 21 (1).
- Xu et al. (2014b) Xu, Y., Du, J., Dai, L.-R., Lee, C.-H., 2014b. Global variance equalization for improving deep neural network based speech enhancement. In: Proc. IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP). pp. 71 – 75.
- Yanhui et al. (2014) Yanhui, T., Jun, D., Yong, X., Lirong, D., Chin-Hui, L., 2014. Deep Neural Network Based Speech Separation for Robust Speech Recognition. In: Proceedings of ICSP2014. pp. 532–536.
- Zhang et al. (2016) Zhang, S. X., Chen, Z., Zhao, Y., Li, J., Gong, Y., Dec 2016. End-to-End attention based text-dependent speaker verification. In: 2016 IEEE Spoken Language Technology Workshop (SLT). pp. 171–178.