Deep Normalization for Speaker Vectors

04/07/2020 ∙ by Yunqi Cai, et al. ∙ Xi'an Jiaotong-Liverpool University Tsinghua University 0

Deep speaker embedding has demonstrated state-of-the-art performance in audio speaker recognition (SRE). However, one potential issue with this approach is that the speaker vectors derived from deep embedding models tend to be non-Gaussian for each individual speaker, and non-homogeneous for distributions of different speakers. These irregular distributions can seriously impact SRE performance, especially with the popular PLDA scoring method, which assumes homogeneous Gaussian distribution. In this paper, we argue that deep speaker vectors require deep normalization, and propose a deep normalization approach based on a novel discriminative normalization flow (DNF) model. We demonstrate the effectiveness of the proposed approach with experiments using the widely used SITW and CNCeleb corpora. In these experiments, the DNF-based normalization delivered substantial performance gains and also showed strong generalization capability in out-of-domain tests.



There are no comments yet.


page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Speaker recognition (SRE) is widely studied, and decades of investigation has resulted in significant performance improvements, and deployment in a wide range of practical applications [4, 43, 16]

. Traditional SRE methods are based on statistical models, which include the popular Gaussian mixture model-universal background model (GMM-UBM) architecture 

[42]. In order to improve the statistical strength with limited data, various subspace models have been proposed [23], and in particular, the i-vector model, which was the most successful and widely used [7]. An important concept introduced by the i-vector model is speaker embedding, that is, representing speakers by fixed-length continuous vectors. With this embedding, the derived speaker vectors construct a speaker space, where SRE tasks can be conducted using either simple cosine distance scoring, or a more complex back-end model such as probabilistic linear discriminant analysis (PLDA) [20]. It should be noted that the i-vector approach essentially performs a statistical embedding as it is based on statistical models, and the derived i-vectors are therefore statistical speaker vectors.

In recent years, deep learning methods have demonstrated significant progress with regard to SRE. Variani et al. reported the results of an initial investigation on a text-dependent task 

[50], using a deep neural net (DNN) to produce frame-level speaker-discriminant features. They then derived utterance-level representations (called ‘d-vectors’) by average pooling. Li et al. [27] extended this with a more speech-friendly net structure, and achieved good performance on text-independent tasks, especially with short utterances. However, one key shortcoming of the d-vector approach is that the frame-level training does not match the utterance-level test. Researchers developed two architectures to solve this problem. The first approach is to use the end-to-end architecture, which accepts two utterances and produces the accept/rejection decision directly [19, 57, 41]. The second approach is the speaker embedding architecture [46, 37], which instead accumulates the frame-level statistics of a variable-length utterance and converts these to a fixed-length vector with the objective of discriminating between the speakers in the training set. Although the training criterion of the end-to-end architecture is more consistent with the SRE task, the embedding architecture is easier to train [31] and the derived speaker vectors support various speaker-related tasks such as speaker-dependent synthesis [21].

The concept of embedding with deep learning methods is exactly the same as the i-vector model. To differentiate between them, in this paper, we denote DNN-based embedding by deep embedding, and the derived fixed-length vectors by deep speaker vectors. Perhaps the most popular deep embedding architecture is the x-vector model proposed by Snyder et al. [46]. It is based on a time-delayed neural net and a statistical pooling, and has achieved good performance in many applications. For simplicity, we use the term x-vector to represent deep speaker vectors derived with any net structure.

Recently, deep speaker embedding models have been significantly improved by developing a more comprehensive architecture [6, 22], improved pooling methods [37, 3, 51, 5], training criteria [32, 9, 52, 2, 14, 59], and training schemes [33, 53, 48]. As a result, the deep embedding approach has achieved state-of-the-art (SOTA) SRE performance [45].

Despite the significant progress outlined above, one potential issue with the deep embedding approach is that the training objective of the embedding models is purely discriminative, meaning that the goal of the training is simply discriminating the speakers, without considering the distribution of the derived speaker vectors. This ‘unconstrained training’ tends to produce irregulated speaker distributions (i.e. the within-speaker distributions or conditional distributions). This means that: (1) the distributions of each individual speaker may be potentially very complex and far from Gaussian (non-Gaussianality), and (2) the distributions of different speakers may be significantly different (i.e. non-homogeneity). Non-Gaussian and non-homogeneous distributions may seriously impact performance of the back-end scoring model, particularly with regard to the popular PLDA scoring method, which is based on the assumption that all speaker distributions are homogeneous and Gaussian. Fig. 1 illustrates some potential problems caused by non-Gaussian and non-homogeneous distributions. It should also be noted that this issue is not as severe for statistical speaker vectors (i-vectors) as they are derived from a constrained model with an underlying Gaussian assumption.

Fig. 1:

Illustration of potential problems caused by non-Gaussian and non-homogeneous distributions. All colored regions represent the distribution of a particular speaker, and the region boundary represents the contour of the same probability. For simplicity, the classification is based on cosine distance. (A) shows two non-homogeneous distributions (although both are Gaussian). The test utterance (red star) is categorized as being the cyan speaker, although the probability that it belongs to the brown speaker is higher. (B) shows two non-Gaussian distributions (although they are homogeneous). The test utterance (red star) is categorized as the brown speaker but the probability that it belongs to the cyan speaker is higher.

A number of researchers have noticed the risk associated with data irregularity, and have presented various compensations. For instance, phone-aware training was investigated, with the aim of reducing the non-Gaussian variation caused by speech content [28, 53]. Another proposed approach was to use various data augmentation methods [46, 55]

. These augmentation methods prevent the training from being over-fitted to highly curved discrimination boundaries, hence more regulated distribution for each individual speaker. Li et al. presented an approach that treats the full-connection layer before the softmax classifier as the basis of speakers, and imposes a central loss 

[30, 29]. This central loss is an explicit regularization that encourages the distributions of individual speakers to be more Gaussian. This concept was also introduced by other researchers [3].

Previous work by the authors presented a variational auto-encoder (VAE) based normalization for x-vectors [58, 54]. This is an independent component and dedicated to regulating x-vectors, and differs from other previous research that integrates the normalization constraint in the embedding model. Our VAE-based normalization encourages the marginal distribution to be Gaussian. However, it cannot normalize conditional distributions, i.e. the distributions of individual speakers, a more important source of data irregulation. In addition, to retain the discriminant strength of the speaker vectors, an auxiliary cross-entropy loss is required when training the VAE model, which complicates the behavior of the normalizer.

In this paper, we propose a new fully generative model to normalize deep speaker vectors. This model is based on normalization flow (NF), a simple yet powerful architecture for density estimation. With this model, a complex distribution can be transformed to a simple isotropic Gaussian (often called the prior distribution). However, directly applying the NF model is insufficient: it regulates the marginal distribution rather than conditional distributions (as with VAE), so cannot deal with individual speakers. We therefore propose a novel discriminative normalization flow (DNF). Compared with the vanilla NF model, DNF allows class-specific prior distributions, which enables it to model multiple speakers with different but homogeneous isotropic Gaussians. This paper will show that our new DNF approach is a deep and nonlinear extension of the widely used linear discriminant analysis (LDA) model.

The remainder of the paper is organized as follows. We first review the LDA and PLDA model in Section II, and through experiments, investigate the role of normalization that LDA plays when applied to x-vectors with PLDA scoring. Section III presents the DNF model, which extends the shallow and linear normalization with LDA to a deep and and nonlinear normalization, with experimental results and analysis presented in Section IV, and finally, the paper is concluded in Section V.

Ii Shallow normalization by LDA

It is well known that for x-vector systems with PLDA scoring, LDA is an important pre-processing step. This is initially unexpected, as PLDA is theoretically an extension of LDA, and can self-discover the most discriminant dimensions. Previous work [58, 29] by the authors argued that the role LDA plays is distribution normalization, by making the conditional distributions of speaker vectors more Gaussian. This normalization makes the LDA-projected data fit the assumptions of PLDA, and therefore makes it more suitable for PLDA modeling. However, this argument should be investigated in more depth, with more theoretical and empirical analysis.

Ii-a Review of LDA and PLDA

Ii-A1 Dimension reduction view for LDA

LDA is a popular tool for dimension reduction, by identifying the most class-discriminant directions, along which the between-class variation is maximized and the within-class variation is minimized (Fisher criterion) [13]

. These directions are aligned with the eigenvectors of

with large eigenvalues, where

and are the between-class and within-class covariance matrices, respectively. An important property of these eigenvectors is that they diagonalize and simultaneously.

Research has shown that LDA can be performed in two steps: normalization and discrimination [17]

. Normalization is a linear transform where the within-class covariance becomes an identity matrix

. This can be achieved with principal component analysis (PCA) on the mean-offset data, and a re-scaling operation. The discrimination is an extra orthogonal transform that aligns the data coordinates along the directions with the largest between-class variation. This can be achieved by another PCA on the normalized class means, and dimension reduction can be achieved by selecting the leading principal components (PCs). Within this discrimination space, classification will be optimal based on Euclidian distance, as shown in Fig. 


Ii-A2 Probabilistic model view for LDA

LDA can be cast to a multi-Gaussian model [18]. Assuming all classes are Gaussian distributed and share the same covariance and the priors of all classes are equal, a probabilistic model can be established. The parameters, including the mean vectors and the shared covariance matrix , can be optimized following the maximum-likelihood (ML) criterion. After training, optimal classification (in terms of maximum a posterior, MAP) for a sample can be obtained by choosing the class with the largest . This is the primary form of LDA. The dimension reduction can be recovered by a constrained multi-Gaussian model where the mean vectors of all the classes are in a subspace. Fig. 2 illustrates the multi-Gaussian view of LDA and its relationship with the dimension-reduction view.

Fig. 2: The multi-Gaussian view of LDA. (A) In the primary form, both classes are Gaussian and share the same covariance. (B) Normalization step: To discover the most discriminant features, firstly transform conditional distributions to be isotropical Gaussian, so that MAP classification can be performed by Euclidian distance. (C) Linearly transform class means to a subspace with the largest between-class variation. This is an orthogonal transform, so does not change the shape of the within-class covariance. Therefore, Euclidian distance based classification remains optimal in terms of MAP prediction.

As discussed, LDA assumes that the within-class distributions are Gaussian and they share the same covariance. We denote these as the Gaussianality condition and the homogeneity condition, respectively. If both conditions are satisfied, we call the data regulated

. Regulated data are suitable for LDA dimension reduction, otherwise the features selected would not be optimal in terms of class discrimination.

Ii-A3 Plda

PLDA extends LDA (probabilistic model view), by placing a Gaussian prior on the class means. A key advantage associated with this prior is that it enables dealing with new classes. More specifically, the posterior distribution of the mean of a new class can be derived, given even a single sample. According to this posterior, the probability that one or more test samples belong to the new class can be calculated by marginalizing over the class mean. Therefore, PLDA can be used to compute the likelihood ratio that two utterances belong to the same and different speakers [20], and is a theoretically sound scoring model. It is important to highlight that PLDA inherits the shared Gaussian assumption of LDA and requires regulated data. If the data are irregulated, the likelihood ratio derived by PLDA may be biased.

Ii-B Why LDA works for x-vectors

To have a better understanding of the role of LDA and explain its contribution to the back-end scoring model, in particular PLDA, we performed an initial experiment using a subset of the VoxCeleb dataset [6, 36]. We created both an i-vector and x-vector system using the entire VoxCeleb database, following the recipes that will be described in Section IV. We then generated i-vectors and x-vectors from a subset of 600 speakers in the training set. We use these to investigate the statistical properties of these vectors before and after LDA.

Ii-B1 Global properties

First of all, we show the between-speaker and within-speaker covariance matrices of i-vectors and x-vectors. These matrices reflect the global properties (i.e. on the data of all the speakers) of the distribution of the speaker vectors. As shown in Fig. 3, x-vectors exhibit more complex correlation patterns compared to i-vectors. After LDA, for both x-vectors and i-vectors, the between- and within-speaker covariances become diagonal. This diagonalization can be regarded as a global normalization. As the correlation patterns of x-vectors are more complex, this normalization is more substantial for x-vectors. To an extent, it answers the question of why LDA contributes more for x-vectors with cosine scoring: although x-vectors are derived from a discriminative model and LDA-based feature selection seems less important, the diagonal between- and within-speaker covariances are good for cosine scoring (which assumes that the feature dimensions are independent).

Fig. 3: Between-speaker (left) and within-speaker (right) covariance of i-vectors (top) and x-vectors (bottom).

Ii-B2 Local properties

The global normalization of LDA does not fully explain everything. In particular, it cannot explain why LDA contributes to PLDA scoring, as PLDA performs the same normalization anyway. Our hypothesis is that LDA, by discarding less discriminative dimensions, performs a local normalization at the class level. More precisely, we will show that LDA improves both homogeneity among classes and also the Gaussianality of each class.

We test this hypothesis with both i-vectors and x-vectors. The tests are conducted in three spaces: (1) the original observation space; (2) the LDA space, i.e. the subspace with leading discriminant dimensions after LDA transform; (3) the residual space, i.e. the subspace complementary to the LDA space. Since testing the homogeneity and Gaussianality in a high-dimensional space is challenging, we perform tests on the principal directions of each conditional distribution. Specifically, we select speakers with more than samples and perform PCA on the data of each speaker, and investigate the homogeneity and Gaussianality on the leading PC directions. The statistics we collected are as follows:

  • PC direction variance for homogeneity. This tests if the covariance matrices of all the speakers have the same PC directions. After PCA, the first PC (PC1) of all the speakers are selected and its mean over the speakers is computed. The cosine distance between the PC1s of individual speakers and the mean PC1 is computed. The variance of these cosine scores is used as the measure to test the PC1 direction variance. The same computation is conducted on all PCs. In this section, we report the direction variance on PC1 and PC2, and the averaged direction variance on the first 10 PCs.

  • PC shape variance for homogeneity. Using PC1 as an example, the coefficients (eigenvalues) of the covariance matrices of all the speakers on the first PC are calculated, and the variance of these coefficients over all speakers is computed. The same computation is performed on all the PCs. Since the coefficient on each PC determines the spreading of the samplings on this direction, the coefficients on all the PCs determine the shape of the speaker distribution. The variances of these coefficients over all speakers then test if the distributions of all speakers have the same shape (regardless of the directions), hence being noted as PC shape variances. We report the shape variance on PC1 and PC2, and the averaged shape variance on the first 10 PCs.

  • Average PC kurtosis for Gaussianality. On each PC direction, we compute the kurtosis for each speaker, and then compute the mean of the kurtosis over all the speakers. The averaged kurtosis over the first 10 PCs is reported.

  • Average PC skewness for Gaussianality. On each PC direction, we compute the skewness for each speaker, and then compute the mean of the kurtosis over all the speakers. The averaged skewness over the first 10 PCs is reported.

The results are shown in Table I, where we also report the between- and within-speaker variance, as well as the SRE performance with the cosine scoring in terms of equal error rates (EER). There are several key observations:

  1. Comparing i-vectors and x-vectors in the original observation space, it can be seen that x-vectors possess larger PC direction and PC shape variances, demonstrating that the distributions of different speakers are less homogeneous. Moreover, x-vectors show larger kurtosis and skewness, indicating that the distributions of individual speakers are less Gaussian. This confirmed our conjecture that x-vectors are less regulated, and are less suitable for PLDA scoring when compared to i-vectors.

  2. For both i-vectors and x-vectors, after LDA, homogeneity and Gaussianality are all improved, and this improvement is much more significant for x-vectors. This indicates that LDA-transformed data are more regularized, especially for x-vectors.

  3. In the residual space, LDA improves homogeneity. With regard to Gaussianality, there is a slight improvement with i-vectors. For x-vectors, however, Gaussianality becomes worse after LDA.

i-vector x-vector
Orig. LDA Res. Org. LDA Res.
EER% 5.75 3.08 19.83 7.00 1.50 15.08
PC1 dir. var 0.064 0.009 0.004 0.104 0.006 0.003
PC2 dir. var 0.089 0.008 0.005 0.156 0.005 0.003
Avg PC dir. var 0.028 0.007 0.004 0.060 0.006 0.003
PC1 shape var 80.1 64.0 60.5 156.0 42.0 128.0
PC2 shape var 53.3 32.3 34.6 68.3 21.8 63.0
Avg PC shape var 30.7 19.8 20.9 42.0 13.7 32.5
PC Kertosis 1.579 0.734 1.268 2.615 1.686 31.40
PC Skewness 0.311 0.209 0.309 0.369 0.275 1.110
Between-class var 0.269 1.164 0.163 0.548 2.332 0.225
Within-class var 0.753 0.996 0.991 0.192 1.030 1.026
TABLE I: Homogenity and Gaussianality of i-vectors and x-vectors before and after LDA.

The final observation is of greatest interest. This suggests that LDA not only selects discriminant dimensions, but also removes non-Gaussian dimensions. To test this conjecture in a more concrete way, we divide all the dimensions sorted by LDA (according to their discriminant power) into multiple subgroups and compute the averaged PC shape variance, PC kurtosis and PC skewness, plus the between-speaker variance and the EER results with cosine scoring. The results are shown in Fig. 4. It can be clearly seen that the most indiscriminative dimensions are non-homogeneous and non-Gaussian. In comparison, Fig. 5 shows the results with i-vectors, where it can be seen that the indiscriminative dimensions of i-vectors are much more regulated.

Fig. 4: Statistics of subgroup dimensions of LDA-projected x-vectors. Top left: Averaged PC shape variance; Top right: Kurtosis; Bottom left: Skewness; Bottom right: Between-speaker variance and EER with cosine scoring.
Fig. 5: Statistics of subgroup dimensions of LDA-projected i-vectors. Top left: Averaged PC shape variance; Top right: Kurtosis; Bottom left: Skewness; Bottom right: Between-speaker variance and EER with cosine scoring.

Ii-B3 Summary of LDA and PLDA

Based on the findings above, there are a number of key conclusions that can be drawn with regard to the role that LDA plays for i-vectors and x-vectors. The typical role of LDA is two-fold, it transforms the between- and within- covariance to be diagonal, and it also selects the most discriminant dimensions. All of these functions contribute to cosine scoring. However, for PLDA scoring, these functions are performed implicitly and LDA pre-processing is therefore not usually necessary. This is the case with i-vectors, but with x-vectors, LDA plays an extra role – speaker vector normalization – by removing irregulated (non-homogeneous and/or non-Gaussian) dimensions. These irregulated dimensions may potentially be attributed to unwanted variance such as linguistic content and length variance, but can also be simply attributed to the unconstrained nature of deep embedding models. Removing these dimensions will make the data more regulated and hence benefit PLDA scoring.

Iii Deep normalization by discriminative normalization flow

The previous section demonstrated that the suitable data regulation is very important for PLDA scoring. However, the linear form of LDA means that it can only normalize the global structure (within- and between- class covariances), rather than individual speakers. The within-speaker normalization is essentially achieved by dimension-trimming. However, the dimensions that are least discriminant are not necessarily consistent with the most irregulated dimensions. This can be clearly seen in Fig. 4, where the most non-Gaussian and non-homogeneous dimensions are in the first subgroup, i.e., the most discriminant dimensions. This means that dimension-trimming is not optimal for either selecting discriminant features (some discriminative features have to be removed because they are irregulated), or for selecting regulated features (some irregulated features cannot be removed as they are discriminative).

Here, we present a new deep normalization model, which is based on deep generative neural nets and is designed for normalizing distributions of individual speakers. Most importantly, we will utilize the powerful normalization flow (NF) model to perform distribution transform, and propose a novel discriminative NF (DNF) model to deal with multiple speakers. To the best knowledge of the authors, this represents a new research direction in this domain.

Iii-a Normalization flow

Deep generative models transform a simple distribution via a deep neural net, so that the output distribution matches the true data [34]. Typical deep generative models include generative adversarial networks (GAN) [15] and variational auto-encoders (VAE) [25]. Normalization flow (NF) is another deep generative model, which is similar to VAE but the transform is invertible, therefore it does not require an explicit encoder and the likelihood can be computed exactly [38]. In this research, we choose NF as the basic architecture of our deep normalization model.

The foundation of NF is the principle of distribution transformation for continuous variables [44]. Let a latent variable and an observation variable be linked by an invertible transform , their probability density has the following relationship:


where is the inverse function of . It has been shown that if is flexible enough, a simple distribution, which we assume to be a standard Gaussian, can be transformed to a very complex distribution [38]. Note that the second term on the right side of the above equation represents the volume (entropy) change during the transform.

Usually, is implemented as a composition of a sequence of relatively simple invertible transforms, denoted by :

where every can be a structured neural net [49]. The entire transform has the following relationship:

where we have defined , and . This model resembles a flow of transforms, which reshapes the simple distribution on gradually, and ultimately reaches the complex distribution on . In the inverse direction, it normalizes the complex distribution on to a simple distribution on , and is therefore called a normalization flow. Fig. 6 illustrates how a complex distribution is normalized to a simple distribution by an NF model.

Fig. 6: A complex mixture of three Gaussians is transformed to a single Gaussian by an NF model. The NF used here is a masked autoregressive flow (MAF) [39], and the distribution on is a standard Gaussian.

The NF model can be trained with the maximum likelihood (ML) criterion. Note that Eq.1 formulates a distribution density on the observation , where the first term is often called the prior distribution, and the second term is called the entropy term. The ML training optimizes the NF model with the following objective:

where indexes the training samples, and represents the parameters of the model. Once the model has been well trained, it can be used to (a) sample by sampling ; (b) compute by calculating the prior and the entropy term; (c) normalize by transforming it to , which is Gaussian distributed. In this paper, we will focus on utilizing the normalization capability of the NF model.

The key issue when designing the NF model is to identify a net structure so that the entropy term in Eq. 1 can be easily computed. Researchers have proposed various NF models based on different structures. These models can be categorized into volume preserved (VP) [10] and non-volume preserved (NVP) [11] models. Although VPs do not change the volume during the flow (i.e. the entropy term is zero), NVPs do not have this constraint and so are generally more flexible.

Iii-B Discriminative normalization flow

The vanilla NF model optimizes the distribution of the training data without considering the class labels, i.e. the marginal distribution. This means that data from different classes tend to congest together in the latent space, and the distributions of individual classes are non-Gaussian, as shown in the top row of Fig. 7. This is not a good property for classification tasks like SRE. Conditional NF models [1] may take the class information as a condition variable, however the conditioning cannot be generalized to unseen classes (e.g. unknown speakers), which makes it unsuitable for open-set tasks such as SRE.

Fig. 7: Vanilla NF (top) pulls all classes together in the latent space, while DNF (bottom) keeps data from different classes separated.

In order to normalize distributions of individual classes and keep different classes separated, we propose a discriminative normalization flow (DNF) model. The main difference is that we allow each class to have its own Gaussian prior, i.e. all the priors share the same covariance but possess different means, formulated as follows:

where is the class label. By setting class-specific means, different classes will be separated from each other in the latent space, as shown in the bottom row of Fig. 7.

Training DNF is mostly the same as the vanilla NF, following the ML criterion. The only difference is that the probability of an observation should be evaluated with the prior corresponding to its class label, formally written by:

where is the class label of , and . Pooling all the training data, we obtain the objective function for DNF training:

where involves all the parameters of the model. Note that this objective is a bit over-parameterized, as the covariance can be set to any values if the flow is flexible enough, e.g. in the case of NVP. We therefore manually set and let the flow handle the volume change.

After training, the DNF model will establish a normalization space for , where the distribution of every class is simply a Gaussian with covariance . With this model, an observation can be transformed to its latent code by the inverse transform , without knowing its class labels. In addition, the latent codes from the same class, which may be unknown, tend to be a Gaussian. From this perspective, DNF is a nonlinear feature transform that is dedicated to within-class normalization.

Iii-C Relation to LDA

From the probabilistic model view, DNF is a nonlinear extension of LDA. Both DNF and LDA are generative models, and they share the same assumption that the distributions of all classes are homogeneous Gaussian in the latent space. However, this assumption can never be true for LDA if the data are complex, due to the limit of the linear transform between the data space and the latent space. However, DNF, our proposed approach, is based on a nonlinear transform, which allows it to establish a truly homogeneous and Gaussian latent space, even for complex irregulated data.

From the dimension reduction view, DNF has a similar role as the normalization step of LDA. Both approaches normalize the distribution of data; the key difference is that the normalization step of LDA normalizes the aggregated conditional distribution of all classes to an isotropic Gaussian, while DNF normalizes all the conditional distributions to homogeneous isotropic Gaussians. Therefore, our DNF approach can deliver a more powerful normalization than the linear normalization of LDA.

However, it should be noted that unlike LDA, DNF does not normalize the between-class covariance, which may lead to performance loss with classification methods where dimension independence is assumed, such as those based on cosine distance. We can therefore combine DNF and LDA by substituting the linear normalization step of LDA for DNF, while keeping the linear discrimination step of LDA unchanged. This leads to a new model with a nonlinear normalization step and a linear discrimination step, which we will call nonlinear discriminative analysis (NDA), and we will investigate its performance in the experimental section.

Iv Experiments

Iv-a Datasets

Three datasets were used in our experiments: VoxCeleb [36, 6], SITW [35] and CNCeleb [12]. VoxCeleb was used for training all the models (i-vector, x-vector, LDA, PLDA and DNF models), while the other two were used for performance evaluation.

VoxCeleb: This is a large-scale audiovisual speaker database collected by the University of Oxford, UK. The entire database consists of VoxCeleb1 and VoxCeleb2

. All the speech signals were collected from open-source media channels and therefore involve rich variations in channel, style, and ambient noise. This dataset, after removing the utterances shared by the SITW dataset, was used to train the i-vector, x-vector, LDA, PLDA and DNF models. The entire dataset contains

hours of speech signals from speakers. Data augmentation was applied to improve robustness, with the MUSAN corpus [47] used to generate noisy utterances, and the room impulse responses (RIRS) corpus [26] was used to generate reverberant utterances.

SITW: This is a standard evaluation dataset excerpted from VoxCeleb1, which consists of speakers. In our experiments, the Eval. Core test set, which contains target trials and imposter trials, was used for evaluation. It should be noted that the acoustic condition of SITW is similar to that of the training set VoxCeleb, so this test can be regarded as an in-domain test.

CNCeleb: This is a large-scale free speaker recognition dataset collected by Tsinghua University from source media. It contains more than utterances from Chinese celebrities. It covers diverse genres, which makes speaker recognition on this dataset much more challenging than on SITW [12]. By pair-wise composition, trials are constructed, including target trials and imposter trials. It is important to note that the acoustic condition of CNCeleb is quite different from that of VoxCeleb, and this therefore represents a challenging corpus that is suitable for use as an out-of-domain test.

Iv-B Model settings

Our SRE approach consists of three components: an x-vector or i-vector frontend that produces speaker vectors, a normalization model that regularizes the distribution of the speaker vectors, and finally, a scoring model that produces pair-wise scores for making a genuine/imposter decision.

Iv-B1 Frontend

  • x-vector system: The x-vector frontend was created using the Kaldi toolkit [40], following the SITW recipe. The acoustic features are -dimensional Fbanks. The main architecture contains three components. The first component is the feature-learning component, which involves time-delay (TD) layers to learn frame-level speaker features. The slicing parameters for these TD layers are: {-, -, , +, +}, {-, , +}, {-, , +}, {}, {

    }. The second component is the statistical pooling component, which computes the mean and standard deviation of the frame-level features from a speech segment. The final one is the speaker-classification component, which discriminates between different speakers. This component has

    full-connection (FC) layers and the size of its output is , corresponding to the number of speakers in the training set. Once trained, the -dimensional activations of the penultimate FC layer are read out as an x-vector.

  • i-vector system: The i-vector frontend was built with the Kaldi toolkit [40], following the SITW recipe. The raw features involve -dimensional MFCCs plus the log energy, augmented by first- and second-order derivatives, resulting in a -dimensional feature vector. This feature is used by the i-vector model. The universal background model (UBM) consists of Gaussian components, and the dimensionality of the the i-vectors is set to be .

Iv-B2 Normalization models

To investigate the merits of our proposed DNF model, we compare its performance with a number of different configurations.

  • LDA: We implemented the basic LDA model, trained to maximize the Fisher criterion. We used the implementation in the Kaldi toolkit [40], which involves a small modification that specifies rather than , where

    is a hyperparameter that was set to be 0.1 in the LDA + cosine scoring experiment, and 0.0 in the LDA + PLDA scoring experiment.

  • LDA/N: The linear normalization component of LDA. It simply normalizes the within-speaker covariance to be an identity matrix, neither diagonalizing the between-speaker covariance nor trimming any dimensions.

  • DNF: The DNF model with the MAF architecture [39]. This model has 5 NVP layers. The Adam optimizer [24] was used to train the model, with the minibatch size set to and the learning rate set to . Note that DNF is a dimensionality-preserved transform.

  • DNF-LDA: One potential issue with DNF is that it does not normalize the between-class covariance. Here, we perform an additional LDA after DNF normalization, to achieve normalization on both within- and between-class covariance. This is essentially a simple implementation of the NDA model discussed in Section III-C.

Iv-B3 Scoring model

Two commonly used scoring models were applied in this study: the simple Cosine scoring, which is based on the cosine distance, and the more complicated PLDA scoring, which is based on PLDA [20].

Iv-C Basic results

In the first experiment, we apply the four normalization models (LDA, LDA/N, DNF, DNF-LDA) to regulate the standard x-vectors derived from both SITW and CNCeleb. The results in terms of equal error rate (EER) are reported in Table II.

Iv-C1 X-vector in-domain results

Firstly, focusing on SITW, the in-domain test, it can be seen that all the normalization models provide performance improvement with cosine scoring. The fact that LDA/N outperforms the baseline in a very significant way (9.19 vs. 17.20) demonstrates the importance of within-speaker normalization, although this approach is only linear. DNF performs better than LDA/N (8.53 vs 9.19), confirming that nonlinear normalization is better than a linear one. LDA performs much better than LDA/N and DNF, demonstrating the importance of the between-class information. Finally, DNF-LDA achieves the best performance, by combining the strength of DNF and LDA,.

For PLDA scoring, all the non-linear normalization models (including dimension-trunking LDA, DNF and DNF + LDA) offer performance improvement. Note that any linear transform (LDA/N and LDA without dimension reduction) does not change the PLDA performance, as the within-speaker and between-speaker covariances that PLDA relies on are linearly invariant. LDA with dimension-reduction provides reasonable performance improvement when the dimension size is carefully selected, demonstrating the importance of distribution normalization for individual speakers. Significantly, DNF obtains better performance than LDA, which confirms that NF is a better normalization approach for this problem. By adding additional LDA-based normalization, DNF-LDA achieves the best performance. These results are consistent with those obtained with cosine scoring.

Iv-C2 X-vector out-of-domain results

When using CNCeleb, the out-of-domain test, the observations are very different. Looking at the cosine scoring results, firstly we observe that LDA/N does not offer any performance improvement over the baseline (16.36 vs 16.32), which indicates that the global within-speaker covariances are significantly different between VoxCeleb (the training data) and CNCeleb, and so the linear normalization that is learned to diagonalize the within-speaker covariance of VoxCeleb can never diagonalize the within-speaker covariance of CNCeleb. LDA, which applies additional transform and dimension selection based on the between-class variance, makes things even worse. This suggests that the between-class covariances of VoxCeleb and CNCeleb are also significantly different.

DNF, which focuses on normalizing distributions of individual speakers, is more robust against data mismatching, when compared to the global linear normalization with LDA/N (14.22 vs 16.36). Applying additional LDA after DNF reduces the performance, which suggests that in the DNF latent space, the between-class covariance still changes significantly from VoxCeleb to CNCeleb. The only exception is the case of dimension-preserving DNF-LDA [512], which does not perform any dimension reduction and provides a marginal gain over DNF (13.83 vs 14.22). This indicates that in the DNF latent space, although the shape of the between-speaker covariance has changed significantly from VoxCeleb to CNCeleb, the principle directions of the covariance may not change much. This is not the case within the LDA/N latent space, as the performance with LDA [512] is worse than with LDA/N (16.87 vs 16.36).

For PLDA scoring, similar conclusions can be drawn: LDA fails in most situations. The principle role of LDA in this scenario is removing irregulated dimensions, and this removal is based on the between-speaker covariance within the latent space by LDA/N, which is in turn based on the within-speaker covariance. However, as we have discussed, both the between- and within-speaker covariances change significantly from VoxCeleb to CNCeleb, so it is not surprising that the LDA-based normalization fails. In contrast to LDA, DNF still works in this situation, which can be attributed to the more robust within-class normalization. However, when applying additional LDA, the unreliable between-class information is used for dimension reduction, which leads to significant performance reduction. This is shown in the case of DNF-LDA with reduced dimensions.

To summarize, the experimental results presented above indicate that the global properties (within- and between-speaker covariances) may change significantly at the dataset level, and any normalization methods based on these properties will suffer from the generalization problem. DNF learns how to normalize individual speakers at different locations of the speaker space, which appears to be more generalizable to unseen data. However, this generalizability seems to only be for within-class distributions: after DNF normalization, there is still significant mismatch with regard to between-class distributions, which should be further investigated.

Cosine PLDA Cosine PLDA
x-vector [512] 17.20 5.30 16.32 13.03
LDA [150] 5.25 4.07 17.67 14.37
LDA [200] 5.82 3.96 17.52 13.50
LDA [400] 7.38 4.65 17.49 12.28
LDA [512] 8.61 5.30 16.87 13.03
LDA/N [512] 9.19 5.30 16.36 13.03
DNF [512] 8.53 3.66 14.22 11.82
DNF-LDA [150] 5.06 3.61 15.42 13.85
DNF-LDA [200] 5.41 3.42 15.18 13.22
DNF-LDA [400] 7.05 3.58 14.20 11.90
DNF-LDA [512] 8.17 3.66 13.83 11.82
TABLE II: EER(%) results on SITW and CNCeleb with x-vector frontend.

Iv-C3 I-vector results

For the purpose of comparison, we report the results with i-vectors in Table III. It can be seen that the normalization methods make very little contribution to PLDA scoring on both SITW and CNCeleb databases. For the cosine scoring, LDA contributes with performance gains on SITW. We argue that this is mainly due to the diagonalization on the between-speaker covariance. However, this contribution is largely lost on CNCeleb, indicating that the between-speaker covariance has changed significantly from SITW to CNCeleb. This observation is the same as in the x-vector experiment. DNF does not show any advantage in this experiment. This is because the within-speaker distributions of i-vectors have been well regulated (see Table I, so a dedicated normalization is not necessary.

Cosine PLDA Cosine PLDA
i-vector [400] 14.24 5.66 17.68 18.25
LDA [150] 7.11 5.36 18.18 18.49
LDA [200] 7.46 5.25 17.85 18.36
LDA [400] 9.32 5.66 16.65 18.25
LDA/N [400] 11.84 5.66 17.23 18.25
DNF [400] 12.06 5.60 18.04 18.15
DNF-LDA [150] 7.30 5.41 18.30 18.53
DNF-LDA [200] 7.52 5.30 18.02 18.36
DNF-LDA [400] 9.49 5.60 17.02 18.15
TABLE III: EER(%) results on SITW and CNCeleb with i-vector frontend.

Iv-D Results on more powerful x-vectors

In this experiment, we constructed more powerful x-vector systems to investigate whether DNF normalization still contributes. We conducted extensive preliminary trials on model structures and training objectives (not reported here due to space constraints), and based on these, we chose three architectures to represent SOTA systems.

  • TDNN + Att.: The same architecture as the TDNN baseline in the previous experiment, but the statistical pooling is replaced by self-attention pooling [60].

  • ResNet-34 + Att.: The ResNet-34 architecture [6, 56] with self-attention pooling [60].

  • ResNet-34 + AAM: The ResNet-34 architecture [6, 56] with additive angular marginal loss [8].

The experimental results are shown in Table IV. For LDA, we report the LDA [200] results only, as it is the best configuration for all the LDA systems. On SITW, all these ‘advanced’ systems outperform the TDNN baseline in a significant way, and DNF still achieves good performance. In most situations, DNF outperforms LDA, and more performance gains can be attained by DNF-LDA. The results on CNCeleb are even more significant. Firstly, they once again confirm the generalizability of DNF, as reported previously; and secondly, they show that the EER reduction on SITW provided by the ‘advanced’ approaches was not transferred to the results on CNCeleb. This indicates that the performance improvement obtained with some of the ‘advanced’ techniques may simply be the result of overfitting.

Cosine PLDA Cosine PLDA
TDNN x-vector [512] 17.20 5.30 16.32 13.03
LDA [200] 5.82 3.96 17.52 13.50
DNF [512] 8.53 3.66 14.22 11.82
DNF-LDA [200] 5.41 3.42 15.18 13.22
TDNN + Att. x-vector [512] 4.37 3.66 15.08 13.05
LDA [200] 3.72 2.73 18.34 13.97
DNF [512] 5.00 2.71 14.69 12.07
DNF-LDA [200] 3.72 2.57 15.45 13.66
ResNet-34 + Att. x-vector [512] 2.73 2.52 13.94 13.11
LDA [200] 2.60 2.00 14.90 12.58
DNF [512] 3.47 1.94 13.86 11.61
DNF-LDA [200] 2.57 1.89 14.04 12.32
ResNet-34 + AAM x-vector [512] 5.71 2.82 15.80 14.02
LDA [200] 2.73 1.86 16.67 13.42
DNF [512] 4.89 2.32 14.66 12.80
DNF-LDA [200] 2.93 1.83 14.96 12.59
TABLE IV: EER(%) results on SITW and CNCeleb with a SOTA x-vector frontend.

Iv-E Analysis

To better understand the behavior of DNF, we monitored the training process, and here we report the change of the statistics related to regulation and discrimination. As in Section II when analyzing LDA, we conduct the analysis with a small-scale experiment. All the data and measures are the same as in the LDA investigation, and we focus on x-vector results.

Iv-E1 Regulation analysis

Fig. 8 presents the four groups of measures related to data regulation: PC directional variance and PC shape variance, which reflect the homogenity of distributions of different speakers, and averaged kurtosis and averaged skewness, which reflect the Gaussianality of the distributions of each speaker. It can be seen that the values of all these measures are significantly reduced during training. Compared to the results in Table I, it can be seen that the DNF can generally reach lower values on all these measures compared to LDA, hence is a better normalization model. Spikes are found with kurtosis, skewness, and PC1/PC2 direction variance. These spikes indicate that the model is trying to change the location of all the speakers in order to find an optimal configuration, but changing one speaker may cause unwanted change on other speakers, due to the complex distributions of the speaker vectors. Nevertheless, the training can ultimately find a better configuration that improves the data regulation in general.

Fig. 8: Change of measures related to data homogenity and Gaussianality during DNF training.

Iv-E2 Discrimination analysis

To investigate the discriminative capability of the DNF-normalized speaker vectors, we compute several measures related to class discrimination: (1) between-class and within-class variance and their ratio; (2) EER results based on cosine scoring; (3) cross entropy, where the logit is computed based on the inner product of training samples and the class means; (4) cross entropy, where the logit is computed based on the cosine distance between training samples and the class means. Fig. 

9 shows the change of these measures during model training. This shows that the data in the latent space becomes increasingly discriminative over time, as indicated by all these measures. In particular, we highlight the continuous increase of the cross entropy based on inner product. If we treat the inverse NF function as a regular neural net and the mean vectors of all the classes as the weights of the final layer, the whole DNF architecture is a standard classification network. This net is usually trained with the CE loss. In DNF, we interpreted the net in a very different way (a generative model) and trained it with a very different loss (ML), and obtained the same CE reduction. This confirms the fundamental relation between generative and discriminative models, as discussed in Section III.

Fig. 9: Change of measures related to class discrimination during DNF training.

V Conclusions

This paper investigated the issue of data irregulation with deep speaker vectors in SRE, and found through comprehensive experiments that deep speaker vectors require deep normalization. Firstly, We found that the within-speaker distributions of deep speaker vectors are highly non-homogeneous and non-Gaussian, which may seriously impact performance of SRE systems. To overcome this problem, we introduced a new deep normalization approach, based on a novel discriminative normalization flow (DNF) model. This model is a nonlinear extension of LDA, and can normalize complex and heterogeneous distributions of individual speakers. Using state of the art system configurations, our experiments on two datasets demonstrated that our new DNF approach delivers consistently better performance compared to the baseline and outperforms the more conventional LDA-based normalization. Furthermore, in the out-of-domain test where LDA performs very poorly, DNF still delivers good performance, confirming the good generalizability and further potential of our approach. Future work will investigate the joint training of the DNF normalizer and the speaker embedding model, and will also apply DNFs to raw acoustic features directly.


  • [1] L. Ardizzone, C. Lüth, J. Kruse, C. Rother, and U. Köthe (2019)

    Guided image generation with conditional invertible neural networks

    arXiv preprint arXiv:1907.02392. Cited by: §III-B.
  • [2] Z. Bai, X. Zhang, and J. Chen (2019) Partial auc optimization based deep speaker embeddings with class-center learning for text-independent speaker verification. arXiv preprint arXiv:1911.08077. Cited by: §I.
  • [3] W. Cai, J. Chen, and M. Li (2018)

    Exploring the encoding layer and loss function in end-to-end speaker and language recognition system

    In Proceedings of Odyssey: The Speaker and Language Recognition Workshop, pp. 74–81. Cited by: §I, §I.
  • [4] J. P. Campbell (1997) Speaker recognition: a tutorial. Proceedings of the IEEE 85 (9), pp. 1437–1462. Cited by: §I.
  • [5] N. Chen, J. Villalba, and N. Dehak (2019) Tied mixture of factor analyzers layer to combine frame level representations in neural speaker embeddings. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2948–2952. Cited by: §I.
  • [6] J. S. Chung, A. Nagrani, and A. Zisserman (2018) VoxCeleb2: deep speaker recognition. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1086–1090. Cited by: §I, §II-B, 2nd item, 3rd item, §IV-A.
  • [7] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (2011) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §I.
  • [8] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019)

    Arcface: additive angular margin loss for deep face recognition


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 4690–4699. Cited by: 3rd item.
  • [9] W. Ding and L. He (2018) MTGAN: speaker verification through multitasking triplet generative adversarial networks. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 3633–3637. Cited by: §I.
  • [10] L. Dinh, D. Krueger, and Y. Bengio (2014) Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §III-A.
  • [11] L. Dinh, J. Sohl-Dickstein, and S. Bengio (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §III-A.
  • [12] Y. Fan, J. Kang, L. Li, K. Li, H. Chen, S. Cheng, P. Zhang, Z. Zhou, Y. Cai, and D. Wang (2019) CN-CELEB: a challenging Chinese speaker recognition dataset. arXiv preprint arXiv:1911.01799. Cited by: §IV-A, §IV-A.
  • [13] R. A. Fisher (1936) The use of multiple measurements in taxonomic problems. Annals of eugenics 7 (2), pp. 179–188. Cited by: §II-A1.
  • [14] Z. Gao, Y. Song, I. McLoughlin, P. Li, Y. Jiang, and L. Dai (2019) Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 361–365. Cited by: §I.
  • [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), pp. 2672–2680. Cited by: §III-A.
  • [16] J. H. Hansen and T. Hasan (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal processing magazine 32 (6), pp. 74–99. Cited by: §I.
  • [17] T. Hastie, R. Tibshirani, and J. Friedman (2009) The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Cited by: §II-A1.
  • [18] T. Hastie and R. Tibshirani (1996) Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 155–176. Cited by: §II-A2.
  • [19] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer (2016) End-to-end text-dependent speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119. Cited by: §I.
  • [20] S. Ioffe (2006) Probabilistic linear discriminant analysis. In European Conference on Computer Vision (ECCV), pp. 531–542. Cited by: §I, §II-A3, §IV-B3.
  • [21] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. L. Moreno, Y. Wu, et al. (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in Neural Information Processing Systems (NIPS), pp. 4480–4490. Cited by: §I.
  • [22] J. Jung, H. Heo, J. Kim, H. Shim, and H. Yu (2019) RawNet: Advanced End-to-End Deep Neural Network Using Raw Waveforms for Text-Independent Speaker Verification. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1268–1272. Cited by: §I.
  • [23] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel (2007) Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing 15 (4), pp. 1435–1447. Cited by: §I.
  • [24] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: 3rd item.
  • [25] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §III-A.
  • [26] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur (2017) A study on data augmentation of reverberant speech for robust speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224. Cited by: §IV-A.
  • [27] L. Li, Y. Chen, Y. Shi, Z. Tang, and D. Wang (2017) Deep speaker feature learning for text-independent speaker verification. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1542–1546. Cited by: §I.
  • [28] L. Li, Y. Lin, Z. Zhang, and D. Wang (2015) Improved deep speaker feature learning for text-dependent speaker recognition. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 426–429. Cited by: §I.
  • [29] L. Li, Z. Tang, Y. Shi, and D. Wang (2019) Gaussian-constrained training for speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6036–6040. Cited by: §I, §II.
  • [30] L. Li, Z. Tang, D. Wang, and T. F. Zheng (2018) Full-info training for deep speaker feature learning. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5369–5373. Cited by: §I.
  • [31] L. Li, D. Wang, Y. Chen, Y. Shi, Z. Tang, and T. F. Zheng (2018) Deep factorization for speech signal. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5094–5098. Cited by: §I.
  • [32] L. Li, D. Wang, C. Xing, and T. F. Zheng (2016) Max-margin metric learning for speaker recognition. In 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–4. Cited by: §I.
  • [33] R. Li, N. L. D. Tuo, M. Yu, D. Su, and D. Yu (2019) BOUNDARY discriminative large margin cosine loss for text-independent speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6321–6325. Cited by: §I.
  • [34] D. J. MacKay (1995) Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 354 (1), pp. 73–80. Cited by: §III-A.
  • [35] M. McLaren, L. Ferrer, D. Castan, and A. Lawson (2016) The speakers in the wild (SITW) speaker recognition database.. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 818–822. Cited by: §IV-A.
  • [36] A. Nagrani, J. S. Chung, and A. Zisserman (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §II-B, §IV-A.
  • [37] K. Okabe, T. Koshinaka, and K. Shinoda (2018) Attentive statistics pooling for deep speaker embedding. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2252–2256. Cited by: §I, §I.
  • [38] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan (2019) Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762. Cited by: §III-A, §III-A.
  • [39] G. Papamakarios, T. Pavlakou, and I. Murray (2017) Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems (NIPS), pp. 2338–2347. Cited by: Fig. 6, 3rd item.
  • [40] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. (2011) The kaldi speech recognition toolkit. In

    IEEE 2011 workshop on automatic speech recognition and understanding

    Cited by: 1st item, 2nd item, 1st item.
  • [41] F. R. rahman Chowdhury, Q. Wang, I. L. Moreno, and L. Wan (2018) Attention-based models for text-dependent speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5359–5363. Cited by: §I.
  • [42] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn (2000) Speaker verification using adapted Gaussian mixture models. Digital signal processing 10 (1-3), pp. 19–41. Cited by: §I.
  • [43] D. A. Reynolds (2002) An overview of automatic speaker recognition technology. In IEEE international conference on Acoustics, speech, and signal processing (ICASSP), Vol. 4, pp. IV–4072. Cited by: §I.
  • [44] W. Rudin (2006) Real and complex analysis. Tata McGraw-hill education. Cited by: §III-A.
  • [45] S. O. Sadjadi, C. Greenberg, E. Singer, D. Reynolds, L. Mason, and J. Hernandez-Cordero (2019) The 2018 NIST Speaker Recognition Evaluation. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1483–1487. Cited by: §I.
  • [46] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust DNN embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §I, §I, §I.
  • [47] D. Snyder, G. Chen, and D. Povey (2015) MUSAN: A Music, Speech, and Noise Corpus. Note: arXiv:1510.08484v1 External Links: 1510.08484 Cited by: §IV-A.
  • [48] T. Stafylakis, J. Rohdin, O. Plchot, P. Mizera, and L. Burget (2019) Self-Supervised Speaker Embeddings. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2863–2867. Cited by: §I.
  • [49] E. G. Tabak and C. V. Turner (2013) A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66 (2), pp. 145–164. Cited by: §III-A.
  • [50] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez (2014) Deep neural networks for small footprint text-dependent speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. Cited by: §I.
  • [51] J. S. C. W. Xie and A. Zisserman (2019) Utterance-level aggregation for speaker recognition in the wild. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–579. Cited by: §I.
  • [52] J. Wang, K. Wang, M. T. Law, F. Rudzicz, and M. Brudno1 (2019) CENTROID-based deep metric learning for speaker recognition. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3652–3656. Cited by: §I.
  • [53] S. Wang, J. Rohdin, L. Burget, O. Plchot, Y. Qian, K. Yu, and J. Cernocky (2019) On the Usage of Phonetic Information for Text-Independent Speaker Embedding Extraction. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1148–1152. Cited by: §I, §I.
  • [54] X. Wang, L. Li, and D. Wang (2019) VAE-based domain adaptation for speaker verification. arXiv preprint arXiv:1908.10092. Cited by: §I.
  • [55] Z. Wu, S. Wang, Y. Qian, and K. Yu (2019)

    Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification

    In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1163–1167. Cited by: §I.
  • [56] H. Zeinali, S. Wang, A. Silnova, P. Matějka, and O. Plchot (2019) BUT system description to voxceleb speaker recognition challenge 2019. arXiv preprint arXiv:1910.12592. Cited by: 2nd item, 3rd item.
  • [57] S. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong (2016) End-to-end attention based text-dependent speaker verification. In Spoken Language Technology Workshop (SLT), pp. 171–178. Cited by: §I.
  • [58] Y. Zhang, L. Li, and D. Wang (2019) VAE-based regularization for deep speaker embedding. arXiv preprint arXiv:1904.03617. Cited by: §I, §II.
  • [59] J. Zhou, T. Jiang, Z. Li, L. Li, and Q. Hong (2019) Deep Speaker Embedding Extraction with Channel-Wise Feature Responses and Additive Supervision Softmax Loss Function. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2883–2887. Cited by: §I.
  • [60] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey (2018) Self-attentive speaker embeddings for text-independent speaker verification.. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 3573–3577. Cited by: 1st item, 2nd item.