I Introduction
Speaker recognition (SRE) is widely studied, and decades of investigation has resulted in significant performance improvements, and deployment in a wide range of practical applications [4, 43, 16]
. Traditional SRE methods are based on statistical models, which include the popular Gaussian mixture modeluniversal background model (GMMUBM) architecture
[42]. In order to improve the statistical strength with limited data, various subspace models have been proposed [23], and in particular, the ivector model, which was the most successful and widely used [7]. An important concept introduced by the ivector model is speaker embedding, that is, representing speakers by fixedlength continuous vectors. With this embedding, the derived speaker vectors construct a speaker space, where SRE tasks can be conducted using either simple cosine distance scoring, or a more complex backend model such as probabilistic linear discriminant analysis (PLDA) [20]. It should be noted that the ivector approach essentially performs a statistical embedding as it is based on statistical models, and the derived ivectors are therefore statistical speaker vectors.In recent years, deep learning methods have demonstrated significant progress with regard to SRE. Variani et al. reported the results of an initial investigation on a textdependent task
[50], using a deep neural net (DNN) to produce framelevel speakerdiscriminant features. They then derived utterancelevel representations (called ‘dvectors’) by average pooling. Li et al. [27] extended this with a more speechfriendly net structure, and achieved good performance on textindependent tasks, especially with short utterances. However, one key shortcoming of the dvector approach is that the framelevel training does not match the utterancelevel test. Researchers developed two architectures to solve this problem. The first approach is to use the endtoend architecture, which accepts two utterances and produces the accept/rejection decision directly [19, 57, 41]. The second approach is the speaker embedding architecture [46, 37], which instead accumulates the framelevel statistics of a variablelength utterance and converts these to a fixedlength vector with the objective of discriminating between the speakers in the training set. Although the training criterion of the endtoend architecture is more consistent with the SRE task, the embedding architecture is easier to train [31] and the derived speaker vectors support various speakerrelated tasks such as speakerdependent synthesis [21].The concept of embedding with deep learning methods is exactly the same as the ivector model. To differentiate between them, in this paper, we denote DNNbased embedding by deep embedding, and the derived fixedlength vectors by deep speaker vectors. Perhaps the most popular deep embedding architecture is the xvector model proposed by Snyder et al. [46]. It is based on a timedelayed neural net and a statistical pooling, and has achieved good performance in many applications. For simplicity, we use the term xvector to represent deep speaker vectors derived with any net structure.
Recently, deep speaker embedding models have been significantly improved by developing a more comprehensive architecture [6, 22], improved pooling methods [37, 3, 51, 5], training criteria [32, 9, 52, 2, 14, 59], and training schemes [33, 53, 48]. As a result, the deep embedding approach has achieved stateoftheart (SOTA) SRE performance [45].
Despite the significant progress outlined above, one potential issue with the deep embedding approach is that the training objective of the embedding models is purely discriminative, meaning that the goal of the training is simply discriminating the speakers, without considering the distribution of the derived speaker vectors. This ‘unconstrained training’ tends to produce irregulated speaker distributions (i.e. the withinspeaker distributions or conditional distributions). This means that: (1) the distributions of each individual speaker may be potentially very complex and far from Gaussian (nonGaussianality), and (2) the distributions of different speakers may be significantly different (i.e. nonhomogeneity). NonGaussian and nonhomogeneous distributions may seriously impact performance of the backend scoring model, particularly with regard to the popular PLDA scoring method, which is based on the assumption that all speaker distributions are homogeneous and Gaussian. Fig. 1 illustrates some potential problems caused by nonGaussian and nonhomogeneous distributions. It should also be noted that this issue is not as severe for statistical speaker vectors (ivectors) as they are derived from a constrained model with an underlying Gaussian assumption.
A number of researchers have noticed the risk associated with data irregularity, and have presented various compensations. For instance, phoneaware training was investigated, with the aim of reducing the nonGaussian variation caused by speech content [28, 53]. Another proposed approach was to use various data augmentation methods [46, 55]
. These augmentation methods prevent the training from being overfitted to highly curved discrimination boundaries, hence more regulated distribution for each individual speaker. Li et al. presented an approach that treats the fullconnection layer before the softmax classifier as the basis of speakers, and imposes a central loss
[30, 29]. This central loss is an explicit regularization that encourages the distributions of individual speakers to be more Gaussian. This concept was also introduced by other researchers [3].Previous work by the authors presented a variational autoencoder (VAE) based normalization for xvectors [58, 54]. This is an independent component and dedicated to regulating xvectors, and differs from other previous research that integrates the normalization constraint in the embedding model. Our VAEbased normalization encourages the marginal distribution to be Gaussian. However, it cannot normalize conditional distributions, i.e. the distributions of individual speakers, a more important source of data irregulation. In addition, to retain the discriminant strength of the speaker vectors, an auxiliary crossentropy loss is required when training the VAE model, which complicates the behavior of the normalizer.
In this paper, we propose a new fully generative model to normalize deep speaker vectors. This model is based on normalization flow (NF), a simple yet powerful architecture for density estimation. With this model, a complex distribution can be transformed to a simple isotropic Gaussian (often called the prior distribution). However, directly applying the NF model is insufficient: it regulates the marginal distribution rather than conditional distributions (as with VAE), so cannot deal with individual speakers. We therefore propose a novel discriminative normalization flow (DNF). Compared with the vanilla NF model, DNF allows classspecific prior distributions, which enables it to model multiple speakers with different but homogeneous isotropic Gaussians. This paper will show that our new DNF approach is a deep and nonlinear extension of the widely used linear discriminant analysis (LDA) model.
The remainder of the paper is organized as follows. We first review the LDA and PLDA model in Section II, and through experiments, investigate the role of normalization that LDA plays when applied to xvectors with PLDA scoring. Section III presents the DNF model, which extends the shallow and linear normalization with LDA to a deep and and nonlinear normalization, with experimental results and analysis presented in Section IV, and finally, the paper is concluded in Section V.
Ii Shallow normalization by LDA
It is well known that for xvector systems with PLDA scoring, LDA is an important preprocessing step. This is initially unexpected, as PLDA is theoretically an extension of LDA, and can selfdiscover the most discriminant dimensions. Previous work [58, 29] by the authors argued that the role LDA plays is distribution normalization, by making the conditional distributions of speaker vectors more Gaussian. This normalization makes the LDAprojected data fit the assumptions of PLDA, and therefore makes it more suitable for PLDA modeling. However, this argument should be investigated in more depth, with more theoretical and empirical analysis.
Iia Review of LDA and PLDA
IiA1 Dimension reduction view for LDA
LDA is a popular tool for dimension reduction, by identifying the most classdiscriminant directions, along which the betweenclass variation is maximized and the withinclass variation is minimized (Fisher criterion) [13]
. These directions are aligned with the eigenvectors of
with large eigenvalues, where
and are the betweenclass and withinclass covariance matrices, respectively. An important property of these eigenvectors is that they diagonalize and simultaneously.Research has shown that LDA can be performed in two steps: normalization and discrimination [17]
. Normalization is a linear transform where the withinclass covariance becomes an identity matrix
. This can be achieved with principal component analysis (PCA) on the meanoffset data, and a rescaling operation. The discrimination is an extra orthogonal transform that aligns the data coordinates along the directions with the largest betweenclass variation. This can be achieved by another PCA on the normalized class means, and dimension reduction can be achieved by selecting the leading principal components (PCs). Within this discrimination space, classification will be optimal based on Euclidian distance, as shown in Fig.
2.IiA2 Probabilistic model view for LDA
LDA can be cast to a multiGaussian model [18]. Assuming all classes are Gaussian distributed and share the same covariance and the priors of all classes are equal, a probabilistic model can be established. The parameters, including the mean vectors and the shared covariance matrix , can be optimized following the maximumlikelihood (ML) criterion. After training, optimal classification (in terms of maximum a posterior, MAP) for a sample can be obtained by choosing the class with the largest . This is the primary form of LDA. The dimension reduction can be recovered by a constrained multiGaussian model where the mean vectors of all the classes are in a subspace. Fig. 2 illustrates the multiGaussian view of LDA and its relationship with the dimensionreduction view.
As discussed, LDA assumes that the withinclass distributions are Gaussian and they share the same covariance. We denote these as the Gaussianality condition and the homogeneity condition, respectively. If both conditions are satisfied, we call the data regulated
. Regulated data are suitable for LDA dimension reduction, otherwise the features selected would not be optimal in terms of class discrimination.
IiA3 Plda
PLDA extends LDA (probabilistic model view), by placing a Gaussian prior on the class means. A key advantage associated with this prior is that it enables dealing with new classes. More specifically, the posterior distribution of the mean of a new class can be derived, given even a single sample. According to this posterior, the probability that one or more test samples belong to the new class can be calculated by marginalizing over the class mean. Therefore, PLDA can be used to compute the likelihood ratio that two utterances belong to the same and different speakers [20], and is a theoretically sound scoring model. It is important to highlight that PLDA inherits the shared Gaussian assumption of LDA and requires regulated data. If the data are irregulated, the likelihood ratio derived by PLDA may be biased.
IiB Why LDA works for xvectors
To have a better understanding of the role of LDA and explain its contribution to the backend scoring model, in particular PLDA, we performed an initial experiment using a subset of the VoxCeleb dataset [6, 36]. We created both an ivector and xvector system using the entire VoxCeleb database, following the recipes that will be described in Section IV. We then generated ivectors and xvectors from a subset of 600 speakers in the training set. We use these to investigate the statistical properties of these vectors before and after LDA.
IiB1 Global properties
First of all, we show the betweenspeaker and withinspeaker covariance matrices of ivectors and xvectors. These matrices reflect the global properties (i.e. on the data of all the speakers) of the distribution of the speaker vectors. As shown in Fig. 3, xvectors exhibit more complex correlation patterns compared to ivectors. After LDA, for both xvectors and ivectors, the between and withinspeaker covariances become diagonal. This diagonalization can be regarded as a global normalization. As the correlation patterns of xvectors are more complex, this normalization is more substantial for xvectors. To an extent, it answers the question of why LDA contributes more for xvectors with cosine scoring: although xvectors are derived from a discriminative model and LDAbased feature selection seems less important, the diagonal between and withinspeaker covariances are good for cosine scoring (which assumes that the feature dimensions are independent).
IiB2 Local properties
The global normalization of LDA does not fully explain everything. In particular, it cannot explain why LDA contributes to PLDA scoring, as PLDA performs the same normalization anyway. Our hypothesis is that LDA, by discarding less discriminative dimensions, performs a local normalization at the class level. More precisely, we will show that LDA improves both homogeneity among classes and also the Gaussianality of each class.
We test this hypothesis with both ivectors and xvectors. The tests are conducted in three spaces: (1) the original observation space; (2) the LDA space, i.e. the subspace with leading discriminant dimensions after LDA transform; (3) the residual space, i.e. the subspace complementary to the LDA space. Since testing the homogeneity and Gaussianality in a highdimensional space is challenging, we perform tests on the principal directions of each conditional distribution. Specifically, we select speakers with more than samples and perform PCA on the data of each speaker, and investigate the homogeneity and Gaussianality on the leading PC directions. The statistics we collected are as follows:

PC direction variance for homogeneity. This tests if the covariance matrices of all the speakers have the same PC directions. After PCA, the first PC (PC1) of all the speakers are selected and its mean over the speakers is computed. The cosine distance between the PC1s of individual speakers and the mean PC1 is computed. The variance of these cosine scores is used as the measure to test the PC1 direction variance. The same computation is conducted on all PCs. In this section, we report the direction variance on PC1 and PC2, and the averaged direction variance on the first 10 PCs.

PC shape variance for homogeneity. Using PC1 as an example, the coefficients (eigenvalues) of the covariance matrices of all the speakers on the first PC are calculated, and the variance of these coefficients over all speakers is computed. The same computation is performed on all the PCs. Since the coefficient on each PC determines the spreading of the samplings on this direction, the coefficients on all the PCs determine the shape of the speaker distribution. The variances of these coefficients over all speakers then test if the distributions of all speakers have the same shape (regardless of the directions), hence being noted as PC shape variances. We report the shape variance on PC1 and PC2, and the averaged shape variance on the first 10 PCs.

Average PC kurtosis for Gaussianality. On each PC direction, we compute the kurtosis for each speaker, and then compute the mean of the kurtosis over all the speakers. The averaged kurtosis over the first 10 PCs is reported.

Average PC skewness for Gaussianality. On each PC direction, we compute the skewness for each speaker, and then compute the mean of the kurtosis over all the speakers. The averaged skewness over the first 10 PCs is reported.
The results are shown in Table I, where we also report the between and withinspeaker variance, as well as the SRE performance with the cosine scoring in terms of equal error rates (EER). There are several key observations:

Comparing ivectors and xvectors in the original observation space, it can be seen that xvectors possess larger PC direction and PC shape variances, demonstrating that the distributions of different speakers are less homogeneous. Moreover, xvectors show larger kurtosis and skewness, indicating that the distributions of individual speakers are less Gaussian. This confirmed our conjecture that xvectors are less regulated, and are less suitable for PLDA scoring when compared to ivectors.

For both ivectors and xvectors, after LDA, homogeneity and Gaussianality are all improved, and this improvement is much more significant for xvectors. This indicates that LDAtransformed data are more regularized, especially for xvectors.

In the residual space, LDA improves homogeneity. With regard to Gaussianality, there is a slight improvement with ivectors. For xvectors, however, Gaussianality becomes worse after LDA.
ivector  xvector  
Orig.  LDA  Res.  Org.  LDA  Res.  
EER%  5.75  3.08  19.83  7.00  1.50  15.08 
PC1 dir. var  0.064  0.009  0.004  0.104  0.006  0.003 
PC2 dir. var  0.089  0.008  0.005  0.156  0.005  0.003 
Avg PC dir. var  0.028  0.007  0.004  0.060  0.006  0.003 
PC1 shape var  80.1  64.0  60.5  156.0  42.0  128.0 
PC2 shape var  53.3  32.3  34.6  68.3  21.8  63.0 
Avg PC shape var  30.7  19.8  20.9  42.0  13.7  32.5 
PC Kertosis  1.579  0.734  1.268  2.615  1.686  31.40 
PC Skewness  0.311  0.209  0.309  0.369  0.275  1.110 
Betweenclass var  0.269  1.164  0.163  0.548  2.332  0.225 
Withinclass var  0.753  0.996  0.991  0.192  1.030  1.026 
The final observation is of greatest interest. This suggests that LDA not only selects discriminant dimensions, but also removes nonGaussian dimensions. To test this conjecture in a more concrete way, we divide all the dimensions sorted by LDA (according to their discriminant power) into multiple subgroups and compute the averaged PC shape variance, PC kurtosis and PC skewness, plus the betweenspeaker variance and the EER results with cosine scoring. The results are shown in Fig. 4. It can be clearly seen that the most indiscriminative dimensions are nonhomogeneous and nonGaussian. In comparison, Fig. 5 shows the results with ivectors, where it can be seen that the indiscriminative dimensions of ivectors are much more regulated.
IiB3 Summary of LDA and PLDA
Based on the findings above, there are a number of key conclusions that can be drawn with regard to the role that LDA plays for ivectors and xvectors. The typical role of LDA is twofold, it transforms the between and within covariance to be diagonal, and it also selects the most discriminant dimensions. All of these functions contribute to cosine scoring. However, for PLDA scoring, these functions are performed implicitly and LDA preprocessing is therefore not usually necessary. This is the case with ivectors, but with xvectors, LDA plays an extra role – speaker vector normalization – by removing irregulated (nonhomogeneous and/or nonGaussian) dimensions. These irregulated dimensions may potentially be attributed to unwanted variance such as linguistic content and length variance, but can also be simply attributed to the unconstrained nature of deep embedding models. Removing these dimensions will make the data more regulated and hence benefit PLDA scoring.
Iii Deep normalization by discriminative normalization flow
The previous section demonstrated that the suitable data regulation is very important for PLDA scoring. However, the linear form of LDA means that it can only normalize the global structure (within and between class covariances), rather than individual speakers. The withinspeaker normalization is essentially achieved by dimensiontrimming. However, the dimensions that are least discriminant are not necessarily consistent with the most irregulated dimensions. This can be clearly seen in Fig. 4, where the most nonGaussian and nonhomogeneous dimensions are in the first subgroup, i.e., the most discriminant dimensions. This means that dimensiontrimming is not optimal for either selecting discriminant features (some discriminative features have to be removed because they are irregulated), or for selecting regulated features (some irregulated features cannot be removed as they are discriminative).
Here, we present a new deep normalization model, which is based on deep generative neural nets and is designed for normalizing distributions of individual speakers. Most importantly, we will utilize the powerful normalization flow (NF) model to perform distribution transform, and propose a novel discriminative NF (DNF) model to deal with multiple speakers. To the best knowledge of the authors, this represents a new research direction in this domain.
Iiia Normalization flow
Deep generative models transform a simple distribution via a deep neural net, so that the output distribution matches the true data [34]. Typical deep generative models include generative adversarial networks (GAN) [15] and variational autoencoders (VAE) [25]. Normalization flow (NF) is another deep generative model, which is similar to VAE but the transform is invertible, therefore it does not require an explicit encoder and the likelihood can be computed exactly [38]. In this research, we choose NF as the basic architecture of our deep normalization model.
The foundation of NF is the principle of distribution transformation for continuous variables [44]. Let a latent variable and an observation variable be linked by an invertible transform , their probability density has the following relationship:
(1) 
where is the inverse function of . It has been shown that if is flexible enough, a simple distribution, which we assume to be a standard Gaussian, can be transformed to a very complex distribution [38]. Note that the second term on the right side of the above equation represents the volume (entropy) change during the transform.
Usually, is implemented as a composition of a sequence of relatively simple invertible transforms, denoted by :
where every can be a structured neural net [49]. The entire transform has the following relationship:
where we have defined , and . This model resembles a flow of transforms, which reshapes the simple distribution on gradually, and ultimately reaches the complex distribution on . In the inverse direction, it normalizes the complex distribution on to a simple distribution on , and is therefore called a normalization flow. Fig. 6 illustrates how a complex distribution is normalized to a simple distribution by an NF model.
The NF model can be trained with the maximum likelihood (ML) criterion. Note that Eq.1 formulates a distribution density on the observation , where the first term is often called the prior distribution, and the second term is called the entropy term. The ML training optimizes the NF model with the following objective:
where indexes the training samples, and represents the parameters of the model. Once the model has been well trained, it can be used to (a) sample by sampling ; (b) compute by calculating the prior and the entropy term; (c) normalize by transforming it to , which is Gaussian distributed. In this paper, we will focus on utilizing the normalization capability of the NF model.
The key issue when designing the NF model is to identify a net structure so that the entropy term in Eq. 1 can be easily computed. Researchers have proposed various NF models based on different structures. These models can be categorized into volume preserved (VP) [10] and nonvolume preserved (NVP) [11] models. Although VPs do not change the volume during the flow (i.e. the entropy term is zero), NVPs do not have this constraint and so are generally more flexible.
IiiB Discriminative normalization flow
The vanilla NF model optimizes the distribution of the training data without considering the class labels, i.e. the marginal distribution. This means that data from different classes tend to congest together in the latent space, and the distributions of individual classes are nonGaussian, as shown in the top row of Fig. 7. This is not a good property for classification tasks like SRE. Conditional NF models [1] may take the class information as a condition variable, however the conditioning cannot be generalized to unseen classes (e.g. unknown speakers), which makes it unsuitable for openset tasks such as SRE.
In order to normalize distributions of individual classes and keep different classes separated, we propose a discriminative normalization flow (DNF) model. The main difference is that we allow each class to have its own Gaussian prior, i.e. all the priors share the same covariance but possess different means, formulated as follows:
where is the class label. By setting classspecific means, different classes will be separated from each other in the latent space, as shown in the bottom row of Fig. 7.
Training DNF is mostly the same as the vanilla NF, following the ML criterion. The only difference is that the probability of an observation should be evaluated with the prior corresponding to its class label, formally written by:
where is the class label of , and . Pooling all the training data, we obtain the objective function for DNF training:
where involves all the parameters of the model. Note that this objective is a bit overparameterized, as the covariance can be set to any values if the flow is flexible enough, e.g. in the case of NVP. We therefore manually set and let the flow handle the volume change.
After training, the DNF model will establish a normalization space for , where the distribution of every class is simply a Gaussian with covariance . With this model, an observation can be transformed to its latent code by the inverse transform , without knowing its class labels. In addition, the latent codes from the same class, which may be unknown, tend to be a Gaussian. From this perspective, DNF is a nonlinear feature transform that is dedicated to withinclass normalization.
IiiC Relation to LDA
From the probabilistic model view, DNF is a nonlinear extension of LDA. Both DNF and LDA are generative models, and they share the same assumption that the distributions of all classes are homogeneous Gaussian in the latent space. However, this assumption can never be true for LDA if the data are complex, due to the limit of the linear transform between the data space and the latent space. However, DNF, our proposed approach, is based on a nonlinear transform, which allows it to establish a truly homogeneous and Gaussian latent space, even for complex irregulated data.
From the dimension reduction view, DNF has a similar role as the normalization step of LDA. Both approaches normalize the distribution of data; the key difference is that the normalization step of LDA normalizes the aggregated conditional distribution of all classes to an isotropic Gaussian, while DNF normalizes all the conditional distributions to homogeneous isotropic Gaussians. Therefore, our DNF approach can deliver a more powerful normalization than the linear normalization of LDA.
However, it should be noted that unlike LDA, DNF does not normalize the betweenclass covariance, which may lead to performance loss with classification methods where dimension independence is assumed, such as those based on cosine distance. We can therefore combine DNF and LDA by substituting the linear normalization step of LDA for DNF, while keeping the linear discrimination step of LDA unchanged. This leads to a new model with a nonlinear normalization step and a linear discrimination step, which we will call nonlinear discriminative analysis (NDA), and we will investigate its performance in the experimental section.
Iv Experiments
Iva Datasets
Three datasets were used in our experiments: VoxCeleb [36, 6], SITW [35] and CNCeleb [12]. VoxCeleb was used for training all the models (ivector, xvector, LDA, PLDA and DNF models), while the other two were used for performance evaluation.
VoxCeleb: This is a largescale audiovisual speaker database collected by the University of Oxford, UK. The entire database consists of VoxCeleb1 and VoxCeleb2
. All the speech signals were collected from opensource media channels and therefore involve rich variations in channel, style, and ambient noise. This dataset, after removing the utterances shared by the SITW dataset, was used to train the ivector, xvector, LDA, PLDA and DNF models. The entire dataset contains
hours of speech signals from speakers. Data augmentation was applied to improve robustness, with the MUSAN corpus [47] used to generate noisy utterances, and the room impulse responses (RIRS) corpus [26] was used to generate reverberant utterances.SITW: This is a standard evaluation dataset excerpted from VoxCeleb1, which consists of speakers. In our experiments, the Eval. Core test set, which contains target trials and imposter trials, was used for evaluation. It should be noted that the acoustic condition of SITW is similar to that of the training set VoxCeleb, so this test can be regarded as an indomain test.
CNCeleb: This is a largescale free speaker recognition dataset collected by Tsinghua University from source media. It contains more than utterances from Chinese celebrities. It covers diverse genres, which makes speaker recognition on this dataset much more challenging than on SITW [12]. By pairwise composition, trials are constructed, including target trials and imposter trials. It is important to note that the acoustic condition of CNCeleb is quite different from that of VoxCeleb, and this therefore represents a challenging corpus that is suitable for use as an outofdomain test.
IvB Model settings
Our SRE approach consists of three components: an xvector or ivector frontend that produces speaker vectors, a normalization model that regularizes the distribution of the speaker vectors, and finally, a scoring model that produces pairwise scores for making a genuine/imposter decision.
IvB1 Frontend

xvector system: The xvector frontend was created using the Kaldi toolkit [40], following the SITW recipe. The acoustic features are dimensional Fbanks. The main architecture contains three components. The first component is the featurelearning component, which involves timedelay (TD) layers to learn framelevel speaker features. The slicing parameters for these TD layers are: {, , , +, +}, {, , +}, {, , +}, {}, {
}. The second component is the statistical pooling component, which computes the mean and standard deviation of the framelevel features from a speech segment. The final one is the speakerclassification component, which discriminates between different speakers. This component has
fullconnection (FC) layers and the size of its output is , corresponding to the number of speakers in the training set. Once trained, the dimensional activations of the penultimate FC layer are read out as an xvector. 
ivector system: The ivector frontend was built with the Kaldi toolkit [40], following the SITW recipe. The raw features involve dimensional MFCCs plus the log energy, augmented by first and secondorder derivatives, resulting in a dimensional feature vector. This feature is used by the ivector model. The universal background model (UBM) consists of Gaussian components, and the dimensionality of the the ivectors is set to be .
IvB2 Normalization models
To investigate the merits of our proposed DNF model, we compare its performance with a number of different configurations.

LDA: We implemented the basic LDA model, trained to maximize the Fisher criterion. We used the implementation in the Kaldi toolkit [40], which involves a small modification that specifies rather than , where
is a hyperparameter that was set to be 0.1 in the LDA + cosine scoring experiment, and 0.0 in the LDA + PLDA scoring experiment.

LDA/N: The linear normalization component of LDA. It simply normalizes the withinspeaker covariance to be an identity matrix, neither diagonalizing the betweenspeaker covariance nor trimming any dimensions.

DNFLDA: One potential issue with DNF is that it does not normalize the betweenclass covariance. Here, we perform an additional LDA after DNF normalization, to achieve normalization on both within and betweenclass covariance. This is essentially a simple implementation of the NDA model discussed in Section IIIC.
IvB3 Scoring model
Two commonly used scoring models were applied in this study: the simple Cosine scoring, which is based on the cosine distance, and the more complicated PLDA scoring, which is based on PLDA [20].
IvC Basic results
In the first experiment, we apply the four normalization models (LDA, LDA/N, DNF, DNFLDA) to regulate the standard xvectors derived from both SITW and CNCeleb. The results in terms of equal error rate (EER) are reported in Table II.
IvC1 Xvector indomain results
Firstly, focusing on SITW, the indomain test, it can be seen that all the normalization models provide performance improvement with cosine scoring. The fact that LDA/N outperforms the baseline in a very significant way (9.19 vs. 17.20) demonstrates the importance of withinspeaker normalization, although this approach is only linear. DNF performs better than LDA/N (8.53 vs 9.19), confirming that nonlinear normalization is better than a linear one. LDA performs much better than LDA/N and DNF, demonstrating the importance of the betweenclass information. Finally, DNFLDA achieves the best performance, by combining the strength of DNF and LDA,.
For PLDA scoring, all the nonlinear normalization models (including dimensiontrunking LDA, DNF and DNF + LDA) offer performance improvement. Note that any linear transform (LDA/N and LDA without dimension reduction) does not change the PLDA performance, as the withinspeaker and betweenspeaker covariances that PLDA relies on are linearly invariant. LDA with dimensionreduction provides reasonable performance improvement when the dimension size is carefully selected, demonstrating the importance of distribution normalization for individual speakers. Significantly, DNF obtains better performance than LDA, which confirms that NF is a better normalization approach for this problem. By adding additional LDAbased normalization, DNFLDA achieves the best performance. These results are consistent with those obtained with cosine scoring.
IvC2 Xvector outofdomain results
When using CNCeleb, the outofdomain test, the observations are very different. Looking at the cosine scoring results, firstly we observe that LDA/N does not offer any performance improvement over the baseline (16.36 vs 16.32), which indicates that the global withinspeaker covariances are significantly different between VoxCeleb (the training data) and CNCeleb, and so the linear normalization that is learned to diagonalize the withinspeaker covariance of VoxCeleb can never diagonalize the withinspeaker covariance of CNCeleb. LDA, which applies additional transform and dimension selection based on the betweenclass variance, makes things even worse. This suggests that the betweenclass covariances of VoxCeleb and CNCeleb are also significantly different.
DNF, which focuses on normalizing distributions of individual speakers, is more robust against data mismatching, when compared to the global linear normalization with LDA/N (14.22 vs 16.36). Applying additional LDA after DNF reduces the performance, which suggests that in the DNF latent space, the betweenclass covariance still changes significantly from VoxCeleb to CNCeleb. The only exception is the case of dimensionpreserving DNFLDA [512], which does not perform any dimension reduction and provides a marginal gain over DNF (13.83 vs 14.22). This indicates that in the DNF latent space, although the shape of the betweenspeaker covariance has changed significantly from VoxCeleb to CNCeleb, the principle directions of the covariance may not change much. This is not the case within the LDA/N latent space, as the performance with LDA [512] is worse than with LDA/N (16.87 vs 16.36).
For PLDA scoring, similar conclusions can be drawn: LDA fails in most situations. The principle role of LDA in this scenario is removing irregulated dimensions, and this removal is based on the betweenspeaker covariance within the latent space by LDA/N, which is in turn based on the withinspeaker covariance. However, as we have discussed, both the between and withinspeaker covariances change significantly from VoxCeleb to CNCeleb, so it is not surprising that the LDAbased normalization fails. In contrast to LDA, DNF still works in this situation, which can be attributed to the more robust withinclass normalization. However, when applying additional LDA, the unreliable betweenclass information is used for dimension reduction, which leads to significant performance reduction. This is shown in the case of DNFLDA with reduced dimensions.
To summarize, the experimental results presented above indicate that the global properties (within and betweenspeaker covariances) may change significantly at the dataset level, and any normalization methods based on these properties will suffer from the generalization problem. DNF learns how to normalize individual speakers at different locations of the speaker space, which appears to be more generalizable to unseen data. However, this generalizability seems to only be for withinclass distributions: after DNF normalization, there is still significant mismatch with regard to betweenclass distributions, which should be further investigated.
SITW  CNCeleb  
Cosine  PLDA  Cosine  PLDA  
xvector [512]  17.20  5.30  16.32  13.03 
LDA [150]  5.25  4.07  17.67  14.37 
LDA [200]  5.82  3.96  17.52  13.50 
LDA [400]  7.38  4.65  17.49  12.28 
LDA [512]  8.61  5.30  16.87  13.03 
LDA/N [512]  9.19  5.30  16.36  13.03 
DNF [512]  8.53  3.66  14.22  11.82 
DNFLDA [150]  5.06  3.61  15.42  13.85 
DNFLDA [200]  5.41  3.42  15.18  13.22 
DNFLDA [400]  7.05  3.58  14.20  11.90 
DNFLDA [512]  8.17  3.66  13.83  11.82 
IvC3 Ivector results
For the purpose of comparison, we report the results with ivectors in Table III. It can be seen that the normalization methods make very little contribution to PLDA scoring on both SITW and CNCeleb databases. For the cosine scoring, LDA contributes with performance gains on SITW. We argue that this is mainly due to the diagonalization on the betweenspeaker covariance. However, this contribution is largely lost on CNCeleb, indicating that the betweenspeaker covariance has changed significantly from SITW to CNCeleb. This observation is the same as in the xvector experiment. DNF does not show any advantage in this experiment. This is because the withinspeaker distributions of ivectors have been well regulated (see Table I, so a dedicated normalization is not necessary.
SITW  CNCeleb  
Cosine  PLDA  Cosine  PLDA  
ivector [400]  14.24  5.66  17.68  18.25 
LDA [150]  7.11  5.36  18.18  18.49 
LDA [200]  7.46  5.25  17.85  18.36 
LDA [400]  9.32  5.66  16.65  18.25 
LDA/N [400]  11.84  5.66  17.23  18.25 
DNF [400]  12.06  5.60  18.04  18.15 
DNFLDA [150]  7.30  5.41  18.30  18.53 
DNFLDA [200]  7.52  5.30  18.02  18.36 
DNFLDA [400]  9.49  5.60  17.02  18.15 
IvD Results on more powerful xvectors
In this experiment, we constructed more powerful xvector systems to investigate whether DNF normalization still contributes. We conducted extensive preliminary trials on model structures and training objectives (not reported here due to space constraints), and based on these, we chose three architectures to represent SOTA systems.

TDNN + Att.: The same architecture as the TDNN baseline in the previous experiment, but the statistical pooling is replaced by selfattention pooling [60].
The experimental results are shown in Table IV. For LDA, we report the LDA [200] results only, as it is the best configuration for all the LDA systems. On SITW, all these ‘advanced’ systems outperform the TDNN baseline in a significant way, and DNF still achieves good performance. In most situations, DNF outperforms LDA, and more performance gains can be attained by DNFLDA. The results on CNCeleb are even more significant. Firstly, they once again confirm the generalizability of DNF, as reported previously; and secondly, they show that the EER reduction on SITW provided by the ‘advanced’ approaches was not transferred to the results on CNCeleb. This indicates that the performance improvement obtained with some of the ‘advanced’ techniques may simply be the result of overfitting.
SITW  CNCeleb  
Cosine  PLDA  Cosine  PLDA  
TDNN  xvector [512]  17.20  5.30  16.32  13.03 
LDA [200]  5.82  3.96  17.52  13.50  
DNF [512]  8.53  3.66  14.22  11.82  
DNFLDA [200]  5.41  3.42  15.18  13.22  
TDNN + Att.  xvector [512]  4.37  3.66  15.08  13.05 
LDA [200]  3.72  2.73  18.34  13.97  
DNF [512]  5.00  2.71  14.69  12.07  
DNFLDA [200]  3.72  2.57  15.45  13.66  
ResNet34 + Att.  xvector [512]  2.73  2.52  13.94  13.11 
LDA [200]  2.60  2.00  14.90  12.58  
DNF [512]  3.47  1.94  13.86  11.61  
DNFLDA [200]  2.57  1.89  14.04  12.32  
ResNet34 + AAM  xvector [512]  5.71  2.82  15.80  14.02 
LDA [200]  2.73  1.86  16.67  13.42  
DNF [512]  4.89  2.32  14.66  12.80  
DNFLDA [200]  2.93  1.83  14.96  12.59 
IvE Analysis
To better understand the behavior of DNF, we monitored the training process, and here we report the change of the statistics related to regulation and discrimination. As in Section II when analyzing LDA, we conduct the analysis with a smallscale experiment. All the data and measures are the same as in the LDA investigation, and we focus on xvector results.
IvE1 Regulation analysis
Fig. 8 presents the four groups of measures related to data regulation: PC directional variance and PC shape variance, which reflect the homogenity of distributions of different speakers, and averaged kurtosis and averaged skewness, which reflect the Gaussianality of the distributions of each speaker. It can be seen that the values of all these measures are significantly reduced during training. Compared to the results in Table I, it can be seen that the DNF can generally reach lower values on all these measures compared to LDA, hence is a better normalization model. Spikes are found with kurtosis, skewness, and PC1/PC2 direction variance. These spikes indicate that the model is trying to change the location of all the speakers in order to find an optimal configuration, but changing one speaker may cause unwanted change on other speakers, due to the complex distributions of the speaker vectors. Nevertheless, the training can ultimately find a better configuration that improves the data regulation in general.
IvE2 Discrimination analysis
To investigate the discriminative capability of the DNFnormalized speaker vectors, we compute several measures related to class discrimination: (1) betweenclass and withinclass variance and their ratio; (2) EER results based on cosine scoring; (3) cross entropy, where the logit is computed based on the inner product of training samples and the class means; (4) cross entropy, where the logit is computed based on the cosine distance between training samples and the class means. Fig.
9 shows the change of these measures during model training. This shows that the data in the latent space becomes increasingly discriminative over time, as indicated by all these measures. In particular, we highlight the continuous increase of the cross entropy based on inner product. If we treat the inverse NF function as a regular neural net and the mean vectors of all the classes as the weights of the final layer, the whole DNF architecture is a standard classification network. This net is usually trained with the CE loss. In DNF, we interpreted the net in a very different way (a generative model) and trained it with a very different loss (ML), and obtained the same CE reduction. This confirms the fundamental relation between generative and discriminative models, as discussed in Section III.V Conclusions
This paper investigated the issue of data irregulation with deep speaker vectors in SRE, and found through comprehensive experiments that deep speaker vectors require deep normalization. Firstly, We found that the withinspeaker distributions of deep speaker vectors are highly nonhomogeneous and nonGaussian, which may seriously impact performance of SRE systems. To overcome this problem, we introduced a new deep normalization approach, based on a novel discriminative normalization flow (DNF) model. This model is a nonlinear extension of LDA, and can normalize complex and heterogeneous distributions of individual speakers. Using state of the art system configurations, our experiments on two datasets demonstrated that our new DNF approach delivers consistently better performance compared to the baseline and outperforms the more conventional LDAbased normalization. Furthermore, in the outofdomain test where LDA performs very poorly, DNF still delivers good performance, confirming the good generalizability and further potential of our approach. Future work will investigate the joint training of the DNF normalizer and the speaker embedding model, and will also apply DNFs to raw acoustic features directly.
References

[1]
(2019)
Guided image generation with conditional invertible neural networks
. arXiv preprint arXiv:1907.02392. Cited by: §IIIB.  [2] (2019) Partial auc optimization based deep speaker embeddings with classcenter learning for textindependent speaker verification. arXiv preprint arXiv:1911.08077. Cited by: §I.

[3]
(2018)
Exploring the encoding layer and loss function in endtoend speaker and language recognition system
. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop, pp. 74–81. Cited by: §I, §I.  [4] (1997) Speaker recognition: a tutorial. Proceedings of the IEEE 85 (9), pp. 1437–1462. Cited by: §I.
 [5] (2019) Tied mixture of factor analyzers layer to combine frame level representations in neural speaker embeddings. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2948–2952. Cited by: §I.
 [6] (2018) VoxCeleb2: deep speaker recognition. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1086–1090. Cited by: §I, §IIB, 2nd item, 3rd item, §IVA.
 [7] (2011) Frontend factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §I.

[8]
(2019)
Arcface: additive angular margin loss for deep face recognition
. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 4690–4699. Cited by: 3rd item.  [9] (2018) MTGAN: speaker verification through multitasking triplet generative adversarial networks. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 3633–3637. Cited by: §I.
 [10] (2014) Nice: nonlinear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §IIIA.
 [11] (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §IIIA.
 [12] (2019) CNCELEB: a challenging Chinese speaker recognition dataset. arXiv preprint arXiv:1911.01799. Cited by: §IVA, §IVA.
 [13] (1936) The use of multiple measurements in taxonomic problems. Annals of eugenics 7 (2), pp. 179–188. Cited by: §IIA1.
 [14] (2019) Improving Aggregation and Loss Function for Better Embedding Learning in EndtoEnd Speaker Verification System. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 361–365. Cited by: §I.
 [15] (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), pp. 2672–2680. Cited by: §IIIA.
 [16] (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal processing magazine 32 (6), pp. 74–99. Cited by: §I.
 [17] (2009) The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Cited by: §IIA1.
 [18] (1996) Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 155–176. Cited by: §IIA2.
 [19] (2016) Endtoend textdependent speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119. Cited by: §I.
 [20] (2006) Probabilistic linear discriminant analysis. In European Conference on Computer Vision (ECCV), pp. 531–542. Cited by: §I, §IIA3, §IVB3.
 [21] (2018) Transfer learning from speaker verification to multispeaker texttospeech synthesis. In Advances in Neural Information Processing Systems (NIPS), pp. 4480–4490. Cited by: §I.
 [22] (2019) RawNet: Advanced EndtoEnd Deep Neural Network Using Raw Waveforms for TextIndependent Speaker Verification. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1268–1272. Cited by: §I.
 [23] (2007) Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing 15 (4), pp. 1435–1447. Cited by: §I.
 [24] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: 3rd item.
 [25] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §IIIA.
 [26] (2017) A study on data augmentation of reverberant speech for robust speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224. Cited by: §IVA.
 [27] (2017) Deep speaker feature learning for textindependent speaker verification. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1542–1546. Cited by: §I.
 [28] (2015) Improved deep speaker feature learning for textdependent speaker recognition. In AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 426–429. Cited by: §I.
 [29] (2019) Gaussianconstrained training for speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6036–6040. Cited by: §I, §II.
 [30] (2018) Fullinfo training for deep speaker feature learning. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5369–5373. Cited by: §I.
 [31] (2018) Deep factorization for speech signal. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5094–5098. Cited by: §I.
 [32] (2016) Maxmargin metric learning for speaker recognition. In 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–4. Cited by: §I.
 [33] (2019) BOUNDARY discriminative large margin cosine loss for textindependent speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6321–6325. Cited by: §I.
 [34] (1995) Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 354 (1), pp. 73–80. Cited by: §IIIA.
 [35] (2016) The speakers in the wild (SITW) speaker recognition database.. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 818–822. Cited by: §IVA.
 [36] (2017) Voxceleb: a largescale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §IIB, §IVA.
 [37] (2018) Attentive statistics pooling for deep speaker embedding. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2252–2256. Cited by: §I, §I.
 [38] (2019) Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762. Cited by: §IIIA, §IIIA.
 [39] (2017) Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems (NIPS), pp. 2338–2347. Cited by: Fig. 6, 3rd item.

[40]
(2011)
The kaldi speech recognition toolkit.
In
IEEE 2011 workshop on automatic speech recognition and understanding
, Cited by: 1st item, 2nd item, 1st item.  [41] (2018) Attentionbased models for textdependent speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5359–5363. Cited by: §I.
 [42] (2000) Speaker verification using adapted Gaussian mixture models. Digital signal processing 10 (13), pp. 19–41. Cited by: §I.
 [43] (2002) An overview of automatic speaker recognition technology. In IEEE international conference on Acoustics, speech, and signal processing (ICASSP), Vol. 4, pp. IV–4072. Cited by: §I.
 [44] (2006) Real and complex analysis. Tata McGrawhill education. Cited by: §IIIA.
 [45] (2019) The 2018 NIST Speaker Recognition Evaluation. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1483–1487. Cited by: §I.
 [46] (2018) Xvectors: robust DNN embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §I, §I, §I.
 [47] (2015) MUSAN: A Music, Speech, and Noise Corpus. Note: arXiv:1510.08484v1 External Links: 1510.08484 Cited by: §IVA.
 [48] (2019) SelfSupervised Speaker Embeddings. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2863–2867. Cited by: §I.
 [49] (2013) A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66 (2), pp. 145–164. Cited by: §IIIA.
 [50] (2014) Deep neural networks for small footprint textdependent speaker verification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. Cited by: §I.
 [51] (2019) Utterancelevel aggregation for speaker recognition in the wild. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–579. Cited by: §I.
 [52] (2019) CENTROIDbased deep metric learning for speaker recognition. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3652–3656. Cited by: §I.
 [53] (2019) On the Usage of Phonetic Information for TextIndependent Speaker Embedding Extraction. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1148–1152. Cited by: §I, §I.
 [54] (2019) VAEbased domain adaptation for speaker verification. arXiv preprint arXiv:1908.10092. Cited by: §I.

[55]
(2019)
Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification
. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1163–1167. Cited by: §I.  [56] (2019) BUT system description to voxceleb speaker recognition challenge 2019. arXiv preprint arXiv:1910.12592. Cited by: 2nd item, 3rd item.
 [57] (2016) Endtoend attention based textdependent speaker verification. In Spoken Language Technology Workshop (SLT), pp. 171–178. Cited by: §I.
 [58] (2019) VAEbased regularization for deep speaker embedding. arXiv preprint arXiv:1904.03617. Cited by: §I, §II.
 [59] (2019) Deep Speaker Embedding Extraction with ChannelWise Feature Responses and Additive Supervision Softmax Loss Function. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2883–2887. Cited by: §I.
 [60] (2018) Selfattentive speaker embeddings for textindependent speaker verification.. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 3573–3577. Cited by: 1st item, 2nd item.
Comments
There are no comments yet.