In recent years the promising deep generative model VAE (variational autoencoder) was developed [1, 2] which has the following properties: (i) it can be made sufficiently deep to capture the complex data structures; (ii) it provides fast sampling of data from the inference model and (iii) it is computationally feasible and scalable.
This paper presents the attempt to apply this VAE model in i-vectors space  for the speaker verification task. We deliberately chose these features in spite of the fact that they are highly Gaussian after i-vector extractor  and length normalization  and are ideal for subsequent modeling with Gaussian PLDA (Probabilistic Linear Discriminant Analysis) . So it should be expected that for these features the performance of VAE will be limited by that of PLDA.
The main goal of this paper is to develop and assay the verification backend for VAE. It is convenient to solve this task in the i-vectors space and then to extend the solution to other features.
2 Verification system based on VAE
In this paper we confine to the investigation of the simplest diagonal version of VAE with a single hidden stochastic layer.
. In autoencoders all hidden layers consist of only deterministic neurons whereas for the factor analysis latent variable we need at least one hidden layer consisting of stochastic neurons (see Figure1 where it is denoted as ).
Similar to the classic factor analysis we should be able to perform the following actions: (i) to make inference for the latent variable posterior and (ii) to sample observed data vectors . To meet these requirements VAE comprises two neural nets, namely inference net and generative net shown at the right and left parts of Figure 1 respectively. Both of them involve the Gaussian assumptions and have identical structure. In addition to the the input layer of size and stochastic layer of size , this structure contains the layers and of deterministic neurons of size , shown on Figure 1 as rhombs. These layers in VAE are responsible for the additional depth and for the nonlinearity:
while parameters of the mean vectors and precision matrices of both generation and inference nets are computed with the linear connections only:
where indices and are used for the inference and generation nets respectively. Hereinafter all vectors are treated as row vectors. In the expressions (1–4) the entire set of the generated net’s parameters is denoted as and that of the inference net is denoted as , following the original paper , and the additional notations like and are used. We consider only diagonal precision matrices and also treated as vectors. This is what we mean by diagonality of VAE.
2.2 Learning VAE
Let be a training set of i-vectors of the dimension . Due to the nonlinearity and depth it is difficult to maximize likelihood directly via analytical EM-algorithm. That is why the authors of [1, 2] have to use the VBA approximation. In the context of analogy to factor analysis we are comparing VAE to FA-VBA.
Following , let us separate the lower bound from the evidence :
where lower bound is
The true posterior in (5) is intractable, therefore it is approximated by the variational posterior .
Like in the conventional Gaussian FA-VBA, the hidden variable prior is assumed to be , the posteriors and are Gaussian and the KL-divergence can found analytically and does not depend on . As in FA-VBA we need to maximize the lower bound to solve the optimization task for the VAE parameters . Due to nonlinearity and depth we are now unable to find this VBA-solution analytically. So we have to resort to the search for the stationary point using the numerical stochastic gradient ascent to update parameters. We also should be able to sample from during the inference stage.
The computation of gradient
does not reveal any difficulties so the standard deterministic backpropagation can be used. However the gradientlooks more problematic. It is known that the naïve Monte Carlo approximation of expectation in (6) which uses samples directly from the inference net
results in very high variance. In this case the training is slow because the gradients of with respect to latent variable are not used [9, 10]. In the papers [1, 2] the reparametrization trick was proposed, according to which the vectors for the Monte-Carlo estimation are not sampled from but instead generated from the deterministic transform
where are sampled from the fixed distribution . Using the reparametrization trick makes it possible to push gradient inside the expectation in (6) because it is now taken over the fixed distribution of which is independent of . As a result, the final expression for the gradient of (6) is as follows:
where is Gaussian with parameters and . Ultimately, we have the following expressions for the gradients of with respect to :
and with respect to :
We found that and minibatch size 100 provided the best results. Moreover, when we trained VAE with and it ceased to capture a complex structure of data and was able to generate data only from a distribution like a single Gaussian (which corresponds to the classical FA). The similar situation was observed when using a naïve Monte Carlo estimate of expectation in (6) instead of reparametrization trick.
2.3 RMS-prop optimizer
2.4 LLR scoring for VAE
Since VAE is a discriminative model we can only use evidence or marginal likelihood to obtain the speaker verification scores. Thus our verification score for the pair of i-vectors is a Likelihood Ratio:
where — are the hypotheses about the facts that are related to the same or different speakers respectively.
If only a single latent variable is used then we can estimate the marginal likelihood under impostor hypothesis with the help of the importance sampling which uses as a proposal distribution:
where samples are obtained from the inference net via reparametrization trick. For the target hypothesis the situation is more complicated. To make computation feasible we assume that and are conditionally independent given :
Since such an assumption is specific for training the conventional Gaussian PLDA analyzer model , it is natural for testing PLDA model as well . However it is not the case for VAE, where training vectors are fed into the model without speaker labels, in fully unsupervised manner. However, our experiments (see Section 3) demonstrated that this assumption is highly reasonable, so VAE performs a speaker embedding which is discussed in 3.2. With using this assumption we can use the importance sampling once again to compute marginal likelihood:
This expression is asymmetric, because samples are taken from when i-vector is an enrollment one. However, under the target hypotheses they could be taken from as well. Therefore one can use the symmetric LR estimate, where takes both these sampling variants into account. However we found no significant difference between and in our experiments on NIST-2010 (DET-5) . That is why all results shown below were obtained with the use of in log-LR estimate, i.e. on the assumption of feeding enrollment vector into inference net.
In the recent paper  on -VAE the empirical deviation from the exact lower bound was used:
The KL-divergence term in (6) can be treated as a natural regularizer (which follows from the variational Bayes) for the lower bound. It was observed in  that if VAE is trained with (i.e. with high penalty on the likelihood term) then it can better disentangle factors than with the theoretical value . In the speaker recognition domain the factors are, for example, eigenvoices in PLDA model. In Section 3 we demonstrate the results of our experiments on investigating -VAE in both “hard” () an “soft” () modes.
3 Experiments and discussion
All our experiments were carried out for two homogeneous cellular corpora, namely NIST and RusTelecom. Train part for the NIST corpus consists of 17486 sessions from 1763 male speakers taken from NIST 1998-2008. Tests were carried out on the male part of NIST 2010 (DET-5 extended protocol) . Train part of RusTelecom database consists of 116678 sessions from 6508 male speakers and test part consists of 235 male speakers. The details of the extraction of 400-dimensional i-vectors for the NIST corpus with using English ASR DNN and 600-dimensional i-vectors for the RusTelecom corpus with using Russian ASR DNN are described in . For the correct comparison of VAE and PLDA the latter should have diagonal covariances for both noise and posterior components. Here we moved from PLDA with latent variable to a simple diagonal two-covariance model . All input vectors for both PLDA and VAE experiments were centered, whitened and length-normalized in both training and testing. Hereinafter we denote a whitening matrix as . We used full matrix and diagonal matrix for the full-covariance PLDA and diagonal PLDA respectively.
3.1 Speaker Embedding VAE
The fact that we selected i-vectors features and thus limited the effectiveness of VAE by that of PLDA is very convenient. By carrying out extensive comparison of VAE and PLDA for (see Tables 2 and 3) we can obtain two conclusions at once:
The second conclusion states that VAE performs speaker embedding in space of latent variable . In other words, similar to PLDA, for the target hypothesis VAE is able to sample and from the likelihood conditioned on of a single speaker.
3.2 Exploring -VAE in low-dimensional space
The second effect was found during -VAE experiments, when we explored the “soft” training mode (). Carrying out the experiments on synthetic data we found that when diagonal VAE model starts to behave like full-covariance (in posterior) VAE model being able to capture the observed training data from Gaussian clusters with non-diagonal covariance. In order to investigate this property in real-life speaker verification task we selected 11119 files of 660 male speakers having at least 10 sessions. We used PCA projections of 400-dimensional i-vectors in order to operate with a wide range of VAE’s number of parameters under comparatively small training dataset.
Figure 2 shows two modes of -VAE training for PCA=2. In Figure 2a the obvious capture (red points) of 660 speakers training data (green points) is observed for the weak regularization mode (). This differs from the standard mode () shown in 2b. We found that the necessary condition for such a behavior is not only weak regularization but also the sufficient number of neurons in both stochastic and deterministic layers. For instance we unable to achieve this capture for the configuration , the minimum configuration required is . Our explanation is as follows. Since the expressiveness of the VAE model depends a lot on a posterior power then when increasing a number of posterior diagonal covariance elements to we can expect that the capabilities of the diagonal covariance VAE will be strengthened up to those of the full-covariance VAE. Anyway we can assert that making hidden layer wider than deterministic ones is necessary for such behavior.
In order to find out if this effect is only a result of overfitting or not and if it may be useful in speaker verification, we carried out a number of verification experiments at PCA=10 for the different VAE configurations. To avoid a strong overfitting we performed a verification tests on the rest 6367 files out of total 17486 in parallel to training. And we stopped training when EER and minDCF metrics computed on this development set started to degrade. Then we tested the obtained VAE model on the male part of the NIST 2010 (DET-5 extended protocol)  using for estimating LR-score (7). The results are shown in Table 1.
It can be seen from Table 1 that, contrary to out prior expectations, in both modes VAE is able to exceed the plateau of diagonal PLDA with respect to EER and minDCF. For the “soft” -VAE there is a conspicuous extremum on the stochastic layer sizes between and . It is not the case for the standard VAE which provides good results starting from the minimal configuration and right up to the maximal number of parameters which is reasonable to use when training on our small training set of 11119 files.
We carried out such experiments for several values of PCA dimensions and in all cases we observed the same above behavior for two -VAE modes. “Soft” -VAE is better than standard one and they both are superior to the diagonal PLDA up to the dimension PCA=15 inclusive. However, starting from PCA=20 the standard VAE becomes worse than diagonal PLDA with respect to EER, though comparable to it with respect to minDCF. One can observe this behavior up to maximal PCA dimensions (limited by i-vector dimension). In these cases only minimal configurations like one shown in the first line of Table 1 can be trained because of the small training set size.
3.3 -VAE in homogeneous corpus
Experiments on the original cellular corpus NIST (17486 training i-vectors) without PCA dimensionality reduction, i.e. for , represent the extreme case of our above observations for large PCA dimensions. The results are shown in Table 2. Here for all -VAE modes best results are achieved for the configuration and switched off RMS-prop (). There were 160 iterations of VAE training for it to saturate. Here we also tested “hard” -VAE () and found that its behavior doesn’t differ significantly from that of standard VAE (). It is interesting that LLR estimate depends only marginally on a number of samples used for both and . It seems, one might expect that VAE performance improves when increases, however this behavior is observed for only “soft” -VAE.
As our experiments show such a behavior of LLR estimate is determined not by a difference in modes. The main factor here is a tightness of the lower bound during the training. Indeed the situation similar to that for the “soft” -VAE in Table 2 is observed if the tested model is underfit.
In order to improve conditions for training “soft” -VAE we moved to a larger training corpus of Russian speech, RusTelecom database . In these experiments the optimal configuration was and RMS-prop was switched off. The learning rate was piecewise-constant starting from and decreasing once in the middle of training. The number of iterations was 220. The results shown in Table 3 demonstrate that we managed to slightly improve the “soft” -VAE situation. However we should have even larger training datasets to achieve the results comparable to those of full-covariance PLDA with “soft” -VAE.
The VAE-based speaker verification system in i-vector space is proposed. The LLR estimate for VAE is developed which demonstrates high effectiveness in all experiments with VAE. We showed that VAE performs a speaker embedding during training and thus, contrary to PLDA, can be trained in a fully unsupervised manner on large unlabeled datasets. We found that -VAE can be trained in a “soft” mode which results in that its properties are close to those of full-covariance VAE model. Last, we demonstrated that in i-vectors space the effectiveness of standard diagonal VAE tends to the plateau corresponding to diagonal PLDA. Therefore we conclude that application of VAE in other features space is of interest.
-  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations, Proceedings, 2014.
D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and
approximate inference in deep generative models,” in
International Conference on Machine Learning, Proceedings, 2014, pp. 1278––1286.
-  N. Dehak, P. Kenny., R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 788––798, 2011.
-  D. Garcia-Romero and C. Y. Espy-Wilso, “Analysis of i-vector length normalization in speaker recognition systems,” in Annual Conference of the International Speech Communication Association (Interspeech), Florence, Italy, Proceedings, 2011, pp. 249–252.
S. J. D. Prince and J. H. Elder, “Probabilistic linear discriminant analysis
for inferences about identity,” in
IEEE 11th Int. Conf. Comput. Vision, Rio de Janeiro, Brazil, Proceedings, 2007, pp. 1––8.
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,”The Journal of Machine Learning Research, no. 11, pp. 3371–3408, 2010.
-  C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, 2006.
D. M. Blei, M. I. Jordan, and J. W. Paisley, “Variational bayesian inference with stochastic search,” in29th International Conference on Machine Learning (ICML-12), Proceedings, 2012, pp. 1367––1374.
-  P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel, “The helmholtz machine,” Neural Computation, no. 7, pp. 889––904, 1995.
-  A. Mnih and K. K. Gregor, “Neural variational inference and learning in belief networks,” in International Conference on Machine Learning, Proceedings, 2014, pp. 1791––1799.
J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”Journal of Machine Learning Research, no. 12, pp. 2121––2159, 2010.
-  N. Brummer and E. de Villiers, “The speaker partitioning problem,” in Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, Proceedings, 2010.
-  “2010 NIST Speaker Recognition Evaluation Plan,” http://www.itl.nist.gov/iad/mig/tests/sre/2010/index.html.
-  I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “-vae: Learning basic visual concepts with a constrained variational framework,” in International Conference on Learning Representations, Proceedings, 2017.
-  T. Pekhovsky, S. Novoselov, A. Sholohov, and O.Kudashev, “On autoencoders in the i-vector space for speaker recognition,” in Odyssey 2016 Speaker and Language recognition workshop, Bilbao, Spain, Proceedings, 2016, pp. 230–241.