In recent years, generative modeling became a very active research area with impressive achievements. The most popular generative schemes are often given by variational auto-encoders (VAEs) [KingmaVAE], generative adversarial networks (GANs) [GAN]
and their variants. VAEs rely on the maximum likelihood principle to learn the underlying data generating distribution by considering a parametric model. Due to the intractability of the parametric model, VAEs employ approximate inference by considering an approximate posterior to get a variational bound on the log-likelihood of the model distribution. Despite its elegance, this approach has the drawback of generating low-quality samples due to the fact that the approximate posterior could be quite different from the true one. On the other hand, GANs have proven to be more impressive when it comes to the visual quality of the generated samples, while the training often involves nontrivial fine-tuning and is unstable. In addition to difficult training, GANs also suffer from “mode collapse” where the generated samples are not diverse enough to capture the diversity and variability in the true data distribution[GAN].
In this work, we propose a new class of robust auto-encoders that also serve as a generative model. The main idea is to develop a ‘score’ function [hyvarinen, scoring] of the observed data and postulated model, so that its minimization problem is equivalent to minimizing the Fisher divergence [jie_nips19]
between the underlying data generating distribution and the postulated/modeled distribution. By doing this, we are able to leverage the potential advantages of Fisher divergence in terms of computation and robustness. In the context of parameter estimation, minimizing the Fisher divergence has led to the Hyvärinen score[hyvarinen]
, which serves as a potential surrogate for the logarithmic score. The main advantage of the Hyvärinen score over logarithmic sore is its significant computational advantage for estimating probability distributions that are known only up to a multiplicative constant, e.g. those in mixture models and complex time series models[hyvarinen, scoring, jie_nips19, jieSMC]. Our work will extend the use of Fisher divergence and Hyvärinen score in the context of variational auto-encoders.
Similar to the logarithmic score, the Hyvärinen score is also intractable to compute due to the intractable integration over the latent variables. One way to mitigate this difficulty is to bound the Hyvärinen score and obtain a variational bound to optimize instead. However, unlike the logarithmic score, this strategy seems to be very complicated and a variational bound seems to be out of reach. Alternatively, it turns out that the variational bound in VAEs can be recovered by minimizing the KL divergence between the joint distribution over the data and latent variable and the modeled joint distribution which can be easily calculated as the product of the prior and the decoder distribution [VAE_tutorial]
. Following the same principle, we propose to minimize the Fisher divergence between the two joint distributions over the model parameters. This minimization results in a loss function that shares similar properties as regular VAEs but more powerful from an inference point of view.
It turns out that our developed loss function is the sum of three terms: the first one is the tractable Fisher divergence between the approximate and the model posteriors, the second is similar to the reconstruction loss in VAEs obtained by evaluating the Hyvärinen score on the decoder distribution, and the last term can be seen as a stability measure that promotes the invariance property in feature extraction in the encoder. Therefore, the new loss function is different from the regular variational bound in regular VAEs in the following aspects: 1) it considers the minimization of the distance between the approximate and the model posteriors which turns out to be difficult when considering the KL divergence due to the intractable normalization constant in the model posterior, 2) it allows to produce robust features by considering a stability measure of the approximate posterior. Experimental results on MNIST[mnist] and CelebA [celebA] datasets validate these aspects and demonstrate the potential of the proposed Fisher AE as compared to some existing schemes such as VAEs and Wasserstein AEs. Moreover, thanks to the stability measure in the Fisher loss function, the encoder is proved to have more stable and robust reconstruction when the data is perturbed by noise as compared to other schemes playing a similar role as denoising auto-encoders [DAE].
Main contributions. First, we develop a new type of AEs that is based on minimizing the Fisher divergence between the underlying data/latent joint distribution and the postulated model joint distribution. Our derived loss function may be decomposed as divergence between posteriors + reconstruction loss + stability measure. Second, our derived method is conceptually appealing as it is reminiscent of the classical evidence lower bound (ELBO) derived from Kullback-Leibler (KL) divergence. Third, we affirmatively address the conjecture made in some earlier work that Fisher divergence can be more robust than KL divergence in modeling complex nonlinear models [jie_nips19, siwei]
in the context of VAEs. Our results indicate that Fisher divergence may serve as a competitive learning machinery for challenging deep learning tasks.
Outline. In Section 2, we provide a brief overview on VAEs and some theoretical concepts related to the Fisher divergence and the Hyvärinen score. In Section 3, we provide the technical details related to the proposed Fisher auto-encoder. Then, in Section 4 we give both qualitative and quantitative results regarding the performance of the proposed Fisher AE. Finally, we provide some concluding remarks in Section 5.
2 Background on VAEs and Fisher divergence
2.1 Variational auto-encoders
By considering a probabilistic model of the data observations given by , the goal of variational inference is to optimize the model parameters to match the true unknown data distribution in some sense. One way to match the true data distribution is to minimize the Kullback-Leibler (KL) divergence as follows:
where are latent variables with prior distribution and is a likelihood function corresponding to the decoder modeled by the parameters
using a neural network. Unfortunately, the intergration over the latent variablesin (1) is usually intractable and an upper bound on the negative marginal log-likelihood is often optimized instead. By introducing an alternative posterior over the latent variables given by and by direct application of the Jensen’s inequality, we have
where is an approximate posterior corresponding to the encoder parameterized by . The bound in (2) is often called the evidence lower bound (ELBO) (w.r.t the log-likelihood) and it is optimized w.r.t both model parameters and :
The common practice is to consider a Gaussian model for the posterior , i.e., where and are the output of a neural network taking as input the data sample and parameterized by . This allows to reparametrize as , where denotes the point-wise multiplication and which permits to efficiently solve (3) using stochastic gradient variational Bayes (SGVB) as in [KingmaVAE].
2.2 Fisher divergence and the Hyvärinen score
A standard procedure in data fitting and density estimation is to select from a parameter space , the probability distribution , that minimizes a certain divergence with respect to the unknown true data distribution . For a certain class of divergences, expanding the divergence w.r.t the true probability distribution yields: , where is a constant that depends only on the data and is a score function associated to . Clearly, the smaller the score , the better the data point fits the model . In practice, given a set of observations , one would minimize the sample average over . The most popular example of these scoring functions [scoring] is the logarithmic score given by which is obtained by minimizing the Kullback-Leibler (KL) divergence, i.e.
. In this case, the procedure of minimizing the score function is widely known as maximum likelihood (ML) estimation and has been extensively applied in statistics and machine learning. Popular instances of ML estimation include logistic regression when minimizing the cross-entropy loss w.r.t a Bernoulli model of the data and regression when minimizing the squared loss in the presence of a Gaussian model of the data[bishop]
. In the context of variational inference, the logarithmic score is fundamental in the construction of variational autoencoders[KingmaVAE] as we showed in the previous section.
Recently, the Hyvärinen score [hyvarinen, jie_nips19, pmlr-v48-liub16, siwei] that we denote by has been proposed as an alternative to the logarithmic score. It turns out that the Hyvärinen score can be obtained by minimizing the Fisher divergence defined as
where denotes the gradient w.r.t . Assuming the same regularity conditions as in Proposition 1 [jie_nips19], we have
for some probability density functionand denotes the Laplacian of some function w.r.t . The potential of both the Fisher divergence and the Hyvärinen score is their ability to deal with probability distributions that are known up to some multiplicative constant. This interesting property allows to consider larger class of unormalized distributions and therefore better fits the data. In the next section, we provide a detailed description of how we can extend the use of Fisher divergence and Hyvärinen score in the context of variational auto-encoders.
3 Proposed Fisher Auto-Encoder
Recall from (2) that instead of minimizing the logarithmic score , we instead upper bound the score and minimize . Similarly, one would look for an upper bound to the Hyvärinen score and minimize it w.r.t model parameters and . However, this is quite non-trivial as opposed to the logarithmic score in (2). Fortunately, the upper bound in (2) can be recovered by minimizing the KL divergence between the following two joint distributions: and where , and are respectively the variational posterior, the prior and the decoder with parameters , and .
Following the same line of thought, we propose to minimize the Fisher divergence between and as follows:
where denotes the gradient w.r.t the augmented variable . The following theorem provides a simplified expression of the Fisher AE loss by expanding and simplifying the Fisher divergence in (8).
The minimization in (8) is equivalent to the following minimization problem:
A proof can be found in the supplementary material. ∎
The Fisher AE loss denoted by in (10) is the sum of the following three terms: 1⃝ the Fisher divergence between the two posteriors and . In traditional VAEs, the KL divergence between these two posteriors is generally intractable since and is hard to compute because . Interestingly, with the Fisher divergence this limitation is alleviated since and we only need for computation. The second term given by 2⃝ is the Hyvärinen score of which is nothing but a reconstruction loss similar to in regular VAEs. When , the reconstruction loss is given by the squared loss111We omit the constant term coming from the Laplacian since it is constant and thus irrelevant to the minimization problem in (9).: which is the same as in regular VAEs under the same model, is the decoder parametrized by . The last term 3⃝ is a stability term that permits to produce robust features in the sense that the posterior distribution is robust against small perturbations in the input data. This is similar to contractive auto-encoders which promote the invariance property in feature extraction [contractive_ae].
When , the Fisher AE loss becomes exactly the Hyvärinen score of the model distribution , i.e. . This is similar to traditional VAEs since we also have in this case.
When , . The proof is concluded by relying on (5). ∎
Given a data point , the Fisher AE loss can be estimated using Monte Carlo with samples from as follows:
where , . Moreover, and both and
can be computed using automatic differentiation tools like Autograd in PyTorch. To solve the minimization in (9
), we use stochastic gradient descent (SGD) with minibatch data of sizeas in [KingmaVAE]. Details of the optimization are given by Algorithm 1.
3.1 Fisher AE with exponential family priors
As discussed earlier, employing the Fisher divergence has the advantage of dealing with probability distributions that are known up to some multiplicative constant. This powerful property allows to consider a rich family of distributions to model the prior . In this paper, we consider the use of exponential family whose general form is given by:
where denotes the natural parameters, is the carrier measure and is referred to as a sufficient statistic [exp_family]
. Popular examples of the exponential family include the Bernoulli, Poisson and Gaussian distributions to name a few[exp_family]. Note that the form given by the right hand side of (12) is not a valid PDF since it does not sum to 1, but it is sufficient to compute the gradient of the log-density w.r.t which is given by . Therefore, the term 1⃝ in (10) can be written as:
which can be approximated using samples , as follows:
A popular class of distributions that belongs to the exponential family is given by the factorable polynomial exponential family (FPE) [polynomial_family] in which is given by
where denotes the order of FPE family and is a set of parameters. The natural parameters, the sufficient statistic and the carrier measure in this case are given by:
With the model in (13), the gradient of w.r.t can be easily derived as
In this section, we provide both qualitative and quantitative results that demonstrate the ability of our proposed Fisher AE model to produce high quality samples on real-world image datasets such as MNIST and CelebA. We compare results with both regular VAEs [KingmaVAE] and Wasserstein Auto-Encoders with GAN penalty (WAE-GAN) [wae]. In the supplementary material, we provide full details for the encoder/decoder architectures used by the different schemes for both MNIST and celebA datasets.
For optimization, we use Adam [adam] with a learning rate , , , a mini-batch size of
and trained various models for 100 epochs. For all experiments, we pickfor MNIST and for celebA and use Gaussian and Bernoulli decoders for Fisher AE and regular VAE respectively. As proposed earlier, we use exponential family priors for the Fisher AE as in (13) and noticed that seems to work better in all experiments whereas Gaussian priors are used for VAE and WAE-GAN. We use Gaussian posteriors for both Fisher AE and VAE such that where and are determined by the encoder architecture for which details are postponed to the supplementary material.
Sampling with SVGD
To sample from the exponential family prior after training, we use Stein Variational Gradient Descent (SVGD) [svgd_general] . Let be the number of samples that we would like to sample from denoted by . We start with and we keep evolving these samples with a step-size for iterations. These parameters (step-size and number of iterations) seem to work reasonably well across all experiments.
Figure 2 exhibits a comparison between the three auto-encoders in terms of robustness, test reconstruction, and random sampling. In order to compare the robustness, we plot the reconstructed samples of the different schemes when the test data is corrupted by an isotropic Gaussian noise with a covariance matrix . The results of this experiment are given by the first row of Figure 2. Clearly, WAE-GAN completely fail to reconstruct the test data and Fisher AE seems to be more robust to noise. This result is confirmed quantitatively in Figure 1 where we plot the normalized binary cross-entropy (BCE) w.r.t the noise variance added to the test data, i.e. we feed the different trained models with and compute the BCE reconstruction loss w.r.t the true test data. In the second and third rows of Figure 2, we show both the reconstruction and generative performance of the different auto-encoders. For both test reconstruction and random sampling, the proposed Fisher AE exhibits a comparable performance to WAE-GAN which achieves the best generative performance thanks to the GAN penalty in the loss function [wae].
For the CelebA dataset, it is clear from the first row (the noisy reconstructions) of Figure 4 that the proposed Fisher AE is more robust than both VAE and WAE-GAN when the test data is corrupted with an isotropic Gaussian noise with covariance matrix . We further validate this property with different noise levels as depicted in Figure 3 where the Fisher AE outperforms VAE and WAE-GAN in the reconstruction MSE. Moreover, as shown in Figure 4, the Fisher AE generates better samples than VAE and has comparable quality to WAE. The visual quality of the samples is confirmed by the quantitative results summarized in Table 1 where the proposed Fisher AE with exponential family priors outperforms VAE in terms of the Fréchet Inception Distance (FID) and has relatively worse performance than WAE. Furthermore, sampling using the exponential prior provides additional challenges due to the difficulty of convergence of the algorithm. This may be alleviated with alternative sampling algorithms, but that remains beyond the scope of this paper.
In this paper, we introduced a new type of auto-encoders constructed based on the minimization of the Fisher divergence between the joint distribution over the data and latent variables and the model joint distribution. The resulting loss function has two interesting aspects: 1) it allows to directly minimize the tractable Fisher divergence between the approximate and the true posteriors and 2) considers a stability measure of the encoder that allows to produce robust features. Experimental results were provided to demonstrate the competitive performance of the proposed Fisher auto-encoders as compared to some existing schemes like VAEs and Wasserstein AEs and their superiority in terms of robustness. An interesting but non trivial extension of the present work is to consider the modeling of the posterior distribution using exponential family priors.
Our proposed approach in this paper is expected to help the machine learning community to have more robust generative models able to learn the generative process of any data set and produce samples at a massive scale. On another front, the proposed approach can also be seen as a denoising technique that allows reconstructing data from noisy observations. At the social level, the proposed approach is expected to considerably accelerate the deployment of learning algorithms that will have access to huge data sets. A potential negative societal consequence might come from the misuse of this approach to generate unwanted content at scale.
This work was supported by the Army Research Office grant No. W911NF-15-1-0479.
Supplementary Material for "Fisher Auto-Encoders"
In the supplementary material, we include the proofs of the theoretical results and more details on the experiments.
Appendix A Proof of Theorem 1
Let’s examine the inner-product terms:
where is obtained by an integration by parts.
where is again obtained by an integration by parts. Grouping all the terms together, we get
By noticing that is independent of the parameters , and , we conclude the proof of Theorem 1.
Appendix B Robustness to binary masking noise
We extend the experiments to examine the robustness of the proposed Fisher AEs and consider another type of noise called binary masking noise which consists on setting the value of a randomly selected fraction of input components to zero [contractive_ae].
Appendix C Further details on experiments
Here, we give the detailed architecture used in the implementation of the different auto-encoders for both MNIST and celebA data sets.
: Fully connected layer with input/output dimensions given by and .
: Transposed convolutional layer with input channels , output channels , kernel size , stride and padding .
: Average Pooling with kernel size, stride and padding respectively given by , and .
BN : Batch-normalization
BiI : 2D bilinear interpolation layer