A divergence (see, for example Dragomir (2005)) is a measure of the difference between two distributions and with the property
Some of our results are specific to the -divergence, defined as
where is a convex function with . An important special case of an
-divergence is the well-known Kullback-Leibler divergencewhich is widely used to train models using maximum likelihood. We are interested in situations in which the supports of the two distributions are different, . In this case the divergence may not be defined. For example, for being an empirical data distribution on continuous dataset , where is the Dirac Delta function. For a model with support , then is not formally defined. This is a challenge since implicit generative models of the form only have limited support; in this case maximum likelihood to learn the model parameter is not available and alternative approaches are required – see Mohamed & Lakshminarayanan (2016) for a recent survey.
2 Spread Divergences
The aim is, from and to define new distributions and that have the same support111For simplicity, we use univariate , with the extension to the multivariate setting being straightforward.. Using the notation to denote integration for continuous , and for discrete with domain
, we define a random variablewith the same domain as and distributions
where is a ‘noise’ process designed to ‘spread’ the mass of and such that and have the same support. For example, if we use a Gaussian , then and both have support . We therefore use noise with the property that, despite not existing, does exist and we define the Spread Divergence
Note that this satisfies the divergence requirement . The second requirement, , is guaranteed for certain ‘noise’ processes, as described in section(2.1).
Spread divergences have many potential applications. For example, for a model with parameter and empirical data distribution , maximum likelihood training corresponds to minimising with respect to . However, for implicit models, the divergence does not exist. However, if a spread divergence exists, provided that the data is distributed according the model for some unknown parameter , the spread divergence has a minimum at . That is (for identifiable models) we can correctly learn the underlying data generating process, even when the original divergence is not defined.
2.1 Noise Requirements for a Spread Divergence
Our main interest is in using noise to define a new divergence in situations in which the original divergence is itself not defined. For discrete variables , , the noise must be a distribution , and
which is equivalent to the requirement that the matrix is invertible, see appendix(B). There is an additional requirement that the spread divergence exists. In the case of -divergences, the spread divergence exists provided that and have the same support. This is guaranteed if
which is satisfied if . In general, therefore, there is a space of noise distributions that define a valid spread divergence. The ‘antifreeze’ method of Furmston & Barber (2009) is a special form of spread noise to define a valid Kullback-Leibler divergence (see also Barber (2012)).
For continuous variables, in order that , the noise , with
must be a probability density and satisfy
This is satisfied if there exists a transform such that , where is the Dirac delta function. As for the discrete case, the spread divergence exists provided that and have the same support, which is guaranteed if . A well known example of such an invertible integral transform is the Weierstrass Transform , which has an explicit representation for . In general, however, we can demonstrate the existence of a spread divergence without the need for an explicit representation of . As we will see below, the noise requirements for defining a valid spread divergence such that are analogous to the requirements on kernels such that the Maximum Mean Discrepancy , see Sriperumbudur et al. (2011) and Sriperumbudur et al. (2012).
3 Stationary Spread Divergences
Consider stationary noise where
is a probability density function with, . In this case and are defined as a convolution
Since , and are guaranteed to have the same support
. A sufficient condition for the existence of the Fourier Transformof a function for real is that is absolutely integrable. All distributions are absolutely integrable, so that both and are guaranteed to exist. Assuming exists, we can use the convolution theorem to write
Hence, we can write
where we used the invertibility of the Fourier transform and assumed that , or equivalently222If can change sign, by continuity, there must exist a point at which ., . Hence, provided that and then defines a valid spread divergence. Note that other transforms have a corresponding convolution theorem333This includes the Laplace, Mellin and Hartley transforms. and the above derivation holds, with the requirement that the corresponding transform of is non-zero. As an example of such a noise process, consider Gaussian noise,
leading to a positive Fourier Transform:
Similarly, for Laplace noise
Since and , this also defines a valid spread divergence over .
3.1 Invertible Mappings
Consider for strictly monotonic . Then, using the change of variables
where is the Jacobian of . For distributions with bounded domain, for example
, we can use a logit function,, which maps the interval to . Using then, for example, Gaussian spread noise , both and have support . If is zero then on the domain .
3.2 Maximising the Spread
From the data processing inequality (see appendix(A)), spread noise will always decrease the -divergence . If we are to use a spread divergence to train a model, there is the danger that adding too much noise may make the spreaded empirical distribution and spreaded model distribution so similar that it becomes difficult to numerically distinguish them, impeding training. In general, therefore, it would be useful to add noise such that we define a valid spread divergence, but can maximally still discern the difference between the two distributions. To gain intuition, we define and to generate data in separated linear subspaces, , , . Using Gaussian spread, , what is the optimal , that maximises the divergence? Clearly, as tends to zero, the divergence increases to infinity, meaning that we must at least constrain the entropy of to be finite. In this case the spreaded distributions are given by
We define a simple Factor Analysis noise model with , where is fixed and . The entropy of is then fixed and independent of . Also, for simplicity, we assume . It is straightforward to show that the spread divergence is maximised for
pointing orthogonal to the vector. Then optimally points along the direction in which the support lies. The support of must be the whole space but to maximise the divergence the noise preferentially spreads along directions defined by and , see figure(1).
4 Mercer Spread Divergence
We showed in section(3) how to define one form of spread divergence, with the result that stationary noise distributions must have strictly positive Fourier Transforms. A natural question is whether, for continuous , there are other easily definable noise distributions that are non-stationary. To examine this question, let , and be square integrable, . We define Mercer noise , where . For strictly positive definite , by Mercer’s Theorem, it admits an expansion
where the eigenfunctionsform a complete orthogonal set of and all , see for example Sriperumbudur et al. (2011). Then
and is equivalent to the requirement
Multiplying both sides by and integrating over we obtain
If and are in then, from Mercer’s Theorem, they can be expressed as orthogonal expansions
Then, equation(19) is
which reduces to (using orthonormality), . Hence, provided is square integrable on and strictly positive definite, then defines valid spread noise. For example, defines a strictly positive non-stationary square integrable kernel on . Provided and are in then the spread noise defines a valid spread divergence.
We demonstrate using a spread divergence to train implicit models
where are the parameters of the encoder . We show that, despite the likelihood not being defined, we can nevertheless successfully train the models using an EM style algorithm, see for example Barber (2012). Finally we discuss a link between spread divergences and privacy preservation.
5.1 Deterministic Linear Latent Model
For observation noise , the Probabilistic PCA model (Tipping & Bishop, 1999) for -dimensional observations and -dimensional latent is
When , the generative mapping from to is deterministic and the model has support only on a subset of and the data likelihood is in general not defined. In the following we consider general , setting to zero at the end of the calculation. To fit the model to iid data using maximum likelihood, the only information required from the dataset is the data covariance . The maximum likeihood solution for PPCA is then , where , are the;
is an arbitrary orthogonal matrix. Using spread noise, the spreaded distribution is a Gaussian
Thus, is of the same form as PPCA, albeit with an inflated covariance matrix. Adding Gaussian spread noise to the data also simply inflates the sample covariance to . Since the eigenvalues of are simply , with unchanged eigenvectors, the optimal deterministic () latent linear model has solution . Unsurprisingly, this is the standard PCA solution; however, the derivation is non-standard since the likelihood of the deterministic latent linear model is not defined. Nevertheless, using the spread divergence, we learn a sensible model and recover the true data generating process if the data were exactly generated according to the deterministic model.
5.2 Deterministic Independent Components Analysis
ICA corresponds to the model , where the independent components follow a non-Gaussian distribution. For Gaussian noise ICA an observation is assumed to be generated by the process , where mixes the independent latent process . In standard linear ICA, where is the column on the mixing matrix . For small observation noise , the EM algorithm (Bermond & Cardoso, 1999) becomes ineffective. To see this, consider and invertible mixing matrix , . At iteration
the EM algorithm has an estimateof the mixing matrix. The M-step updates to
where, for noiseless data (),
is the moment matrix of the data. Thus,. and the algorithm ‘freezes’. Similarly, for low noise progress critically slows down. Whilst over-relaxation methods, see for example Winther & Petersen (2007) can help in the case of small noise, for zero noise , over-relaxation is of no benefit.
, for independent zero mean unit variance Laplace components on. The elements of used to generate the data are uniform random . We use , samples and 2000 EM iterations to estimate the mixing matrix. The relative error is averaged over all
and 10 random experiments. We also plot standard errors around the mean relative error. In blue we show the error in learning the underlying parameter using the standard EM algorithm. As expected, as, the error blows up as the EM algorithm ‘freezes’. In orange we plot the error for EM using spread noise, as described in section(5.2.1); no slowing down appears as the model noise decreases. As the model noise increases, the quality of the learned model under spread noise decreases gradually. In (b) we show that, apart from very small , the error for the spread EM algorithm is lower than for the standard EM algorithm. Here , , , , with 500 EM updates used. Results are averaged over 50 runs of randomly drawn .
5.2.1 Healing Critical Slowing Down
To deal with small noise and the limiting case of a deterministic model (), we consider Gaussian spread noise to give
The empirical distribution is replaced by the spreaded empirical distribution
The M-step has the same form as equation(25) but with modified statistics
The E-step optimally sets
where is a normaliser and
We can rewrite the expectations required for the E-step of the EM algorithm as
Generally the posterior will be peaked around and writing the expectations with respect to allows for an effective sampling approximation focussed on regions of high probability. We implement this update by drawing samples from and, for each sample, we draw samples from . This scheme has the advantage over more standard variational approaches, see for example Winther & Petersen (2007), in that we obtain a consistent estimator of the M-step update for 444We focus on demonstrating how the spread divergences heals critical slowing down, rather than deriving a state-of-the-art approximation of . The importance sampling approach has fast run time and works well, even for large latent dimensions, . We also implemented a variational factorised approximation of but found this to be relatively slow and ineffective. A variational Gaussian approximation of improves on the factorised approximation, but is still slow compared to the importance sampling scheme.. We show results for a toy experiment in figure(2), learning the underlying mixing matrix in a deterministic non-square setting. Note that standard algorithms such as FastICA (Hyvärinen, 1999) fail in this setting. The noise value is set to , for estimated mixing matrix of the underlying deterministic model , . The EM algorithm learns a good approximation of the unknown mixing matrix and latent components , with no critical slowing down.
5.3 Training Implicit Non-linear Models
For a deterministic non-linear implicit model, we set and parameterise
by a deep neural network. The likelihood equation(22) is in general intractable and it is natural to consider a variational approximation (Kingma & Welling, 2013),
However, since this bound is not well defined. Instead, we minimise the spread divergence
. The approach is a straightforward extension of the standard variational autoencoder and in appendix(C) we provide details of how to do this, along with higher resolution images of samples from the generative model. We dub this model and associated spread divergence training the ‘VAE’. As a demonstration, we trained a generative network on the MNIST dataset, appendix(D). We used Gaussian spread noise for the VAE and observation noise for the standard noisy VAE. The network . We also trained a deep convolutional generative model on the CelebA dataset (Liu et al., 2015), see figure(3) and appendix(E). We pre-process CelebA images by first taking 140x140 centre crops and then resizing to 64x64. Pixel values were then rescaled to lie in . We use Gaussian spread noise for the VAE and observation noise for the standard noisy VAE.
Spread divergences have an obvious connection to privacy preservation and Randomised Response (Warner, 1965). For example, Alice might have a collection of binary votes , and Bob might like to learn the fraction of votes that are 1. However, Alice does not want to send to Bob the raw data . Letting , Alice can find the fraction of votes that are 1 by performing maximum (log) likelihood
We can equivalently write this as finding that minimises where the model is given by and the data distribution is given by
To preserve privacy, instead of sending , Alice can send to Bob, where is sampled from the distribution so that if she draws a sample with probability and with probability . This generates a noisy data distribution . Since
this suggests that we could attempt to learn from the noisy data by . Bob can equivalently maximise the spread log likelihood
where . Using , , , the maximum of the spread log likelihood is at . This simple example recovers the standard ‘debiasing’ procedure in (Warner, 1965)
. However, the spread divergence suggests a general strategy to perform machine learningfor any model given only noisy data, namely . In the limit of a large number of samples, the accuracy with which the true parameter is recovered improves (although privacy decreases since the ability to recover the from a set of noisy samples correspondingly increases).
We described an approach to defining a divergence, even when two distributions to not have the same support. The method introduces a ‘noise’ variable to ‘spread’ mass from each distribution to cover the same domain. Previous approaches (Furmston & Barber, 2009; Sønderby et al., 2016) can be seen as special cases. We showed that defining divergences this way enables us to train deterministic generative models using standard ‘likelihood’ based approaches. Spread divergences have deep connections to other approaches to define measures of disagreement between distributions. In particular, one can view the spread divergence as the probabilistic analogue of MMD, with conditions required for the existence of the spread divergence closely related to the universality requirement on MMD kernels (Micchelli et al., 2006).
Theoretically, we can learn the underlying true data generating process by the use of any valid spread divergence — for example for fixed Gaussian spread noise. In practice, however, the quality of the learned model can depend on the choice of spread noise. In this work we fixed the spread noise, but showed that if we were to learn the spread noise, it would preferentially spread mass across the manifolds defining the two distributions. In future work, we will investigate learning spread noise to maximally discriminate two distributions, which would involve a minimax model training objective, with an inner maximisation over the spread noise and an outer maximisation over the model parameters. This would bring our work much closer to adversarial training methods (Goodfellow, 2017).
- Barber (2012) D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, New York, NY, USA, 2012. ISBN 0521518148, 9780521518147.
- Bermond & Cardoso (1999) O. Bermond and J. F. Cardoso. Approximate likelihood for noisy mixtures. In Proc. ICA ’99, pp. 325–330, 1999.
S. S. Dragomir.
Some general divergence measures for probability distributions.Acta Mathematica Hungarica, 109(4):331–345, Nov 2005. ISSN 1588-2632. doi: 10.1007/s10474-005-0251-6.
- Furmston & Barber (2009) T. Furmston and D. Barber. Solving deterministic policy (PO)MPDs using Expectation-Maximisation and Antifreeze. In First international workshop on learning and data mining for robotics (LEMIR), pp. 56–70, 2009. In conjunction with ECML/PKDD-2009.
- Gerchinovitz et al. (2018) S. Gerchinovitz, P. Ménard, and G. Stoltz. Fano’s inequality for random variables. arXiv, 2018. doi: arXiv:1702.05985v2.
- Goodfellow (2017) I. J. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. CoRR, abs/1701.00160, 2017.
- Hyvärinen (1999) A. Hyvärinen. Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3):626–634, May 1999. ISSN 1045-9227. doi: 10.1109/72.761722.
- Ioffe & Szegedy (2015) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Kingma & Welling (2013) D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. arXiv:1312.6114 [stat.ML], 2013.
Liu et al. (2015)
Z. Liu, P. Luo, X. Wang, and X. Tang.
Deep Learning Face Attributes in the Wild.
Proceedings of International Conference on Computer Vision (ICCV), 2015.
- Micchelli et al. (2006) C. A. Micchelli, Y. Xu, and H. Zhang. Universal Kernels. Journal of Machine Learning Research, 6:2651–2667, 2006.
- Mohamed & Lakshminarayanan (2016) S. Mohamed and B. Lakshminarayanan. Learning in implicit generative models. arXiv preprint, 2016. doi: arXiv:1610.03483.
- Roth et al. (2017) K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing Training of Generative Adversarial Networks through Regularization . arXiv:1705.09367, 2017.
- Sønderby et al. (2016) C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár. Amortised map inference for image super-resolution. arXiv preprint arXiv:1610.04490, 2016.
- Sriperumbudur et al. (2012) B. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. Lanckriet. On the Empirical Estimation of Integral Probability Metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.
- Sriperumbudur et al. (2011) B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, Characteristic Kernels and RKHS Embedding of Measures. J. Mach. Learn. Res., 12:2389–2410, July 2011. ISSN 1532-4435.
- Tipping & Bishop (1999) M. E. Tipping and C. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 21/3:611–622, January 1999.
- Warner (1965) S. L. Warner. Randomised response: a survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.
- Winther & Petersen (2007) O. Winther and K. B. Petersen. Bayesian independent component analysis: Variational methods and non-negative decompositions. Digital Signal Processing, 17(5):858 – 872, 2007. ISSN 1051-2004. Special Issue on Bayesian Source Separation.
- Zhang et al. (2018) M. Zhang, T. Bird, R. Habib, T. Xu, and D. Barber. Training Generative Latent Models by Variational -Divergence Minimization. arXiv preprint, 2018.
Appendix A Spread noise makes distributions more similar
The data processing inequality for -divergences (see for example Gerchinovitz et al. (2018)) states that
. For completeness, we provide here an elementary proof of this result. We consider the following joint distributions
whose marginals are the spreaded distributions
The divergence between the two joint distributions is
The -divergence between two marginal distributions is no larger than the -divergence between the joint (see also Zhang et al. (2018)). To see this, consider
Intuitively, spreading two distributions increases their overlap, reducing the divergence. When and do not have the same support, can be infinite or not well-defined.
Appendix B Injective Linear Mappings
Consider an injective linear mapping from space to . From the rank nullity theorem for finite dimensional spaces,
If is injective, then . If then
Since , it must be that . Hence, injective linear maps between between two (finite dimensional) spaces of the same dimension are surjective; equivalently, they are invertible.
Appendix C Spread Divergence for Deterministic Deep Generative Models
Instead of minimising the likelihood, we train an implicit generative model by minimising the spread divergence
According to our general theory,
Typically, the integral over will be intractable and we resort to an unbiased sampled estimate (though see below for Gaussian ). Neglecting constants, the KL divergence estimator is
where is a noisy sample of , namely . In most cases of interest, with non-linear , the distribution is intractable. We therefore use the variational lower bound
Parameterising the variational distribution as a Gaussian,
then we can reparameterise and write
where is the entropy of a Gaussian with covariance