1 Introduction
A divergence (see, for example Dragomir (2005)) is a measure of the difference between two distributions and with the property
(1) 
Some of our results are specific to the divergence, defined as
(2) 
where is a convex function with . An important special case of an
divergence is the wellknown KullbackLeibler divergence
which is widely used to train models using maximum likelihood. We are interested in situations in which the supports of the two distributions are different, . In this case the divergence may not be defined. For example, for being an empirical data distribution on continuous dataset , where is the Dirac Delta function. For a model with support , then is not formally defined. This is a challenge since implicit generative models of the form only have limited support; in this case maximum likelihood to learn the model parameter is not available and alternative approaches are required – see Mohamed & Lakshminarayanan (2016) for a recent survey.2 Spread Divergences
The aim is, from and to define new distributions and that have the same support^{1}^{1}1For simplicity, we use univariate , with the extension to the multivariate setting being straightforward.. Using the notation to denote integration for continuous , and for discrete with domain
, we define a random variable
with the same domain as and distributions(3) 
where is a ‘noise’ process designed to ‘spread’ the mass of and such that and have the same support. For example, if we use a Gaussian , then and both have support . We therefore use noise with the property that, despite not existing, does exist and we define the Spread Divergence
(4) 
Note that this satisfies the divergence requirement . The second requirement, , is guaranteed for certain ‘noise’ processes, as described in section(2.1).
Spread divergences have many potential applications. For example, for a model with parameter and empirical data distribution , maximum likelihood training corresponds to minimising with respect to . However, for implicit models, the divergence does not exist. However, if a spread divergence exists, provided that the data is distributed according the model for some unknown parameter , the spread divergence has a minimum at . That is (for identifiable models) we can correctly learn the underlying data generating process, even when the original divergence is not defined.
2.1 Noise Requirements for a Spread Divergence
Our main interest is in using noise to define a new divergence in situations in which the original divergence is itself not defined. For discrete variables , , the noise must be a distribution , and
(5) 
which is equivalent to the requirement that the matrix is invertible, see appendix(B). There is an additional requirement that the spread divergence exists. In the case of divergences, the spread divergence exists provided that and have the same support. This is guaranteed if
(6) 
which is satisfied if . In general, therefore, there is a space of noise distributions that define a valid spread divergence. The ‘antifreeze’ method of Furmston & Barber (2009) is a special form of spread noise to define a valid KullbackLeibler divergence (see also Barber (2012)).
For continuous variables, in order that , the noise , with
must be a probability density and satisfy
(7) 
This is satisfied if there exists a transform such that , where is the Dirac delta function. As for the discrete case, the spread divergence exists provided that and have the same support, which is guaranteed if . A well known example of such an invertible integral transform is the Weierstrass Transform , which has an explicit representation for . In general, however, we can demonstrate the existence of a spread divergence without the need for an explicit representation of . As we will see below, the noise requirements for defining a valid spread divergence such that are analogous to the requirements on kernels such that the Maximum Mean Discrepancy , see Sriperumbudur et al. (2011) and Sriperumbudur et al. (2012).
3 Stationary Spread Divergences
Consider stationary noise where
is a probability density function with
, . In this case and are defined as a convolution(8) 
Since , and are guaranteed to have the same support
. A sufficient condition for the existence of the Fourier Transform
of a function for real is that is absolutely integrable. All distributions are absolutely integrable, so that both and are guaranteed to exist. Assuming exists, we can use the convolution theorem to write(9) 
Hence, we can write
(10) 
where we used the invertibility of the Fourier transform and assumed that , or equivalently^{2}^{2}2If can change sign, by continuity, there must exist a point at which ., . Hence, provided that and then defines a valid spread divergence. Note that other transforms have a corresponding convolution theorem^{3}^{3}3This includes the Laplace, Mellin and Hartley transforms. and the above derivation holds, with the requirement that the corresponding transform of is nonzero. As an example of such a noise process, consider Gaussian noise,
(11) 
leading to a positive Fourier Transform:
(12) 
Similarly, for Laplace noise
(13) 
Since and , this also defines a valid spread divergence over .
3.1 Invertible Mappings
Consider for strictly monotonic . Then, using the change of variables
(14) 
where is the Jacobian of . For distributions with bounded domain, for example
, we can use a logit function,
, which maps the interval to . Using then, for example, Gaussian spread noise , both and have support . If is zero then on the domain .3.2 Maximising the Spread
From the data processing inequality (see appendix(A)), spread noise will always decrease the divergence . If we are to use a spread divergence to train a model, there is the danger that adding too much noise may make the spreaded empirical distribution and spreaded model distribution so similar that it becomes difficult to numerically distinguish them, impeding training. In general, therefore, it would be useful to add noise such that we define a valid spread divergence, but can maximally still discern the difference between the two distributions. To gain intuition, we define and to generate data in separated linear subspaces, , , . Using Gaussian spread, , what is the optimal , that maximises the divergence? Clearly, as tends to zero, the divergence increases to infinity, meaning that we must at least constrain the entropy of to be finite. In this case the spreaded distributions are given by
(15) 
We define a simple Factor Analysis noise model with , where is fixed and . The entropy of is then fixed and independent of . Also, for simplicity, we assume . It is straightforward to show that the spread divergence is maximised for
pointing orthogonal to the vector
. Then optimally points along the direction in which the support lies. The support of must be the whole space but to maximise the divergence the noise preferentially spreads along directions defined by and , see figure(1).4 Mercer Spread Divergence
We showed in section(3) how to define one form of spread divergence, with the result that stationary noise distributions must have strictly positive Fourier Transforms. A natural question is whether, for continuous , there are other easily definable noise distributions that are nonstationary. To examine this question, let , and be square integrable, . We define Mercer noise , where . For strictly positive definite , by Mercer’s Theorem, it admits an expansion
(16) 
where the eigenfunctions
form a complete orthogonal set of and all , see for example Sriperumbudur et al. (2011). Then(17) 
and is equivalent to the requirement
(18) 
Multiplying both sides by and integrating over we obtain
(19) 
If and are in then, from Mercer’s Theorem, they can be expressed as orthogonal expansions
(20) 
Then, equation(19) is
(21) 
which reduces to (using orthonormality), . Hence, provided is square integrable on and strictly positive definite, then defines valid spread noise. For example, defines a strictly positive nonstationary square integrable kernel on . Provided and are in then the spread noise defines a valid spread divergence.
5 Applications
We demonstrate using a spread divergence to train implicit models
(22) 
where are the parameters of the encoder . We show that, despite the likelihood not being defined, we can nevertheless successfully train the models using an EM style algorithm, see for example Barber (2012). Finally we discuss a link between spread divergences and privacy preservation.
5.1 Deterministic Linear Latent Model
For observation noise , the Probabilistic PCA model (Tipping & Bishop, 1999) for dimensional observations and dimensional latent is
(23) 
When , the generative mapping from to is deterministic and the model has support only on a subset of and the data likelihood is in general not defined. In the following we consider general , setting to zero at the end of the calculation. To fit the model to iid data using maximum likelihood, the only information required from the dataset is the data covariance . The maximum likeihood solution for PPCA is then , where , are the
largest eigenvalues, eigenvectors of
;is an arbitrary orthogonal matrix. Using spread noise
, the spreaded distribution is a Gaussian(24) 
Thus, is of the same form as PPCA, albeit with an inflated covariance matrix. Adding Gaussian spread noise to the data also simply inflates the sample covariance to . Since the eigenvalues of are simply , with unchanged eigenvectors, the optimal deterministic () latent linear model has solution . Unsurprisingly, this is the standard PCA solution; however, the derivation is nonstandard since the likelihood of the deterministic latent linear model is not defined. Nevertheless, using the spread divergence, we learn a sensible model and recover the true data generating process if the data were exactly generated according to the deterministic model.
5.2 Deterministic Independent Components Analysis
ICA corresponds to the model , where the independent components follow a nonGaussian distribution. For Gaussian noise ICA an observation is assumed to be generated by the process , where mixes the independent latent process . In standard linear ICA, where is the column on the mixing matrix . For small observation noise , the EM algorithm (Bermond & Cardoso, 1999) becomes ineffective. To see this, consider and invertible mixing matrix , . At iteration
the EM algorithm has an estimate
of the mixing matrix. The Mstep updates to(25) 
where, for noiseless data (),
(26) 
where
is the moment matrix of the data. Thus,
. and the algorithm ‘freezes’. Similarly, for low noise progress critically slows down. Whilst overrelaxation methods, see for example Winther & Petersen (2007) can help in the case of small noise, for zero noise , overrelaxation is of no benefit., for independent zero mean unit variance Laplace components on
. The elements of used to generate the data are uniform random . We use , samples and 2000 EM iterations to estimate the mixing matrix. The relative error is averaged over alland 10 random experiments. We also plot standard errors around the mean relative error. In blue we show the error in learning the underlying parameter using the standard EM algorithm. As expected, as
, the error blows up as the EM algorithm ‘freezes’. In orange we plot the error for EM using spread noise, as described in section(5.2.1); no slowing down appears as the model noise decreases. As the model noise increases, the quality of the learned model under spread noise decreases gradually. In (b) we show that, apart from very small , the error for the spread EM algorithm is lower than for the standard EM algorithm. Here , , , , with 500 EM updates used. Results are averaged over 50 runs of randomly drawn .5.2.1 Healing Critical Slowing Down
To deal with small noise and the limiting case of a deterministic model (), we consider Gaussian spread noise to give
(27) 
The empirical distribution is replaced by the spreaded empirical distribution
(28) 
The Mstep has the same form as equation(25) but with modified statistics
(29) 
(30) 
The Estep optimally sets
(31) 
where is a normaliser and
(32) 
We can rewrite the expectations required for the Estep of the EM algorithm as
(33) 
(34) 
Generally the posterior will be peaked around and writing the expectations with respect to allows for an effective sampling approximation focussed on regions of high probability. We implement this update by drawing samples from and, for each sample, we draw samples from . This scheme has the advantage over more standard variational approaches, see for example Winther & Petersen (2007), in that we obtain a consistent estimator of the Mstep update for ^{4}^{4}4We focus on demonstrating how the spread divergences heals critical slowing down, rather than deriving a stateoftheart approximation of . The importance sampling approach has fast run time and works well, even for large latent dimensions, . We also implemented a variational factorised approximation of but found this to be relatively slow and ineffective. A variational Gaussian approximation of improves on the factorised approximation, but is still slow compared to the importance sampling scheme.. We show results for a toy experiment in figure(2), learning the underlying mixing matrix in a deterministic nonsquare setting. Note that standard algorithms such as FastICA (Hyvärinen, 1999) fail in this setting. The noise value is set to , for estimated mixing matrix of the underlying deterministic model , . The EM algorithm learns a good approximation of the unknown mixing matrix and latent components , with no critical slowing down.
5.3 Training Implicit Nonlinear Models
For a deterministic nonlinear implicit model, we set and parameterise
by a deep neural network. The likelihood equation(
22) is in general intractable and it is natural to consider a variational approximation (Kingma & Welling, 2013),(35) 
However, since this bound is not well defined. Instead, we minimise the spread divergence
. The approach is a straightforward extension of the standard variational autoencoder and in appendix(
C) we provide details of how to do this, along with higher resolution images of samples from the generative model. We dub this model and associated spread divergence training the ‘VAE’. As a demonstration, we trained a generative network on the MNIST dataset, appendix(D). We used Gaussian spread noise for the VAE and observation noise for the standard noisy VAE. The networkcontains 8 layers, each layer with 400 units and relu activation function and latent dimension
. We also trained a deep convolutional generative model on the CelebA dataset (Liu et al., 2015), see figure(3) and appendix(E). We preprocess CelebA images by first taking 140x140 centre crops and then resizing to 64x64. Pixel values were then rescaled to lie in . We use Gaussian spread noise for the VAE and observation noise for the standard noisy VAE.5.4 Privacy
Spread divergences have an obvious connection to privacy preservation and Randomised Response (Warner, 1965). For example, Alice might have a collection of binary votes , and Bob might like to learn the fraction of votes that are 1. However, Alice does not want to send to Bob the raw data . Letting , Alice can find the fraction of votes that are 1 by performing maximum (log) likelihood
(36) 
We can equivalently write this as finding that minimises where the model is given by and the data distribution is given by
(37) 
To preserve privacy, instead of sending , Alice can send to Bob, where is sampled from the distribution so that if she draws a sample with probability and with probability . This generates a noisy data distribution . Since
(38) 
this suggests that we could attempt to learn from the noisy data by . Bob can equivalently maximise the spread log likelihood
(39) 
where . Using , , , the maximum of the spread log likelihood is at . This simple example recovers the standard ‘debiasing’ procedure in (Warner, 1965)
. However, the spread divergence suggests a general strategy to perform machine learning
for any model given only noisy data, namely . In the limit of a large number of samples, the accuracy with which the true parameter is recovered improves (although privacy decreases since the ability to recover the from a set of noisy samples correspondingly increases).6 Summary
We described an approach to defining a divergence, even when two distributions to not have the same support. The method introduces a ‘noise’ variable to ‘spread’ mass from each distribution to cover the same domain. Previous approaches (Furmston & Barber, 2009; Sønderby et al., 2016) can be seen as special cases. We showed that defining divergences this way enables us to train deterministic generative models using standard ‘likelihood’ based approaches. Spread divergences have deep connections to other approaches to define measures of disagreement between distributions. In particular, one can view the spread divergence as the probabilistic analogue of MMD, with conditions required for the existence of the spread divergence closely related to the universality requirement on MMD kernels (Micchelli et al., 2006).
Theoretically, we can learn the underlying true data generating process by the use of any valid spread divergence — for example for fixed Gaussian spread noise. In practice, however, the quality of the learned model can depend on the choice of spread noise. In this work we fixed the spread noise, but showed that if we were to learn the spread noise, it would preferentially spread mass across the manifolds defining the two distributions. In future work, we will investigate learning spread noise to maximally discriminate two distributions, which would involve a minimax model training objective, with an inner maximisation over the spread noise and an outer maximisation over the model parameters. This would bring our work much closer to adversarial training methods (Goodfellow, 2017).
References
 Barber (2012) D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, New York, NY, USA, 2012. ISBN 0521518148, 9780521518147.
 Bermond & Cardoso (1999) O. Bermond and J. F. Cardoso. Approximate likelihood for noisy mixtures. In Proc. ICA ’99, pp. 325–330, 1999.

Dragomir (2005)
S. S. Dragomir.
Some general divergence measures for probability distributions.
Acta Mathematica Hungarica, 109(4):331–345, Nov 2005. ISSN 15882632. doi: 10.1007/s1047400502516.  Furmston & Barber (2009) T. Furmston and D. Barber. Solving deterministic policy (PO)MPDs using ExpectationMaximisation and Antifreeze. In First international workshop on learning and data mining for robotics (LEMIR), pp. 56–70, 2009. In conjunction with ECML/PKDD2009.
 Gerchinovitz et al. (2018) S. Gerchinovitz, P. Ménard, and G. Stoltz. Fano’s inequality for random variables. arXiv, 2018. doi: arXiv:1702.05985v2.
 Goodfellow (2017) I. J. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. CoRR, abs/1701.00160, 2017.
 Hyvärinen (1999) A. Hyvärinen. Fast and robust fixedpoint algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3):626–634, May 1999. ISSN 10459227. doi: 10.1109/72.761722.
 Ioffe & Szegedy (2015) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Kingma & Welling (2013) D. P. Kingma and M. Welling. AutoEncoding Variational Bayes. arXiv:1312.6114 [stat.ML], 2013.

Liu et al. (2015)
Z. Liu, P. Luo, X. Wang, and X. Tang.
Deep Learning Face Attributes in the Wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, 2015.  Micchelli et al. (2006) C. A. Micchelli, Y. Xu, and H. Zhang. Universal Kernels. Journal of Machine Learning Research, 6:2651–2667, 2006.
 Mohamed & Lakshminarayanan (2016) S. Mohamed and B. Lakshminarayanan. Learning in implicit generative models. arXiv preprint, 2016. doi: arXiv:1610.03483.
 Roth et al. (2017) K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing Training of Generative Adversarial Networks through Regularization . arXiv:1705.09367, 2017.
 Sønderby et al. (2016) C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár. Amortised map inference for image superresolution. arXiv preprint arXiv:1610.04490, 2016.
 Sriperumbudur et al. (2012) B. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. Lanckriet. On the Empirical Estimation of Integral Probability Metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.
 Sriperumbudur et al. (2011) B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, Characteristic Kernels and RKHS Embedding of Measures. J. Mach. Learn. Res., 12:2389–2410, July 2011. ISSN 15324435.
 Tipping & Bishop (1999) M. E. Tipping and C. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 21/3:611–622, January 1999.
 Warner (1965) S. L. Warner. Randomised response: a survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.
 Winther & Petersen (2007) O. Winther and K. B. Petersen. Bayesian independent component analysis: Variational methods and nonnegative decompositions. Digital Signal Processing, 17(5):858 – 872, 2007. ISSN 10512004. Special Issue on Bayesian Source Separation.
 Zhang et al. (2018) M. Zhang, T. Bird, R. Habib, T. Xu, and D. Barber. Training Generative Latent Models by Variational Divergence Minimization. arXiv preprint, 2018.
Appendix A Spread noise makes distributions more similar
The data processing inequality for divergences (see for example Gerchinovitz et al. (2018)) states that
. For completeness, we provide here an elementary proof of this result. We consider the following joint distributions
(40) 
whose marginals are the spreaded distributions
(41) 
The divergence between the two joint distributions is
(42) 
The divergence between two marginal distributions is no larger than the divergence between the joint (see also Zhang et al. (2018)). To see this, consider
Hence,
(43) 
Intuitively, spreading two distributions increases their overlap, reducing the divergence. When and do not have the same support, can be infinite or not welldefined.
Appendix B Injective Linear Mappings
Consider an injective linear mapping from space to . From the rank nullity theorem for finite dimensional spaces,
(44) 
If is injective, then . If then
(45) 
Since , it must be that . Hence, injective linear maps between between two (finite dimensional) spaces of the same dimension are surjective; equivalently, they are invertible.
In the context of spread noise, since the domain of and are equal and
is defined through a linear transformation of
, the requirement in (5) that the mapping is injective is equivalent to the requirement that the mapping is invertible.Appendix C Spread Divergence for Deterministic Deep Generative Models
Instead of minimising the likelihood, we train an implicit generative model by minimising the spread divergence
(46) 
where
(47) 
and
(48) 
According to our general theory,
(49) 
Here
(50) 
Typically, the integral over will be intractable and we resort to an unbiased sampled estimate (though see below for Gaussian ). Neglecting constants, the KL divergence estimator is
(51) 
where is a noisy sample of , namely . In most cases of interest, with nonlinear , the distribution is intractable. We therefore use the variational lower bound
(52) 
Parameterising the variational distribution as a Gaussian,
(53) 
then we can reparameterise and write
(54) 
where is the entropy of a Gaussian with covariance </
Comments
There are no comments yet.