Spread Divergences

11/21/2018 ∙ by David Barber, et al. ∙ 0

For distributions p and q with different support, the divergence generally will not exist. We define a spread divergence on modified p and q and describe sufficient conditions for the existence of such a divergence. We give examples of using a spread divergence to train implicit generative models, including linear models (Principal Components Analysis and Independent Components Analysis) and non-linear models (Deep Generative Networks).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A divergence (see, for example Dragomir (2005)) is a measure of the difference between two distributions and with the property

(1)

Some of our results are specific to the -divergence, defined as

(2)

where is a convex function with . An important special case of an

-divergence is the well-known Kullback-Leibler divergence

which is widely used to train models using maximum likelihood. We are interested in situations in which the supports of the two distributions are different, . In this case the divergence may not be defined. For example, for being an empirical data distribution on continuous dataset , where is the Dirac Delta function. For a model with support , then is not formally defined. This is a challenge since implicit generative models of the form only have limited support; in this case maximum likelihood to learn the model parameter is not available and alternative approaches are required – see Mohamed & Lakshminarayanan (2016) for a recent survey.

2 Spread Divergences

The aim is, from and to define new distributions and that have the same support111For simplicity, we use univariate , with the extension to the multivariate setting being straightforward.. Using the notation to denote integration for continuous , and for discrete with domain

, we define a random variable

with the same domain as and distributions

(3)

where is a ‘noise’ process designed to ‘spread’ the mass of and such that and have the same support. For example, if we use a Gaussian , then and both have support . We therefore use noise with the property that, despite not existing, does exist and we define the Spread Divergence

(4)

Note that this satisfies the divergence requirement . The second requirement, , is guaranteed for certain ‘noise’ processes, as described in section(2.1).

Spread divergences have many potential applications. For example, for a model with parameter and empirical data distribution , maximum likelihood training corresponds to minimising with respect to . However, for implicit models, the divergence does not exist. However, if a spread divergence exists, provided that the data is distributed according the model for some unknown parameter , the spread divergence has a minimum at . That is (for identifiable models) we can correctly learn the underlying data generating process, even when the original divergence is not defined.

2.1 Noise Requirements for a Spread Divergence

Our main interest is in using noise to define a new divergence in situations in which the original divergence is itself not defined. For discrete variables , , the noise must be a distribution , and

(5)

which is equivalent to the requirement that the matrix is invertible, see appendix(B). There is an additional requirement that the spread divergence exists. In the case of -divergences, the spread divergence exists provided that and have the same support. This is guaranteed if

(6)

which is satisfied if . In general, therefore, there is a space of noise distributions that define a valid spread divergence. The ‘antifreeze’ method of Furmston & Barber (2009) is a special form of spread noise to define a valid Kullback-Leibler divergence (see also Barber (2012)).

For continuous variables, in order that , the noise , with

must be a probability density and satisfy

(7)

This is satisfied if there exists a transform such that , where is the Dirac delta function. As for the discrete case, the spread divergence exists provided that and have the same support, which is guaranteed if . A well known example of such an invertible integral transform is the Weierstrass Transform , which has an explicit representation for . In general, however, we can demonstrate the existence of a spread divergence without the need for an explicit representation of . As we will see below, the noise requirements for defining a valid spread divergence such that are analogous to the requirements on kernels such that the Maximum Mean Discrepancy , see Sriperumbudur et al. (2011) and Sriperumbudur et al. (2012).

3 Stationary Spread Divergences

Consider stationary noise where

is a probability density function with

, . In this case and are defined as a convolution

(8)

Since , and are guaranteed to have the same support

. A sufficient condition for the existence of the Fourier Transform

of a function for real is that is absolutely integrable. All distributions are absolutely integrable, so that both and are guaranteed to exist. Assuming exists, we can use the convolution theorem to write

(9)

Hence, we can write

(10)

where we used the invertibility of the Fourier transform and assumed that , or equivalently222If can change sign, by continuity, there must exist a point at which ., . Hence, provided that and then defines a valid spread divergence. Note that other transforms have a corresponding convolution theorem333This includes the Laplace, Mellin and Hartley transforms. and the above derivation holds, with the requirement that the corresponding transform of is non-zero. As an example of such a noise process, consider Gaussian noise,

(11)

leading to a positive Fourier Transform:

(12)

Similarly, for Laplace noise

(13)

Since and , this also defines a valid spread divergence over .

3.1 Invertible Mappings

Consider for strictly monotonic . Then, using the change of variables

(14)

where is the Jacobian of . For distributions with bounded domain, for example

, we can use a logit function,

, which maps the interval to . Using then, for example, Gaussian spread noise , both and have support . If is zero then on the domain .

3.2 Maximising the Spread

From the data processing inequality (see appendix(A)), spread noise will always decrease the -divergence . If we are to use a spread divergence to train a model, there is the danger that adding too much noise may make the spreaded empirical distribution and spreaded model distribution so similar that it becomes difficult to numerically distinguish them, impeding training. In general, therefore, it would be useful to add noise such that we define a valid spread divergence, but can maximally still discern the difference between the two distributions. To gain intuition, we define and to generate data in separated linear subspaces, , , . Using Gaussian spread, , what is the optimal , that maximises the divergence? Clearly, as tends to zero, the divergence increases to infinity, meaning that we must at least constrain the entropy of to be finite. In this case the spreaded distributions are given by

(15)

We define a simple Factor Analysis noise model with , where is fixed and . The entropy of is then fixed and independent of . Also, for simplicity, we assume . It is straightforward to show that the spread divergence is maximised for

pointing orthogonal to the vector

. Then optimally points along the direction in which the support lies. The support of must be the whole space but to maximise the divergence the noise preferentially spreads along directions defined by and , see figure(1).

Figure 1:

Left: The lower dotted line denotes Gaussian distributed data

with support only along the linear subspace defined by the origin and direction . The upper dotted line denotes Gaussian distributed data with support different from . Optimally, to maximise the spread divergence between the two distributions, for fixed noise entropy, we should add noise that preferentially spreads out along the directions defined by and , as denoted by the ellipses.

4 Mercer Spread Divergence

We showed in section(3) how to define one form of spread divergence, with the result that stationary noise distributions must have strictly positive Fourier Transforms. A natural question is whether, for continuous , there are other easily definable noise distributions that are non-stationary. To examine this question, let , and be square integrable, . We define Mercer noise , where . For strictly positive definite , by Mercer’s Theorem, it admits an expansion

(16)

where the eigenfunctions

form a complete orthogonal set of and all , see for example Sriperumbudur et al. (2011). Then

(17)

and is equivalent to the requirement

(18)

Multiplying both sides by and integrating over we obtain

(19)

If and are in then, from Mercer’s Theorem, they can be expressed as orthogonal expansions

(20)

Then, equation(19) is

(21)

which reduces to (using orthonormality), . Hence, provided is square integrable on and strictly positive definite, then defines valid spread noise. For example, defines a strictly positive non-stationary square integrable kernel on . Provided and are in then the spread noise defines a valid spread divergence.

5 Applications

We demonstrate using a spread divergence to train implicit models

(22)

where are the parameters of the encoder . We show that, despite the likelihood not being defined, we can nevertheless successfully train the models using an EM style algorithm, see for example Barber (2012). Finally we discuss a link between spread divergences and privacy preservation.

5.1 Deterministic Linear Latent Model

For observation noise , the Probabilistic PCA model (Tipping & Bishop, 1999) for -dimensional observations and -dimensional latent is

(23)

When , the generative mapping from to is deterministic and the model has support only on a subset of and the data likelihood is in general not defined. In the following we consider general , setting to zero at the end of the calculation. To fit the model to iid data using maximum likelihood, the only information required from the dataset is the data covariance . The maximum likeihood solution for PPCA is then , where , are the

largest eigenvalues, eigenvectors of

;

is an arbitrary orthogonal matrix. Using spread noise

, the spreaded distribution is a Gaussian

(24)

Thus, is of the same form as PPCA, albeit with an inflated covariance matrix. Adding Gaussian spread noise to the data also simply inflates the sample covariance to . Since the eigenvalues of are simply , with unchanged eigenvectors, the optimal deterministic () latent linear model has solution . Unsurprisingly, this is the standard PCA solution; however, the derivation is non-standard since the likelihood of the deterministic latent linear model is not defined. Nevertheless, using the spread divergence, we learn a sensible model and recover the true data generating process if the data were exactly generated according to the deterministic model.

5.2 Deterministic Independent Components Analysis

ICA corresponds to the model , where the independent components follow a non-Gaussian distribution. For Gaussian noise ICA an observation is assumed to be generated by the process , where mixes the independent latent process . In standard linear ICA, where is the column on the mixing matrix . For small observation noise , the EM algorithm (Bermond & Cardoso, 1999) becomes ineffective. To see this, consider and invertible mixing matrix , . At iteration

the EM algorithm has an estimate

of the mixing matrix. The M-step updates to

(25)

where, for noiseless data (),

(26)

where

is the moment matrix of the data. Thus,

. and the algorithm ‘freezes’. Similarly, for low noise progress critically slows down. Whilst over-relaxation methods, see for example Winther & Petersen (2007) can help in the case of small noise, for zero noise , over-relaxation is of no benefit.

(a) Relative error

as a function of the model noise standard deviation

.
(b) Relative error as a function of the number of datapoints .
Figure 2: (a) For observations and latent variables, we generate datapoints from the model

, for independent zero mean unit variance Laplace components on

. The elements of used to generate the data are uniform random . We use , samples and 2000 EM iterations to estimate the mixing matrix. The relative error is averaged over all

and 10 random experiments. We also plot standard errors around the mean relative error. In blue we show the error in learning the underlying parameter using the standard EM algorithm. As expected, as

, the error blows up as the EM algorithm ‘freezes’. In orange we plot the error for EM using spread noise, as described in section(5.2.1); no slowing down appears as the model noise decreases. As the model noise increases, the quality of the learned model under spread noise decreases gradually. In (b) we show that, apart from very small , the error for the spread EM algorithm is lower than for the standard EM algorithm. Here , , , , with 500 EM updates used. Results are averaged over 50 runs of randomly drawn .

5.2.1 Healing Critical Slowing Down

To deal with small noise and the limiting case of a deterministic model (), we consider Gaussian spread noise to give

(27)

The empirical distribution is replaced by the spreaded empirical distribution

(28)

The M-step has the same form as equation(25) but with modified statistics

(29)
(30)

The E-step optimally sets

(31)

where is a normaliser and

(32)

We can rewrite the expectations required for the E-step of the EM algorithm as

(33)
(34)

Generally the posterior will be peaked around and writing the expectations with respect to allows for an effective sampling approximation focussed on regions of high probability. We implement this update by drawing samples from and, for each sample, we draw samples from . This scheme has the advantage over more standard variational approaches, see for example Winther & Petersen (2007), in that we obtain a consistent estimator of the M-step update for 444We focus on demonstrating how the spread divergences heals critical slowing down, rather than deriving a state-of-the-art approximation of . The importance sampling approach has fast run time and works well, even for large latent dimensions, . We also implemented a variational factorised approximation of but found this to be relatively slow and ineffective. A variational Gaussian approximation of improves on the factorised approximation, but is still slow compared to the importance sampling scheme.. We show results for a toy experiment in figure(2), learning the underlying mixing matrix in a deterministic non-square setting. Note that standard algorithms such as FastICA (Hyvärinen, 1999) fail in this setting. The noise value is set to , for estimated mixing matrix of the underlying deterministic model , . The EM algorithm learns a good approximation of the unknown mixing matrix and latent components , with no critical slowing down.

5.3 Training Implicit Non-linear Models

For a deterministic non-linear implicit model, we set and parameterise

by a deep neural network. The likelihood equation(

22) is in general intractable and it is natural to consider a variational approximation (Kingma & Welling, 2013),

(35)

However, since this bound is not well defined. Instead, we minimise the spread divergence

. The approach is a straightforward extension of the standard variational autoencoder and in appendix(

C) we provide details of how to do this, along with higher resolution images of samples from the generative model. We dub this model and associated spread divergence training the ‘VAE’. As a demonstration, we trained a generative network on the MNIST dataset, appendix(D). We used Gaussian spread noise for the VAE and observation noise for the standard noisy VAE. The network

contains 8 layers, each layer with 400 units and relu activation function and latent dimension

. We also trained a deep convolutional generative model on the CelebA dataset (Liu et al., 2015), see figure(3) and appendix(E). We pre-process CelebA images by first taking 140x140 centre crops and then resizing to 64x64. Pixel values were then rescaled to lie in . We use Gaussian spread noise for the VAE and observation noise for the standard noisy VAE.

(a) VAE samples
(b) Noisy VAE means
(c) Noisy VAE samples
Figure 3: Comparison of training approaches for the CelebA dataset. All models had the same structure and were trained using the same Adam settings, as in the MNIST experiment.

5.4 Privacy

Spread divergences have an obvious connection to privacy preservation and Randomised Response (Warner, 1965). For example, Alice might have a collection of binary votes , and Bob might like to learn the fraction of votes that are 1. However, Alice does not want to send to Bob the raw data . Letting , Alice can find the fraction of votes that are 1 by performing maximum (log) likelihood

(36)

We can equivalently write this as finding that minimises where the model is given by and the data distribution is given by

(37)

To preserve privacy, instead of sending , Alice can send to Bob, where is sampled from the distribution so that if she draws a sample with probability and with probability . This generates a noisy data distribution . Since

(38)

this suggests that we could attempt to learn from the noisy data by . Bob can equivalently maximise the spread log likelihood

(39)

where . Using , , , the maximum of the spread log likelihood is at . This simple example recovers the standard ‘debiasing’ procedure in (Warner, 1965)

. However, the spread divergence suggests a general strategy to perform machine learning

for any model given only noisy data, namely . In the limit of a large number of samples, the accuracy with which the true parameter is recovered improves (although privacy decreases since the ability to recover the from a set of noisy samples correspondingly increases).

6 Summary

We described an approach to defining a divergence, even when two distributions to not have the same support. The method introduces a ‘noise’ variable to ‘spread’ mass from each distribution to cover the same domain. Previous approaches (Furmston & Barber, 2009; Sønderby et al., 2016) can be seen as special cases. We showed that defining divergences this way enables us to train deterministic generative models using standard ‘likelihood’ based approaches. Spread divergences have deep connections to other approaches to define measures of disagreement between distributions. In particular, one can view the spread divergence as the probabilistic analogue of MMD, with conditions required for the existence of the spread divergence closely related to the universality requirement on MMD kernels (Micchelli et al., 2006).

Theoretically, we can learn the underlying true data generating process by the use of any valid spread divergence — for example for fixed Gaussian spread noise. In practice, however, the quality of the learned model can depend on the choice of spread noise. In this work we fixed the spread noise, but showed that if we were to learn the spread noise, it would preferentially spread mass across the manifolds defining the two distributions. In future work, we will investigate learning spread noise to maximally discriminate two distributions, which would involve a minimax model training objective, with an inner maximisation over the spread noise and an outer maximisation over the model parameters. This would bring our work much closer to adversarial training methods (Goodfellow, 2017).

References

Appendix A Spread noise makes distributions more similar

The data processing inequality for -divergences (see for example Gerchinovitz et al. (2018)) states that

. For completeness, we provide here an elementary proof of this result. We consider the following joint distributions

(40)

whose marginals are the spreaded distributions

(41)

The divergence between the two joint distributions is

(42)

The -divergence between two marginal distributions is no larger than the -divergence between the joint (see also Zhang et al. (2018)). To see this, consider

Hence,

(43)

Intuitively, spreading two distributions increases their overlap, reducing the divergence. When and do not have the same support, can be infinite or not well-defined.

Appendix B Injective Linear Mappings

Consider an injective linear mapping from space to . From the rank nullity theorem for finite dimensional spaces,

(44)

If is injective, then . If then

(45)

Since , it must be that . Hence, injective linear maps between between two (finite dimensional) spaces of the same dimension are surjective; equivalently, they are invertible.

In the context of spread noise, since the domain of and are equal and

is defined through a linear transformation of

, the requirement in (5) that the mapping is injective is equivalent to the requirement that the mapping is invertible.

Appendix C Spread Divergence for Deterministic Deep Generative Models

Instead of minimising the likelihood, we train an implicit generative model by minimising the spread divergence

(46)

where

(47)

and

(48)

According to our general theory,

(49)

Here

(50)

Typically, the integral over will be intractable and we resort to an unbiased sampled estimate (though see below for Gaussian ). Neglecting constants, the KL divergence estimator is

(51)

where is a noisy sample of , namely . In most cases of interest, with non-linear , the distribution is intractable. We therefore use the variational lower bound

(52)

Parameterising the variational distribution as a Gaussian,

(53)

then we can reparameterise and write

(54)

where is the entropy of a Gaussian with covariance . For Gaussian spread noise in dimensions, this is (ignoring constants)