Blindness of score-based methods to isolated components and mixing proportions

08/23/2020
by   Li K. Wenliang, et al.
UCL
0

A large family of score-based methods are developed recently to solve unsupervised learning problems including density estimation, statistical testing and variational inference. These methods are attractive because they exploit the derivative of the log density, which is independent of the normaliser, and are thus suitable for tasks involving unnormalised densities. Despite the theoretical guarantees on the performance, here we illustrate a common practical issue suffered by these methods when the unnormalised distribution of interest has isolated components. In particular, we study the behaviour of some popular score-based methods on tasks involving 1-D mixture of Gaussian. These methods fail to identify appropriate mixing proportions when the unnormalised distribution is multimodal. Finally, some directions for finding a remedy are discussed in light of recent successes in specific tasks. We hope to bring the attention of theoreticians and practitioners to this issue when developing new algorithms and applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/06/2021

Revisiting consistency of a recursive estimator of mixing distributions

Estimation of the mixing distribution under a general mixture model is a...
10/23/2020

Statistical Guarantees for Transformation Based Models with Applications to Implicit Variational Inference

Transformation-based methods have been an attractive approach in non-par...
10/20/2020

Estimating a mixing distribution on the sphere using predictive recursion

Mixture models are commonly used when data show signs of heterogeneity a...
06/05/2018

Neural-Kernelized Conditional Density Estimation

Conditional density estimation is a general framework for solving variou...
05/28/2018

Semi-Implicit Variational Inference

Semi-implicit variational inference (SIVI) is introduced to expand the c...
06/10/2020

Gaussian Gated Linear Networks

We propose the Gaussian Gated Linear Network (G-GLN), an extension to th...
03/25/2019

General Probabilistic Surface Optimization and Log Density Estimation

In this paper we contribute a novel algorithm family, which generalizes ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and background

Define the score function of a random variable

or its density as (assuming it exists), where belongs to some suitable set

. Note that this is different from the usual definition of the score function where the derivative is taken with respect to the model parameters. Here, we illustrate that some theoretically guaranteed algorithms based on the score function can fail catastrophically in practice when the unnormalised distribution has isolated components (substantial probability mass separated by regions of

with negligible mass). For simplicity and visualisation, we consider the following 1-D distributions throughout:

Example 1 (Gaussian mixtures).

Define the following density functions on :

where , , and are mixing proportions satisfying . Without loss of generality, we assume that . Examples of densities and are shown in 1.

When has sufficiently large magnitude (with a negative sign), density has isolated components. First, we show the following property of .

Lemma 1 (Weak dependence of score on mixing proportions).

For the and defined in 1, the dependence of on can be made arbitrarily weak for by taking .

Proof.

The score function is

As , we can see that

Perhaps surprisingly, when is far away from , the score function becomes roughly independent of the mixing proportions. This is illustrated in 1 (top).

Figure 1: Top, densities and score functions in 1. Note that . The scale of the density is shown on the left vertical axis, and the scale of score function is shown on the right vertical axis. Bottom, density and the optimal (rescaled for visualisation) for the SD.

In the next three sections, we discuss the consequence of 1 in three popular score-based methods:

  • score matching (SM) (HyvaerinenHyvaerinen2005Estimation)

    that fits unnormalised energy-based models of data by minimising the Fisher divergence;

  • (kernel) stein discrepancy (SD) (ChwialkowskiGretton2016Kernel; LiuJordan2016Kernelized) for a goodness-of-fit test between an unnormalised model and samples drawn from an unknown distribution;

  • Stein variational gradient descent (SVGD) (LiuWang2016Stein) that deterministically moves a set of particles to approximate the posterior distribution in variational inference.

2 Score matching

Consider the problem of fitting an unnormalised density model on data drawn from . Maximum-likelihood learning for unnormalised models is difficult due to the intractable normaliser . Instead, HyvaerinenHyvaerinen2005Estimation proposed to train by minimising the Fisher divergence. For , this loss is

Note that is independent of the log normaliser. Under mild conditions, the divergence can be estimated from data up to a constant that does not depend on the model , and the model can be optimised using samples from .

Proposition 1 (Blindness to mixing proportions).

Given (x) in 1 and another density that is identical to except that (x) has a different mixing proportion than , then as regardless of the mixing proportions.

Proof.

By 1, and converge to the same limit on the support where Also, has negligible mass around where the scores differ. ∎

If we take as the model distribution and as the data distribution, then when the (empirical) Fisher divergence is minimised, one can obtain an almost equally good solution by changing the mixing proportions in .

Proposition 2 (Blindness to isolated components in the model).

The Fisher divergence for and in 1 is close to zero regardless of mixing proportion .

Proof.

We sketch the main idea. The Fisher divergence is an expectation under . Note that has negligible mass for as . Using 1, it can be shown that on the domain where has most of its mass, . Therefore, . ∎

Taking as the model density and as the data density, we see that, as long as part of the model fits the data well, the model can have an arbitrary number of isolated (spurious) component supported far away from data points while the loss is unaffected. The causes are twofold:

  1. score functions of and converge to that of for (1);

  2. The Fisher divergence is an expectation only over .

A visualisation of these results are shown in 1 (top). As such, applications that depend on the learned (log-)density of the trained model, such as sampling by Hamiltonian Monte Carlo, can produce bad results. The issue of isolated component implied by 2 can be partially alleviated by controlling the tail behaviour of (ArbelGretton2018Kernel; WenliangGretton2019Learning), but the issue implied by 1 is more difficult to address in general.

Such issues do not arise in maximum-likelihood estimation, since isolated components or incorrect mixing proportions have a huge impact on the normalised density. Thus, while optimising the score function gets rid of the intractable normaliser, it also loses any control on normalised density.

2.1 Training with noisy data for sampling

Despite the issues above, SongErmon2019Generative demonstrated a successful approach to training energy-based models for generating samples. They adopted the denoising formulation of the score matching objective (VincentVincent2011connection) and trained a model to produce the score function instead of the density . Further, they proposed to add isotropic Gaussian noise to data samples, effectively smoothing the data distribution with a Gaussian density function. During training, the data distribution is transformed into multiple noisy versions

by adding Gaussian noise with standard deviations

. The model is trained to estimate the score function on data from all , taking as an input in addition to . During sampling, a set of randomly initialised particles are updated by Langevin dynamics on successive ’s with decreasing from to .

In this procedure, the noise essentially bridges isolated components together. The density corresponding to the largest does not contain isolated components. SM can thus correctly estimate this noisy distribution (or its score), but it is unlikely that the model can learn the correct mixing proportions for the smallest . Nonetheless, adding noise does not alter the mixing proportions in the data distribution, and the score function is correct locally at each component with respect to the noisy distribution for all noise levels. Therefore, running Langevin dynamics according to the high-noise score function ensures that the particles are initially dispersed according to the correct mixing proportions in the data, and annealing then pushes the particles to follow the score of low-noise data distribution.

It is worth noting that this model does not estimate the (unnormalised) but only estimate its score function given a noise level. Although the score function is sufficient for certain applications, such as mode finding and Langevin sampling, it is not for others that additionally require the log density itself, such as Hamiltonian Monte Carlo.

3 Stein discrepancy

The stein discrepancy (SD) can be used for goodness-of-fit test that measures how well an unnormalised model describes samples from an unknown density (LiuJordan2016Kernelized; ChwialkowskiGretton2016Kernel)

. One application of SD, among many others, is to benchmark approximate inference methods applied to latent variable model described by joint distribution

, where is latent. The exact posterior is usually intractable, but the unnormalised posterior is trivial. Also, popular inference methods produce sample representations. As such, we can compare how good the posterior samples approximate for each . However, as we argue below, this application may fail due to the blindness associated with the score function.

SD is based on the Stein operator for given by

where is in the Stein class of , meaning that it is smooth and satisfies

see (ChwialkowskiGretton2016Kernel; LiuJordan2016Kernelized) for detail. Under this condition, it is easy to show that if and only if . This motivates the following definition of SD between and

(1)

where is some function class.

Proposition 3 (SD says ).

For and in 1, their Stein discrepancy as , where .

Proof.

We again provide a sketch. If , we solve the Lagrangian

One can check that the solution is

Denote the constant of proportionality by , then for the and in 1, as ,

Since has negligible mass for , the SD

Similarly, the solution for is

Denoting the constant of proportionality by , as ,

For , we know from 1 that as , so . Thus, the SD . ∎

As shown in 1 (bottom), the best is almost zero for regardless of whether is or . Since is an expectation under that has most of its mass for , it fails to detect the differences. Intuitively, by inspection on (1), we can make scale as to counteract the vanishing tail of and expose the second component to SD. However, such functions may not be in or .

Therefore, SD cannot detect isolated components in the model that are far away from the samples of

even when those components have high mixing proportions. One may hope that by considering the variance of the estimated SD, goodness-of-fit tests can be made robust to isolates components. However, this is not the case since the two

’s are essentially zero w.r.t.  and cannot induce a large variance. If is sufficiently unsmooth to scale up the vanishing tail of , then the variance can be made large, making the test more robust.

With this issue in mind, we return to the application to benchmarking approximate inference mentioned at the beginning of the Section. A flexible generative model can induce complicated a posterior that has many isolated components. If the approximate posterior only covers one of the components, especially in variational inference where the objective has mode-seeking behaviour (e.g. ), SD may report a good performance while being blind to the components in that are not covered by the samples.

4 Stein variational gradient descent (SVGD)

We now consider the impact of 1 on SVGD (LiuWang2016Stein). Here, the goal is to find a sample representation of to approximate an unnormalised distribution . The main idea is to find a direction to push a set of particles so as to minimise . For defined by a function in the reproducing kernel Hilbert space associated with a kernel , the optimal direction evaluated at is given by the kernel Stein operator

In practice, we initialise a set of particles and perform a gradient step along this direction to obtain an updated set of samples .

Since the optimisation is over particles rather than a parametric distribution, SVGD can be very flexible to approximate complicated posteriors. However, as the algorithm minimises the mode-seeking , the particles can end up in a single component that is far away from other components of the posterior. Initialising particles to be from a broader distribution may help alleviate this problem, but it is still difficult to allocate the particles to avoid converging to the wrong mixing proportions.

To demonstrate the blindness of SVGD to mixing proportions, we run this algorithm to fit the bimodal in 1. The results are in 2. The distribution of the particles after running SVGD is highly sensitive to initial

, a Gaussian distribution with mean

and standard deviation . Even if the initial distribution is very wide with a large , the final distribution still depends strongly on . Critically, in cases where is far away from the midpoint of the two modes, the particles find grossly incorrect mixing proportions.

Figure 2: Running SVGD to approximate the density in 1. The top three rows have ; the bottom three rows have . Red line is the target distribution . Blue histogram is the distribution of particles at initialisation drawn from a Gaussian with mean and standard deviation (shown on title). Orange histogram is the distribution of particles after running SVGD. The kernel used to define the SVGD update is Gaussian with unit bandwidth.

5 Discussion

We have demonstrated that three popular score-based methods fail to detect the isolated components or to identify mixing proportions. It is worth noting that this happens when the different components are separated by regions of negligible probability mass, such as in 1, but not when they are overlapping. Although the mixing proportions are explicitly defined for a mixture of Gaussian distribution, one can interpret them as the probability masses of generic components, and the consequence of 1 extends to more general settings. In addition, the blindness is also caused by the expectation taken only the data distribution, hiding spurious isolated components in the model.

While score-based methods allow very efficient learning of flexible distributions, it is important to check whether the blindness harms performance for downstream applications. For example, SD involves an expectation of the score of under a different (data) distribution ; this integral can be blind to isolated components in and give misleading test results. In SVGD, the score function is used to evaluate a local

direction of particle movement, but it does not propose global changes across different isolated components, which could yield incorrect identification of mixing proportions. On the other hand, for Bayesian neural networks, ignoring the trivial components caused by the symmetry of the parameter space should not adversely affect the performance.

5.1 Estimating the gradient of entropy of implicit distributions

Interestingly, the blindness does not affect the application of score to entropy optimisation for implicit distributions, e.g. (LiTurner2018Gradient; ShiZhu2018Spectral). For a distribution defined by , where is some simple distribution and is a flexible function parametrised by , the gradient of the entropy satisfies

There are two reasons why this application does not suffer the blindness. First, samples from can be easily drawn from the implicit distribution; this is in contrast to the application of using the score function to sample from by HMC or Langevin. Second, the expectation above is an integral of a function of under the same density (through ), which cannot be blind to itself, unlike in SD where the expectation involves two different density functions and .

5.2 Possible solutions

We speculate that the blindness could be remedied by utilising global information of the unnormalised distribution. The most immediate idea is to consider annealing as inspired by (SongErmon2019Generative). There is an interesting connection between annealed sampling using SM and variational inference using SVGD: the initial set of particles are drawn from an over-dispersed distribution relative to the target distribution. The annealing procedure used in score-based sampling ensures that all components are covered with the correct proportions of particles, and the global information is inherited across each decrement of the noise level. This suggests that SVGD could also benefit from an annealing procedure.

Other than adding Gaussian noise, one can consider adding an inverse temperature variable to the log density of the model

Though attractive at first glance, this parametrisation cannot be easily applied to the three scenarios above. For SM and its application to sampling, a small is similar to introducing a large amount of noise, and a close to 1 corresponds to no noise. Thus, the temperature-adjusted

provides a principled parametrisation or inductive bias for consistent interpolation between high and low noise regimes. However, it is not straightforward to obtain samples from

at higher temperatures. In particular, introducing additive Gaussian noise is not equivalent to increasing the temperature. For SD, one can run Langevin dynamics on the model at a low with particles initialised as data samples from . The samples are then annealed back to . If the data samples before and after annealing agree, then we can tell that is close to . However, this complicated sampling procedure defeats the purpose of SD tests which is to avoid sampling from the unnormalised . For SVGD, a smaller

scales down the length of the update vector for each particle, so an adaptive step-size schedule is required. Further, the particles may need to be perturbed by noise to fully “mix” during annealing. However, preliminary experiments suggest that these heuristics are not sufficient to address this issue in full.

Apart from increasing the dispersion of data, we also consider other ways to incorporate global information of the unnormalised distribution. For SVGD, samples from the generative model could be used to inform the movement of the particles. This may eventually constitute some sort of wake-sleep procedure. For SM, if one can estimate some low-order moments of the model distribution

and compare with the estimated moments of data distribution , then the presence of isolated components or incorrect mixing proportions will cause a substantial discrepancy between the two.

Alternatively, one can introduce the following loss in addition to the Fisher divergence when fitting the model by SM

where is a sample from the dataset, is a weighting term, and

is a simple model of the data that preserves probability mass information, such as a kernel density estimator or a normalising flow fit by maximum-likelihood. The loss

tries to match the log-density ratio between (a subset of) all pairs of data samples measured under a simpler maximum-likelihood model and the more complex energy-based model. Note that the log-density ratio of the model is also independent of the normaliser. Clearly, if and are sampled from two different components, then this term will be large if the does not estimate the mixing proportion correctly. However, it is difficult to tune the strength : if it is too small, then it will have little effect; if too large, then it will bias towards the potentially bad simple model fit. Also, this cost is yet another expectation under the data distribution, and will not help eliminate isolated components in the model that are not covered by data.

Acknowledgement

I thank Maneesh Sahani and Arthur Gretton for useful discussions. This research is funded by the Gatsby Charitable Foundation.