1 Introduction and background
Define the score function of a random variable
or its density as (assuming it exists), where belongs to some suitable set. Note that this is different from the usual definition of the score function where the derivative is taken with respect to the model parameters. Here, we illustrate that some theoretically guaranteed algorithms based on the score function can fail catastrophically in practice when the unnormalised distribution has isolated components (substantial probability mass separated by regions of
with negligible mass). For simplicity and visualisation, we consider the following 1D distributions throughout:Example 1 (Gaussian mixtures).
Define the following density functions on :
where , , and are mixing proportions satisfying . Without loss of generality, we assume that . Examples of densities and are shown in 1.
When has sufficiently large magnitude (with a negative sign), density has isolated components. First, we show the following property of .
Lemma 1 (Weak dependence of score on mixing proportions).
For the and defined in 1, the dependence of on can be made arbitrarily weak for by taking .
Proof.
The score function is
As , we can see that
∎
Perhaps surprisingly, when is far away from , the score function becomes roughly independent of the mixing proportions. This is illustrated in 1 (top).
In the next three sections, we discuss the consequence of 1 in three popular scorebased methods:

score matching (SM) (HyvaerinenHyvaerinen2005Estimation)
that fits unnormalised energybased models of data by minimising the Fisher divergence;

(kernel) stein discrepancy (SD) (ChwialkowskiGretton2016Kernel; LiuJordan2016Kernelized) for a goodnessoffit test between an unnormalised model and samples drawn from an unknown distribution;

Stein variational gradient descent (SVGD) (LiuWang2016Stein) that deterministically moves a set of particles to approximate the posterior distribution in variational inference.
2 Score matching
Consider the problem of fitting an unnormalised density model on data drawn from . Maximumlikelihood learning for unnormalised models is difficult due to the intractable normaliser . Instead, HyvaerinenHyvaerinen2005Estimation proposed to train by minimising the Fisher divergence. For , this loss is
Note that is independent of the log normaliser. Under mild conditions, the divergence can be estimated from data up to a constant that does not depend on the model , and the model can be optimised using samples from .
Proposition 1 (Blindness to mixing proportions).
Given (x) in 1 and another density that is identical to except that (x) has a different mixing proportion than , then as regardless of the mixing proportions.
Proof.
By 1, and converge to the same limit on the support where Also, has negligible mass around where the scores differ. ∎
If we take as the model distribution and as the data distribution, then when the (empirical) Fisher divergence is minimised, one can obtain an almost equally good solution by changing the mixing proportions in .
Proposition 2 (Blindness to isolated components in the model).
The Fisher divergence for and in 1 is close to zero regardless of mixing proportion .
Proof.
We sketch the main idea. The Fisher divergence is an expectation under . Note that has negligible mass for as . Using 1, it can be shown that on the domain where has most of its mass, . Therefore, . ∎
Taking as the model density and as the data density, we see that, as long as part of the model fits the data well, the model can have an arbitrary number of isolated (spurious) component supported far away from data points while the loss is unaffected. The causes are twofold:

score functions of and converge to that of for (1);

The Fisher divergence is an expectation only over .
A visualisation of these results are shown in 1 (top). As such, applications that depend on the learned (log)density of the trained model, such as sampling by Hamiltonian Monte Carlo, can produce bad results. The issue of isolated component implied by 2 can be partially alleviated by controlling the tail behaviour of (ArbelGretton2018Kernel; WenliangGretton2019Learning), but the issue implied by 1 is more difficult to address in general.
Such issues do not arise in maximumlikelihood estimation, since isolated components or incorrect mixing proportions have a huge impact on the normalised density. Thus, while optimising the score function gets rid of the intractable normaliser, it also loses any control on normalised density.
2.1 Training with noisy data for sampling
Despite the issues above, SongErmon2019Generative demonstrated a successful approach to training energybased models for generating samples. They adopted the denoising formulation of the score matching objective (VincentVincent2011connection) and trained a model to produce the score function instead of the density . Further, they proposed to add isotropic Gaussian noise to data samples, effectively smoothing the data distribution with a Gaussian density function. During training, the data distribution is transformed into multiple noisy versions
by adding Gaussian noise with standard deviations
. The model is trained to estimate the score function on data from all , taking as an input in addition to . During sampling, a set of randomly initialised particles are updated by Langevin dynamics on successive ’s with decreasing from to .In this procedure, the noise essentially bridges isolated components together. The density corresponding to the largest does not contain isolated components. SM can thus correctly estimate this noisy distribution (or its score), but it is unlikely that the model can learn the correct mixing proportions for the smallest . Nonetheless, adding noise does not alter the mixing proportions in the data distribution, and the score function is correct locally at each component with respect to the noisy distribution for all noise levels. Therefore, running Langevin dynamics according to the highnoise score function ensures that the particles are initially dispersed according to the correct mixing proportions in the data, and annealing then pushes the particles to follow the score of lownoise data distribution.
It is worth noting that this model does not estimate the (unnormalised) but only estimate its score function given a noise level. Although the score function is sufficient for certain applications, such as mode finding and Langevin sampling, it is not for others that additionally require the log density itself, such as Hamiltonian Monte Carlo.
3 Stein discrepancy
The stein discrepancy (SD) can be used for goodnessoffit test that measures how well an unnormalised model describes samples from an unknown density (LiuJordan2016Kernelized; ChwialkowskiGretton2016Kernel)
. One application of SD, among many others, is to benchmark approximate inference methods applied to latent variable model described by joint distribution
, where is latent. The exact posterior is usually intractable, but the unnormalised posterior is trivial. Also, popular inference methods produce sample representations. As such, we can compare how good the posterior samples approximate for each . However, as we argue below, this application may fail due to the blindness associated with the score function.SD is based on the Stein operator for given by
where is in the Stein class of , meaning that it is smooth and satisfies
see (ChwialkowskiGretton2016Kernel; LiuJordan2016Kernelized) for detail. Under this condition, it is easy to show that if and only if . This motivates the following definition of SD between and
(1) 
where is some function class.
Proposition 3 (SD says ).
For and in 1, their Stein discrepancy as , where .
Proof.
We again provide a sketch. If , we solve the Lagrangian
One can check that the solution is
Denote the constant of proportionality by , then for the and in 1, as ,
Since has negligible mass for , the SD
Similarly, the solution for is
Denoting the constant of proportionality by , as ,
For , we know from 1 that as , so . Thus, the SD . ∎
As shown in 1 (bottom), the best is almost zero for regardless of whether is or . Since is an expectation under that has most of its mass for , it fails to detect the differences. Intuitively, by inspection on (1), we can make scale as to counteract the vanishing tail of and expose the second component to SD. However, such functions may not be in or .
Therefore, SD cannot detect isolated components in the model that are far away from the samples of
even when those components have high mixing proportions. One may hope that by considering the variance of the estimated SD, goodnessoffit tests can be made robust to isolates components. However, this is not the case since the two
’s are essentially zero w.r.t. and cannot induce a large variance. If is sufficiently unsmooth to scale up the vanishing tail of , then the variance can be made large, making the test more robust.With this issue in mind, we return to the application to benchmarking approximate inference mentioned at the beginning of the Section. A flexible generative model can induce complicated a posterior that has many isolated components. If the approximate posterior only covers one of the components, especially in variational inference where the objective has modeseeking behaviour (e.g. ), SD may report a good performance while being blind to the components in that are not covered by the samples.
4 Stein variational gradient descent (SVGD)
We now consider the impact of 1 on SVGD (LiuWang2016Stein). Here, the goal is to find a sample representation of to approximate an unnormalised distribution . The main idea is to find a direction to push a set of particles so as to minimise . For defined by a function in the reproducing kernel Hilbert space associated with a kernel , the optimal direction evaluated at is given by the kernel Stein operator
In practice, we initialise a set of particles and perform a gradient step along this direction to obtain an updated set of samples .
Since the optimisation is over particles rather than a parametric distribution, SVGD can be very flexible to approximate complicated posteriors. However, as the algorithm minimises the modeseeking , the particles can end up in a single component that is far away from other components of the posterior. Initialising particles to be from a broader distribution may help alleviate this problem, but it is still difficult to allocate the particles to avoid converging to the wrong mixing proportions.
To demonstrate the blindness of SVGD to mixing proportions, we run this algorithm to fit the bimodal in 1. The results are in 2. The distribution of the particles after running SVGD is highly sensitive to initial
, a Gaussian distribution with mean
and standard deviation . Even if the initial distribution is very wide with a large , the final distribution still depends strongly on . Critically, in cases where is far away from the midpoint of the two modes, the particles find grossly incorrect mixing proportions.5 Discussion
We have demonstrated that three popular scorebased methods fail to detect the isolated components or to identify mixing proportions. It is worth noting that this happens when the different components are separated by regions of negligible probability mass, such as in 1, but not when they are overlapping. Although the mixing proportions are explicitly defined for a mixture of Gaussian distribution, one can interpret them as the probability masses of generic components, and the consequence of 1 extends to more general settings. In addition, the blindness is also caused by the expectation taken only the data distribution, hiding spurious isolated components in the model.
While scorebased methods allow very efficient learning of flexible distributions, it is important to check whether the blindness harms performance for downstream applications. For example, SD involves an expectation of the score of under a different (data) distribution ; this integral can be blind to isolated components in and give misleading test results. In SVGD, the score function is used to evaluate a local
direction of particle movement, but it does not propose global changes across different isolated components, which could yield incorrect identification of mixing proportions. On the other hand, for Bayesian neural networks, ignoring the trivial components caused by the symmetry of the parameter space should not adversely affect the performance.
5.1 Estimating the gradient of entropy of implicit distributions
Interestingly, the blindness does not affect the application of score to entropy optimisation for implicit distributions, e.g. (LiTurner2018Gradient; ShiZhu2018Spectral). For a distribution defined by , where is some simple distribution and is a flexible function parametrised by , the gradient of the entropy satisfies
There are two reasons why this application does not suffer the blindness. First, samples from can be easily drawn from the implicit distribution; this is in contrast to the application of using the score function to sample from by HMC or Langevin. Second, the expectation above is an integral of a function of under the same density (through ), which cannot be blind to itself, unlike in SD where the expectation involves two different density functions and .
5.2 Possible solutions
We speculate that the blindness could be remedied by utilising global information of the unnormalised distribution. The most immediate idea is to consider annealing as inspired by (SongErmon2019Generative). There is an interesting connection between annealed sampling using SM and variational inference using SVGD: the initial set of particles are drawn from an overdispersed distribution relative to the target distribution. The annealing procedure used in scorebased sampling ensures that all components are covered with the correct proportions of particles, and the global information is inherited across each decrement of the noise level. This suggests that SVGD could also benefit from an annealing procedure.
Other than adding Gaussian noise, one can consider adding an inverse temperature variable to the log density of the model
Though attractive at first glance, this parametrisation cannot be easily applied to the three scenarios above. For SM and its application to sampling, a small is similar to introducing a large amount of noise, and a close to 1 corresponds to no noise. Thus, the temperatureadjusted
provides a principled parametrisation or inductive bias for consistent interpolation between high and low noise regimes. However, it is not straightforward to obtain samples from
at higher temperatures. In particular, introducing additive Gaussian noise is not equivalent to increasing the temperature. For SD, one can run Langevin dynamics on the model at a low with particles initialised as data samples from . The samples are then annealed back to . If the data samples before and after annealing agree, then we can tell that is close to . However, this complicated sampling procedure defeats the purpose of SD tests which is to avoid sampling from the unnormalised . For SVGD, a smallerscales down the length of the update vector for each particle, so an adaptive stepsize schedule is required. Further, the particles may need to be perturbed by noise to fully “mix” during annealing. However, preliminary experiments suggest that these heuristics are not sufficient to address this issue in full.
Apart from increasing the dispersion of data, we also consider other ways to incorporate global information of the unnormalised distribution. For SVGD, samples from the generative model could be used to inform the movement of the particles. This may eventually constitute some sort of wakesleep procedure. For SM, if one can estimate some loworder moments of the model distribution
and compare with the estimated moments of data distribution , then the presence of isolated components or incorrect mixing proportions will cause a substantial discrepancy between the two.Alternatively, one can introduce the following loss in addition to the Fisher divergence when fitting the model by SM
where is a sample from the dataset, is a weighting term, and
is a simple model of the data that preserves probability mass information, such as a kernel density estimator or a normalising flow fit by maximumlikelihood. The loss
tries to match the logdensity ratio between (a subset of) all pairs of data samples measured under a simpler maximumlikelihood model and the more complex energybased model. Note that the logdensity ratio of the model is also independent of the normaliser. Clearly, if and are sampled from two different components, then this term will be large if the does not estimate the mixing proportion correctly. However, it is difficult to tune the strength : if it is too small, then it will have little effect; if too large, then it will bias towards the potentially bad simple model fit. Also, this cost is yet another expectation under the data distribution, and will not help eliminate isolated components in the model that are not covered by data.Acknowledgement
I thank Maneesh Sahani and Arthur Gretton for useful discussions. This research is funded by the Gatsby Charitable Foundation.
Comments
There are no comments yet.