Gradient conjugate priors and deep neural networks

02/07/2018
by   Pavel Gurevich, et al.
0

The paper deals with learning the probability distribution of the observed data by artificial neural networks. We suggest a so-called gradient conjugate prior (GCP) update appropriate for neural networks, which is a modification of the classical Bayesian update for conjugate priors. We establish a connection between the gradient conjugate prior update and the maximization of the log-likelihood of the predictive distribution. Unlike for the Bayesian neural networks, we do not impose a prior on the weights of the neural networks, but rather assume that the ground truth distribution is normal with unknown mean and variance and learn by neural networks the parameters of a prior (normal-gamma distribution) for these unknown mean and variance. The update of the parameters is done, using the gradient that, at each step, directs towards minimizing the Kullback--Leibler divergence from the prior to the posterior distribution (both being normal-gamma). We obtain a corresponding dynamical system for the prior's parameters and analyze its properties. In particular, we study the limiting behavior of all the prior's parameters and show how it differs from the case of the classical full Bayesian update. The results are validated on synthetic and real world data sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/21/2019

Robustness Against Outliers For Deep Neural Networks By Gradient Conjugate Priors

We analyze a new robust method for the reconstruction of probability dis...
04/02/2019

BCMA-ES: A Bayesian approach to CMA-ES

This paper introduces a novel theoretically sound approach for the celeb...
11/20/2020

Gradient Regularisation as Approximate Variational Inference

Variational inference in Bayesian neural networks is usually performed u...
05/15/2021

On the Distributional Properties of Adaptive Gradients

Adaptive gradient methods have achieved remarkable success in training d...
12/12/2019

Towards Expressive Priors for Bayesian Neural Networks: Poisson Process Radial Basis Function Networks

While Bayesian neural networks have many appealing characteristics, curr...
12/28/2020

Objective Bayesian Analysis for the Differential Entropy of the Gamma Distribution

The use of entropy related concepts goes from physics, such as in statis...
07/12/2020

It Is Likely That Your Loss Should be a Likelihood

We recall that certain common losses are simplified likelihoods and inst...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reconstructing probability distributions of observed data by artificial neural networks is one of the most essential parts of machine learning and artificial intelligence 

[3, 32]

. Learning probability distributions not only allows one to predict the behavior of a system under consideration, but to also quantify the uncertainty with which the predictions are done. Under the assumption that the data are normally distributed, the most well studied way of reconstructing probability distributions is the Bayesian learning of neural networks 

[30]

. One treats the weights of the network as normally distributed random variables, prescribes their prior distribution, and then finds the posterior distribution conditioned on the data. The main difficulty is that neither the posterior, nor the resulting predictive distributions are given in a closed form. As a result, different approximation methods have been developed 

[35, 16, 14, 45, 4, 20, 7, 12, 6, 25, 13, 27, 24, 28]. However, many of them have certain drawbacks related to the lack of scalability in data size or the neural network complexity, and are still a field of ongoing research. Furthermore, Bayesian neural networks often assume homoscedastic variance in the likelihood (i.e., same for all samples) and rather learn uncertainty due to lack of data (epistemic uncertainty). Among other methods for uncertainty quantification, there are the delta method [46, 15, 44]

, the mean-variance estimate 

[36], and deep ensemble methods [22, 23]. A combination of the Bayesian approach (using the dropout variational inference) with the mean-variance estimate was used in [18], thus allowing for a simultaneous estimation of epistemic and aleatoric (due to noise in data) uncertainty. A new method based on minimizing a joint loss for a regression network and another network quantifying uncertainty was recently proposed in [10]. We refer to [19, 41] and the recent works [33, 6, 22, 23, 10] for a comprehensive comparison of the above methods and further references to research on the Bayesian learning of neural networks.

We study an alternative approach to reconstructing the ground truth probability distribution based on what we call a gradient conjugate prior (GCP) update

. We are interested in learning conditional probability distributions

of targets111Throughout this paper, we denote random variables by bold letters and the arguments of their probability distributions by the corresponding non-bold letters. corresponding to data samples , using artificial neural networks (supervised learning). For brevity, we will often omit the dependence of distributions on . Thus, assuming that the ground truth distribution of a random variable (corresponding to observed data) is Gaussian with unknown mean and precision, we let neural networks learn the four parameters of the normal-gamma distribution that serves as a prior for the mean and variance of . We emphasize that, unlike for Bayesian neural networks, the weights of the neural networks are deterministic in our approach. Given a parametrized prior, one has the predictive distribution in the form of a (non-standardized) Student’s t-distribution , whose parameters are explicitly determined by the outputs of the neural networks. For further details, we refer to Sec. 2.4, which includes a graphical model visualization in Fig. 2.1 and a comparison with Bayesian neural networks in Table 2.1.

Given an observation , the classical Bayesian update yields the posterior distribution for the mean and variance of . This posterior appears normal-gamma as well [3]. However, one cannot update its parameters directly because they are represented by the outputs of the neural networks. Instead, one has to update the weights of the neural networks. We suggest to make a gradient descent step in the direction of minimization of the Kullback–Leibler (KL) divergence from the prior to the posterior (see the details in Sec. 2.4). This is the step that we call the GCP update. After updating the weights, one takes the next observation and repeats the above update procedure. One cycles over the whole training data set until convergence of the log-likelihood of predictive distribution

(1.1)

In the paper, we provide a detailed analysis of the dynamics given by the GCP update. Intuitively, one might think that the GCP update, after convergence, yields the same result as the classical CP update. Surprisingly, this is not the case: the parametrized normal-gamma distribution does not converge to the Bayesian posterior (see Remark 3.4

). Nevertheless, the predictive distribution does converge to the ground truth Gaussian distribution

. This is explained by the observation, which we prove in Sec. 2.5: the GCP update is actually equivalent to maximizing by gradient ascent the log-likelihood (1.1) of the predictive distribution. As the number of observations tends to infinity the GCP update becomes also equivalent to minimizing by gradient descent the KL divergence from the predictive distribution to the ground truth distribution . We show that these equivalences hold in general, even if the prior is not conjugate to the likelihood function. Thus, we see that the GCP method estimates aleatoric uncertainty.

We emphasize that, although in our approach the approximating distribution gets parametrized (as the predictive distribution in the mean-variance approach [36]

or the approximating latent variables distribution in variational autoencoders 

[21]), the way we parametrize and optimize and the way we interpret the result is different, as shown in Fig. 2.1 and summarized in Table 2.1.

Now let us come back to our original assumption that is a normal distribution and is a Student’s t-distribution. The latter appears to be overparametrized (by four parameters instead of three). We keep it overparametrized in order to compare the dynamics of the parameters under the classical CP update and under the GCP update. Reformulation of our results for Student’s t-distribution parameterized in the standard way by three parameters will be straightforward. There is a vast literature on the estimation of parameters of Student’s t-distribution, see, e.g., the overview [34] and the references therein. Note that, in the context of neural networks, different samples correspond to different inputs of the network, and hence they belong to different Student’s t-distributions with different unknown parameters. Thus, the maximization of the likelihood of Student’s t-distribution with respect to the weights of the networks is one of the most common methods. In [43]

, the possibility of utilizing evolutionary algorithms for maximizing the likelihood was explored experimentally. Another natural way is to use the gradient ascent with respect to the weights of the network. As we said, the latter is equivalent to the usage of the GCP update. In the paper, we obtain a dynamical system for the prior’s parameters that approximates the GCP update (as well as the gradient ascent for maximization of Student’s t-distribution). We study the dynamics of the prior’s parameters in detail, in particular analyzing their convergence properties. Our approach is illustrated with synthetic data and validated on various real-world data sets in comparison with other methods for learning probability distributions based on neural networks. To our best knowledge,

neither the dynamical systems analysis of the GCP (or gradient ascent for maximizing the likelihood of Student’s t-distribution), nor a thorough comparison of the GCP with other methods has been carried out before.

As an interesting and useful consequence of our analysis, we will see how the GCP interacts with outliers in the training set (a small percentage of observations that do not come from the assumed normal distribution ). The outliers prevent one of the prior’s parameters (

, which is related to the number of degrees of freedom of

) from going to infinity. On one hand, this is known [29, 39] to allow for a better estimate of the mean and variance of , compared with directly using the maximization of the likelihood of a normal distribution. However, on the other hand, this still leads to overestimation of the variance of . To deal with this issue, we obtain an explicit formula (see (2.17)) that allows one to correct the estimate of the variance and recover the ground truth variance of . To our knowledge, such a correction formula was not derived in the literature before.

The paper is organized as follows. In Sec. 2, we provide a detailed motivation for the GCP update, explain how we approximate the parameters of the prior distribution by neural networks, establish the relation between the GCP update and the predictive distribution, and formulate the method of learning the ground truth distribution from the practical point of view. Section 3 is the mathematical core of this paper. We derive a dynamical system for the prior’s parameters, induced by the GCP update, and analyze it in detail. In particular, we obtain an asymptotics for the growth rate of and find the limits of the other parameters of the prior. In Sec. 4, we study the dynamics for a fixed . We find the limiting values for the rest of the parameters and show how one can recover the variance of the ground truth normal distribution . In Sec. 5, we clarify the role of a fixed . Namely, we compare the sensitivity to outliers of the GCP update with that in minimizing the standard squared error loss or maximizing the log-likelihood of a normal distribution. Furthermore, we show how controls the learning speed in clean and noisy regions. In Sec. 6, we illustrate the fit of neural networks for synthetic and various real-world data sets. Section 7 contains a conclusion and an outline of possible directions of further research. Appendices AD contain the proofs of auxiliary lemmas from Sec. 3. In Appendix E

, we present the values of hyperparameters of different methods that are compared in Sec. 

6.

2 Motivation

2.1 Estimating normal distributions with unknown mean and precision

Assume one wants to estimate unknown mean and precision (the inverse of the variance) of normally distributed data . We remind that is conditioned on , but we often omit this dependence in our notation. We will analyze scalar and refer to Sec. 7 for a discussion of multivariate data. One standard approach for estimating the mean and precision is based on the conjugate prior update. One assumes that they are random variables, and respectively, with a joint prior given by the normal-gamma distribution

(2.1)

where

The marginal distribution for is a non-standardized Student’s t-distribution with

(2.2)

The marginal distribution for is the Gamma distribution with

(2.3)

By marginalizing and , one can get the predictive distribution for , which appears to be a non-standardized Student’s t-distribution. Its mean and variance can be used to estimate the mean and variance of . The estimated mean and variance are given by

(2.4)

We refer, e.g., to [3] for further details.

2.2 Conjugate prior update

Suppose one observes a new sample

. Then, by the Bayes theorem, the conditional distribution of

under the condition that (the posterior distribution denoted by ) appears to be normal-gamma as well [3], namely,

(2.5)

where is defined in (2.1) and the parameters are updated as follows:

(2.6)

We call (2.6) the conjugate prior (CP) update.

2.3 Kullback–Leibler divergence

The Kullback–Leibler (KL) divergence from a continuous distribution to a continuous distribution is defined as follows:

(2.7)

We denote by the digamma function, where is the gamma function. Then for the above normal-gamma distributions (2.1) and (2.5) the KL divergence takes the form [40]

(2.8)

2.4 Approximation of the parameters by gradient conjugate prior neural networks

Our goal is to approximate the parameters by multi-layer neural networks, i.e., to represent them as functions of inputs and weights: , . The corresponding graphical model is shown in Fig. 2.1. In Table 2.1, we summarize our approach and highlight its difference from the Bayesian neural networks and variational inference222The latent variables are usually denoted in the Bayesian neural networks framework or in the variational inference framework. We use the notation to make it consistent with our notation in Sec. 2.5..

Figure 2.1: Deterministic parameters (input , the prior’s parameters , and the weights ) are shown by solid nodes and random variables (, and ) by circles. The shaded circle corresponds to the observed random variable . The box encompasses the quantities depending on .
Bayesian neural networks GCP networks
Data Inputs , targets
Ground truth Gaussian
Captured
uncertainty
Epistemic, homoscedastic

Aleatoric, heteroscedastic

Weights Random Deterministic
Latent
variables
Weights independent of
Means and precisions
conditioned on and
Prior fixed during training evolves during training.
Likelihood Gaussian with constant
Gaussian with both and
depending on and .
Posterior intractable and fixed during training tractable normal-Gamma and evolves during training
Training
Minimize the KL-divergence
from parametrized by deterministic
to the intractable fixed posterior
Gradient descent step to minimize the reverse
KL-divergence w.r.t. .
The posterior is tractable and recalculated
after each observation based on the evolving prior
Result approximates the posterior
does not converge to ,
but the predictive distribution maximizes the likelihood of data
.
Predictive
typically evaluated by sampling
Explicit Student’s t-distribution ,
where depend on and
Outliers in
the training set
Distorted means and overestimated variances
Robust means and variances via the correction formula (2.17)
Table 2.1: Comparison of Bayesian neural networks with variational inference and GCP networks

In our case, one cannot directly apply the update in (2.6), but has to update the weights instead. The natural way to do so is to observe a sample , to calculate the posterior distribution (2.5) and to change the weights in the direction of , i.e.,

(2.9)

where is a learning rate. When we compute the gradient of with respect of , we keep all the prime variables in (2.8) fixed and do not treat them as functions of , while all the nonprime variables are treated as functions of . We still use the notation in this case. We call (2.9) the gradient conjugate prior (GCP) update.

As we will see below, this update induces the update for that is different from the classical conjugate prior update (2.6) and yields a completely different dynamics. Before we analyze this dynamics in detail, we present an alternative viewpoint on the GCP update, which provides an insight into what is actually optimized by (2.9) in the general case.

2.5 GCP update and learning the predictive distribution

Suppose we want to learn a ground truth probability distribution of a random variable  (a normal distribution in our particular case). Since the ground truth distribution is a priori unknown, we conjecture that it belongs to a family of distributions parametrized by  (in our case and is a normal distribution with mean and precision ). Since is a priori unknown, we assume it is a random variable with a prior distribution from a family parametrized by (in our case, is the normal-gamma distribution and are the weights of neural networks approximating , see Sec. 2.4). We denote the predictive distribution by

(2.10)

(non-standardized Student’s t-distribution in our case). Given an observation , the Bayes rule determines the posterior distribution of :

(2.11)

In our case, is normal-gamma again, but we emphasize that, in general, it need not be from the same family as the prior is.

Now we compute the gradient of the KL divergence

(2.12)

(cf. (2.7)) with respect to , assuming that in the posterior distribution is freezed, and we do not differentiate it. Denoting such a gradient by , we obtain the following lemma.

Lemma 2.1.

Proof.

Freezing in (2.12), we have

Plugging in from (2.11) and using (2.10) yields

Lemma 2.1 shows that the GCP update (2.9) is the gradient ascent step in the direction of maximizing the log-likelihood of the predictive distribution given a new observation . Furthermore, using Lemma 2.1, we see that given observations , the averaged GCP update of the parameters is given by (cf. (2.9))

(2.13)

Further, if the observations are sampled from the ground truth distribution and their number tends to infinity, then the GCP update (2.13) assumes the form

(2.14)
Remark 2.1.
  1. Formula (2.13) shows that the GCP update maximizes the likelihood of the predictive distribution for the observations .

  2. Formula (2.14) shows that the GCP update is equivalent to the gradient descent step for the minimization of the KL divergence from the ground truth distribution to the predictive distribution . If the ground truth distribution belongs to the family , then the minimum equals zero and is achieved for some (not necessarily unique) such that ; otherwise the minimum is positive and provides the best possible approximation of the ground truth in the sense of the KL divergence.

  3. In our case, is a normal distribution and are Student’s t-distributions. In accordance with item 2, we will see below that the GCP update forces the number of degrees of freedom of to tend to infinity. However, due to the overparametrization of the predictive distribution (four parameters instead of three), the learned variance of will be represented by a curve in the space . The limit point to which will converge during the GCP update, will depend on the initial condition. Interestingly, will always be different from the limit obtained by the classical CP update (2.6) (cf. Remark 3.4).

2.6 Practical approaches

Based on Remark 2.1 (items 1 and 2), we suggest the following general practical approach.

Practical approach 2.1.
  1. One approximates the parameters of the prior by neural networks:

    (2.15)

    We call them the GCP neural networks.

  2. One trains these four networks by the GCP update (2.9) until convergence of .

  3. The resulting predictive distribution is the non-standardized Student’s t-distribution . The estimated mean and variance (for ) are given by

    (2.16)

In practice one has finitely many observations , , and the distribution is a linear combination of the Dirac delta functions supported at . Due to Remark 2.1 (item 1), Approach 2.1 yields the predictive Student’s t-distribution with maximal likelihood. However, if the observations are sampled from a normal distribution and their number tends to infinity, will tend to infinity due to Remark 2.1 (item 3). We will also show that , and will converge to a finite value .

Remark 2.2.

Below we will justify the fact that if is fixed and equals to some value , then we obtain the best approximation of by a non-standardized Student’s t-distribution with degrees of freedom. However, one can still recover the correct variance of the ground truth normal distribution by appropriately modifying the predictive variance in (2.16), namely, by using

(2.17)

with from Definition 3.1. We call it a correction formula for the variance.

Furthermore, we will see that if the data in the training set come from a normal distribution but contain a small number of outliers in a certain region of the input space , then the GCP will automatically learn finite values of in this region. This will lead to that is higher than the ground truth variance of the normal distribution, and the variance estimate can be corrected by using (2.17) instead. This is illustrated in sections 5.1, 6.3, and 6.5.

In the rest of the paper, we will rigorously justify the above approach based on the GCP update, study the dynamics of under this update, and analyze how one should correct the variance for a fixed .

3 Dynamics of

3.1 Dynamical system for

The GCP update (2.9) induces the update for as follows:

(3.1)

where , and similarly for and , respectively.

Obviously, the new parameters are different from given by the classical conjugate prior update (2.6). From now on, we replace , etc. by new learning rates and analyze how the parameters will change and to which values they will converge under the updates of the form

(3.2)

where are the learning rates. As before, when we compute the derivatives of , we keep all the prime-variables in (2.8) fixed and do not treat them as functions of . In other words, we first compute the derivatives of with respect to and then substitute from (2.6). For brevity, we will simply write , etc. We call (3.2) the GCP update as well.

Setting

(3.3)

we have

(3.4)
(3.5)
(3.6)
(3.7)

In this section and in the next one, we will treat the parameters as functions of time and study a dynamical system that approximates the GCP update (3.2) when the number of observations is large. We will concentrate on the prototype situation, where all new learning rates are the same.

Condition 3.1.

In the GCP update (3.2), we have .

Under Condition 3.1, the approximating dynamical system takes the form

(3.8)

hereinafter the expectations are taken with respect to the true distribution of which is treated as a normally distributed random variable with mean and variance , see Fig. 3.1.

Remark 3.1.

Due to (2.14), system (3.8) defines a gradient flow with the potential , where is the Student’s t-distribution .

Remark 3.2.

If Condition 3.1 does not hold, then the respective factors will appear in the right-hand sides in (3.8). The modifications one has to make in the arguments below are straightforward.

Figure 3.1:

The black circle indicates the prior probability distribution (

2.1) in the space of the parameters

. The white circles indicate the posterior probability distributions (

2.5) corresponding to different observations

. The black vectors are the gradients with respect to the nonprime variables of the corresponding KL divergences. The blue vector is the averaged gradient. An equilibrium of system (

3.8) would correspond to the case where the blue vector vanishes. Theorem 3.2 shows that this actually never happens. However, Theorem 4.1 shows that if one keeps fixed, but updates , then one obtains a whole curve of equilibria.

3.2 Estimation of the mean

Using (3.4), we obtain the formula for the expectation

(3.9)
Theorem 3.1.

The first equation in (3.8) has a unique equilibrium . It is stable in the sense that, for any , we have

Proof.

Without loss of generality, assume that and (otherwise, make a change of variables in the integral in (3.9)). Then we obtain from (3.9)

(3.10)

where do not depend on .

Obviously, the right-hand side in (3.10) vanishes at . Furthermore, due to (3.10), for ,

because for . Similarly, for . ∎

3.3 Estimation of the variance. The unbounded absorbing set

From now on, taking into account Theorem 3.1, we assume the following.

Condition 3.2.

.

Under Condition 3.2, we study the other three equations in (3.8), namely,

(3.11)

where (due to Condition 3.2)

(3.12)
(3.13)
(3.14)

3.3.1 The functions and

To formulate the main theorem of this section, we introduce a function , which plays the central role throughout the paper.

Definition 3.1.

For each , is defined as a unique root of the equation

(3.15)

with respect to , where

(3.16)

and is the complementary error function.

Figure 3.2: The function from Definition 3.1.

The main properties of are given in the following lemma (see Fig. 3.2).

Lemma 3.1.
  1. Equation (3.15) has a unique root ,

  2. is monotonically increasing,

  3. ,

  4. satisfies the differential equation

    (3.17)
  5. has the following asymptotics:

    (3.18)

    where , .

Proof.

These properties are proved in Lemmas A.1A.4. ∎

Definition 3.2.

For each , we define the functions (see Fig. 3.3, left)

(3.19)

We remind that .

3.3.2 Estimation of the variance

The main result of this section (illustrated by Figures 3.3 and 3.4) is as follows.

Theorem 3.2.
  1. There is a smooth increasing function , , such that

    1. on the curve ,

    2. and ,

    3. for all ,

    4. for any , there exists such that

    5. the region

      (3.20)

      is forward invariant for system (3.11).

  2. For any

    , there exists a time moment

    depending on the initial condition such that for all , , , .

  3. For any , there is depending on the initial conditions such that the points for all lie on the integral curve

    (3.21)

    of the equation

    (3.22)
  4. For any , we have

    where .

Theorem 3.2 immediately implies the following corollary about the asymptotics of the variance in (2.4) for the predictive Student’s t-distribution.

Corollary 3.1.

For any , we have

In particular,

Proof.

From Theorem 3.2, item 1, we have by definition of

Deviding these inequalities by and recalling that as and as , we obtain the desired result. ∎

Figure 3.3: Left: The curves given by (3.19), the curve from Theorem 3.2, item 1, and the region  given by (3.20). The arrows indicate the directions of the vector field. Right: Green lines are the curves given by (3.21) for . Black lines are the curves given by  with . The black and green curves are orthogonal to each other.
Figure 3.4: Left: Several trajectories obtained via iterating (3.2) for 2000 samples drawn from the normal distribution with mean and variance . Middle/Right: Graphs of

plotted versus the number of epochs, corresponding to the lower-right/upper-left trajectory in the left figure.

Remark 3.3.

One can show that in the definition of the function can be replaced by with a sufficiently large . In particular, the asymptotics in Corollary 3.1 will assume the form

The proof would require obtaining an extra term in the asymptotics of as . However, we will not elaborate on these details.

Remark 3.4.

Suppose the number of observations tends to infinity. Then in the standard conjugate prior update (2.6), the parameters tend to infinity and the estimated mean and variance given by (2.4) converge to the ground truth mean and variance , while

The situation is quite different in Theorem 3.2. Although the parameter tends to infinity, converges to a finite positive value and converges to zero. Nevertheless, the estimated variance in Corollary 3.1 converges to the ground truth variance , while (due to (2.2), (2.3), and (2.4))

3.4 Dynamics of . Proof of Theorem 3.2

First, we show that and simultaneously vanish on the two-dimensional manifold

(3.23)

where is defined in (3.19) Note that this manifold corresponds to the curve in Fig. 3.3, left. We will also see that and in .

Lemma 3.2.

We have

(3.24)
(3.25)
Proof.

This lemma is proved in Appendix B. ∎

Now we show that the trajectories lie on curves that do not depend on or , see the green lines in Fig. 3.3 (right).

Lemma 3.3.

Let () satisfy the last two equations in (3.11) (for an arbitrary ). Then there is such that all the points belong to the integral curve (3.21) of the equation (3.22).

Proof.

This lemma is proved in Appendix C. ∎

Now we show that is strictly negative on the manifold (3.23), and, hence, neither system (3.8), nor system (3.11) possesses an equilibrium.

Lemma 3.4.

We have

(3.26)

Moreover, for any , there exists such that