1 Introduction
Reconstructing probability distributions of observed data by artificial neural networks is one of the most essential parts of machine learning and artificial intelligence
[3, 32]. Learning probability distributions not only allows one to predict the behavior of a system under consideration, but to also quantify the uncertainty with which the predictions are done. Under the assumption that the data are normally distributed, the most well studied way of reconstructing probability distributions is the Bayesian learning of neural networks
[30]. One treats the weights of the network as normally distributed random variables, prescribes their prior distribution, and then finds the posterior distribution conditioned on the data. The main difficulty is that neither the posterior, nor the resulting predictive distributions are given in a closed form. As a result, different approximation methods have been developed
[35, 16, 14, 45, 4, 20, 7, 12, 6, 25, 13, 27, 24, 28]. However, many of them have certain drawbacks related to the lack of scalability in data size or the neural network complexity, and are still a field of ongoing research. Furthermore, Bayesian neural networks often assume homoscedastic variance in the likelihood (i.e., same for all samples) and rather learn uncertainty due to lack of data (epistemic uncertainty). Among other methods for uncertainty quantification, there are the delta method [46, 15, 44], the meanvariance estimate
[36], and deep ensemble methods [22, 23]. A combination of the Bayesian approach (using the dropout variational inference) with the meanvariance estimate was used in [18], thus allowing for a simultaneous estimation of epistemic and aleatoric (due to noise in data) uncertainty. A new method based on minimizing a joint loss for a regression network and another network quantifying uncertainty was recently proposed in [10]. We refer to [19, 41] and the recent works [33, 6, 22, 23, 10] for a comprehensive comparison of the above methods and further references to research on the Bayesian learning of neural networks.We study an alternative approach to reconstructing the ground truth probability distribution based on what we call a gradient conjugate prior (GCP) update
. We are interested in learning conditional probability distributions
of targets^{1}^{1}1Throughout this paper, we denote random variables by bold letters and the arguments of their probability distributions by the corresponding nonbold letters. corresponding to data samples , using artificial neural networks (supervised learning). For brevity, we will often omit the dependence of distributions on . Thus, assuming that the ground truth distribution of a random variable (corresponding to observed data) is Gaussian with unknown mean and precision, we let neural networks learn the four parameters of the normalgamma distribution that serves as a prior for the mean and variance of . We emphasize that, unlike for Bayesian neural networks, the weights of the neural networks are deterministic in our approach. Given a parametrized prior, one has the predictive distribution in the form of a (nonstandardized) Student’s tdistribution , whose parameters are explicitly determined by the outputs of the neural networks. For further details, we refer to Sec. 2.4, which includes a graphical model visualization in Fig. 2.1 and a comparison with Bayesian neural networks in Table 2.1.Given an observation , the classical Bayesian update yields the posterior distribution for the mean and variance of . This posterior appears normalgamma as well [3]. However, one cannot update its parameters directly because they are represented by the outputs of the neural networks. Instead, one has to update the weights of the neural networks. We suggest to make a gradient descent step in the direction of minimization of the Kullback–Leibler (KL) divergence from the prior to the posterior (see the details in Sec. 2.4). This is the step that we call the GCP update. After updating the weights, one takes the next observation and repeats the above update procedure. One cycles over the whole training data set until convergence of the loglikelihood of predictive distribution
(1.1) 
In the paper, we provide a detailed analysis of the dynamics given by the GCP update. Intuitively, one might think that the GCP update, after convergence, yields the same result as the classical CP update. Surprisingly, this is not the case: the parametrized normalgamma distribution does not converge to the Bayesian posterior (see Remark 3.4
). Nevertheless, the predictive distribution does converge to the ground truth Gaussian distribution
. This is explained by the observation, which we prove in Sec. 2.5: the GCP update is actually equivalent to maximizing by gradient ascent the loglikelihood (1.1) of the predictive distribution. As the number of observations tends to infinity the GCP update becomes also equivalent to minimizing by gradient descent the KL divergence from the predictive distribution to the ground truth distribution . We show that these equivalences hold in general, even if the prior is not conjugate to the likelihood function. Thus, we see that the GCP method estimates aleatoric uncertainty.We emphasize that, although in our approach the approximating distribution gets parametrized (as the predictive distribution in the meanvariance approach [36]
or the approximating latent variables distribution in variational autoencoders
[21]), the way we parametrize and optimize and the way we interpret the result is different, as shown in Fig. 2.1 and summarized in Table 2.1.Now let us come back to our original assumption that is a normal distribution and is a Student’s tdistribution. The latter appears to be overparametrized (by four parameters instead of three). We keep it overparametrized in order to compare the dynamics of the parameters under the classical CP update and under the GCP update. Reformulation of our results for Student’s tdistribution parameterized in the standard way by three parameters will be straightforward. There is a vast literature on the estimation of parameters of Student’s tdistribution, see, e.g., the overview [34] and the references therein. Note that, in the context of neural networks, different samples correspond to different inputs of the network, and hence they belong to different Student’s tdistributions with different unknown parameters. Thus, the maximization of the likelihood of Student’s tdistribution with respect to the weights of the networks is one of the most common methods. In [43]
, the possibility of utilizing evolutionary algorithms for maximizing the likelihood was explored experimentally. Another natural way is to use the gradient ascent with respect to the weights of the network. As we said, the latter is equivalent to the usage of the GCP update. In the paper, we obtain a dynamical system for the prior’s parameters that approximates the GCP update (as well as the gradient ascent for maximization of Student’s tdistribution). We study the dynamics of the prior’s parameters in detail, in particular analyzing their convergence properties. Our approach is illustrated with synthetic data and validated on various realworld data sets in comparison with other methods for learning probability distributions based on neural networks. To our best knowledge,
neither the dynamical systems analysis of the GCP (or gradient ascent for maximizing the likelihood of Student’s tdistribution), nor a thorough comparison of the GCP with other methods has been carried out before.As an interesting and useful consequence of our analysis, we will see how the GCP interacts with outliers in the training set (a small percentage of observations that do not come from the assumed normal distribution ). The outliers prevent one of the prior’s parameters (
, which is related to the number of degrees of freedom of
) from going to infinity. On one hand, this is known [29, 39] to allow for a better estimate of the mean and variance of , compared with directly using the maximization of the likelihood of a normal distribution. However, on the other hand, this still leads to overestimation of the variance of . To deal with this issue, we obtain an explicit formula (see (2.17)) that allows one to correct the estimate of the variance and recover the ground truth variance of . To our knowledge, such a correction formula was not derived in the literature before.The paper is organized as follows. In Sec. 2, we provide a detailed motivation for the GCP update, explain how we approximate the parameters of the prior distribution by neural networks, establish the relation between the GCP update and the predictive distribution, and formulate the method of learning the ground truth distribution from the practical point of view. Section 3 is the mathematical core of this paper. We derive a dynamical system for the prior’s parameters, induced by the GCP update, and analyze it in detail. In particular, we obtain an asymptotics for the growth rate of and find the limits of the other parameters of the prior. In Sec. 4, we study the dynamics for a fixed . We find the limiting values for the rest of the parameters and show how one can recover the variance of the ground truth normal distribution . In Sec. 5, we clarify the role of a fixed . Namely, we compare the sensitivity to outliers of the GCP update with that in minimizing the standard squared error loss or maximizing the loglikelihood of a normal distribution. Furthermore, we show how controls the learning speed in clean and noisy regions. In Sec. 6, we illustrate the fit of neural networks for synthetic and various realworld data sets. Section 7 contains a conclusion and an outline of possible directions of further research. Appendices A–D contain the proofs of auxiliary lemmas from Sec. 3. In Appendix E
, we present the values of hyperparameters of different methods that are compared in Sec.
6.2 Motivation
2.1 Estimating normal distributions with unknown mean and precision
Assume one wants to estimate unknown mean and precision (the inverse of the variance) of normally distributed data . We remind that is conditioned on , but we often omit this dependence in our notation. We will analyze scalar and refer to Sec. 7 for a discussion of multivariate data. One standard approach for estimating the mean and precision is based on the conjugate prior update. One assumes that they are random variables, and respectively, with a joint prior given by the normalgamma distribution
(2.1) 
where
The marginal distribution for is a nonstandardized Student’s tdistribution with
(2.2) 
The marginal distribution for is the Gamma distribution with
(2.3) 
By marginalizing and , one can get the predictive distribution for , which appears to be a nonstandardized Student’s tdistribution. Its mean and variance can be used to estimate the mean and variance of . The estimated mean and variance are given by
(2.4) 
We refer, e.g., to [3] for further details.
2.2 Conjugate prior update
Suppose one observes a new sample
. Then, by the Bayes theorem, the conditional distribution of
under the condition that (the posterior distribution denoted by ) appears to be normalgamma as well [3], namely,(2.5) 
where is defined in (2.1) and the parameters are updated as follows:
(2.6) 
We call (2.6) the conjugate prior (CP) update.
2.3 Kullback–Leibler divergence
2.4 Approximation of the parameters by gradient conjugate prior neural networks
Our goal is to approximate the parameters by multilayer neural networks, i.e., to represent them as functions of inputs and weights: , . The corresponding graphical model is shown in Fig. 2.1. In Table 2.1, we summarize our approach and highlight its difference from the Bayesian neural networks and variational inference^{2}^{2}2The latent variables are usually denoted in the Bayesian neural networks framework or in the variational inference framework. We use the notation to make it consistent with our notation in Sec. 2.5..
Bayesian neural networks  GCP networks  
Data  Inputs , targets  
Ground truth  Gaussian  

Epistemic, homoscedastic  Aleatoric, heteroscedastic 

Weights  Random  Deterministic  

Weights independent of 


Prior  fixed during training  evolves during training.  
Likelihood  Gaussian with constant 


Posterior  intractable and fixed during training  tractable normalGamma and evolves during training  
Training 



Result  approximates the posterior 


Predictive 





Robust means and variances via the correction formula (2.17) 
In our case, one cannot directly apply the update in (2.6), but has to update the weights instead. The natural way to do so is to observe a sample , to calculate the posterior distribution (2.5) and to change the weights in the direction of , i.e.,
(2.9) 
where is a learning rate. When we compute the gradient of with respect of , we keep all the prime variables in (2.8) fixed and do not treat them as functions of , while all the nonprime variables are treated as functions of . We still use the notation in this case. We call (2.9) the gradient conjugate prior (GCP) update.
As we will see below, this update induces the update for that is different from the classical conjugate prior update (2.6) and yields a completely different dynamics. Before we analyze this dynamics in detail, we present an alternative viewpoint on the GCP update, which provides an insight into what is actually optimized by (2.9) in the general case.
2.5 GCP update and learning the predictive distribution
Suppose we want to learn a ground truth probability distribution of a random variable (a normal distribution in our particular case). Since the ground truth distribution is a priori unknown, we conjecture that it belongs to a family of distributions parametrized by (in our case and is a normal distribution with mean and precision ). Since is a priori unknown, we assume it is a random variable with a prior distribution from a family parametrized by (in our case, is the normalgamma distribution and are the weights of neural networks approximating , see Sec. 2.4). We denote the predictive distribution by
(2.10) 
(nonstandardized Student’s tdistribution in our case). Given an observation , the Bayes rule determines the posterior distribution of :
(2.11) 
In our case, is normalgamma again, but we emphasize that, in general, it need not be from the same family as the prior is.
Now we compute the gradient of the KL divergence
(2.12) 
(cf. (2.7)) with respect to , assuming that in the posterior distribution is freezed, and we do not differentiate it. Denoting such a gradient by , we obtain the following lemma.
Lemma 2.1.
Lemma 2.1 shows that the GCP update (2.9) is the gradient ascent step in the direction of maximizing the loglikelihood of the predictive distribution given a new observation . Furthermore, using Lemma 2.1, we see that given observations , the averaged GCP update of the parameters is given by (cf. (2.9))
(2.13) 
Further, if the observations are sampled from the ground truth distribution and their number tends to infinity, then the GCP update (2.13) assumes the form
(2.14)  
Remark 2.1.

Formula (2.13) shows that the GCP update maximizes the likelihood of the predictive distribution for the observations .

Formula (2.14) shows that the GCP update is equivalent to the gradient descent step for the minimization of the KL divergence from the ground truth distribution to the predictive distribution . If the ground truth distribution belongs to the family , then the minimum equals zero and is achieved for some (not necessarily unique) such that ; otherwise the minimum is positive and provides the best possible approximation of the ground truth in the sense of the KL divergence.

In our case, is a normal distribution and are Student’s tdistributions. In accordance with item 2, we will see below that the GCP update forces the number of degrees of freedom of to tend to infinity. However, due to the overparametrization of the predictive distribution (four parameters instead of three), the learned variance of will be represented by a curve in the space . The limit point to which will converge during the GCP update, will depend on the initial condition. Interestingly, will always be different from the limit obtained by the classical CP update (2.6) (cf. Remark 3.4).
2.6 Practical approaches
Practical approach 2.1.

One approximates the parameters of the prior by neural networks:
(2.15) We call them the GCP neural networks.

One trains these four networks by the GCP update (2.9) until convergence of .

The resulting predictive distribution is the nonstandardized Student’s tdistribution . The estimated mean and variance (for ) are given by
(2.16)
In practice one has finitely many observations , , and the distribution is a linear combination of the Dirac delta functions supported at . Due to Remark 2.1 (item 1), Approach 2.1 yields the predictive Student’s tdistribution with maximal likelihood. However, if the observations are sampled from a normal distribution and their number tends to infinity, will tend to infinity due to Remark 2.1 (item 3). We will also show that , and will converge to a finite value .
Remark 2.2.
Below we will justify the fact that if is fixed and equals to some value , then we obtain the best approximation of by a nonstandardized Student’s tdistribution with degrees of freedom. However, one can still recover the correct variance of the ground truth normal distribution by appropriately modifying the predictive variance in (2.16), namely, by using
(2.17) 
with from Definition 3.1. We call it a correction formula for the variance.
Furthermore, we will see that if the data in the training set come from a normal distribution but contain a small number of outliers in a certain region of the input space , then the GCP will automatically learn finite values of in this region. This will lead to that is higher than the ground truth variance of the normal distribution, and the variance estimate can be corrected by using (2.17) instead. This is illustrated in sections 5.1, 6.3, and 6.5.
In the rest of the paper, we will rigorously justify the above approach based on the GCP update, study the dynamics of under this update, and analyze how one should correct the variance for a fixed .
3 Dynamics of
3.1 Dynamical system for
The GCP update (2.9) induces the update for as follows:
(3.1) 
where , and similarly for and , respectively.
Obviously, the new parameters are different from given by the classical conjugate prior update (2.6). From now on, we replace , etc. by new learning rates and analyze how the parameters will change and to which values they will converge under the updates of the form
(3.2) 
where are the learning rates. As before, when we compute the derivatives of , we keep all the primevariables in (2.8) fixed and do not treat them as functions of . In other words, we first compute the derivatives of with respect to and then substitute from (2.6). For brevity, we will simply write , etc. We call (3.2) the GCP update as well.
Setting
(3.3) 
we have
(3.4)  
(3.5)  
(3.6)  
(3.7) 
In this section and in the next one, we will treat the parameters as functions of time and study a dynamical system that approximates the GCP update (3.2) when the number of observations is large. We will concentrate on the prototype situation, where all new learning rates are the same.
Condition 3.1.
In the GCP update (3.2), we have .
Under Condition 3.1, the approximating dynamical system takes the form
(3.8) 
hereinafter the expectations are taken with respect to the true distribution of which is treated as a normally distributed random variable with mean and variance , see Fig. 3.1.
Remark 3.1.
3.2 Estimation of the mean
Using (3.4), we obtain the formula for the expectation
(3.9) 
Theorem 3.1.
The first equation in (3.8) has a unique equilibrium . It is stable in the sense that, for any , we have
3.3 Estimation of the variance. The unbounded absorbing set
From now on, taking into account Theorem 3.1, we assume the following.
Condition 3.2.
.
Under Condition 3.2, we study the other three equations in (3.8), namely,
(3.11) 
where (due to Condition 3.2)
(3.12)  
(3.13)  
(3.14) 
3.3.1 The functions and
To formulate the main theorem of this section, we introduce a function , which plays the central role throughout the paper.
Definition 3.1.
For each , is defined as a unique root of the equation
(3.15) 
with respect to , where
(3.16) 
and is the complementary error function.
The main properties of are given in the following lemma (see Fig. 3.2).
Lemma 3.1.

Equation (3.15) has a unique root ,

is monotonically increasing,

,

satisfies the differential equation
(3.17) 
has the following asymptotics:
(3.18) where , .
Definition 3.2.
3.3.2 Estimation of the variance
Theorem 3.2.

There is a smooth increasing function , , such that

on the curve ,

and ,

for all ,

for any , there exists such that


For any , there is depending on the initial conditions such that the points for all lie on the integral curve
(3.21) of the equation
(3.22) 
For any , we have
where .
Theorem 3.2 immediately implies the following corollary about the asymptotics of the variance in (2.4) for the predictive Student’s tdistribution.
Corollary 3.1.
For any , we have
In particular,
Proof.
Remark 3.3.
One can show that in the definition of the function can be replaced by with a sufficiently large . In particular, the asymptotics in Corollary 3.1 will assume the form
The proof would require obtaining an extra term in the asymptotics of as . However, we will not elaborate on these details.
3.4 Dynamics of . Proof of Theorem 3.2
First, we show that and simultaneously vanish on the twodimensional manifold
(3.23) 
where is defined in (3.19) Note that this manifold corresponds to the curve in Fig. 3.3, left. We will also see that and in .
Lemma 3.2.
We have
(3.24) 
(3.25)  
Proof.
This lemma is proved in Appendix B. ∎
Now we show that the trajectories lie on curves that do not depend on or , see the green lines in Fig. 3.3 (right).
Lemma 3.3.
Proof.
This lemma is proved in Appendix C. ∎
Now we show that is strictly negative on the manifold (3.23), and, hence, neither system (3.8), nor system (3.11) possesses an equilibrium.
Lemma 3.4.
We have
(3.26) 
Moreover, for any , there exists such that
Comments
There are no comments yet.