Regression with Uncertainty Quantification in Large Scale Complex Data

12/04/2019 ∙ by Nicholas Wilkins, et al. ∙ Rochester Institute of Technology 0

While several methods for predicting uncertainty on deep networks have been recently proposed, they do not readily translate to large and complex datasets. In this paper we utilize a simplified form of the Mixture Density Networks (MDNs) to produce a one-shot approach to quantify uncertainty in regression problems. We show that our uncertainty bounds are on-par or better than other reported existing methods. When applied to standard regression benchmark datasets, we show an improvement in predictive log-likelihood and root-mean-square-error when compared to existing state-of-the-art methods. We also demonstrate this method's efficacy on stochastic, highly volatile time-series data where stock prices are predicted for the next time interval. The resulting uncertainty graph summarizes significant anomalies in the stock price chart. Furthermore, we apply this method to the task of age estimation from the challenging IMDb-Wiki dataset of half a million face images. We successfully predict the uncertainties associated with the prediction and empirically analyze the underlying causes of the uncertainties. This uncertainty quantification can be used to pre-process low quality datasets and further enable learning.



There are no comments yet.


page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When utilizing regression under deep learning, one typically attempts to learn an optimal mapping (under some loss function) from a feature space (notated here as

) to some target space (notated here as ): . We wish to learn that function such that some loss function is minimized. Due to the aforementioned construction, typically when utilizing a regressor, a single value is predicted. Frequently, however, the feature space does not adequately capture enough information to perfectly predict . Additionally, there is frequently stochasticity present within the system (i.e. aleatoric uncertainty) which prevents any system from perfectly predicting onto

. The main focus of this work is to construct such a regressor which efficiently regresses onto a normal distribution on the target space. The choice of a Gaussian as the target distribution is in standing with traditional statistics methods where when the measurement errors occurring in regression problems are assumed to follow a normal distribution.

Deep Neural Networks have transformed the field of machine learning by allowing advanced concepts to be learned from large-scale input data. Furthermore, these techniques have allowed for recent breakthroughs in pattern recognition that have applications in many fields such as chemistry, biology, physics, manufacturing, and medical sciences. Although promising and highly useful, many deep learning techniques only provide a point estimate and seldom provide a means to understand inherent uncertainty in the input data. Due to this, they are frequently incapable of understanding their own limitations

111Although in classification one can determine how far training samples are from the decision boundary, this confidence is often significantly different from understanding the inherent limitations of the learning system. This can potentially have disastrous impacts in many important real life scenarios.

For many areas of scientific study, especially in areas of critical importance such as in medical image analysis for patient diagnosis, this lack of uncertainty quantification is highly problematic. The inability to understand and quantify the model’s confidence in its predicted values is a high source of potential risk and liability [10]. For example, when faced with a difficult diagnosis, the ability for a deep learning system to report large uncertainties would allow for human operators to intervene and review those specific cases. If deep learning is to be widely used for critical applications in practical settings, such as making medical diagnosis from input data, a key requirement would be the ability to provide statistically meaningful uncertainty measurements with their predictions.

In his 1994 Ph.D. thesis, Bishop [1]

introduced Mixture Density Networks (MDNs); where a neural network is used to predict a probability distribution over the target value

, rather than a single point estimate. The MDNs train with a fixed number of mixtures of Gaussian components over the course of the training scheme using the Negative Log Likelihood (NLL) as the loss function to the network. This proposed training scheme has the potential to address many of the issues highlighted above, but due to limitations in computing power in the 1990’s, MDNs did not gain as wide popularity.

In this paper we present a elegant and simplified approach to quantify uncertainty in large-scale regression problems. We propose a one-shot approach requiring no significant overhead. Additionally, we demonstrate the uncertainty bounds produced by this system are on-par or better than currently existing methods reported in the literature. Finally, we illustrate how this can be utilized for cleaning datasets and removing erroneous data autonomously.

2 Prior Work

As uncertainty measurement is immensely useful, prior research has been conducted into quantifying uncertainty. However, many proposed methods require substantial overhead. Additionally, although many past works have made the distinction, our calculated uncertainty is capable of incorporating both aleatory and epistemic uncertainties[7]. 222Aleatory Uncertainty is irreducible uncertainty due to probabilistic variability while epistemic uncertainty is reducible and stems from uncertainty in the system model.

Separate Regressor

One popular method for quantifying uncertainty is to regress directly on the uncertainty[12]. Typically, two regressors are utilized: a value regressor and an uncertainty regressor. These work separately to predict their respective values. This method requires the uncertainty regressor to learn the specifications of the value regressor. Additionally, the training schedule must be carefully designed to ensure that both regressors learn in tandem. Furthermore, it is much easier (due to the complexity of simultaneously optimizing two systems) for this system to get stuck in a local optimum.

In our proposed approach we utilize a single network, thereby allowing for various components of the value regressor to interact with components of the uncertainty regressor (and vice versa). This reduces the computational overhead introduced by having two separate regressors. Furthermore, as we are only training a single network, our training schedule is significantly less complex.

Deep Gaussian Process

Deep Gaussian processes are a class of models utilized for regression that combine Gaussian processes (GPs) with deep architectures. These were initially introduced by [3]. Deep GP is a composition of GP’s where each layer consists of GP units that connect it to the next layer. Imagine a neural network with one or more hidden nodes and each edge connecting any two nodes in the network is a GP. Exact inference on deep GPs is intractable, and although several variational approximation methods have been proposed, they are difficult to implement and do not extend readily to arbitrary kernels. They can be used to perform a regression with uncertainty bounds through a technique known as Kriging, but struggle with high dimensional or large-scale data.

Because our proposed approach does not bear the significant overhead burden when training, it is capable of regressing on high-dimensional, large scale image data and provides uncertainty measures on the predictions made.

MC Dropout

MC Dropout[4], a method aimed at replicating the behavior of a Deep Gaussian process, can also be used for uncertainty quantification. However, utilizing this method requires ensembling on the network, leading to multiple evaluations of the network, which results in additional computational expense.

As we regress directly on the distribution parameters, we do not need to sample from our network thus making the network substantially faster (as we only need to make one forward pass through our network).

Ensemble Methods

Recently there has been research into utilizing Deep Ensembles[9] (an ensemble of deep learners) to create multiple hypotheses. From these hypotheses, an uncertainty can be inferred. While extremely promising, utilizing this method requires one to train multiple deep learners and evaluate multiple deep networks to generate uncertainty resulting in a fairly computationally expensive process. We avoid these issues by utilizing a single regressor.

Since we are only utilizing a single regressor, we only need to train, evaluate, and store one regressor.

Bayesian Regression

Drawing inspiration from statistics and probability theory, Bayesian regression assigns each parameter a prior probability distribution. Bayesian learning, via the Bayesian update rule, is utilized to update the probability distributions to best fit the data. One can extend Bayesian Regression to neural networks

[8][2][6] utilizing a similar methodology to produce uncertainty. As before, this requires storing and optimizing a distribution for each parameter, and thus is computationally expensive. Additionally, to determine a hypotheses and uncertainty, one must sample the network utilizing techniques such as variations inference[5] or MCMC[13].

We are performing ordinary regression on the distribution parameters; each of our weights and biases take on a single value. Thus, we do not need to sample from our network. This enables us to use our method on large scale imaging datasets.

3 Methodology

3.1 Framing the Problem

Suppose we have samples where is a joint probability distribution of and . For this paper, we assume that


That is to say that each cross section of the joint probability distribution function (PDF) degenerates into a normal distribution. Furthermore, we assume that each output dimension is conditionally independent from each other. Thus, for all , is a diagonal matrix.

We wish to learn a mapping from to the parameters of a Gaussian:


where so that


(or alternatively the KL Divergence is minimized). Utilizing this mapping, one can determine the uncertainty (both epistemic and aleatory) of our model. We demonstrate the capability of capturing both of these uncertainties in the experiments section.

Thus, as our target distributions are Gaussian, by producing a distribution on target variable, one can produce confidence intervals on the target variable. If one can determine such a mapping, one can achieve the uncertainty quantification described above.

3.2 Approach

To learn the mapping described in Section 3.1, we train a regressor to output the parameters of our target distribution with the following log-likelihood loss:



is the true joint distribution on

and is a Gaussian induced with parameters from the regressor. In the appendix we demonstrate that an optimal learning scheme (with appropriate assumptions) will converge to the true distribution parameters assuming the target distribution is a Gaussian. Thus, a scheme which is optimal under this loss will also have a minimal mean squared error (or any other loss) to the target data points.

This loss under finite data degenerates into NLL loss:


where is the frequency occurs in the dataset. As our target distribution is a Gaussian333for this simplification, please note that is base ,


which will reach its minimum where


reaches its minimum. This loss is preferable over a multitude of other losses (such as KL Divergence) as it does not require defining an auxiliary ground truth probability distribution.

3.3 Network Architecture

We model this regressor utilizing a multi-component neural network, which must output two values (assuming we are regressing on a single target variable): the mean and standard deviation (which can also be interpreted as an uncertainty). Figure


shows an overview of the generalized architecture with the various components. When applied on high-dimensional numerical input data, the feature extractor can be implemented as a deep neural network which embeds the data into a lower dimensional space; when applied on time series data, the feature extractor can be implemented as some variant of a recursive neural network (RNN); and when applied on complex natural images, a convolutional neural network can used for feature extraction.

We typically utilize two fully connected layers as the regressor layer, where the number of nodes is determined by the complexity of the output of the feature extraction layer. The regressor network produces the mean and the standard deviation, and Softplus is applied to the standard deviation to ensure a valid probability distribution is generated. We demonstrate the efficacy of the architecture on different data types in Section 4.

Figure 1: General architecture for the regressor.

4 Experiments and Results

To evaluate the proposed method, we utilized it in a variety of experiments described in the ensuing subsections. For the first Toy dataset, the goal was simply to test the regression capabilities of the model and visualize the uncertainty measures 2-dimensions of the well-known caloric dataset from Kaggle444 For the next set of experiments, we applied the method to the standard benchmark datasets commonly used to measure the quality of a regression algorithm and compare with the state-of-the-art techniques described in Section 2. Next, we applied the algorithm to highly volatile, stock prices using data from 2015 to mid-year 2018 to predict the uncertainty of the stock in from mid-year 2018 till date. These uncertainty periods were shown to correspond to different current affairs events that could have impacted that particular stock.

Lastly, we applied the network to the complex, large-scale image dataset, the IMDb-Wiki data, to demonstrate its efficacy in performing high-quality age predictions (via regression) along with the uncertainty measures associated with the predictions. While many deep learning applications have been used successfully for different problems involving image data, there is very little work from the literature on estimating the associated uncertainties. In the different networks tested, we measure the quality of the uncertainties via the predictive log likelihood.

4.1 Toy Caloric Datasets

As an initial test of our framework, we perform a one dimensional regression utilizing one parameter on a toy dataset created from Kaggle (“Exercise and Calories”). This is used purely for illustrative purposes. In addition, this will provide empirical evidence that this network is capable of capturing aleatory uncertainty (at least for basic scenarios). We attempt to determine how many calories an individual burned based on body heat. Please note that we add artificial noise to discourage the network from memorizing the mean and standard deviations of each input.

In Figure 2, one can observe that this network successfully converges to perform the distribution regression, demonstrating that the network can capture aleatory uncertainty (the randomness inherent in the system). As this dataset is purely for demonstrative purposes, we do not include quality metrics.

Figure 2: Regression on calories burned from body heat data. The yellow line is the regression. The gray shaded region is the confidence interval

4.2 Numerical Datasets

Dataset [6] PBP [4] MC- [9] Deep Ours
Dropout Ensembles
Boston 2.57 0.09 2.46 0.25 2.41 0.25 2.23 0.05
Concrete 3.16 0.02 3.04 0.09 3.06 0.18 3.05 0.04
Energy 2.04 0.02 1.99 0.09 1.38 0.22 1.91 0.02
Kin8nm -0.90 0.01 -0.95 0.03 -1.20 0.02 -1.18 0.02
Naval- -3.73 0.01 -3.80 0.05 -5.63 0.05 -3.82 0.09
Power plant 2.84 0.01 2.80 0.05 2.79 0.04 2.85 0.01
Protein 2.97 0.00 2.89 0.01 2.83 0.02 2.14 0.01
Wine 0.97 0.01 0.93 0.06 0.94 0.12 0.87 0.02
Yacht 1.63 0.02 1.55 0.12 1.18 0.21 4.06 0.00
MSD 3.60 NA v3.59 NA 3.35 NA 3.40 NA
Table 1: Comparison of different architectures performance for NLL on popular benchmark datasets. Measurements courtesy of Deep Ensembles paper by Lakshminarayanan et al. [7].

As an additional test of the framework, we demonstrate that our model is on-par or superior to other popular uncertainty quantification models (specifically PBP [6], MC Dropout[4] and Deep Ensembles[9]) for regression under several benchmark datasets, commonly used to measure the quality of a regression algorithm. As can be observed in Table 1, with the exception of the Yacht dataset where our technique is under-par, we performed on-par with or out-performed all other approaches in terms of NLL (negative log-loss).

4.3 Uncertainty Measures on Stock Prices

To test for uncertainty in predictions in large, complex, stochastic, highly volatile time series data, we applied the methodology specifically on similar stocks from the entertainment industry555One of the authors spent his summer internship at a financial organization and specifically analyzed this family of stocks.. The family is comprised of stocks from Century Fox, Inc. (FOX), Netflix, Inc. (NFLX), Time Warner, Inc. (TWX),, Inc. (AMZN), Walt Disney Co. (DIS), Comcast Corporation (CMCSA)

. The stocks are classified as a family based on their sector, industry, asset class, and the prices of the stocks over an extended period of time being highly correlated with each other.

Suppose we are given the stock close prices for days prior to day : . We wish to predict the closing price on day : . To do this, we predict to prescribe a distribution onto .

Figure 3: The blue graph is the stock price chart for FOX while the red graph is the measure of uncertainty estimated by the network. Image is best viewed in color

4.3.1 Data preparation and Training Schedule

We downloaded the publicly available stock price information for the family of stocks explained above. Data for the entire family from 2015 till May 2018 was used as training data, with the goal of predicting uncertainty for only the FOX stocks from June 2018 till date (February 2019 at the time of submission).

The network shown in Figure 1

was implemented with the feature extraction layer being implemented as a gated recurrent unit (GRU) with a look-back of 10 days. The training scheme involved looking at the stock prices over a period of 10 days with the goal of predicting price on the 11th day along with the measure of uncertainty of the prediction. The resulting uncertainty measures are shown in Figure


4.3.2 Analysis of Results

To analyze the uncertainties resulting from the implementation, we set a threshold of 0.5 so that days on which the uncertainty measure was above this threshold were flagged as anomalous trading days. We provide a list of FOX-related news 666News data was obtained from We threw away many other events leaving those related to where our uncertanties were high in that period and compare with the anomalous days predicted by our network. The results are shown in Table 2.

Real Date Network predictions News related to 21st Century FOX
05-17-18 05-31-18 -Suzanne Scott named CEO Of FOX News
06-13-18 06-15-18 -Comcast offers to buy 21st Century Fox
media assets for $65B in cash
10-19 till 10-19 till -Walt Disney receives unconditional approval
10-20-18 10-22-18 from China For 21st Century Fox deal;
-Amazon/Blackstone bid for Disney’s 22
regional sports networks;
11-26-18 11-26-18 -Disney, Fox sued in U.S. for $1B over
Malaysia theme park
01-07-19 -21st Century Fox announces filing of
registration statement on Form 10 for Fox
Table 2: The left column shows true dates on which major events occurred at 21st Century Fox; the second column shows the closest date estimated by our network, and the last column describes the event.

4.4 Age Estimation from Face Image

To test how well this architecture will work on large complex datasets, we applied it on the nontrivial problem of age estimation. Given an image of a face, the network was tasked with predicting the age of the individual in the picture. Posed as a general problem, this task is a very challenging regression problem.

We utilized the IMDb-Wiki Dataset: a dataset of half a million faces scraped from both IMDb and Wikipedia (primarily IMDb), and tagged with the corresponding ages of individuals in the images. This dataset was generated by first identifying faces in images utilizing the Mathias et. al. face detector[11]. The faces were then given a 40% margin around the border and cropped out. Finally, the age was automatically extracted from the document by extracting both the time of the photograph and the year of the individual’s birth. Due to the highly automated nature of the collection of the data, this dataset is very noisy, where multiple entries in the dataset contain either no face or multiple faces. Additionally, several of the entries contain just a copyright sign. Also, in some cases, the collection year was incorrectly extracted from the webpage. See Table 4 for examples of invalid face images.

Figure 4: Examples of invalid data in the IMDb-Wiki dataset. Images were identified algorithmically as having extremely high uncertainty and loss values

Although this dataset contained a similar distribution of males to females (see Figure 5(left)), it contained primarily individuals between 20 and 40 years old. Additionally, because the IMDb dataset contained a random sampling of Hollywood actors, the dataset was primarily composed of young Caucasian individuals, thus, having high implicit bias. We empirically demonstrated that our method is still capable of correctly identifying underrepresented samples in spite of the imbalances in the data.

Figure 5: Statistics of the IMDb dataset. From the image labels provided on the left is age distribution and the right shows gender distribution.

4.4.1 Cleaning, Preparation and Training Schedule

We did not wish to excessively clean the data, but rather remove the clearly wrong data. We did this by removing samples which had individuals younger than three years old or older than 100 years old. Additionally, we removed images which were too small (namely smaller than 16 by 16). We then re-sized all the images to 224 by 224.

We did not remove invalid images which contained multiple, or no faces, although we standardize the images to be color (if they were black and white, we replicated that channel over the R, G, and B channels). Additionally, we did not remove the mislabeled entries (those which had valid ages and valid images, but were clearly mislabeled). We did not remove these so we could test the ability of the uncertainty quantification. One would expect an appropriate uncertainty quantification algorithm would give high uncertainty for invalid data or abnormal data. We exploit this later to automatically clean the dataset.

This image-based network was trained utilizing a 16-layer convolutional neural network (CNN) with an Adam optimizer, a learning rate of 0.00005 and a batch size of 8 until convergence (6 epochs).

4.4.2 Analysis of results

20/19.6/3.1 21/22.0/3.2 24/22.3/7.5 23/23.8/7.8 22/22.8/3.6
92/26.1/7.9 82/30.8/6.4 67/27.1/5.9 70/25.6/6.6 82/35.3/7.2
36/55.3/19.0 58/62.0/19.1 27/69.6/21.6 69/76.4/20.5 88/83.9/20.8
Figure 6: The three numbers below each image correspond to (i) the actual age (as provided in the dataset)/(ii) the estimated age (as predicted by the regression network)/(iii) the uncertainty value reported by the network (the higher the value, the more uncertain the prediction). The top row shows some of the faces on which the network reported the lowest error values. The middle row shows the faces on which were reported the highest errors; and the last row shows the faces on which the network reported the highest uncertainty.
Method MAE NLL
CNN + Regressor 7.54
CNN + Regressor + Uncertainty 7.57 3.63
CNN + Regressor + Uncertainty + Cleaning 5.22 3.53
Table 3: The accuracy (both mean absolute error and negative log likelihood) of various approaches on age estimation.

The results of these experiments were very promising (first row of Figure 6). Even on this noisy dataset, the architecture only performed poorly when the ground truth was wrong (see the middle row of Figure 6). These results demonstrate that our model is capable of capturing epistemic uncertainty. Additionally, this architecture’s uncertainty not only expressed how confident the model was, but also how clean the data sample was. Thus, this model often reported high confidence if a sample was well represented within the dataset. Empirically we can see that difficult samples (those in a class with low representation, poor lighting, side facing faces, ambiguous individual, multiple faces) obtained high uncertainty. Images in the dataset which were incorrectly scraped along with excessively noisy or incorrect data had the highest uncertainty (see the last row of Figure 6). This architecture can therefore be used to evaluate the quality of samples, assuming a large portion of the data is of good quality.

4.4.3 Determining the overhead of uncertainty quantification

An error quantification network is often only appealing if it does not have a significant impact on performance. Thus, the quantification of the discrepancy of some error metric (say RMSE) between a classical regressor with parameters and that of an uncertainty-aware regressor with parameters should be minimized. To this end, we train two networks utilizing the same initial configuration of parameters and same number of parameters (except for the last layer) until convergence (one vanilla regressor and one error quantification regressor).

Examining Table 3, we can observe that the discrepancy between the MAE of the uncertainty-agnostic regressor and the uncertainty-aware regressor is negligible. Thus, computing the uncertainty does not provide any significant additional overhead to this model. This final layer can therefore be added to any regressor to provide uncertainty metrics.

4.4.4 Automated data cleaning

As described earlier, this architecture can be utilized to determine the quality of a sample by examining the uncertainty produced. After training, we identified the samples with the top uncertainty and removed them (please note that we left the validation samples unchanged). After removing these samples from our training set, we obtained significantly better results on the validation dataset (see Table 3). Thus, this architecture is uniquely well-suited for unclean datasets to generate relatively high performing regressors.

5 Limitations

There are several limitations to both the uncertainty quantification method and the data cleaning process described.

Uncertainty Quantification

While the uncertainty quantification has been demonstrated (both analytically and empirically) to appropriately quantify error, applying this direct method will likely fail if the uncertainty follows any other distribution other than normal (but this is also the case for confidence measurements in classical statistics).

Data Cleaning Process

While the data cleaning process was implemented successfully here and shown to benefit learning, utilizing this process can have several adverse side effects if not applied carefully. Eliminating uncertainty could increase the systematic bias of the dataset and compromise the integrity of the data.

6 Conclusion and Future Work

This method of uncertainty quantification has been demonstrated to work well with large scale image datasets. Furthermore, this method has been shown to perform better than current methods for uncertainty quantification without overhead. This uncertainty quantification aspect has been exploited to develop a data cleaning procedure which improved the accuracy on an unchanged validation set. Future work for this subject includes generalizing this to arbitrary distribution regression and investigating uncertainty for classification.


  • [1] C. Bishop (1994-01) Mixture density networks. Technical report External Links: Link Cited by: §1.
  • [2] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. In International Conference on Machine Learning (ICML), pp. 1613–1622. External Links: Link Cited by: §2.
  • [3] A. Damianou and N. Lawrence (2013-05) Deep gaussian processes. In

    Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics

    AISTATS, Vol. 31, pp. 207–215. Cited by: §2.
  • [4] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In International Conference on International Conference on Machine Learning (ICML), pp. 1050–1059. External Links: Link Cited by: §2, §4.2, Table 1.
  • [5] A. Graves (2011) Practical variational inference for neural networks. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 2348–2356. External Links: Link Cited by: §2.
  • [6] J. M. Hernández-Lobato and R. P. Adams (2015)

    Probabilistic backpropagation for scalable learning of bayesian neural networks

    External Links: arXiv:1502.05336 Cited by: §2, §4.2, Table 1.
  • [7] A. D. Kiureghian and O. Ditlevsen (2009) Aleatory or epistemic? does it matter?. Structural Safety 31 (2), pp. 105–112. External Links: Document Cited by: §2.
  • [8] A. Korattikara Balan, V. Rathod, K. P. Murphy, and M. Welling (2015) Bayesian dark knowledge. In Advances in Neural Information Processing Systems (NIPS), pp. 3438–3446. External Links: Link Cited by: §2.
  • [9] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NIPS), pp. 6402–6413. External Links: Link Cited by: §2, §4.2, Table 1.
  • [10] C. Leibig, V. Allken, P. Berens, and S. Wahl (2017) Leveraging uncertainty information from deep neural networks for disease detection. Scientific Reports 7. External Links: Document Cited by: §1.
  • [11] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool (2014) Face detection without bells and whistles. In

    European Conference on Computer Vision (ECCV)

    pp. 720–735. Cited by: §4.4.
  • [12] D.a. Nix and A.s. Weigend (1994)

    Estimating the mean and variance of the target probability distribution

    Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN94). External Links: Document Cited by: §2.
  • [13] A. Vehtari, S. Sarkka, and J. Lampinen (2000) On mcmc sampling in bayesian mlp neural networks. Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium. External Links: Document Cited by: §2.

7 Appendix

As stated in the paper, we are learning from data drawn from

Theorem 1.

is minimized (under infinite i.i.d. data) when if for all , for some .


Suppose (i.e. for a given , the is normally distributed). Then,


Furthermore, for some cumulative probability distribution . Please note we notate as for brevity. Thus,


By the Euler Lagrange theorem, this function reaches its minimum when


where is the integrand of . Differentiating, one obtains


As ,




Solving this separable differential equation under the condition that yields


Thus, as desired. ∎

As an optimal learning scheme reaches the global optimum, under an optimal learning scheme, would be learned to be