1 Introduction
When utilizing regression under deep learning, one typically attempts to learn an optimal mapping (under some loss function) from a feature space (notated here as
) to some target space (notated here as ): . We wish to learn that function such that some loss function is minimized. Due to the aforementioned construction, typically when utilizing a regressor, a single value is predicted. Frequently, however, the feature space does not adequately capture enough information to perfectly predict . Additionally, there is frequently stochasticity present within the system (i.e. aleatoric uncertainty) which prevents any system from perfectly predicting onto. The main focus of this work is to construct such a regressor which efficiently regresses onto a normal distribution on the target space. The choice of a Gaussian as the target distribution is in standing with traditional statistics methods where when the measurement errors occurring in regression problems are assumed to follow a normal distribution.
Deep Neural Networks have transformed the field of machine learning by allowing advanced concepts to be learned from largescale input data. Furthermore, these techniques have allowed for recent breakthroughs in pattern recognition that have applications in many fields such as chemistry, biology, physics, manufacturing, and medical sciences. Although promising and highly useful, many deep learning techniques only provide a point estimate and seldom provide a means to understand inherent uncertainty in the input data. Due to this, they are frequently incapable of understanding their own limitations
^{1}^{1}1Although in classification one can determine how far training samples are from the decision boundary, this confidence is often significantly different from understanding the inherent limitations of the learning system. This can potentially have disastrous impacts in many important real life scenarios.For many areas of scientific study, especially in areas of critical importance such as in medical image analysis for patient diagnosis, this lack of uncertainty quantification is highly problematic. The inability to understand and quantify the model’s confidence in its predicted values is a high source of potential risk and liability [10]. For example, when faced with a difficult diagnosis, the ability for a deep learning system to report large uncertainties would allow for human operators to intervene and review those specific cases. If deep learning is to be widely used for critical applications in practical settings, such as making medical diagnosis from input data, a key requirement would be the ability to provide statistically meaningful uncertainty measurements with their predictions.
In his 1994 Ph.D. thesis, Bishop [1]
introduced Mixture Density Networks (MDNs); where a neural network is used to predict a probability distribution over the target value
, rather than a single point estimate. The MDNs train with a fixed number of mixtures of Gaussian components over the course of the training scheme using the Negative Log Likelihood (NLL) as the loss function to the network. This proposed training scheme has the potential to address many of the issues highlighted above, but due to limitations in computing power in the 1990’s, MDNs did not gain as wide popularity.In this paper we present a elegant and simplified approach to quantify uncertainty in largescale regression problems. We propose a oneshot approach requiring no significant overhead. Additionally, we demonstrate the uncertainty bounds produced by this system are onpar or better than currently existing methods reported in the literature. Finally, we illustrate how this can be utilized for cleaning datasets and removing erroneous data autonomously.
2 Prior Work
As uncertainty measurement is immensely useful, prior research has been conducted into quantifying uncertainty. However, many proposed methods require substantial overhead. Additionally, although many past works have made the distinction, our calculated uncertainty is capable of incorporating both aleatory and epistemic uncertainties[7]. ^{2}^{2}2Aleatory Uncertainty is irreducible uncertainty due to probabilistic variability while epistemic uncertainty is reducible and stems from uncertainty in the system model.
Separate Regressor
One popular method for quantifying uncertainty is to regress directly on the uncertainty[12]. Typically, two regressors are utilized: a value regressor and an uncertainty regressor. These work separately to predict their respective values. This method requires the uncertainty regressor to learn the specifications of the value regressor. Additionally, the training schedule must be carefully designed to ensure that both regressors learn in tandem. Furthermore, it is much easier (due to the complexity of simultaneously optimizing two systems) for this system to get stuck in a local optimum.
In our proposed approach we utilize a single network, thereby allowing for various components of the value regressor to interact with components of the uncertainty regressor (and vice versa). This reduces the computational overhead introduced by having two separate regressors. Furthermore, as we are only training a single network, our training schedule is significantly less complex.
Deep Gaussian Process
Deep Gaussian processes are a class of models utilized for regression that combine Gaussian processes (GPs) with deep architectures. These were initially introduced by [3]. Deep GP is a composition of GP’s where each layer consists of GP units that connect it to the next layer. Imagine a neural network with one or more hidden nodes and each edge connecting any two nodes in the network is a GP. Exact inference on deep GPs is intractable, and although several variational approximation methods have been proposed, they are difficult to implement and do not extend readily to arbitrary kernels. They can be used to perform a regression with uncertainty bounds through a technique known as Kriging, but struggle with high dimensional or largescale data.
Because our proposed approach does not bear the significant overhead burden when training, it is capable of regressing on highdimensional, large scale image data and provides uncertainty measures on the predictions made.
MC Dropout
MC Dropout[4], a method aimed at replicating the behavior of a Deep Gaussian process, can also be used for uncertainty quantification. However, utilizing this method requires ensembling on the network, leading to multiple evaluations of the network, which results in additional computational expense.
As we regress directly on the distribution parameters, we do not need to sample from our network thus making the network substantially faster (as we only need to make one forward pass through our network).
Ensemble Methods
Recently there has been research into utilizing Deep Ensembles[9] (an ensemble of deep learners) to create multiple hypotheses. From these hypotheses, an uncertainty can be inferred. While extremely promising, utilizing this method requires one to train multiple deep learners and evaluate multiple deep networks to generate uncertainty resulting in a fairly computationally expensive process. We avoid these issues by utilizing a single regressor.
Since we are only utilizing a single regressor, we only need to train, evaluate, and store one regressor.
Bayesian Regression
Drawing inspiration from statistics and probability theory, Bayesian regression assigns each parameter a prior probability distribution. Bayesian learning, via the Bayesian update rule, is utilized to update the probability distributions to best fit the data. One can extend Bayesian Regression to neural networks
[8][2][6] utilizing a similar methodology to produce uncertainty. As before, this requires storing and optimizing a distribution for each parameter, and thus is computationally expensive. Additionally, to determine a hypotheses and uncertainty, one must sample the network utilizing techniques such as variations inference[5] or MCMC[13].We are performing ordinary regression on the distribution parameters; each of our weights and biases take on a single value. Thus, we do not need to sample from our network. This enables us to use our method on large scale imaging datasets.
3 Methodology
3.1 Framing the Problem
Suppose we have samples where is a joint probability distribution of and . For this paper, we assume that
(1) 
That is to say that each cross section of the joint probability distribution function (PDF) degenerates into a normal distribution. Furthermore, we assume that each output dimension is conditionally independent from each other. Thus, for all , is a diagonal matrix.
We wish to learn a mapping from to the parameters of a Gaussian:
(2) 
where so that
(3) 
(or alternatively the KL Divergence is minimized). Utilizing this mapping, one can determine the uncertainty (both epistemic and aleatory) of our model. We demonstrate the capability of capturing both of these uncertainties in the experiments section.
Thus, as our target distributions are Gaussian, by producing a distribution on target variable, one can produce confidence intervals on the target variable. If one can determine such a mapping, one can achieve the uncertainty quantification described above.
3.2 Approach
To learn the mapping described in Section 3.1, we train a regressor to output the parameters of our target distribution with the following loglikelihood loss:
(4) 
where
is the true joint distribution on
and is a Gaussian induced with parameters from the regressor. In the appendix we demonstrate that an optimal learning scheme (with appropriate assumptions) will converge to the true distribution parameters assuming the target distribution is a Gaussian. Thus, a scheme which is optimal under this loss will also have a minimal mean squared error (or any other loss) to the target data points.This loss under finite data degenerates into NLL loss:
(5)  
(6) 
where is the frequency occurs in the dataset. As our target distribution is a Gaussian^{3}^{3}3for this simplification, please note that is base ,
(7)  
(8) 
which will reach its minimum where
(9) 
reaches its minimum. This loss is preferable over a multitude of other losses (such as KL Divergence) as it does not require defining an auxiliary ground truth probability distribution.
3.3 Network Architecture
We model this regressor utilizing a multicomponent neural network, which must output two values (assuming we are regressing on a single target variable): the mean and standard deviation (which can also be interpreted as an uncertainty). Figure
1shows an overview of the generalized architecture with the various components. When applied on highdimensional numerical input data, the feature extractor can be implemented as a deep neural network which embeds the data into a lower dimensional space; when applied on time series data, the feature extractor can be implemented as some variant of a recursive neural network (RNN); and when applied on complex natural images, a convolutional neural network can used for feature extraction.
We typically utilize two fully connected layers as the regressor layer, where the number of nodes is determined by the complexity of the output of the feature extraction layer. The regressor network produces the mean and the standard deviation, and Softplus is applied to the standard deviation to ensure a valid probability distribution is generated. We demonstrate the efficacy of the architecture on different data types in Section 4.
4 Experiments and Results
To evaluate the proposed method, we utilized it in a variety of experiments described in the ensuing subsections. For the first Toy dataset, the goal was simply to test the regression capabilities of the model and visualize the uncertainty measures 2dimensions of the wellknown caloric dataset from Kaggle^{4}^{4}4https://www.kaggle.com/fmendes/exerciseandcalories. For the next set of experiments, we applied the method to the standard benchmark datasets commonly used to measure the quality of a regression algorithm and compare with the stateoftheart techniques described in Section 2. Next, we applied the algorithm to highly volatile, stock prices using data from 2015 to midyear 2018 to predict the uncertainty of the stock in from midyear 2018 till date. These uncertainty periods were shown to correspond to different current affairs events that could have impacted that particular stock.
Lastly, we applied the network to the complex, largescale image dataset, the IMDbWiki data, to demonstrate its efficacy in performing highquality age predictions (via regression) along with the uncertainty measures associated with the predictions. While many deep learning applications have been used successfully for different problems involving image data, there is very little work from the literature on estimating the associated uncertainties. In the different networks tested, we measure the quality of the uncertainties via the predictive log likelihood.
4.1 Toy Caloric Datasets
As an initial test of our framework, we perform a one dimensional regression utilizing one parameter on a toy dataset created from Kaggle (“Exercise and Calories”). This is used purely for illustrative purposes. In addition, this will provide empirical evidence that this network is capable of capturing aleatory uncertainty (at least for basic scenarios). We attempt to determine how many calories an individual burned based on body heat. Please note that we add artificial noise to discourage the network from memorizing the mean and standard deviations of each input.
In Figure 2, one can observe that this network successfully converges to perform the distribution regression, demonstrating that the network can capture aleatory uncertainty (the randomness inherent in the system). As this dataset is purely for demonstrative purposes, we do not include quality metrics.
4.2 Numerical Datasets
Dataset  [6] PBP  [4] MC  [9] Deep  Ours 

Dropout  Ensembles  
Boston  2.57 0.09  2.46 0.25  2.41 0.25  2.23 0.05 
Concrete  3.16 0.02  3.04 0.09  3.06 0.18  3.05 0.04 
Energy  2.04 0.02  1.99 0.09  1.38 0.22  1.91 0.02 
Kin8nm  0.90 0.01  0.95 0.03  1.20 0.02  1.18 0.02 
Naval  3.73 0.01  3.80 0.05  5.63 0.05  3.82 0.09 
propulsion  
Power plant  2.84 0.01  2.80 0.05  2.79 0.04  2.85 0.01 
Protein  2.97 0.00  2.89 0.01  2.83 0.02  2.14 0.01 
Wine  0.97 0.01  0.93 0.06  0.94 0.12  0.87 0.02 
Yacht  1.63 0.02  1.55 0.12  1.18 0.21  4.06 0.00 
MSD  3.60 NA  v3.59 NA  3.35 NA  3.40 NA 
As an additional test of the framework, we demonstrate that our model is onpar or superior to other popular uncertainty quantification models (specifically PBP [6], MC Dropout[4] and Deep Ensembles[9]) for regression under several benchmark datasets, commonly used to measure the quality of a regression algorithm. As can be observed in Table 1, with the exception of the Yacht dataset where our technique is underpar, we performed onpar with or outperformed all other approaches in terms of NLL (negative logloss).
4.3 Uncertainty Measures on Stock Prices
To test for uncertainty in predictions in large, complex, stochastic, highly volatile time series data, we applied the methodology specifically on similar stocks from the entertainment industry^{5}^{5}5One of the authors spent his summer internship at a financial organization and specifically analyzed this family of stocks.. The family is comprised of stocks from Century Fox, Inc. (FOX), Netflix, Inc. (NFLX), Time Warner, Inc. (TWX), Amazon.com, Inc. (AMZN), Walt Disney Co. (DIS), Comcast Corporation (CMCSA)
. The stocks are classified as a family based on their sector, industry, asset class, and the prices of the stocks over an extended period of time being highly correlated with each other.
Suppose we are given the stock close prices for days prior to day : . We wish to predict the closing price on day : . To do this, we predict to prescribe a distribution onto .
4.3.1 Data preparation and Training Schedule
We downloaded the publicly available stock price information for the family of stocks explained above. Data for the entire family from 2015 till May 2018 was used as training data, with the goal of predicting uncertainty for only the FOX stocks from June 2018 till date (February 2019 at the time of submission).
The network shown in Figure 1
was implemented with the feature extraction layer being implemented as a gated recurrent unit (GRU) with a lookback of 10 days. The training scheme involved looking at the stock prices over a period of 10 days with the goal of predicting price on the 11th day along with the measure of uncertainty of the prediction. The resulting uncertainty measures are shown in Figure
3.4.3.2 Analysis of Results
To analyze the uncertainties resulting from the implementation, we set a threshold of 0.5 so that days on which the uncertainty measure was above this threshold were flagged as anomalous trading days. We provide a list of FOXrelated news ^{6}^{6}6News data was obtained from https://www.reuters.com/finance/stocks/FOX/keydevelopments. We threw away many other events leaving those related to where our uncertanties were high in that period and compare with the anomalous days predicted by our network. The results are shown in Table 2.
Real Date  Network predictions  News related to 21st Century FOX 

051718  053118  Suzanne Scott named CEO Of FOX News 
061318  061518  Comcast offers to buy 21st Century Fox 
media assets for $65B in cash  
1019 till  1019 till  Walt Disney receives unconditional approval 
102018  102218  from China For 21st Century Fox deal; 
Amazon/Blackstone bid for Disney’s 22  
regional sports networks;  
112618  112618  Disney, Fox sued in U.S. for $1B over 
Malaysia theme park  
010719  —  21st Century Fox announces filing of 
registration statement on Form 10 for Fox 
4.4 Age Estimation from Face Image
To test how well this architecture will work on large complex datasets, we applied it on the nontrivial problem of age estimation. Given an image of a face, the network was tasked with predicting the age of the individual in the picture. Posed as a general problem, this task is a very challenging regression problem.
We utilized the IMDbWiki Dataset: a dataset of half a million faces scraped from both IMDb and Wikipedia (primarily IMDb), and tagged with the corresponding ages of individuals in the images. This dataset was generated by first identifying faces in images utilizing the Mathias et. al. face detector[11]. The faces were then given a 40% margin around the border and cropped out. Finally, the age was automatically extracted from the document by extracting both the time of the photograph and the year of the individual’s birth. Due to the highly automated nature of the collection of the data, this dataset is very noisy, where multiple entries in the dataset contain either no face or multiple faces. Additionally, several of the entries contain just a copyright sign. Also, in some cases, the collection year was incorrectly extracted from the webpage. See Table 4 for examples of invalid face images.
Although this dataset contained a similar distribution of males to females (see Figure 5(left)), it contained primarily individuals between 20 and 40 years old. Additionally, because the IMDb dataset contained a random sampling of Hollywood actors, the dataset was primarily composed of young Caucasian individuals, thus, having high implicit bias. We empirically demonstrated that our method is still capable of correctly identifying underrepresented samples in spite of the imbalances in the data.
4.4.1 Cleaning, Preparation and Training Schedule
We did not wish to excessively clean the data, but rather remove the clearly wrong data. We did this by removing samples which had individuals younger than three years old or older than 100 years old. Additionally, we removed images which were too small (namely smaller than 16 by 16). We then resized all the images to 224 by 224.
We did not remove invalid images which contained multiple, or no faces, although we standardize the images to be color (if they were black and white, we replicated that channel over the R, G, and B channels). Additionally, we did not remove the mislabeled entries (those which had valid ages and valid images, but were clearly mislabeled). We did not remove these so we could test the ability of the uncertainty quantification. One would expect an appropriate uncertainty quantification algorithm would give high uncertainty for invalid data or abnormal data. We exploit this later to automatically clean the dataset.
This imagebased network was trained utilizing a 16layer convolutional neural network (CNN) with an Adam optimizer, a learning rate of 0.00005 and a batch size of 8 until convergence (6 epochs).
4.4.2 Analysis of results
20/19.6/3.1  21/22.0/3.2  24/22.3/7.5  23/23.8/7.8  22/22.8/3.6 
92/26.1/7.9  82/30.8/6.4  67/27.1/5.9  70/25.6/6.6  82/35.3/7.2 
36/55.3/19.0  58/62.0/19.1  27/69.6/21.6  69/76.4/20.5  88/83.9/20.8 
Method  MAE  NLL 

CNN + Regressor  7.54  
CNN + Regressor + Uncertainty  7.57  3.63 
CNN + Regressor + Uncertainty + Cleaning  5.22  3.53 
The results of these experiments were very promising (first row of Figure 6). Even on this noisy dataset, the architecture only performed poorly when the ground truth was wrong (see the middle row of Figure 6). These results demonstrate that our model is capable of capturing epistemic uncertainty. Additionally, this architecture’s uncertainty not only expressed how confident the model was, but also how clean the data sample was. Thus, this model often reported high confidence if a sample was well represented within the dataset. Empirically we can see that difficult samples (those in a class with low representation, poor lighting, side facing faces, ambiguous individual, multiple faces) obtained high uncertainty. Images in the dataset which were incorrectly scraped along with excessively noisy or incorrect data had the highest uncertainty (see the last row of Figure 6). This architecture can therefore be used to evaluate the quality of samples, assuming a large portion of the data is of good quality.
4.4.3 Determining the overhead of uncertainty quantification
An error quantification network is often only appealing if it does not have a significant impact on performance. Thus, the quantification of the discrepancy of some error metric (say RMSE) between a classical regressor with parameters and that of an uncertaintyaware regressor with parameters should be minimized. To this end, we train two networks utilizing the same initial configuration of parameters and same number of parameters (except for the last layer) until convergence (one vanilla regressor and one error quantification regressor).
Examining Table 3, we can observe that the discrepancy between the MAE of the uncertaintyagnostic regressor and the uncertaintyaware regressor is negligible. Thus, computing the uncertainty does not provide any significant additional overhead to this model. This final layer can therefore be added to any regressor to provide uncertainty metrics.
4.4.4 Automated data cleaning
As described earlier, this architecture can be utilized to determine the quality of a sample by examining the uncertainty produced. After training, we identified the samples with the top uncertainty and removed them (please note that we left the validation samples unchanged). After removing these samples from our training set, we obtained significantly better results on the validation dataset (see Table 3). Thus, this architecture is uniquely wellsuited for unclean datasets to generate relatively high performing regressors.
5 Limitations
There are several limitations to both the uncertainty quantification method and the data cleaning process described.
Uncertainty Quantification
While the uncertainty quantification has been demonstrated (both analytically and empirically) to appropriately quantify error, applying this direct method will likely fail if the uncertainty follows any other distribution other than normal (but this is also the case for confidence measurements in classical statistics).
Data Cleaning Process
While the data cleaning process was implemented successfully here and shown to benefit learning, utilizing this process can have several adverse side effects if not applied carefully. Eliminating uncertainty could increase the systematic bias of the dataset and compromise the integrity of the data.
6 Conclusion and Future Work
This method of uncertainty quantification has been demonstrated to work well with large scale image datasets. Furthermore, this method has been shown to perform better than current methods for uncertainty quantification without overhead. This uncertainty quantification aspect has been exploited to develop a data cleaning procedure which improved the accuracy on an unchanged validation set. Future work for this subject includes generalizing this to arbitrary distribution regression and investigating uncertainty for classification.
References
 [1] (199401) Mixture density networks. Technical report External Links: Link Cited by: §1.
 [2] (2015) Weight uncertainty in neural networks. In International Conference on Machine Learning (ICML), pp. 1613–1622. External Links: Link Cited by: §2.

[3]
(201305)
Deep gaussian processes.
In
Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics
, AISTATS, Vol. 31, pp. 207–215. Cited by: §2.  [4] (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In International Conference on International Conference on Machine Learning (ICML), pp. 1050–1059. External Links: Link Cited by: §2, §4.2, Table 1.
 [5] (2011) Practical variational inference for neural networks. In Advances in Neural Information Processing Systems 24, J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 2348–2356. External Links: Link Cited by: §2.

[6]
(2015)
Probabilistic backpropagation for scalable learning of bayesian neural networks
. External Links: arXiv:1502.05336 Cited by: §2, §4.2, Table 1.  [7] (2009) Aleatory or epistemic? does it matter?. Structural Safety 31 (2), pp. 105–112. External Links: Document Cited by: §2.
 [8] (2015) Bayesian dark knowledge. In Advances in Neural Information Processing Systems (NIPS), pp. 3438–3446. External Links: Link Cited by: §2.
 [9] (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NIPS), pp. 6402–6413. External Links: Link Cited by: §2, §4.2, Table 1.
 [10] (2017) Leveraging uncertainty information from deep neural networks for disease detection. Scientific Reports 7. External Links: Document Cited by: §1.

[11]
(2014)
Face detection without bells and whistles.
In
European Conference on Computer Vision (ECCV)
, pp. 720–735. Cited by: §4.4. 
[12]
(1994)
Estimating the mean and variance of the target probability distribution
. Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN94). External Links: Document Cited by: §2.  [13] (2000) On mcmc sampling in bayesian mlp neural networks. Proceedings of the IEEEINNSENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium. External Links: Document Cited by: §2.
7 Appendix
As stated in the paper, we are learning from data drawn from
Theorem 1.
(10) 
is minimized (under infinite i.i.d. data) when if for all , for some .
Proof.
Suppose (i.e. for a given , the is normally distributed). Then,
(11)  
(12) 
Furthermore, for some cumulative probability distribution . Please note we notate as for brevity. Thus,
(13) 
By the Euler Lagrange theorem, this function reaches its minimum when
(14)  
(15)  
(16) 
where is the integrand of . Differentiating, one obtains
(17) 
As ,
(18) 
Thus,
(19) 
Solving this separable differential equation under the condition that yields
(20) 
Thus, as desired. ∎
As an optimal learning scheme reaches the global optimum, under an optimal learning scheme, would be learned to be
Comments
There are no comments yet.