1 Introduction
Nowadays, Neural Networks (NNs) are arguably the most popular machine learning tool among Artificial Intelligence (AI) community. Researchers and practitioners have applied NNs to a wide variety of fields, including manufacturing
Bergmann2014 , bioinformatics LeCun2015 , physics Baldi2014 , finance Niaki2013 , chemistry Anjos2015 , healthcare Shahid2019, etc. Although standard NNs are good at making a point prediction (a single outcome) for supervised learning tasks, they are unable to provide the uncertainty information about predictions. For realworld decision makers, representing the model uncertainty is of crucial importance
Gal2016 ; Ghahramani2015 ; Krzywinski2013 . For example, in the case of regression, providing a confidence interval of the prediction allows the decision maker to anticipate the possible outcomes with explicit probability. In contrast, simply returning a single point prediction imposes increased risks on the decision making, e.g., a predictively good but actually risky medical treatment may be overconfidently interpreted without uncertainty information.Conventional Bayesian models such as Gaussian Process (GP) Rasmussen2006 offer a mathematically grounded approach to reason about the predictive uncertainty, but usually come with a prohibitive computational cost and lessened performance compared to NNs Gal2016 . As a potential solution, considerable research has been devoted to the combination of Bayesian models and NNs (see "Related Work" section for a detailed review of these approaches), aiming to overcome the downsides of both. However, from the classical Bayesian Neural Network Neal1996 , in which a distribution of weights is learned, to the most recent Neural Processes Garnelo2018 ; Marta2018 ; Kim2019 , in which a distribution over functions is defined, all these methods require significant modifications to the model infrastructure and training pipeline. Compared to standard (nonBayesian) NNs, these new models are often computationally slower to train and harder to implement Balaji2017 ; Gal2016 , creating tremendous difficulty for practical uses. In Gal2016 , a theoretical tool is derived to extract uncertainty information from dropout training, however, the method can only be applied to dropout models, and it also requires to change the internal inference pipeline of dropout NNs. Quantifying pointprediction uncertainty in standard NNs, which are overwhelmingly popular in practical applications, still remains a challenging problem with significant potential impact.
To circumvent the above issues, this paper presents a new framework that can quantitatively estimate the pointprediction uncertainty of standard NNs without any modifications to the model structure or training pipeline. The proposed framework can be directly applied to any pretrained NNs without retraining them. The main idea is to estimate the prediction residuals of NNs using a modified GP, which introduces a new composite kernel that makes use of both inputs and outputs of the NNs. The framework is referred to as RIO (for Residual Input/Output), and the new kernel as an I/O kernel. In addition to uncertainty estimation, RIO has an interesting and unexpected sideeffect: It also provides a way to reduce the error of the NN predictions. Moreover, with the help of recent sparse GP models, RIO is well scalable to largescale datasets. Since classification problems can also be treated as regression on class labels lee2018 , this paper will focus on regression tasks. A theoretical foundation is provided to prove the effectiveness of both residual estimation and I/O kernel. Twelve realworld datasets are tested in empirical studies, in which RIO exhibits reliable uncertainty estimations, more accurate point predictions, and better scalability compared to existing approaches.
2 Method
In this section, the general problem statement will be given, the RIO framework will be developed, and justified theoretically, focusing on the two main contributions: estimating the residuals with GP and using an I/O kernel. For background introductions of NNs, GPs, and its more efficient approximation, Stochastic Variational Gaussian Processes (SVGPs) Hensman2013 ; Hensman2015 , see section S1 in appendix.
2.1 Problem Statement
Consider a training dataset , and a pretrained standard NN that outputs a point prediction given . The problem is twofold: (1) Quantify the uncertainty in the predictions of the NN, and (2) calibrate the point predictions of the NN (i.e. make them more accurate).
2.2 Framework Overview
The main idea of the proposed framework, RIO (for Residual estimation with an I/O kernel), is to predict the residuals between observed outcomes and NN predictions using GP with a composite kernel. The framework can be divided into two phases: training phase and deployment phase.
In the training phase, the residuals between observed outcomes and NN predictions on the training dataset are calculated as
(1) 
Let
denote the vector of all residuals and
denote the vector of all NN predictions. A GP with a composite kernel is trained assuming , where denotes an covariance matrix at all pairs of training points based on a composite kernel(2) 
Suppose a radial basis function (RBF) kernel is used for both
and . Then,(3) 
The training process of GP learns the hyperparameters
, , , , and by maximizing the log marginal likelihood given by(4) 
In the deployment phase, a test point is input to the NN to get an output . The trained GP predicts the distribution of the residual as , where
(5)  
(6) 
where denotes the vector of kernelbased covariances (i.e., ) between and the training points.
Interestingly, note that the predicted residuals can also be used to calibrate the point predictions of the NN. Finally, the calibrated prediction with uncertainty information is given by
(7) 
In other words, RIO not only adds the uncertainty estimation to a standard NN—it also makes their output more accurate, without any modification of its architecture or training. Figure 1 shows the overall procedure when applying the proposed framework in realworld applications.
2.3 Underlying Rationale of Residual Prediction
This subsection gives a theoretical justification for why residual prediction helps both error correction and uncertainty estimation of an NN. Specifically, such prediction (1) leads to output that is more accurate than that of GP alone, (2) leads to output that is more accurate than that of the NN alone, and (3) leads to uncertainty estimates that are positively correlated with variance of NN residuals.
Consider a dataset , with drawn i.i.d. from a distribution , and
where , , and is a function with mean zero and variance . Without loss of generality, represents the component in the target function that can be modeled by a GP, represents the observation noise, and represents all the remaining components. With mean squared error as loss, the optimal predictor for is .
Suppose that, from the perspective of GP given , is indistinguishable from noise, but that there is a neural network that has successfully learned some of its structure. Consider the residuals . Here, is the remaining GP component, i.e., , for some nonnegative . Similarly, is the remainder of indistinguishable from noise, i.e., . Aside from these indistinguishable functions, assume GP hyperparameters can be estimated optimally.
Let be the posterior mean of the GP trained directly on , and be that of a GP trained to fit the residuals, which yields the final predictor . Let , , and be the expected generalization errors of , , and .
First, following a standard approach Rasmussen2006
, consider the eigenfunction expansion
and . Letbe the diagonal matrix of the eigenvalues
, and be the design matrix, i.e., . The following series of results capture the improvement due to residual estimation (proofs in section S3.1 of appendix).Lemma 2.1.
Lemma 2.2.
Lemma 2.3.
Theorem 2.4.
and
In particular, the improvement with respect to GP is greater than the reduction in apparent noise. Importantly, this improvement in error corresponds to a predictive variance that is closer to the optimal for this problem, so uncertainty estimates are improved as well. Experiments on real world data confirm that when predicting the residuals of an NN, the estimated noise level of the GP is indeed lower and correlates with the reduction in generalization error (see S2.3 in appendix). This reduction is possible because the NN is able to extract higherlevel features not available to the GP.
These results also lead to a key practical property, which is illustrated in Figure 2.
Proposition 1.
The variance of NN residuals is positively correlated with the uncertainty of .
Proof.
Increases in lead to increases in , and increases in lead to decreases in . So, increases in either or lead to increases of , which is the expected predictive variance of . ∎
This property matches the intuition that the GP’s variance should generally be higher for bad NNs than for good NNs. Such a property is crucial to accurately measuring the confidence of NN predictions.
2.4 Robustness of I/O Kernel
This section provides a justification for why a GP using the proposed I/O kernel is more robust than the standard GP, i.e., using the input kernel alone. The reasoning closely matches that in Section 2.3.
Consider the setup in Section 2.3, but now with where and , with nontrivial RBF kernels , (as in Equation 3). Again, suppose that, due to its high expressivity Csaji2001 , is indistinguishable from noise from the perspective of GP.
Let , , and be the expected generalization errors of the standard GP, GP with output kernel only, and GP with I/O kernel. Then, the expected result follows (proof in S3.2 of appendix).
Theorem 2.5.
and
The optimizer associated with GP simultaneously optimizes the hyperparameters of both kernels, so the less useful kernel usually receives a smaller signal variance. As a result, the I/O kernel is resilient to failures of either kernel. In particular, the GP using I/O kernel improves performance even in the case where the problem is so complex that Euclidean distance in the input space provides no useful correlation information or when the input space contains some noisy features. Conversely, when the NN is a bad predictor, and is simply noise, the standard GP with input kernel alone is recovered. In other words, the I/O kernel is never worse than using the input kernel alone, and in practice it is often better. This conclusion is confirmed in experiments, as will be described next.
2.5 Scalability
RIO is scalable to large datasets by applying sparse GP methods, e.g., SVGP Hensman2013 ; Hensman2015 . All the conclusions in previous sections still remain valid since sparse GP is simply an approximation of the original GP. In the case of applying SVGP with a traditional optimizer, e.g., LBFGSB Byrd1995 ; Zhu1997 , the computational complexity is , and space complexity is , where is the number of data points and is the number of inducing variables. Experiments show that the computational cost of this implementation is significantly cheaper than other stateoftheart approaches.
3 Empirical Evaluation
Experiments in this section compare nine algorithms on 12 realworld datasets. The algorithms include standard NN, the proposed RIO framework, four ablated variants of RIO, and three stateoftheart models that can provide predictive uncertainty, namely, SVGP Hensman2013 , Neural Network Gaussian Process (NNGP) lee2018 , and Attentive Neural Processes (ANP) Kim2019 . In naming the RIO variants, "R" means estimating NN residuals then correcting NN outputs, "Y" means directly estimating outcomes, "I" means only using input kernel, "O" means only using output kernel, and "IO" means using I/O kernel. For all RIO variants (including full RIO), SVGP is used as the GP component, but using the appropriate kernel and prediction target. Therefore, "Y+I" amounts to original SVGP, and it is denoted as "SVGP" in all the experimental results. All 12 datasets are realworld regression problems from UCI Machine Learning Repository Dua2017 , and cover a wide variety of dataset sizes and feature dimensionalities. Except for the "MSD" dataset, all other datasets are tested for 100 independent runs. During each run, the dataset is randomly split into training set, validation set, and test set, and all algorithms are trained on the same split. All RIO variants that involve an output kernel or residual estimation are based on the trained NN in the same run. For "MSD", since the dataset split is strictly predefined by the provider, only 10 independent runs are conducted. NNGP and ANP are only tested on the four smallest dataset (based on the product of dataset size and feature dimensionality) because they do not scale to larger datasets within the available computational time. It is notable that for all the RIO variants, no extensive hyperparameter tunings are conducted; the same default setup is used for all experiments, i.e., RBF kernel and 50 inducing points. See appendix for the details of experimental setups. Table 1 summarizes the numerical results from these experiments. The main conclusions in terms of pointprediction error, uncertainty estimation, computational requirements, and ablation studies are summarized below.
PointPrediction Error
The errors between point predictions of models and true outcomes of test points are measured using Root Mean Square Error (RMSE), and the mean and standard deviation of RMSEs over multiple experimental runs are shown in the "Prediction RMSE" column of Table 1. For models that return a probabilistic distribution, the mean of the distribution is considered as the point prediction. RIO performs the best or equals the best (based on statistical tests) in 11 out of 12 datasets, and ranks second in the remaining dataset. Compared to standard NN, the probabilistic models that predict the outcomes directly generally perform worse. This result demonstrates the advantages of standard NNs in making point predictions, and makes it compelling to extract uncertainty information from NN predictions.
Dataset  Method  Prediction  95%CI  90%CI  68%CI  Time  Dataset  Method  Prediction  95%CI  90%CI  68%CI  Time 
n d  RMSE  RMSE  RMSE  RMSE  (sec)  n d  RMSE  RMSE  RMSE  RMSE  (sec)  
yacht  NN  3.761.86        2.41  ENB/h  NN  0.940.37        6.48 
RIO  3.061.37  0.658  0.489  0.413  3.00  RIO  0.810.36  0.359  0.255  0.118  3.37  
R+I  3.671.79  0.673  0.466  0.325  2.73  R+I  0.840.37  0.370  0.264  0.134  3.06  
252  R+O  3.151.37  0.667  0.512  0.376  3.37  768  R+O  0.830.36  0.333  0.234  0.135  3.72 
Y+O  12.416.78  1.228  1.139  0.883  5.26  Y+O  1.671.58  1.926  1.847  1.470  6.41  
6  Y+IO  11.356.86  1.298  1.112  0.946  5.41  8  Y+IO  1.050.55  1.340  1.273  1.003  6.91 
SVGP  14.753.74  0.956  0.757  0.444  4.74  SVGP  3.040.67  2.233  1.960  0.616  5.51  
NNGP  10.551.54  1.947  1.634  0.989  13562  NNGP  4.970.29  1.929  1.618  0.978  14175  
ANP  6.893.22  1.117  0.825  0.408  125.7  ANP  4.322.47  0.625  0.260  0.157  566.8  
ENB/c  NN  1.890.40        6.25  airfoil  NN  4.820.47        5.05 
RIO  1.760.39  0.378  0.248  0.146  3.41  RIO  4.010.27  0.301  0.204  0.106  3.83  
R+I  1.800.39  0.365  0.237  0.146  3.09  R+I  4.140.30  0.327  0.240  0.129  3.52  
768  R+O  1.790.39  0.377  0.245  0.148  3.75  1505  R+O  4.210.29  0.341  0.251  0.134  4.15 
Y+O  2.380.92  1.558  1.515  0.841  6.21  Y+O  5.342.04  1.535  1.165  0.743  6.92  
8  Y+IO  2.060.71  1.227  1.194  0.752  6.51  5  Y+IO  4.751.19  1.106  0.787  0.679  7.14 
SVGP  3.640.70  0.387  0.442  0.724  5.49  SVGP  5.891.04  1.183  0.795  0.420  6.22  
NNGP  4.910.32  1.899  1.594  0.965  13406  NNGP  6.540.23  1.930  1.621  0.979  17024  
ANP  4.331.93  0.567  0.262  0.148  565.7  ANP  24.833.4  1.363  1.444  1.105  1657  
CCS  NN  6.280.53        6.71  wine/r  NN  0.6920.04        3.57 
RIO  6.190.50  0.556  0.436  0.233  3.55  RIO  0.6780.04  0.352  0.260  0.131  3.85  
1030  R+I  6.200.51  0.574  0.455  0.242  3.23  1599  R+I  0.6900.04  0.365  0.267  0.136  3.41 
R+O  6.230.52  0.579  0.455  0.243  3.88  R+O  0.6790.04  0.337  0.246  0.126  4.01  
8  Y+O  7.813.01  0.316  0.252  0.198  6.05  11  Y+O  0.6910.06  1.023  0.976  0.538  6.72 
Y+IO  7.282.41  0.355  0.291  0.202  6.26  Y+IO  0.6760.04  0.430  0.299  0.243  5.92  
SVGP  11.872.06  0.853  0.723  0.431  5.59  SVGP  0.8830.17  1.502  1.342  0.985  6.37  
wine/w  NN  0.7230.02        8.27  CCPP  NN  5.040.62        12.6 
RIO  0.7100.02  0.224  0.142  0.065  5.92  RIO  4.250.13  0.141  0.120  0.083  9.21  
4898  R+I  0.7220.02  0.212  0.142  0.066  4.82  9568  R+I  4.260.13  0.160  0.138  0.096  8.69 
R+O  0.7110.02  0.205  0.130  0.062  5.66  R+O  4.370.16  0.133  0.113  0.079  9.38  
11  Y+O  0.7260.04  0.587  0.395  0.309  10.20  4  Y+O  8.8612.6  1.446  1.485  1.421  22.53 
Y+IO  0.7170.02  0.297  0.290  0.227  10.69  Y+IO  7.604.84  1.078  1.121  1.004  22.04  
SVGP  0.8730.12  1.041  1.044  0.700  11.05  SVGP  6.102.59  1.386  1.441  1.166  20.37  
protein  NN  4.210.08        135.6  SC  NN  12.270.73        146.3 
RIO  4.140.06  0.182  0.100  0.073  34.9  RIO  11.470.40  0.284  0.131  0.166  29.8  
45730  R+I  4.160.06  0.156  0.081  0.060  32.4  21263  R+I  11.490.42  0.281  0.129  0.167  29.2 
R+O  4.160.07  0.186  0.104  0.070  31.1  R+O  11.700.42  0.321  0.172  0.157  22.7  
9  Y+O  4.190.12  0.272  0.175  0.050  38.8  80  Y+O  12.201.39  0.396  0.241  0.114  27.7 
Y+IO  4.120.06  0.238  0.146  0.050  45.8  Y+IO  11.720.53  0.320  0.156  0.132  38.5  
SVGP  5.220.07  0.184  0.142  0.073  49.1  SVGP  18.291.51  0.614  0.473  0.210  41.9  
CT  NN  1.200.38        632.4  MSD  NN  12.530.82        1040 
RIO  1.010.29  0.324  0.327  0.301  121.4  RIO  10.020.23  0.091  0.081  0.253  988.4  
53500  R+I  1.200.38  0.245  0.235  0.195  73.1  515345  R+I  12.530.82  0.080  0.072  0.137  1265 
R+O  0.990.29  0.259  0.238  0.210  37.7  R+O  10.090.25  0.137  0.114  0.254  1364  
384  Y+O  1.991.07  1.409  1.497  1.426  110.7  90  Y+O  18.9611.5  0.789  0.850  0.846  1564 
Y+IO  1.890.65  1.340  1.389  1.272  292.1  Y+IO  20.7212.6  0.609  0.669  0.701  2441  
SVGP  52.10.19  2.338  0.172  0.156  213.6  SVGP  14.401.78  0.171  0.237  0.414  6942 
The symbols and indicate that the difference between the marked entry and RIO is statistically significant at the 5% significance level using paired test and Wilcoxon test, respectively. The best entries that are significantly better than all the others under at least one statistical test are marked in boldface (ties are allowed).
Figure 3 shows a comparison among NN, RIO and SVGP in terms of prediction RMSE. RIO is able to improve the NN predictions consistently regardless of how the dataset is split and how well the NN is trained. Conversely, SVGP performs significantly worse than NN in all the datasets. For the CT dataset, which has 386 input features, SVGP fails severely since the input kernel cannot capture the implicit correlation information. ANP shows an extremely unstable performance in airfoil dataset due to its poor scalability to the increase of dataset size. To conclude, applying RIO to NNs not only provides additional uncertainty information, but also reduces the pointprediction error.
Uncertainty Estimation
Confidence interval (CI) is a useful tool to estimate the distribution of outcomes with explicit probabilities. In order to measure the quality of uncertainty estimation quantitatively, the percentages of testing outcomes that are within the 95%/90%/68% CIs as estimated by each algorithm are calculated. These percentages should be as close to the estimated confidence levels as possible, e.g., a perfect uncertainty estimator would have exactly 95% of testing outcomes within its estimated 95% CIs. The differences between the estimated CIs and true CIs are quantified using their standardized errors with respect to Zscore (see appendix for more details). The RMSEs based on these errors are shown in Table 1 for estimated 95%, 90% and 68% CIs. RIO provides the best or equals the best uncertainty estimations in 7, 6 and 5 out of 12 datasets for 95%, 90% and 68% CIs, respectively. In the remaining datasets, the differences between RIO and the best entries are also small (
0.145 with respect to Zscore). One interesting observation is that in the CT dataset SVGP shows very poor prediction RMSE and 95% CI RMSE, but achieves the best 90% and 68% CI RMSE. After investigation, this is actually happened by chance and is not due to accurate CI estimations of SVGP (See S2.4 in appendix for more details).To provide a more straightforward comparison, Figure 4 shows the distribution of percentages of testing outcomes that are within the estimated 68% CIs over 100 independent runs. It is notable that an accurate coverage percentage of 68% CI is usually more difficult to achieve than those of 95% and 90% because the true distribution of observed outcomes are denser in 68% CI borderline than in the tails. RIO makes reliable CI estimations in most cases, albeit it occasionally returns erroneous CIs for the yacht dataset. ANP provides reasonable CIs in two datasets but performs unstably in airfoil dataset. NNGP always returns overconfident CIs for these realworld datasets due to the lack of noise estimation in its original implementation. Boxplots for coverage percentages of all the CIs over all the datasets are presented in Appendix. The conclusion is that compared to all the other tested models, RIO provides reasonable CI estimations in most cases.
Computation Time
The "time" column in Table 1 shows the computation time of each algorithm, averaged over all the independent runs. All algorithms are implemented using Tensorflow under the same running environment (see Appendix for details about the implementations). The RIO variants scale well to increasing dataset sizes and feature dimensionalities. In contrast, ANP’s computation time increases significantly with the scale of the dataset. NNGP always needs very expensive computational budgets due to its costly grid search of hyperparameters. Thus, RIO scales better than other approaches to large datasets.
These figures show the distribution of the percentages of testing outcomes that are within the estimated 68% CIs over 100 independent runs. As usual, the box extends from the 25 to 75 quartile values of the data (each data point represents an independent experimental run), with a line at the median. The whiskers extend from the box to show the range of the data. Flier points are those past the end of the whiskers, indicating the outliers.
Ablation Study
The RIO variants with residual estimation generally perform better than its counterparts in both pointprediction error and uncertainty estimation. This result confirms the effectiveness of residual estimation, as suggested by the theoretical analysis in Section 2.3. Another important result is that Y+IO outperforms both Y+I and Y+O in most cases across all performance metrics, and RIO generally provides more robust performance than R+I and R+O in all aspects. This result, in turn, confirms that the I/O kernel is robust, as suggested by the analysis in Section 2.4. In sum, both residual estimation and the I/O kernel contribute substantially to the good performance of RIO.
4 Discussion and Future Directions
In addition to reliable uncertainty estimation, accurate point prediction, and good scalability, RIO also provides benefits in other aspects.
RIO can be directly applied to any standard NN without modification to the model architecture or training pipeline. Moreover, retraining of the NN or change of inference process are not required. The framework simply requires the outputs of an NN; it does not need to access any internal structure. This feature makes the framework more accessible to practitioners in realworld applications, e.g., data scientists can train NNs using traditional pipelines, then directly apply RIO to the trained NNs.
RIO also provides robustness to a type of adversarial attack. Consider a worstcase scenario, in which an adversary can arbitrarily alter the output of the NN with minuscule changes to the input. It is wellknown that there are NNs for which this is possible Goodfellow2015 . In this case, with the help of the I/O kernel, the model becomes highly uncertain with respect to the output kernel. A confident prediction then requires both input and output to be reasonable. In the real world, a high degree of uncertainty may meet a threshold for disqualifying the prediction as outside the scope of the model’s ability.
There are several promising future directions for extending RIO: First, applying RIO to reinforcement learning (RL) algorithms, which usually use standard NNs for reward predictions, allows uncertainty estimation of the future rewards. Agents can then employ more mathematically efficient exploration strategies, e.g., bandit algorithms
Thompson1933 , rather than traditional epsilon greedy methods. Second, RIO applied to Bayesian optimization (BO) Mockus1975 makes it possible to use standard NNs in surrogate modeling. This approach can potentially improve the expressivity of the surrogate model and the scalability of BO. Third, since RIO only requires access to the inputs and outputs of NNs, it can be directly applied to any existing prediction models, including hybrid and ensemble models. This makes RIO a more general tool for realworld practitioners.5 Related Work
There has been significant interest in combining NNs with probabilistic Bayesian models. An early approach was Bayesian Neural Networks, in which a prior distribution is defined on the weights and biases of a NN, and a posterior distribution is then inferred from the training data MacKay1992 ; Neal1996 . Traditional variational inference techniques have been applied to the learning procedure of Bayesian NN, but with limited success Barber1998 ; Graves2011 ; Hinton1993 . By using a more advanced variational inference method, new approximations for Bayesian NNs were achieved that provided similar performance as dropout NNs Blundell2015 . However, the main drawbacks of Bayesian NNs remain: prohibitive computational cost and difficult implementation procedure compared to standard NNs.
Alternatives to Bayesian NNs have been developed recently. One such approach introduces a training pipeline that incorporates ensembles of NNs and adversarial training Balaji2017
. Another approach, NNGP, considers a theoretical connection between NNs and GP to develop a model approximating the Bayesian inference process of wide deep neural networks
lee2018 . Deep kernel learning (DKL) combines NNs with GP by using a deep NN embedding as the input to the GP kernel Andrew2016 . Conditional Neural Processes (CNPs) combine the benefits of NNs and GP, by defining conditional distributions over functions given data, and parameterizing this dependence with a NN Garnelo2018 . Neural Processes (NPs) generalize deterministic CNPs by incorporating a latent variable, strengthening the connection to approximate Bayesian and latent variable approaches Marta2018 . Attentive Neural Processes (ANPs) further extends NPs by incorporating attention to overcome underfitting issues Kim2019 . The above models all require significant modifications to the original NN model and training pipeline. Compared to standard NNs, they are also less computationally efficient and more difficult for practitioners to implement. In the approach that shares the most motivation with RIO, Monte Carlo dropout was used to estimate the predictive uncertainty of dropout NNs Gal2016 . However, this method is restricted to dropout NNs, and also requires modifications to the NN inference process.6 Conclusion
The RIO framework both provides estimates of predictive uncertainty of neural networks, and reduces their pointprediction errors. The approach is based on residual estimation and a composite I/O kernel, which are theoretically well founded and perform well on several realworld problems. By utilizing a sparse approximation of GP, RIO scales well to large datasets. Remarkably, it can be applied directly to any pretrained NNs without modifications to model architecture or training pipeline. Thus, RIO can be used to make NN regression practical and powerful in many realworld applications.
References
 [1] O. Anjos, C. Iglesias, F. Peres, J. MartÃnez, A. Garcia, and J. Taboada. Neural networks applied to discriminate botanical origin of honeys. Food Chemistry, 175:128–136, 05 2015.

[2]
P. Baldi, P. Sadowski, and D. Whiteson.
Searching for exotic particles in highenergy physics with deep learning.
Nature communications, 5:4308, 07 2014.  [3] D. Barber and C. Bishop. Ensemble learning in bayesian neural networks. In Generalization in Neural Networks and Machine Learning, pages 215–237. Springer Verlag, January 1998.
 [4] S. Bergmann, S. Stelzer, and S. Strassburger. On the use of artificial neural networks in simulationbased manufacturing control. Journal of Simulation, 8(1):76–90, Feb 2014.
 [5] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pages 1613–1622. JMLR.org, 2015.
 [6] R. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
 [7] B. Csaji. Approximation with artificial neural networks. M.S.’Thesis, Dept. Science, Eotvos Lorand Univ., Budapest, Hungary, 2001.
 [8] L. Csató and M. Opper. Sparse online gaussian processes. Neural computation, 14:641–68, 04 2002.
 [9] D. Dua and C. Graff. UCI machine learning repository, 2017.
 [10] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pages 1050–1059. JMLR.org, 2016.
 [11] M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. Rezende, and S. M. A. Eslami. Conditional neural processes. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1704–1713. PMLR, 10–15 Jul 2018.
 [12] M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. M. A. Eslami, and Y. W. Teh. Neural processes. CoRR, abs/1807.01622, 2018.
 [13] Z. Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521:452 EP –, 05 2015.
 [14] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
 [15] A. Graves. Practical variational inference for neural networks. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 2348–2356, USA, 2011. Curran Associates Inc.
 [16] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence, UAI’13, pages 282–290, Arlington, Virginia, United States, 2013. AUAI Press.
 [17] J. Hensman, A. Matthews, and Z. Ghahramani. Scalable Variational Gaussian Process Classification. In G. Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 351–360, San Diego, California, USA, 09–12 May 2015. PMLR.

[18]
G. E. Hinton and D. van Camp.
Keeping the neural networks simple by minimizing the description
length of the weights.
In
Proceedings of the Sixth Annual Conference on Computational Learning Theory
, COLT ’93, pages 5–13, New York, NY, USA, 1993. ACM.  [19] H. Kim, A. Mnih, J. Schwarz, M. Garnelo, S. M. A. Eslami, D. Rosenbaum, O. Vinyals, and Y. W. Teh. Attentive neural processes. CoRR, abs/1901.05761, 2019.
 [20] M. Krzywinski and N. Altman. Importance of being uncertain. Nature Methods, 10:809 EP –, 08 2013.
 [21] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6402–6413. Curran Associates, Inc., 2017.
 [22] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. On optimization methods for deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pages 265–272, USA, 2011. Omnipress.
 [23] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521:436 EP –, 05 2015.
 [24] J. Lee, Y. Bahri, R. Novak, S. Schoenholz, J. Pennington, and J. Sohldickstein. Deep neural networks as gaussian processes. International Conference on Learning Representations, 2018.

[25]
D. J. C. MacKay.
A practical bayesian framework for backpropagation networks.
Neural Comput., 4(3):448–472, May 1992.  [26] J. Močkus. On bayesian methods for seeking the extremum. In G. I. Marchuk, editor, Optimization Techniques IFIP Technical Conference Novosibirsk, July 1–7, 1974, pages 400–404, Berlin, Heidelberg, 1975. Springer Berlin Heidelberg.
 [27] R. M. Neal. Bayesian Learning for Neural Networks. SpringerVerlag, Berlin, Heidelberg, 1996.
 [28] S. T. A. Niaki and S. Hoseinzade. Forecasting s&p 500 index using artificial neural networks and design of experiments. Journal of Industrial Engineering International, 9(1):1, Feb 2013.
 [29] M. Opper and F. Vivarelli. General bounds on bayes errors for regression with gaussian processes. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pages 302–308, Cambridge, MA, USA, 1999. MIT Press.
 [30] J. Quiñonero Candela and C. E. Rasmussen. A unifying view of sparse approximate gaussian process regression. J. Mach. Learn. Res., 6:1939–1959, Dec. 2005.
 [31] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Jan. 2006.
 [32] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse gaussian process regression. In IN WORKSHOP ON AI AND STATISTICS 9, 2003.
 [33] N. Shahid, T. Rappon, and W. Berta. Applications of artificial neural networks in health care organizational decisionmaking: A scoping review. PLOS ONE, 14(2):1–22, 02 2019.
 [34] P. Sollich. Learning curves for gaussian processes. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pages 344–350, Cambridge, MA, USA, 1999. MIT Press.
 [35] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
 [36] M. K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In In Artificial Intelligence and Statistics 12, pages 567–574, 2009.
 [37] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In A. Gretton and C. C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 370–378, Cadiz, Spain, 09–11 May 2016. PMLR.
 [38] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: Lbfgsb: Fortran subroutines for largescale boundconstrained optimization. ACM Trans. Math. Softw., 23(4):550–560, Dec. 1997.
Appendix S1 Background
This section reviews notation for Neural Networks, Gaussian Processes, and its more efficient approximation, Stochastic Variational Gaussian Processes. The RIO method, introduced in Section 2 of the main paper, uses Gaussian Processes to estimate the uncertainty in neural network predictions and reduces their pointprediction errors.
s1.1 Neural Networks
Neural Networks (NNs) learn a nonlinear transformation from input to output space based on a number of training examples. Let denote the training dataset with size , and and
denote the inputs and outputs (i.e., targets). A fullyconnected feedforward neural network with
hidden layers of width (for layer ) performs the following computations: Let denote the output value of th node in th hidden layer given input , then and , where denotes the weight on the connection from th node in previous layer to th node in th hidden layer, denotes the bias of th node in th hidden layer, andis a nonlinear activation function. The output value of
th node in output layer is then given by , where denotes the weight on the connection from th node in last hidden layer to th node in output layer, and denotes the bias of th node in output layer.A gradientbased optimizer is usually used to learn the weights and bias given a predefined loss function, e.g., a squared loss function
. For a standard NN, the learned parameters are fixed, so the NN output is also a fixed point. For a Bayesian NN, a distribution of the parameters is learned, so the NN output is a distribution of . However, a pretrained standard NN needs to be augmented, e.g., with a Gaussian Process, to achieve the same result.s1.2 Gaussian Processes
A Gaussian Process (GP) is a collection of random variables, such that any finite collection of these variables follows a joint multivariate Gaussian distribution
[31]. Given a training dataset and , where denotes additive independent identically distributed Gaussian noise, the first step for GP is to fit itself to these training data assuming , where denotes a multivariate Gaussian distribution with mean 0 and covariance matrix . denotes the kernelbased covariance matrix at all pairs of training points with each entry , and denotes the noise variance of observations. One commonly used kernel is the radial basis function (RBF) kernel, which is defined as . The signal variance , length scale and noise variance are trainable hyperparameters. The hyperparameters of the covariance function are optimized during the learning process to maximize the log marginal likelihood .After fitting phase, the GP is utilized to predict the distribution of label given a test point . This prediction is given by with and , where denotes the vector of kernelbased covariances (i.e., ) between and all the training points, and denotes the vector of all training labels. Unlike with NN, the uncertainty of the prediction of a GP is therefore explicitly quantified.
s1.3 Stochastic Variational Gaussian Processes
The main limitation of the standard GP, as defined above, is that it is excessively expensive in both computational and storage cost. For a dataset with data points, the inference of standard GP has time complexity and space complexity . To circumvent this issue, sparse GP methods were developed to approximate the original GP by introducing inducing variables [8, 30, 32, 36]. These approximation approaches lead to a computational complexity of and space complexity of , where is the number of inducing variables. Following this line of work, the Stochastic Variational Gaussian Process (SVGP) [16, 17] further improves the scalability of the approach by applying Stochastic Variational Inference (SVI) technique, as follows:
Consider the same training dataset and GP as in Section S1.2, and assume a set of inducing variables as and ( and are unknown). SVGP learns a variational distribution by maximizing a lower bound of , where and denotes the probability density under original GP. Trainable hyperparameters during the learning process include values of and hyperparameters of the covariance function of original GP. Given a test point , the predictive distribution is then given by , which still follows a Gaussian distribution. One advantage of SVGP is that minibatch training methods [22] can be applied in case of very large dataset. Suppose the minibatch size is and , then for each training step/iteration, the computational complexity is , and the space complexity is . For full details about SVGP, see [16]. Since NNs typically are based on training with relatively large datasets, SVGP makes it practical to implement uncertainty estimates on NNs.
Appendix S2 Empirical Study
s2.1 Experimental Setups
Dataset Description
In total, 12 realworld regression datasets from UCI machine learning repository [9] are tested. Table 1 summarizes the basic information of these datasets. For all the datasets except MSD, 20% of the whole dataset is used as test dataset and 80% is used as training dataset, and this split is randomly generated in each independent run. For MSD, the first 463715 samples are used as training dataset and the last 51630 samples are used as testing dataset according to the provider’s guideline. During the experiments, all the datasets except for MSD are tested for 100 independent runs, and MSD datasets are tested for 10 independent runs. For each independent run, the same random dataset split are used by all the tested algorithms to ensure fair comparisons.
abbreviation  full name in UCI ML repository  dataset size  dimension  note 

yacht  Yacht Hydrodynamics Data Set  252  6   
ENB/h  Energy efficiency  768  8  Heating Load as target 
ENB/c  Energy efficiency  768  8  Cooling Load as target 
airfoil  Airfoil SelfNoise  1505  5   
CCS  Concrete Compressive Strength  1030  8   
wine/r  Wine Quality  1599  11  only use winequalityred data 
wine/w  Wine Quality  4898  11  only use winequalitywhite data 
CCPP  Combined Cycle Power Plant  9568  4   
CASP  Physicochemical Properties of Protein Tertiary Structure  54730  9   
SC  Superconductivty Data  21263  80   
CT  Relative location of CT slices on axial axis  53500  384   
MSD  YearPredictionMSD  515345  90  train: first 463715, test: last 51630 
Parametric Setup for Algorithms

NN: For SC dataset, a fully connected feedforward NN with 2 hidden layers, each with 128 hidden neurons, is used. For CT dataset, a fully connected feedforward NN with 2 hidden layers, each with 256 hidden neurons, is used. For all the remaining datasets, a fully connected feedforward NN with 2 hidden layers, each with 64 hidden neurons, is used. The inputs to the NN are normalized to have mean 0 and standard deviation 1. The activation function is ReLU for all the hidden layers. The maximum number of epochs for training is 1000. 20% of the training data is used as validation data, and the split is random at each independent run. An early stop is triggered if the loss on validation data has not be improved for 10 epochs. The optimizer is RMSprop with learning rate 0.001, and the loss function is mean squared error (MSE).

RIO, RIO variants and SVGP [16]: SVGP is used as an approximator to original GP in RIO and all the RIO variants. For RIO, RIO variants and SVGP, the number of inducing points are 50 for all the experiments. RBF kernel is used for both input and output kernel. For RIO, RIO variants and SVGP, the signal variances and length scales of all the kernels plus the noise variance are the trainable hyperparameters. The optimizer is LBFGSB with default parameters as in Scipy.optimize documentation (https://docs.scipy.org/doc/scipy/reference/optimize.minimizelbfgsb.html): ’maxcor’: 10, ’ftol’: 2.220446049250313e09, ’gtol’: 1e05, ’eps’: 1e08, ’maxfun’: 15000, ’maxiter’: 15000, ’iprint’: 1, ’maxls’: 20. The training process runs until the LBFGSB optimizer decides to stop.

NNGP [24]: For NNGP kernel, the depth is 2, and the activation function is ReLU. , , and . Following the learning process in original paper, a grid search is performed to search for the best values of and . Same as in the original paper, a grid of 30 points evenly spaced from 0.1 to 5.0 (for ) and 30 points evenly spaced from 0 to 2.0 (for ) was evaluated. The noise variance is fixed as 0.01. The grid search process stops when Cholesky decomposition fails or all the 900 points are evaluated. The best values found during the grid search will be used in the experiments. No precomputed lookup tables are utilized.

ANP [19]: The parametric setups of ANP are following the recommendations in the original paper. The attention type is multihead, the hidden size is 64, the max number of context points is 50, the context ratio is 0.8, the random kernel hyperparameters option is on. The size of latent encoder is , the number of latents is 64, the size of deterministic encoder is , the size of decoder is , and the deterministic path option is on. Adam optimizer with learning rate is used, and the maximum number of training iterations is 2000.
Performance Metrics

To measure the pointprediction error, the Root Mean Square Error (RMSE) between the method predictions and true outcomes on test datasets are calculated for each independent experimental run. After that, the mean and standard deviations of these RMSEs are used to measure the performance of the algorithms.

To quantitatively measure the quality of uncertainty estimation, the estimated 95%/90%/68% confidence intervals (CIs) are calculated for all the algorithms except for standard NNs. A theoretically optimal way to measure the quality of these CIs is to calculate the difference between these estimated CIs and true CIs. This requires that the true distributions of the outcomes are known for each test point. However, for realworld datasets, only one observed outcome is available for each test point. To overcome this limitation, we develop a performance metric to quantitatively compare the quality of estimated CIs among algorithms. First, for each independent experimental run, the percentages of test outcomes within the estimated 95%/90%/68% CIs of each algorithm are calculated. The residuals between these coverage percentages and 95%/90%/68% is still biased, e.g., a meaninglessly wide estimated 95% CI that covers 100% of the test outcomes only has a 5% residual. To reduce this bias, the Zscore (under the assumption that the true test outcome follows a Gaussian distribution) is calculated for coverage percentages and targeted 95%/90%/68%. The differences between the two Zscores are used as a standardized error to approximate the true Zscore error between estimated CIs and target CIs. For coverage percentages of 100%, the Zscore of 99.999% is used. Each independent experimental run has a standardized error based on the test set, and the RMSE over all the independent runs are used as the final metric to compare algorithms.

To compare the computation time of the algorithms, the training time (wall clock time) of NN, RIO, all the RIO variants, SVGP and ANP are averaged over all the independent runs as the computation time. For NNGP, the wall clock time for the grid search is used. In case that the grid search stops due to Cholesky decomposition failures, the computation time of NNGP will be estimated as the average running time of all the successful evaluations 900, which is the supposed number of evaluations. All the algorithms are implemented using Tensorflow, and tested in the exactly same python environment. All the experiments are running on a MacBook Pro machine with 2.5 GHz Intel Core i7 processor and AMD Radeon R9 M370X 2048 MB graphic card.
s2.2 Supplementary Figures
Figure S1 and S2 show the distribution of the percentages that testing outcomes are within the estimated 95%/90%/68% CIs over all the independent runs for all the datasets and algorithms. In Figure S1 and S2, the box extends from the 25 to 75 quartile values of the data (each data point represents an independent experimental run), with a line at the median. The whiskers extend from the box to show the range of the data. Flier points are those past the end of the whiskers, indicating the outliers.
s2.3 Correlation between Noise Variance and Prediction RMSE
Dataset  Method  Prediction RMSE  Noise Variance  Dataset  Method  Prediction RMSE  Noise Variance 

yacht  SVGP  14.75  19.25  ENB/h  SVGP  3.04  25.56 
Y+IO  11.35  18.30  Y+IO  1.05  3.49  
RIO  3.06  4.65  RIO  0.81  0.57  
ENB/c  SVGP  3.64  27.45  airfoil  SVGP  5.89  60.11 
Y+IO  2.06  11.04  Y+IO  4.75  49.69  
RIO  1.76  2.45  RIO  4.01  11.43  
CCS  SVGP  11.87  45.59  wine/r  SVGP  0.883  2.12 
Y+IO  7.28  39.23  Y+IO  0.676  0.46  
RIO  6.19  19.07  RIO  0.678  0.28  
wine/w  SVGP  0.873  2.02  CCPP  SVGP  6.10  141.95 
Y+IO  0.717  0.53  Y+IO  7.60  116.28  
RIO  0.710  0.38  RIO  4.25  14.91  
protein  SVGP  5.22  27.17  SC  SVGP  18.29  152.57 
Y+IO  4.12  14.40  Y+IO  11.72  82.88  
RIO  4.14  15.67  RIO  11.47  88.37  
CT  SVGP  52.1  2588.32  MSD  SVGP  14.40  247.89 
Y+IO  1.89  18.00  Y+IO  20.72  409.70  
RIO  1.01  0.95  RIO  10.02  97.56 
Table S2 shows the mean of prediction RMSE and noise variance of RIO, Y+IO and SVGP over all the independent runs for all the datasets. A clear positive correlation between prediction RMSE and noise variance can be observed. In most cases, RIO has a lower prediction RMSE and a lower noise variance than Y+IO, which has lower values in both metrics than original SVGP. This results are in accordance with the theoretical analysis in section 2.3 of the main paper, and demonstrates the effectiveness of both residual estimation and I/O kernel.
s2.4 Additional Discussions
One interesting observation is that in CT dataset SVGP shows very poor prediction RMSE and 95% CI RMSE, but achieves the best 90% and 68% CI RMSE. After investigation, this is only happened by chance, and is not due to the accurate CI estimations of SVGP. Since SVGP is not able to extract any useful information from the highdimensional input space, it treated all the outcomes as simply noise. As a result, SVGP shows a very large RMSE compared to other algorithms, and the mean of its predicted outcome distribution is always around 0. Since SVGP treats everything as noise, the estimated noise variance is very high, and the estimated 95% CI based on this noise variance is overly high and covers all the test outcomes in most cases. This leads to a high Zscore RMSE as we expected. However, when the estimated 90% CI is evaluated, the big error in mean estimation and big error in variance estimation cancel most part of each other by chance, i.e., the estimated 90% CI is mistakenly shifted by erroneous mean then the overly wide variance fortunately covers slightly more than 90% test outcomes. Similar thing happen to the estimated 68% CI, but because of the overly high estimated variance, the coverage percentages are now blow 68%, but still provide decent Zscore errors. This phenomenon shows that the performance metric we used may introduce outliers if the prediction errors are too high. For comparison among algorithms that give reasonable prediction errors, which happens in most cases in the experiments, the current performance metric still works.
Appendix S3 Proofs
s3.1 Proofs for Section 2.3
Theorem 2.4 follows from a series of three lemmas. First, following a standard approach [31], consider the eigenfunction expansion and . Let be the diagonal matrix of the eigenvalues , and be the design matrix, i.e., .
Lemma 2.1.
Proof.
Since, from the perspective of GP, is indistinguishable from noise, the optimal hyperparameter setting for the GP predictor is mean zero, kernel and noise variance . The expected generalization error of the GP predictor with posterior mean is
where the last step makes use of the fact that the expectation of the product of independent zeromean random variables is zero. Plugging in a wellknown result for the generalization error of Gaussian processes [29, 31, 34] yields the intended result
∎
Lemma 2.2.
Proof.
Making use of the fact that ,
Now, , and . So, ∎
Lemma 2.3.
Proof.
Here, the goal of the GP is to fit the residuals . Since and is indistinguishable from noise, the optimal hyperparameter setting for the GP predictor is mean zero, kernel , and noise variance . Denote the posterior mean of the GP residual predictor by . Then, the final predictor for is The expected generalization error of is then
Comments
There are no comments yet.