# Multivariate Deep Evidential Regression

There is significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, yet several important gaps in the theory and implementation of these networks remain. We discuss three issues with a proposed solution to extract aleatoric and epistemic uncertainties from regression-based neural networks. The approach derives a technique by placing evidential priors over the original Gaussian likelihood function and training the NN to infer the hyperparameters of the evidential distribution. Doing so allows for the simultaneous extraction of both uncertainties without sampling or utilization of out-of-distribution data for univariate regression tasks. We describe the outstanding issues in detail, provide a possible solution, and generalize the deep evidential regression technique for multivariate cases.

## Authors

• 1 publication
• 14 publications
• ### Deep Evidential Regression

Deterministic neural networks (NNs) are increasingly being deployed in s...
10/07/2019 ∙ by Alexander Amini, et al. ∙ 14

• ### Real-time Uncertainty Decomposition for Online Learning Control

Safety-critical decisions based on machine learning models require a cle...
10/06/2020 ∙ by Jonas Umlauft, et al. ∙ 0

• ### Global Convolutional Neural Processes

The ability to deal with uncertainty in machine learning models has beco...
09/02/2021 ∙ by Xuesong Wang, et al. ∙ 42

• ### Incorporating Epistemic Uncertainty into the Safety Assurance of Socio-Technical Systems

In system development, epistemic uncertainty is an ever-present possibil...
10/10/2017 ∙ by Chris Leong, et al. ∙ 0

• ### Uncertainty-Based Out-of-Distribution Classification in Deep Reinforcement Learning

Robustness to out-of-distribution (OOD) data is an important goal in bui...
12/31/2019 ∙ by Andreas Sedlmeier, et al. ∙ 11

• ### nn-dependability-kit: Engineering Neural Networks for Safety-Critical Systems

nn-dependability-kit is an open-source toolbox to support safety enginee...
11/16/2018 ∙ by Chih-Hong Cheng, et al. ∙ 0

• ### Multivariate regression and fit function uncertainty

10/03/2013 ∙ by Peter Kovesarki, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Using neural networks (NNs) for regression tasks is one of the main applications of modern machine learning. Given a dataset of pairs, the typical objective is to train a NN w.r.t.  such that a given loss becomes minimal for each pair. Traditional regression-based NNs are designed to output the regression target, a.k.a., the prediction for , directly which allows a subsequent minimization, for example of the sum of squares:

 minw∑iL(→xi,→yi)=minw∑i(→xi−f(→xi|w))2Li(w). (1)

Technically, this is nothing but a fit of a model , parameterized with , w.r.t.  to data. As with any fit, the model has to find a balance between being too specific (over-fitting) and being too general (under-fitting). In machine learning this balance is typically evaluated by analyzing the trained model on a separated part of the given data which was not seen during training. In practice, no model will be able to describe this evaluation sample perfectly and deviations can be categorized into two groups: aleatoric and epistemic uncertainties kendall17 . The former quantifies system stochasticity such as observation and process noise, and the latter is model-based or subjective uncertainty due to limited data.

In the following we will describe and analyze an approach to reliably estimate these kinds of uncertainties for NNs by modifying the architecture and introducing an appropriate loss function. The structure of this paper is as follows: First, we will briefly discuss aleatoric and epistemic uncertainties using a pseudo example. We then give an overview of the proposed solution of Amini et al.

amini20 . In Sec. 2 we describe several issues with the prior work, and follow with a possible solution in Sec. 3. Finally, in Sec. 4 we summarize our multivariate generalization approach extending the prior work, which we use throughout the text.

### 1.1 Aleatoric and epistemic uncertainty

In Fig. 1 we show data located at where for each value of multiple measurements,

, were taken. We generated these data by sampling from a normal distribution centered at the dashed line referred to as the ground truth (GT). The model (solid line) represents the prediction for different values of

. The uncertainty in data is low for and large for

, leading to a low aleatoric uncertainty at the former points and a high aleatoric uncertainty at the latter where there is high variance in the observed data. Similarly, the epistemic uncertainty is low at

where predictions are close to the observed data, and large at where the model poorly fits the observed data. In general, aleatoric uncertainty is related to the noise level in the data and, typically, does not depend on the sample size – only the shape of this uncertainty becomes sharper with increasing sample size. In contrast, epistemic uncertainty does scale with the sample size, and either allows the model to be pulled towards the observed distribution at if only the sample size in this region is increased, or allows the fit of a more complex model and thus decreasing under-fitting in general.

Pivotal for this work is the point (and, technically, all other points ) since no data were observed here. Having no data also corresponds to an epistemic uncertainty since it decreases, in theory, if more data are drawn, assuming a conclusive data sample. In contrast to the large epistemic uncertainty at this uncertainty is hard to detect by evaluating a trained model, but at the same time it can be crucial for models to communicate this type of uncertainty in real-world applications such as autonomous driving geiger12 ; bojarski16 ; godard17 , where models can easily be confronted with out-of-distribution data that was underrepresented during training, leading to dangerous and expensive failures.

### 1.2 Deep Evidential Regression

Different approaches have been developed to enable models to estimate either aleatoric or epistemic uncertainty, where the latter often requires out-of-distribution data or compute-intense sampling, limiting the application of such approaches malinin18 ; gal16 ; lakshminarayanan17 ; jain21 . Recently, Amini et al. adopted a technique from the classification realm and attacked this problem by changing the interpretation of the parameters of the NN amini20 ; sensoy18

: The number of output neurons of a NN for a univariate regression task has to be increased from one to four. The output of these neurons are interpreted as

, , , and .111In amini20 the authors refer to them as , , and , respectively. These are the parameters of a normal-inverse-gamma function , and used to estimate the prediction and both uncertainties as:

 E[μ]=μ0predictionE[σ2]=β/(α−1)aleatoricvar[μ]=E[σ2]/κepistemic (2)

The authors derive these relations by taking the normal-inverse-gamma distribution (

NIG

) as the conjugated prior of a normal distribution with unknown mean

. Further, the authors show that by using Bayesian inference the loss function for these four parameters becomes a scaled Student’s

-distribution with degrees of freedom (DoF), parametrized as:

 LNLLi=St2α(yi∣∣∣μ0,β(1+κ)κα). (3)

For reasons we will discuss later, they combine it with a second loss function, referred to as the evidence regularizer, using the total evidence , yielding the total loss:

 Li(w)=LNLLi(w)+λ×|yi−μ0|Φ, (4)

where the coupling, , is a hyperparameter of the model. Note that, following the notation of amini20 , we have dropped indices for all parameters for the sake of brevity – see Appendix A for a more elaborated discussion.

## 2 Addressing issues in the prior art

In this section we discuss three issues with the prior work on Deep Evidential Regresstion amini20 . We also describe new multivariate formulations, which will be detailed later in Sec. 4.

### 2.1 Definition of total evidence

In Bayesian inference a normal-inverse-Wishart distribution (NIW) is a conjugate prior for i.i.d. drawn events from a multivariate normal distribution with unknown mean and unknown variance  gelman04 . In the univariate case, , a NIW distribution becomes a NIG distribution and we slightly change our notation and use and for the (unknown) mean and the (unknown) variance, respectively. These prior densities,

 p(μ,σ2) =NIG(μ0,κ;α,β) p(→μ,Σ) =NIW(→μ0,κ;Ψ,ν) (5)

with222As eluded previously we suppress indices for the sake of brevity. , and , correspond to the assumption that each pair or is sampled from a normal distribution, , and an inverse gamma distribution, , or inverse Wishart distribution, ,

 σ2 ∼Γ−1(α,β) Σ ∼W−1(Ψ,ν)≡W−1(νΣ0,ν) (6a) μ|σ2 ∼N(μ0,σ2/κ) →μ|Σ ∼N(→μ0,Σ/κ) (6b)

where only the sampling of the variance is i.i.d. since it enters via the scaling parameter in the likelihood of the mean. Using rather than corresponds to parametrizing the distribution of with an inverse - rather than a -distribution in the univariate case, which has the advantage of a clearer interpretation of .

In Appendix B we derive that taking a NIG (NIW) distribution as a conjugated prior corresponds to assuming prior knowledge about the mean and the variance extracted from virtual measurements of the former and virtual measurements ( virtual measurements) for the latter. Therefore, it appears natural to define the total evidence of the prior as the sum of the number of virtual measurements:

 Φ′ :=κ+2α Φ′ :=κ+ν (7)

where the former (latter) refers to the univariate (multivariate) case. In amini20 the authors define the total evidence of the univariate case as333In the notation of amini20 becomes and the total evidence reads .

 Φ:=2κ+α. (8)

For consistency we therefore propose to change this and to adopt our definition of . We will revisit this definition in the next section and discuss it in the context of the evidence regularizer.

### 2.2 Ambiguity of shape parameters

We follow the approach in amini20

and use the posterior predictive or model evidence of a

NIW distribution for finding a proper loss function

. From Bayesian probability theory the model evidence is a marginal likelihood and, as such, defined as the likelihood of an observation,

, given the evidential distribution parameters, , and is computed by marginalizing over the likelihood parameter (a.k.a. the nuisance parameter), , where , , and are positive definite matrices. In our case of placing a NIW evidential prior on a multivariate Gaussian likelihood function an analytical solution exists and can be parametrized with a multivariate -distribution with DoF (see Appendix C.2 and C.3 for more details):

 p(→yi|m)=∫dθN(→yi|θ)NIW(θ|m)=tν−n+1(→yi∣∣∣→μ0,1ν−n+11+κκΨ). (9)

Using this result we can compute the negative log-likelihood loss for sample as:

 LNLLi=−logp(→yi|m) =logΓ(ν−n+12)−logΓ(ν+12) =+n2log(π1+κκ)−ν2log|Ψ| =+ν+12log∣∣∣Ψ+κ1+κ(→yi−→μ0)(→yi−→μ0)⊤∣∣∣. (10)

From the compact notation in Eq. (9) it is obvious that on its own is not capable of defining unambiguously. In particular, a fitting approach could be used to find , and the product from data. However, in order to disentangle the latter additional constraints have to be set, e.g., via an additional regularization of .

The higher-order evidential distribution is projected by integrating out the nuisance parameters and and, in the univariate case, the four DoF of collapse into three DoF of a scaled Student’s -distribution. Fitting this reduced set of DoF is not sufficient to recover all DoF of the evidential distribution. The impact of this observation is that fitting the width of the -distribution will not help to unfold and and it is possible to find manifolds with different values of and but with the same value for the loss function . In fact, can be tuned such that for any given value of a value for or can be find. Therefore, on its own is not sufficient to learn the parameters . (See Appendix D for more details.)

In amini20 this degeneration of the loss function is broken by introducing the evidence regularizer,

 LRi=|yi−μ0|Φ, (11)

however, it is unclear how the NN could learn the parameters when the evidence regularizer is disabled by setting . More importantly, although breaks the degeneration, using , the total loss can easily be minimized (in theory) for by simply sending since any impact on can be compensated by adjusting without changing the value of . Sending to zero drives the ratio of aleatoric and epistemic uncertainty to zero as well, cf. Eqs. (2), thus making their values useless. (In practice, is numerically unstable and minimizer we fail to converge towards this point.)

In summary: The loss is degenerated and requires regularization of either or . The proposed regularizer of amini20 does not lead to correct uncertainty estimations and has a numerical unstable minimum.

### 2.3 Challenging extraction of shape parameters

We argued previously that minimizing Eq. (10) is nothing but the fit of a -distribution to data. In Appendix A we further elaborate that technically only a single data point is fitted per distribution, although correlations between neighboring points will enter in practice. It is therefore difficult to estimate the number of data points used per fit, still, we argue here that a large statistic is necessary to extract the parameter reliably which plays a crucial role in estimating the epistemic uncertainty.

In general, small values for will raise the tails of the distribution but only slightly affect the shape of the core of the function where most measurements will be find, assuming that these indeed follow a -distribution. Extraction of using a fit thus needs a large statistic which properly describes the tails. In reality, however, the assumption of normal distributed events often collapses especially in the tails of a distribution which makes a fit of even more ambitious. Assuming one only has a decent statistic near the core, the parameters and of a scaled Student’s -distribution,444For the sake of brevity we overload here the notations for and . , are highly correlated as shown in Fig. 1(b) where we fit of a -distribution with to a -distribution with on the interval .

To study possible biases of the fitted values of the parameters and we conduct a pseudo experiment where we generate data by drawing them i.i.d. from Student’s -distributions with fixed shape parameters and fit them on different sample sizes. (See Appendix E for details.) For each sample size we evaluate 200 fits and find a bias for and for small sample sizes which decreases if the sample size increases.

In summary: Extracting the parameter of a -distribution requires a sufficiently large data sample. Due to correlations the fit result is biased which can be significant if the sample size is too small. We showed this for the univariate case (where corresponds to ) but the same holds for the multivariate case as well where an even larger data set is needed.

## 3 A possible solution

In this section we describe a possible solution for two of the aforementioned issues. The issue regarding the correlation of the shape parameters of a -distribution is not affected by our solution proposal and biases have to be studied using data.

In amini20 the authors did not introduce the term as a way to lift the degeneration of but motivate it as an evidential regularizer, similar to sensoy18 . The idea of combining the -norm of the prediction error with the total evidence is to enforce the NN to learn large errors in the prediction are acceptable, as long as this is reflected in a small total evidence and vice-versa. In other words, in the absence of data which would have the potential to squeeze the prediction error, the prior information should approach an uninformed prior and , as proposed by the authors, is one possible metric to measure the distance to it but does not lead to a meaningful minimum as described previously. Finding a suitable metric which breaks the degeneration of but also pushes the distribution towards an uninformed prior is not straight-forward. For example, instances of the -divergence family, differential entropy or taking the peak height of the function as a measure to meet the second requirement do not break the degeneration. This is because any metric of is, by construction, ignorant of internal dependencies of the shape parameters but in order to break the degeneration and have to be unfolded from .

We therefore propose to acknowledge the loss of one DoF and couple the parameters and with a constant hyperparmeter , i.e., using again the index notation:

 νi=rκi. (12)

Including an evidential regularizer as to the loss function is therefore no longer necessary and minimizing is sufficient. We motivate this ansatz by considering it unnatural to have prior information from virtual measurements for the mean and virtual measurements for the variance where the ratio significantly fluctuates throughout the data sample or even differs largely in scale. Another way of seeing the implicit coupling of both variables is the case of vanishing epistemic uncertainty, i.e., . The model should then become a normal distribution as described in Appendix A which corresponds to . (This is in the univariate case.) Coupling and , as proposed in Eq. (12), enforces this behavior.

In summary: For the univariate case we propose to couple the parameters and via a hyperparameter that is kept constant for all instances. Similarly, in the multivariate case one should couple and with . In the next section we summarize how this changes the loss function.

## 4 Multivariate generalization

In this section we summarize our multivariate generalization, combine it with our proposed solution from Sec. 3 and benchmark it with a multivariate experiment. In general, using our proposed multivariate generalization it is possible to not just learn uncertainties of each regression target of a multivariate dataset individually, but also to learn their correlations which is, by design, impossible to achieve with chained univariate regressions.

Taking a NIW distribution as the conjugated prior we found that minimizing the loss

 Li≡LNLLi =+n2log(r+νi)−νi∑jℓ(i)j =+νi+12log∣∣LiL⊤i+1r+νi(→yi−→μ0,i)(→yi−→μ0,i)⊤∣∣+const. (13)

allows the estimation of the prediction and both types of uncertainties as555See Appendix C.1 for a derivation.:

 E[μ]=→μ0,ipredictionE[Σ]∝νiνi−n−1LiL⊤ialeatoricvar[→μ]∝E[Σ]/νiepistemic (14)

where we rewrote the positive (semi-)definite matrix with the lower triangular matrix and enforce positive diagonal elements by parametrizing with :

 (Li)jk=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩exp{ℓ(i)j}if j=k,ℓ(i)jkif j>k,0else. (15)

We note that regardless of this rewriting the limit is numerically unstable and cut-offs have to be placed in practice.

In order to learn the parameters a NN has to have output neurons and one hyperparameters . By using Eq. (12) we acknowledge the loss of one DoF due to the projection of the higher-order NIW distribution. This reduction comes with the cost that we loose the predictive power on a global scale of the aleatoric and the epistemic uncertainties which we assume to be constant by taking as a hyperparameter. In practice this means that if a global scale is of interest, one has to rescale the predicted uncertainties (which is viable with either Bayesian or Frequentist techniques).

Finally, we conduct a simple experiment with and to benchmark our multivariate generalization. Our experiment is built upon the univariate experiment described by Amini et al. amini20 with a critical difference: Rather than training and evaluating the NN on partially disjunct data samples666In amini20 the NN is trained for but evaluated on . where the NN has no chance to identify an increasing epistemic uncertainty, we generate data in the -plane with varying density. We overload our notation of and now being the features of our data sample given input ,

 x =(1+ϵ)cost y =(1+ϵ)sint, (16a) where the distribution of t is not flat but has a ∨ shape, t ∼⎧⎪ ⎪⎨⎪ ⎪⎩1−ζπif ζ∈[0,π],ζπ−1if ζ∈(π,2π],0else, (16b)

and is drawn from a normal distribution, . We draw 300 data points in total (see Fig. 2(a)

) and fit the distribution with a small, fully connected NN with a single input neuron, two hidden layers of 32 neurons using Rectified Linear Unit activation functions, and six output neurons.

The output of the last layer, , is transformed and subsequently taken as the parameters of the evidential distribution:

 →μ=(p1p2),L=(exp{p3}0p4exp{p5}),ν=8+5tanhp6, (17)

where an exponential function is used to constrain the diagonal elements of to be strictly positive and the transformation of corresponds to the required lower bound777See Appendix C.1 for details. and a cut-off – empirically, -distributions with more than DoF are almost indistinguishable from genuine normal distributions.

Constraining onto the interval makes this parameter effectively a gate. Being closed, , corresponds to the situation that the additional DoF of a -distribution helps to better fit the data, whereas being open, , indicates that the genuine distribution function actually yields the better fit result. That is, even though the data are drawn from a normal distribution, a fit with a more flexible function will more likely find a better minimum for sparsely sampled data. This extra flexibility of the -distribution w.r.t. a normal distribution, coming from the extra DoF, becomes less important when the data distribution becomes denser and, on average, better resembles its genuine distribution function.

In total we train 100 NNs from scratch on the synthetic data sample and overlay their predictions for and in Fig. 2(b). More importantly for this work is the behavior of the parameter which we show in Fig. 4.

We find a statistical significant drop in towards the center of which corresponds to an enhancement of the epistemic uncertainty. It is exactly this area where the data distribution becomes sparse and therefore nicely meets our expectations. However, not all NNs converged to this solution and fluctuations are present. In fact, we find the shown behavior being strongly correlated with the sample size of the synthetic data sample: reducing the sample size causes serious over-fitting of the model, whereas increasing quickly opens the gate for all values of . We find that in presence of a sufficient amount of data not just the gate is open for all values of , but the predicted values of and resemble the genuine distribution well, also in the regions (we already see this behavior indicated in Fig. 2(b)) which is again in agreement with our expectations.

The outlined technique therefore helps to detect variations of the epistemic uncertainty throughout the data landscape. See Appendix F for more details.

## 5 Related Work

Our work builds specifically on the prior art amini20

for uncertainty estimation with evidential neural networks, and more generally on the advancing area of uncertainty reasoning in deep learning.

The probabilistic perspective in machine learning (ML) frames learning as inferring plausible models to explain observed data. Observed data can be consistent with many models, and therefore which model is appropriate given the data is uncertain ghahramani15

. Probabilistic (or Bayesian) ML methods are rooted in probability theory and thus provide a framework for modeling uncertainties. Traditional probabilistic methods include Gaussian processes

rasmussen06 , latent variable models bishop98 , and probabilistic graphical models koller09 . In recent years there have been many explorations into Bayesian approaches to deep learning kendall17 ; neal96 ; guo17 ; wilson15 ; hafner18 ; ovadia19 ; izmailov19 ; seedat19 . The key observation is that neural networks are typically underspecified by the data, thus different settings of the parameters correspond to a diverse variety of compelling explanations for the data – i.e., a deep learning posterior consists of high performing models which make meaningfully different predictions on test data, as demonstrated in izmailov19 ; garipov18 ; zolna19 . This underspecification by NNs makes Bayesian inference, and by corollary uncertainty estimation, particularly compelling for deep learning. Bayesian deep learning aims to compute a distribution over the model parameters during training in order to quantify uncertainties, such that the posterior is available for uncertainty estimation and model calibration guo17

. With Bayesian NNs that have thousands and millions of parameters this posterior is intractable, so implementations largely focus on several approximate methods for Bayesian inference: First, Markov Chain Monte Carlo (MCMC) methods iteratively draw samples from the unknown posterior distribution, and efficient MCMC methods make use of gradient information rather than performing random walks. In particular stochastic gradient MCMC for Bayesian NNs

welling11 ; li16 ; park18 ; maddox19 , with a main drawback being the inability to capture complex distributions in the parameter space without increasing the computational overhead. Second, variational inference (VI) performs Bayesian inference by using a computationally tractable variational distribution to approximate the posterior. One approach by Graves et al. graves13 is to use a Gaussian variational posterior to approximate the distribution of the weights in a network, but the capacity of the uncertainty representation is limited by the variational distribution. In general we see that MCMC has a higher variance and lower bias in the estimate, while VI has a higher bias but lower variance mattei19 . The preeminent Bayesian deep learning approach by Gal and Ghahramani gal16 showed that variational inference can be approximated without modifying the network. This is achieved through a method of approximate variational inference called Monte Carlo Dropout (MCD), whereby dropout is performed during inference, using multiple dropout masks.

Better understanding of the integration of deep learning with probabilistic ML such as Gaussian processes (GP) is also a fruitful direction, namely with Deep Kernel Learning wilson15 ; lavin21 and deep GP duvenaud14 ; dutordoir21 .

Alternative to the prior-over-weights approach of Bayesian NN, one can view deep learning as an evidence acquisition process – different from the Bayesian modeling nomenclature, evidence here is a measure of the amount of support collected from data in favor of a sample to be classified into a certain class, and uncertainty is inversely proportional to the total evidence

sensoy18 . Samples during training each add support to a learned higher-order, evidential distribution, which yields epistemic and aleatoric uncertainties without the need for sampling. Several recent works develop this approach to deep learning and uncertainty estimation which put this in practice with prior networks that place priors directly over the likelihood function amini20 ; malinin18 . These approaches largely struggle with regularization sensoy18 , generalization (particularly without using out-of-distribution training data) malinin18 ; hafner18 , capturing aleatoric uncertainty gurevich19 , and the issues we have addressed above with the prior art Deep Evidential Regression amini20 .

There are also the frequentist approaches of bootstrapping and ensembling, which can be used to estimate NN uncertainty without the Bayesian computational overhead as well as being easily parallelizable – for instance Deep Ensembles, where multiple randomly initialized NNs are trained and at test time the output variance from the ensemble of models is used as an estimate of uncertainty lakshminarayanan17 .

## 6 Conclusion

We discussed the recent developments towards uncertainty-aware neural networks, Deep Evidential Regression amini20 , identifying several outstanding issues with the approach and providing detailed solutions grounded in theory and experimental results. In addition to correcting the prior art, we extend it for multivariate scenarios. The solutions and new approach we presented here would benefit from future studies towards empirical validation.

Neural networks have already had significant impacts in many applications – from medical imaging to dialogue systems to autonomous vehicles – and will continue to do so for years to come. Yet there are important shortcomings in our understanding and confidence in this class of machine learning, notably in the ability to estimate uncertainties and calibrate models. This problem becomes more complex, and potentially dangerous, when NNs are built within larger systems that combine data, software, hardware, and people in dynamic, complex ways. Reliable and systematic methods of uncertainty quantification with NNs are needed, especially considering deployments in safety critical domains such as medicine and autonomous vehicles. In addition to the practical utility of reliably quantifying uncertainty in NNs, there is a significant need to build confidence in the models and establish trust with the end-users. Work towards quantifying uncertainties in machine learning is essential such that these models and systems “know when they don’t know”, and are thus more trusted and usable in real-world scenarios.

## Reproducibility

The source code for reproducing our experiments – implementation of the NN and multivariate methods, and algorithms to generate the data – is available on github.com/avitase/mder/ (MIT License). The model and experiments are lightweight, running locally on a 4-core MacBook Pro in under an hour. We use the Python programming language as well as various libraries, most notably PyTorch, NumPy, Jupyter and matplotlib.

## Appendix A Maximum likelihood estimation using Gaussians

A simple approach to estimate uncertainties of a regression-based NN is to assume that data are drawn i.i.d. from normal distributions, i.e.,

 (18)

for each pair of the data sample at hand. Instead of using Bayesian inference one could simple seek for the maximum of the combined likelihood

 maxwL(w)=∏iN(yi∣∣μi,σ2i), (19)

or, alternatively, for the minimum of the negative log-likelihood

 minwL(w)=∑i(12log(2πσ2i)+(yi−μi)22σ2i)Li(w) (20)

and let the NN itself estimate and by adding one extra neuron to the output layer and interpret the output values of these two neurons as and .

We note that this corresponds to fitting Gaussian functions to single data points and it is the objective of the supervisor of the training process of the NN to ensure that and do not over-fit the data as shown in the right side of Fig. 5.

Here, the model is obviously unable to differentiate between aleatoric and epsitemic uncertainty and will merge both components into

if the model is under-fitting. In case of missing data the model will likely interpolate between regions where data are available and thus underestimate the epistemic uncertainty.

For the sake of brevity we drop the index notation of the parameters of the likelihood in the main text but stress the tight coupling of each pair to its individual pair. In reality however, given a sufficiently large data sample, a well behaving NN will likely produce similar values for if two pairs and are close. Technically, this approach is equivalent to fitting independent functions to each data point but the NN will correlate adjacent points.

## Appendix B Interpretation of the shape parameters of Nig and Niw distributions

An interpretation of the shape parameters of a NIG or NIW distribution can be find by analyzing the joint posterior density after taking measurements,

 →y∈RmorY=(yij)=⎛⎜ ⎜ ⎜⎝→yT1⋮→yTn⎞⎟ ⎟ ⎟⎠∈Rn×m, (21)

i.e., multiplying the prior density by the normal likelihood yields the posterior density

 p(μ,σ2|→y) =NIG(μ′0,κ′;α′,β′) p(→μ,Σ|Y) =NIW(→μ′0,κ′;Ψ′,ν′) (22)

with

 κ′ =κ+m κ′ =κ+m (23a) 2α′ =2α+m ν′ =ν+m (23b) μ′0 =1κ+m(κm)(μ0⟨→y⟩) →μ′0 =1κ+m(κm)(→μ0⟨Y⟩) (23c) 2β′ =2β+~s+mκκ′(μ0−⟨→y⟩)2 Ψ′ =Ψ+~S+mκκ′(→μ0−⟨Y⟩)(→μ0−⟨Y⟩)⊤ (23d)

where we introduced the squared sum of residuals (up to a scaling constant these are estimators of the sample variance)

 ~s =m∑i=1(yi−⟨→y⟩)2 ~S =m∑i=1(→yi−⟨Y⟩)(→yi−⟨Y⟩)⊤ (24)

and the expectation value

 ⟨→y⟩ =1nn∑iyi∈R ⟨Y⟩ =1nm,n∑i,j=1yij→ei∈Rm. (25)

These relations can easily be interpreted as the combination of prior information and the information contained in the data. In particular, Eq. (23c) reads as the weighted sum of two measurement outcomes of the mean, where the weights correspond to (virtual) measurements encoded in the prior and the (actual) measurements, i.e., the prior distribution can be thought of as providing the information equivalent to observations with sample mean . Similarly, using Eq. (23d), the prior distribution can be thought of as providing the information equivalent to observations with average squared deviation .

## Appendix C Derivation of multivariate generalization in detail

### c.1 Moments

We assume our data was drawn from a multivariate Gaussian with unknown mean and variance . We probabilistically model these parameters according to:

 Σ ∼W−1(Ψ,ν)≡W−1(νΣ0,ν), →μ|Σ ∼N(→μ0,Σ/κ),

where is a multivariate normal distribution

 N(→x|→μ,Σ) =1√(2π)n|Σ|exp{−12(→x−→μ)⊤Σ−1(→x−→μ)} (26a) =1√(2π)n|Σ|exp{−12tr((→x−→μ)(→x−→μ)⊤Σ−1)} (26b)

and is an Inverse-Wishart distribution

 W−1(Σ|Ψ,ν)=1Γn(ν/2) ⎷12νn|Ψ|ν|Σ|ν+n+1exp{−12tr(ΨΣ−1)}.

Using rather than corresponds to parametrizing the distribution of with an inverse - rather than a -distribution in the univariate case. This has the advantage of a clearer interpretation of , i.e., the prior distribution can be thought of as providing the information equivalent to observations with average squared deviation . Similarly, the prior distribution can be though of as providing the information equivalent to observations with sample mean , cf. Eqs. (23) and gelman04 .

The prior joint distribution (a

NIW distribution) factorizes and can be written as:

 p(→μ,Σ) =p(→μ)×p(Σ) =N(→μ|→μ0,Σ/κ)×W−1(Σ|Ψ,ν)NIW.

The first order moments are then given by

 ⟨→μ⟩NIW =⟨→μ⟩N=→μ0, (27a) ⟨Σ⟩NIW =⟨Σ⟩W−1 =1ν−n−1Ψ =νν−n−1Σ0(for ν>n+1). (27b) Using these and ⟨μiμj⟩NIW=⟨μiμj⟩N=Σij/κ+μ0,iμ0,j we find the variance of →μ being: var(→μ)NIW =⎛⎜ ⎜ ⎜⎝⋮…⟨μiμj⟩NIW…⋮⎞⎟ ⎟ ⎟⎠−⎛⎜ ⎜ ⎜⎝⋮…⟨μi⟩NIW⟨μj⟩NIW% …⋮⎞⎟ ⎟ ⎟⎠ =1κ(ν−n−1)Ψ =νκ(ν−n−1)Σ0(for ν>n+1). (27c)

We note that in the univariate case, , these relations become

 ⟨μ⟩ =μ0, (28a) ⟨σ2⟩ =β/(α−1), (28b) var(μ) =β/(κ(α−1)) (28c)

for

 σ2 ∼Γ−1(α,β), μ|σ2 ∼N(μ0,σ2/κ),

as expected.

### c.2 Model evidence

Here we derive the posterior predictive or model evidence of a NIW distribution,

 NIW(→μ,Σ|→μ0,Ψ,κ,ν)=N(→μ|→μ0,Σ/κ)×W−1(Σ|Ψ,ν). (29)

where as well as are proper normalized in and , respectively, such that

 ∫d→xN(→x|→μ,Σ)=∫d→μN(→x|→μ,Σ)=∫d→μN(→μ|→x,Σ)=1, (30a) ∫dΣW−1(Σ|Ψ,ν)=1. (30b)

From Bayesian probability theory the model evidence is a marginal likelihood and, as such, defined as the likelihood of an observation

given the evidential distribution parameters and is computed by marginalizing over the likelihood parameter , where , and are positive definite matrices:

 p(→yi|m)=p(→yi|θ,m)p(θ|m)p(θ|→yi,m)=∫dθp(→yi|θ)p(θ|m). (31)

In our case of placing a NIW evidential prior on a multivariate Gaussian likelihood function, i.e.,

 p(→yi|m)=∫dθN(→yi|θ)NIW(θ|m),

an analytical solution exists and can be parametrized with a multivariate -distribution with DoF, cf. Sec. C.3:

 p(→yi|m) =tν−n+1(→yi∣∣∣→μ0,1ν−n+11+κκΨ) (32a) =Γ(ν+12)Γ(ν−n+12)√κn(1+κ)n1πn|Ψ|×(1+κ1+κ(→yi−→μ0)⊤Ψ−1(→yi−→μ0))−ν+12 =Γ(ν+12)Γ(ν−n+12)√κn(1+κ)n1πn|Ψ|×⎛⎜⎝∣∣Ψ+κ1+κ(→yi−→μ0)(→yi−→μ0)⊤∣∣|Ψ|⎞⎟⎠−ν+12, (32b)

where we used Sylvester’s determinant theorem,

 |Ψ+c(→μ−→μ0)(→μ−→μ0)⊤|=|Ψ|(1+c(→μ−→μ0)⊤Ψ−1(→μ−→μ0)) (33)

with , to derive Eq. (32b). Obviously, on its own is not capable of defining unambiguously. In particular, a fitting approach could be used to find , and the product from data. However, in order to disentangle the latter additional constraints have to be set, e.g., via an additional regularization of .

Using this result we can compute the negative log-likelihood loss for sample as:

 LNLLi =−logp(→yi|m) =logΓ(ν−n+12)−logΓ(ν+12) =+n2log(π1+κκ)−ν2log|Ψ| =+ν+12log∣∣∣Ψ+κ1+κ(→yi−→μ0)(→yi−→μ0)⊤∣∣∣, (34a) or, alternatively, using a slightly different parametrization of m with Ψ≡νΣ0: LNLLi =logΓ(ν−n+12)−logΓ(ν+12) =+n2log(νπ1+κκ)−ν2log|Σ0| =+ν+12log∣∣∣Σ0+κ1+κ1ν(→yi−→μ0)(→yi−→μ0)⊤∣∣∣. (34b)

We note that in the univariate case, , Eq. (32a) becomes a non-standardized Student’s -distribution with DoF:

 p(yi|m) =Stν(yi∣∣∣μ0,1+κκσ20)

and Eq. (34a) reduces to

 LNLLi =−logp(→yi|m) =logΓ(ν2)−logΓ(ν+12) =+12log(π/κ)−ν2log(νσ20(1+κ)) =+ν+12log(νσ20(1+κ)+κ(yi−μ0)2)

and thus reproduces the findings of amini20 with , , and . Similar to the multivariate case, on its own is not sufficient to define unambiguously.

### c.3 Analytical derivation of the model evidence

One way to derive Eq. (32a) is, first, to use the fact that arguments can be shifted within the product of the two multivariate normal distributions and ,

 N(→yi|→μ,Σ)×N(→μ|→μ0,Σ/κ) =N(→μ|→yi,Σ)×N(→μ|→μ0,Σ/κ) =N(→yi∣∣∣→μ0,1+κκΣ)×N(→μ∣∣∣→yi+κ→μ01+κ,11+κΣ), (35)

which we use to separate the integration parameter