Fundamental Issues Regarding Uncertainties in Artificial Neural Networks

02/25/2020
by   Neil A. Thacker, et al.
0

Artificial Neural Networks (ANNs) implement a specific form of multi-variate extrapolation and will generate an output for any input pattern, even when there is no similar training pattern. Extrapolations are not necessarily to be trusted, and in order to support safety critical systems, we require such systems to give an indication of the training sample related uncertainty associated with their output. Some readers may think that this is a well known issue which is already covered by the basic principles of pattern recognition. We will explain below how this is not the case and how the conventional (Likelihood estimate of) conditional probability of classification does not correctly assess this uncertainty. We provide a discussion of the standard interpretations of this problem and show how a quantitative approach based upon long standing methods can be practically applied. The methods are illustrated on the task of early diagnosis of dementing diseases using Magnetic Resonance Imaging.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/07/2005

Visual Character Recognition using Artificial Neural Networks

The recognition of optical characters is known to be one of the earliest...
02/19/2012

Classification by Ensembles of Neural Networks

We introduce a new procedure for training of artificial neural networks ...
09/02/2019

Hardening of Artificial Neural Networks for Use in Safety-Critical Applications – A Mapping Study

Context: Across different domains, Artificial Neural Networks (ANNs) are...
08/09/2019

One-time learning and reverse salience signal with a salience affected neural network (SANN)

Standard artificial neural networks model key cognitive aspects of brain...
01/26/2021

Ensembling complex network 'perspectives' for mild cognitive impairment detection with artificial neural networks

In this paper, we propose a novel method for mild cognitive impairment d...
01/24/2020

Estimation for Compositional Data using Measurements from Nonlinear Systems using Artificial Neural Networks

Our objective is to estimate the unknown compositional input from its ou...
12/02/2019

Simulation of neural function in an artificial Hebbian network

Artificial neural networks have diverged far from their early inspiratio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning, and in particular artificial neural networks, have been applied successfully in a number of areas with state-of-the-art performance [2]. A key research challenge, identified by The Royal Society [51], is verification and robustness, especially for safety-critical applications, where the quality of decisions and predictions must be verifiable to a high standard. This high standard of robustness must be maintained, not only in the large scale/big data scenario, but also in applications where only smaller amounts of labelled data are available.

Conventional descriptions of pattern recognition systems relate the output of ANN’s to conditional probabilities of classification  [49]. Although it would be convenient to assume that this output tells us something useful about uncertainty, in reality it does not. The problem here arises from the density of samples () in the vicinity of the input pattern. When there is only one pattern from which to determine the output, will be driven to a value of 0 or 1. Whereas from a statistical perspective the sample size is simply not large enough to be so confident. In order to really understand our output we need to know not only but also the total sample density which gave rise to it. This makes it possible to check that output data will support decision making at a level which meets performance specifications  [27]. One way to explain this is to say that the system output is the maximum Likelihood estimate of , whereas what we need to know is the expectation value . We will illustrate this difference below for Binomial statistics, which is the simplest sample based probability estimate.

Some have tackled this problem using a “what if” approach, where an effort is made to identify the specific pieces of information which have most influence  [29]. Bishop  [8]

considered the problem of validating outputs from a multi-layer perceptron by explicitly modelling the density of the input space using a Parzen window

[45]

based approach. For, areas of the input space with low density the outputs are flagged as unreliable. In a similar approach, based on radial basis function networks, Leonard

et al [37, 38]

use the hidden nodes of the network as the model of the input space density. As well as flagging unreliable outputs due to low input density, the method attempt to put 95% confidence intervals on the outputs. This is based on Student’s t-statistics, for 95% confidence, of the cross-validation error, with the number of degree of freedom given by the number of input vectors that significantly activate contributing hidden nodes.

Uncertainties arising from artificial decision systems can be grouped into three categories, statistical uncertainties (due to perturbations in input data), systematic uncertainties (due to the uncertainties associated with training) and bias (due to use of a mis-specified functional model). Ideally to understand the reliability of an output we need all of these. Kendall and Gal [35] refer to processes of aleotoric and epistemic uncertainty, of which the former is the statistical error and the latter is at least the systematic error. In this paper will assume that epistemic error is systematic error and discuss bias as a separate issue.

The statistical uncertainty can be obtained in a relatively straightforward manner, by perturbing the input and observing the consequent variation in output. Numerical approaches based upon error prorogation can even make use of the derivatives used during training to make such assessments. In the earlier work of Gal [22] statistical uncertainty is modelled via a subjective belief in the smoothness of the output function. This prior assumption has a constraining effect similar to the covariance function used in Gaussian processes. In the more recent work [35]

, this term has been replaced as the amount of noise inherent in the observed output data, and is either tuned (for homoscedastic noise) or learned as function of the input data for heteroscedastic noise.

In order to estimate the systematic (epistemic) uncertainty associated with a predictor we need first either a quantitative description of the possible variations in training data, or a description of the uncertainty on the associated parameters (consistent with the former). Variations over data may be computationally intensive to assess. If it is possible to obtain the uncertainties in parameters then any subsequent estimation of output uncertainty is likely to be more efficient, as the number of trained parameters should always be less than the number of training patterns.

A recent approach to this has been suggested by Gal and Ghahramani [22, 21], using drop-out [53] in a monte carlo style approach, to estimate uncertainty on the outputs. It is shown that training a neural network with drop out is mathematically equivalent to a Gaussian process [48]. Drop out is used as computationally efficient approximation to variational inference [25, 10, 28] without increasing the number of model parameters. However, the equivalence only holds for large numbers of hidden nodes and is thus not fully scalable or universally valid.

Predictive uncertainty, due to systematic uncertainty on the weights is obtained, using drop-out as a Monte Carlo integration, by estimating the first two raw moments, under the assumption that the joint density of the outputs are diagonal multivariate normal. In this case, the number of terms in the integration is limited by the number of weights, and only appropriate for large scale networks. Even with large scale networks, the multivariate joint density of the weights is not fully sampled and correlations between parameters, which may be significant

[6], are not accounted for. In more recent work [23], the discrete form of drop-out is relaxed to a continuous Concrete distribution [41] with improved uncertainty estimates when compared to synthetic data with known uncertainties.

We wish to make explicit the amount of epistemic uncertainty associated with any decision system and the consequent uncertainty then arising during use. We will tackle the problem using approaches more closely related to formal statistics than the work cited above (see for example  [43]

), but will explain below how earlier methods were restricted both by numerical practicalities and unrealistic approximations. In our opinion, a quantitative statistical approach for the assessment of epistemic uncertainty would involve sampling of weights from their expected uncertainty distribution whilst maintaining a high standard of validation, even for cases involving small data and/or networks.

The correct interpretation of training cost functions is Likelihood, and its maximisation is the de-facto method for parameter estimation. However, even when taking steps to ensure that the Likelihood construct is a valid statistical description of data uncertainty (i.e. honest  [14]), the variation of the Likelihood function over the parameter has the wrong properties to allow it to be interpreted directly as a parameter density. Regardless of the choice of probability framework (Bayesian or Frequentist) it is accepted that a Likelihood function needs to be multiplied by something which is a function of the parameters in order to construct a consistent (Bayesian) or quantitatively valid (Frequentist) description of uncertainty. Therefore, our first task is to identify the principles which specify how this should be done.

1.1 Background Theory & Notation

We introduce here the theory associated with estimation of parameter uncertainty when using Likelihood in order to identify the relationships between competing methodologies and associated approximations. Suppose we have a dataset of i.i.d. observations: , with an associated family of parametric conditional pdf(s) . We hence can write the pdf for the entire dataset as: . The Likelihood and log-likelihood for the observations are then:

(1)

where or is the maximum-likelihood estimator for the parameter. The standard Bayesian approach is to construct a posterior distribution over the space of parameters conditioned on the data by means of a prior distribution on parameter space thus (see Jeffreys [32], Theorem 10):

(2)

Ideally, in the absence of any other meaningful estimate of the priors, we need an uninformative prior. Unfortunately, this is not as simple as making uniform (e.g. as implicitly assumed in  [43, 3]), as this makes a fundamental (and unfortunately common) error regarding the correct use of probabilities and probability densities111An uninformative prior probability is flat, an uninformative density is not, and for arbitrary parameters, see below, is the latter. This difference is hard to comprehend if no distinction is made between the two.. Alternatively, a Jeffreys prior can also be constructed by using his ‘general rule’, and requiring that the functional form of the prior is invariant under arbitrary transformations of the parameter space [31].

However, we here instead give a naïve derivation for the case of a single parameter, in order to make clear the link between the Jeffreys priors under the Bayesian approach, and the approach taken by Welch and by Peers [58, 46], which leads to approaches that seek to achieve a Bayesian-Frequentist synthesis [24, 34]. We will then exploit this interpretation to develop a practical method for use with highly non-linear systems, i.e. ANNs.

1.1.1 Interpretation of Jeffreys Priors.

We show below how the Jeffreys prior is related to the process of re-mapping parameters to achieve Gaussian Likelihood functions. We start with the 1-parameter log-likelihood or Likelihood for data points, and hence the posterior:

We will consider the case where is large, and we will assume that the Likelihood is then strongly-peaked about the maximum likelihood estimate (MLE) of the parameter . We then first shift and scale to define a new parameter, , where:

(3)

We here have introduced the empirical Fisher Information function per data point , which has been replaced by its value at the optimum . We see that the new variable is centred on the peak of the Likelihood, and the width of the peak has been scaled by the second derivative at the peak (the curvature at the optimum) [20].

We now expand the posterior about its value at , by noting that is (i.e. the width of the peak of about the optimum scales as ). We now write [7]:

where is the normalisation term. The Taylor expansion for the log-likelihood is:

(4)

where we have used prime notation to denote multiple derivatives, and remembering that since we are expanding about the optimum, . Note also that since is trivially , we have to include the third powers to get a term of . We also expand the prior:

which is just the first, linear, correction to a constant prior.

Substituting in for in terms of , after some algebra we find:

(5)

We note that this expression is in agreement with the related expressions given by Welch and Peers [58]

, although their method is more general, since they use the full moment generating function. The zeroth order piece of the posterior is a centred, unit Gaussian, which shows that we correctly scaled and shifted the parameter

. The first-order corrections are both odd polynomials in

, hence the normalization is just the usual term for a Gaussian, with no corrections required at this order222The pi in the normalization should not be confused with the function used for the prior.. The full result says that the first correction to the zeroth-order Gaussian comes from two terms, the third derivative of the Likelihood at the optimum (which gives the amount to which the Likelihood is not symmetric about that optimum), and the term which shows to what extent the prior is not symmetric about the optimum (that is, if ).

We might want to try getting these two terms to cancel, and hence have a posterior that is Gaussian to . But we cannot do this algebraically in , since the expansion in powers of mixes powers of , so that here we have both linear and cubic terms in of the same order in . However, given the statements above about the symmetry of the Likelihood, and the symmetry of the prior, we can instead require that the posterior should also be ‘symmetric’, or at least centred. That is, if we require that the expectation value of under the posterior vanishes to first order:

(6)

The integrals can be computed, since we just need the expectation values of powers under a unit Gaussian. These are given by the formula:

where

is a Gaussian distribution of mean

and variance

. We hence find:

(7)

The posterior is hence centred to first-order if:

Using the definition of from (3), this can be rearranged to give the differential equation:

(8)

This differential equation corresponds to equations (29) & (30) of Welch and Peers [58], and the equation for a first-order matching prior from the probabilistic-matching priors literature (e.g., see Ghosh [24] Eqn. (4.3)). The solution of this equation is:

which is just the Jeffreys General Rule prior [31]. For a vector of parameters , the analogous Jeffreys prior (ignoring scaling and shifts) can be taken as:

(9)

with the elements of the Fisher Information matrix being defined as

(10)

These terms are otherwise known as the terms of the inverse parameter covariances  [47] and can be recognised as the standard approach for representation and estimation of parameter uncertainty.

1.1.2 An Alternative Way to Understand Parameter Uncertainty

Use of equation (10) assumes we are working in a regime where

can be taken to be large so that the uncertainty associated with parameters converges to a multivariate Gaussian (Normal) distribution. This is generally

not the case for ANNs. Difficulties of tractability also arise when applying equation (9) to evaluate equation (2), both with the practicalities of computing second derivatives and also the use of an expectation value (generally ignored). Although it has been suggested that second derivatives should be computable on an ANN via a simple extension to the back-propagation training algorithm [9], the authors still know of no general method. However, the naïve derivation of the simple Jeffreys prior above gives us the link between the Bayesian definition of Jeffreys invariant priors and the Frequentist literature on probability matching priors (see [24] and [13]), and indicates how we can progress.

It is important to note that the degrees of freedom we are manipulating here by defining a prior over the parameter(s) is our freedom to reparameterise our original family of model pdfs. A Likelihood or an integrated Likelihood is not a pdf or a probability. Under a redefinition of parameter , we have that:

(11)

where is our new Likelihood function under reparameterisation. In particular, this means that the ordering of Likelihood values is preserved, hence and the optimum Likelihood remains the optimum.

We hence see that the derivative of the reparameterisation function takes the place of the Bayesian prior , and the mapped Likelihood function generated from the original Likelihood function replaces the posterior  [58] as generated from the Likelihood function. The requirement that a prior pdf is non-negative becomes the requirement that our reparameterisation function is monotonically non-decreasing (that is, ). The use of a Jeffreys prior can be considered as only the start of an iterative process, for which the Gaussian mapped parameter is ultimately the result (see Appendix A) 333An iterative sequence of invariant priors, starting from the Jeffreys prior, was also investigated by Dowe. See [17], §7.1, page 953 for an example involving the multinomial distribution..

Rather than applying the Jeffreys prior/mapping itself, singly or iteratively , we instead chose to map a suitable portion of the Likelihood function directly to a Gaussian. We hence will require that:

Therefore, if we centre the MLE such that then:

(12)

which is just the function , defined as the signed square-root of the log-likelihood ratio statistic [33, 16]. It has been shown previously that hypothesis tests constructed for are rectangular  [58] and honest  [14], as required for practical use. The idea can also be seen to be consistent with the work of Cramér and Rao regarding the minimum variance bound (MVB), in the sense that the MVB is saturated when the Likelihood function is exactly Gaussian with a known mean, since efficient estimators for the variance exist in this case (e.g., see Cramér [12], Chapter 32).

This general approach was known about at least as far back as Anscombe444Who also said [5], “Typically it is the evidence from a small body of data (often corresponding to a non-normal Likelihood function) that is difficult to grasp precisely. in 1964 [5, 30]. It can also be related to the more familiar case of the Fisher -transformation555For the specific case of sample correlation coefficients, the highly-skewed nature of the sampling distribution, even for large sample sizes, made the standard correlation coefficient unsuitable when it came to assessing the accuracy of observed correlations. Fisher showed that a simple transformation based on the hyperbolic tangent reduced these curves to close approximations to the normal distribution, with a variance that is stable over different values of the true correlation [18, 19]., which Winterbottom [59]

shows can be derived by first requiring that it reduces skewness, and that, after bias correction, is

both normalising and variance-stabilising.

In conclusion, the Jeffreys prior can be derived as an approximation to the process of mapping the original Likelihood function onto a Gaussian. Frequentists would claim that this is what the Jeffreys prior is doing, whilst Bayesians might claim that equation (9) defines the underlying principle. What we explain below is that it does not matter which of these explanations you personally prefer, the consequences will turn out to be the same.

1.1.3 Theory Summary

Jeffreys derived his priors in order to obtain consistency under parameter transformation, which can be considered a scientific requirement. In a Frequentist sense, the Jeffreys prior can be derived as a density scaling which approximates the mapping of a parameter to achieve a Gaussian Likelihood function (see Eqns. (3) to (9)). It should be noted that such a mapping is also consistent under parameter transformation.

Generally when performing Likelihood estimation of parameters, we rely to some extent on the central limit theorem to ensure that for sufficiently large quantities of data the Likelihood function around the optima will be approximately multi-dimensional Gaussian. Under these circumstances the derivative terms and the correlations between them, can be modelled using a covariance matrix determined from the Minimum Variance Bound (MVB) in the usual way  (

10

). However, for highly asymmetric likelihood functions proportionately more data will be needed. For highly non-linear systems, and in circumstances where the curse of dimensionality reduces the effective data quantity, we can not expect this justification to hold. We will show below that this is the circumstance which we encounter for ANNs, but first we start with two simpler problems in statistical estimation which illustrate this.

Rather than using Jeffreys’ approximation, if we directly map a parameter with (12), the Jeffreys prior density is not only exact but uniform, due to origins of Eqn. (9). In a Bayesian sense, you may not accept this origin for Jeffreys priors and would prefer to simply accept Eqn. (9) as already exact. Either way, for this special definition of a parameter the Likelihood function can be directly interpreted as a parameter density (2). Under this scheme the Bayesian and Frequentist approaches are directly comparable and we achieve a form of synthesis, in all respects except the interpretation of as a probability666Bayesians have already noted that Jeffreys priors are often not consistent with Kolmogorov’s axioms and therefore “improper”..

Now that we have this understanding of the relationship between approaches, it gives us new analysis options. We can exploit either: the observation that the uninformative prior for the original parameter is the derivative of (see Approach I below) , or that that the prior for this mapped parameter is uniform (see Approach II below). Both insights allow us to avoid the need to evaluate equation (9) but still compute equation (2).

1.2 Parameter Uncertainty: Approach I, Binomial and Chi-square

In what follows, we use the signed square-root of the log likelihood ratio defined above as our mapping function. We then estimate the distribution over a parameter for use in assessment of future computational uncertainty (systematic or ‘epistemic’ errors). That is:

(13)

where we now use the notation rather than the posterior to make it clear that we are using a specific mapping of parameter space rather than an explicit Bayesian prior on parameter space, or a general mapped Likelihood function. We emphasise at this point that, following basic local argument, this expression is expected to be exact under either a Frequentist or Bayesian interpretation and applicable to arbitrary likelihood functions, whilst use of equation (9) is not.

1.2.1 Example: The Estimation of a Variance from a Sample with Known Mean

We consider a very small sample of Gaussian i.i.d. data , where the mean is known, and the model is parameterised by the variance thus:

The variance parameter has been chosen as an example since it gives a highly skewed Likelihood function, and also to illustrate the non-unique nature of Jeffreys various priors [34]. The Likelihood function is:

We are asked to determine the uncertainty associated with a MLE of variance , where777

Note that this is an unbiased estimate since we are using the known mean, rather than the sample mean.

:

and hence the corresponding log-likelihood function is:

What we can observe for this system is that once is specified, the Likelihood can be written in a scale-invariant form:

We can therefore, without loss of generality, restrict ourselves to consideration of the uncertainty associated with . We can now compare the theoretical predictions from Bayesian and Frequentist approaches. For this particular example there are two Jeffreys priors in the literature  [34] (see Table 1). The first is the non-location or scale invariant prior (see Jeffreys [32], §3.1), which gives , whereas the second is the Jeffreys General Rule prior [31]888To be precise, this is the General Rule prior when you take the model to be that with two parameters, , where you compute the determinant of the Fisher Information matrix to obtain the prior. The fact that the General Rule itself gives a different answer if you fix one parameter and compute just the Fisher Information function is the specific example considered by Jeffreys in the 1961 edition of [32], see §3.10, page 182., where .

Approach prior conditional density
Chi-square variance
Jeffreys General Rule
Jeffreys Non-location Rule
Frequentist
Binomial
Jeffreys General Rule
Frequentist
Table 1: Example: Variance estimate distribution using uninformative priors. Note that the frequentist theory interprets the Bayesian prior as a product of two terms, a uniform density over the mapped variable (consistent with the Fisher information being constant), which can be seen as the true “prior”, and the differential term needed to conserve probability mass under variable transformation away from . Also .

The Frequentist approach, though a different solution to either of the previous two, is also invariant under monotonic non-linear remapping of . It uses the mapping generated by as defined by Eqn. (12). In this case:

and it is straightforward to check that the Likelihood becomes a Gaussian when rewritten in terms of . The derivative term, which would be interpreted as a prior under a Bayesian framework, is instead the term needed to conserve probability mass under parameter transformations , as we move away from a Gaussian Likelihood function. These definitions lead to the distributions over shown in Figure 1. Note that although plotted against is a Gaussian by definition, for comparison with the other approaches we instead have to consider plotted as a function of (see Eqn. (11) for details).

Note that for a Jeffreys prior of the form , the peak of the plots will lie at a value of . For small , this will not be close to the theoretical expectation value of . For the frequentist plots, note that although the peak of will lie exactly at , the same is not true for the plot of .

The general observation of these curves is that they are all quite similar, certainly to the level of difference observed between different Jeffreys rules. The Frequentist Monte-Carlo distribution is shown for comparison. DiCiccio and Martin [16] observe that this approach gives “near perfect coverage” in hypothesis tests, i.e. the estimated distribution is quantitatively valid, as required for a valid Frequentist theory.

Figure 1: Theoretical predictions of the distributions of variance estimation uncertainty for (a) a sample of n=2, with and (b) a sample of n=4 , with . Note that for Jeffreys priors of the form , the peak occurs at .

1.2.2 Example: Estimation of Binomial Probability

Binomial statistics can be used to model the estimation of a probability using sample data, i.e. from samples, . When using ANNs, the approximation of outputs as conditional probabilities () is normally explained as a local estimation of the proportional ratio of classes observed in the sample  [49]. This can be achieved using either least-squares or cross entropy cost functions, although the correct Likelihood function for this is still better understood in statistical terms as binomial statistics. However, the potential for large dimensional input vectors leads to the well known “curse of dimensionality”, and in some places in the data space, outputs have to be estimated with very small sample sizes, perhaps even only one. In these situations, the Likelihood estimate of may be high, but the uncertainty in this calculation is not reflected in this value. Under these circumstances it is better to think in terms of the expectation of the output, as this quantifies the uncertainties due to sample statistics.

For this example we therefore examine the task of estimating given very small sample sizes, and compute the distribution over this value using Bayesian and Frequentist approaches. Hence now the data we have is the value of observed, and the relevant parameter to be determined is the related probability . This then gives a Likelihood for this system as:

(14)

where as expected, the MLE of is given by . The Jeffreys General rule prior is , and the (Frequentist) Gaussian mapping function is:

(15)

and the sign is determined by the sign of . Note that is correctly defined so that . The mathematical summary is given in Table 1, and the calculated uncertainty for and are shown in Figure  2. We can see that the computed distributions from both theoretical approaches are nearly identical.

1.2.3 Summary

The Frequentist and Bayesian approaches to the calculation of uncertainty over parameter determined using Likelihood are very similar, even for small sample situations where the difference might be expected to be at there largest. Given the aims of this work, to summarise the uncertainty associated with likelihood estimation, it is the Frequentist theory, based upon making quantitative distributions “honest” for use in hypothesis test construction, which most directly addresses our needs. The Bayesian approach omits the Frequentist axiom by choice (i.e. any need to conform to data samples). It is an approximation which is derived on the basis of more general mathematical considerations. It therefore seems to us very natural to use the Frequentist approach as the basis for the modelling of uncertainty in ANNs.

Figure 2: Theoretical predictions for the distributions of binomial probability () estimation uncertainty for (a) a sample of N=2, with , and (b) a sample of N=10 with .

1.3 Parameter Uncertainty: Approach II, ANNs

Having shown how the theory deals with standard low sample statistical problems, we now wish to apply it to parameter uncertainty in ANNs. Approach I is not entirly practical for this. However, based upon the principle outlined above, by replacing the original parameter with the remapped variable we replace with a constant . Then the likelihood function also describes the (Gaussian) uncertainty over the parameter. However, there is still a problem. Strictly the mapping needs to be applied to the multi-dimensional parameter space, i.e. is a vector function. What we propose in this work is that we apply separate transforms to each of the network parameters. In effect we assume the “prior” can be written as

(16)

So that by replacing each parameter with , this will make the joint parameter distribution approximate a multi-dimensional Gaussian. If nothing else we would hope that this will help accelerate the effect of the “central limit” process (see section 1.1.3). Then the original likelihood function can also be approximated using

(17)

and as the “prior” terms are constant

(18)

We can then use the inverse covariance ( ) to approximate the uncertainty in the parameters arising due to the training sample. The second part of this paper now details how this approach was implemented and tested for a real world clinical decision support system and the insights gained.

2 Methods

2.1 Data Acquisition and Preparation

The dataset chosen to illustrate the estimation of ANN output uncertainty is the clinical task of early diagnosis of dementing diseases, where a safe interpretation would be considered essential for ethically responsible patient management.

Diagnosis Norm. Alz. F.T.D. Vas.D.
Age   (sd) 64.2  (7.7) 61.3  (6.4) 60.6  (0.2) 67.6   (5.9)
duration  (sd) - 3.4  (1.6) 3.6  (3.1) 2.3  (2.1)


Table 2. Demographic make up of the sample.

The subjects comprised 19 patients with frontotemporal dementia, 18 with Alzheimer’s disease, 11 with vascular dementia and 9 normal controls. Their age distribution and duration of illness are shown in table 2. All patients were referrals to a specialist diagnostic dementia clinic, and had undergone comprehensive neurological and neuro-psychological assessments as part of their diagnostic evaluation. Patients with frontotemporal dementia and Alzheimer’s disease fulfilled currently accepted clinical diagnostic criteria for those conditions  [40, 44, 42] and were free from significant risk factors for cerebrovascular disease (Hachinski scale 4)  [26]. Patients with vascular dementia all had high risk factors for vascular disease, with Hachinski scale scores 7. Patients exhibited the characteristic pattern of dementia associated with their clinical diagnosis  [52]. All patients had been followed up for years. The clinical diagnosis was therefore confirmed by the evolution of the illness. Individuals were excluded if diagnosis of the form of dementia was equivocal or if the clinical pattern suggested mixed aetiology.

All subjects were scanned using a Phillips 1.5 Tesla ACS-NT scanner with a PowerTrack 6000 gradient subsystem. The patients were scanned using a birdcage head-coil receiver. CSF segmentation was performed on coronal fast spin echo inversion recovery images (TR 6850 ms, TE 18 ms, TI 300 ms, echo train length = 9 ). Contiguous 3mm slices were obtained throughout the brain with an in-plane resolution of 0.89 mm2 (matrix 256 x 204, field of view 230mm x 184mm ). The details of the image analysis are given in  [56]. The CSF volume measurements were obtained twice for each subject, giving a total of 118 samples.

The purpose of using an ANN in this work is to map twelve volume measurements of cerebral-spinal fluid (CSF) volume to four diagnostic categories (): normal brain, Altzheimers disease, frontotemporal dementia and vascular dementia. CSF was selected for this work because it is easier to segment from MR images than brain tissue, and gives a direct quantitative assessment of the change in brain-volume over time, as the skull size remains fixed after early adulthood.

However, some additional input variables are also needed to correctly interpret CSF volume measurements, as total brain volumes vary in size between individuals and the normal degree of atrophy varies as a function of age. Including age and volume to the list of inputs gives a total of 14 along with four output variables. If we try to directly train an ANN on the 118 examples available there would not be enough data to constrain the free parameters. In previous work  [56]

, this issue was addressed via application of prior knowledge (see Discussion below) and use of a k-nearest neighbour classifier (kNN). Here the kNN is replaced by an ANN.

The initial twelve volume measurements were normalised to the volume of a rectangular bounding box of the CSF. The normal subjects were used to determine a simple (and therefore stable, low parameter) correction, which adjusted each CSF volume of normal subjects back to a nominal age around 40. These same corrections were then applied to diseased data. The typical expected behaviour of dementing diseases was used to further reduce the number of dimensions from 12 down to 5 (see Discussion). The final five variables can be represented as a five dimensional vector which is used as input to an ANN for purposes of estimating (see Figure 3).

Figure 3: (a) Graph of reduced variables vs . parameters. (b) Graph of reduced variables vs .

2.2 ANN Architectures and Training

A selection of ANNs were generated with 5 inputs and four outputs () with binary training targets and variable numbers of hidden units with either one or two hidden layers. These networks were trained using a combination of Resilient Propagation  [50] and Conjugate Gradient Optimisation  [47], using purpose written software (ANSI C). It was important for later stages of the experiments that the cost function had well located optima . This was generally achieved after 20,000 cost function evaluations for RPROP followed by a further 1,000 evaluations using conjugate gradient optimisation. This completed on a DELL Precision T7500 workstation, using only one of the co-processors, in under one minute.

The cost function used was cross entropy defined as

(19)

which can be considered a Binomial Likelihood function evaluated on a continuum of input samples (). The factor of 2 allows a direct relationship to be made between this Likelihood and Chi-squared statistics, and is also needed when re-mapping onto a Gaussian  (12). We believe this to be the quantitatively valid Likelihood to use for this classification task. It is important that the cost function is selected in this way, as all subsequent theory regarding output uncertainty estimation is only correct if the Likelihood function is an honest model of the training data uncertainty.

The alternative architectures were evaluated on the basis of their generalisation performance using both leave-one-out cross validation and calculation of the Akaike Information Criteria ()  [1]. For simplicity the cross-validation score was computed as a least-squares difference between the network output and the training target. This involved training each tested architecture 118 times and was defined as the final cost function average. Following this two architectures were chosen for uncertainty estimation tests.

The Gaussian re-mapping of parameters was achieved using simple inspection of the cost function. Given a selected ANN, the weight parameters ( ) were first assessed to determine the typical scale which generated a change in one of the symmetrised cost function . This was done using an iterative process requiring no more than 10 evaluations of the cost function. This allowed us to generate graphs of the asymmetry in the cost function over a sensible change of prior to remapping. It also allowed us to assess the effect of asymmetrical parameter re-mapping on the parameter covariance.

Starting from , the change in was assessed for each parameter to determine three points on either side of the minimum , which spanned a change in the cost function of around 3. This could typically be achieved with less than 20 evaluations of the cost function. These points were then used to define a cubic spline approximation to the inverse of Eqn. (12) (), which allows a value of to be computed which will give rise to a specific value of , such that

(20)

The adequacy of this mapping was checked to see if values of could be achieved to an accuracy of 0.1 up to a maximum of , requiring a further 20 cost function evaluations. The upper and lower limit of this valid range were stored for later use.

The remapped parameters define a log Likelihood function which is quadratic, at least for changes in individual parameters . In order to avoid the need for an analytic calculation of the second derivatives of the cost function, and to get a better approximation to the general shape of the cost function, the off-diagonal inverse covariance for pairs of parameters () were then computed via inspection using the relationship

(21)

at the expense of 4 more cost function evaluations. Here, is chosen to be the maximum allowable absolute values determined by the valid limits and . It is worth noting that this calculation is the least squares estimate of the parameters of a quadratic model using four symmetrically placed evaluations, and is relatively insensitive to the exact location of the optima. The diagonal terms of this matrix are all 1 due to the definition of .

The cost function could therefore be remapped and the inverse covariance estimated for 30 parameters with typically 2,000 evaluations of the cost function, i.e. equivalent to around 10% of the original network optimisation.

2.3 Evaluation of Output Uncertainty

The uncertainties in ANN output for an individual input sample was evaluated using a Markov-Chain Monte-Carlo (MCMC).

Gaussian distributed IID random variables (

) were used to generate random steps in . These steps were then evaluated using Eqn. (18) and every 50th update was output as a sample in order to reduce sample correlation. A hundred of these samples were used to generate instances of network parameters using the vector of cubic spline approximations . Values of generated outside of the valid range of the mapping were rejected by setting the corresponding density estimate in the MCMC step to zero. Accepted variations in were used to build up the distribution over the output for the selected input sample.

Finally, in order to confirm the adequacy of the Gaussian mapping, the computed Mahalanobis distance was stored along with true ANN cost function for later comparison. The combined process typically involved 10,000 cost function evaluations, but this could be reduced to as few as 100 if the Gaussian mapping comparison was not needed.

3 Results

The cross-validation and AIC evaluation of selected neural architectures are shown in Figure 4. The entire dataset was run twice in each case from different random number seeds in order to get a feel for the stability of these estimates. The LOO estimate is absolute but less precise than the AIC. The AIC has the disadvantage of requiring an estimate of the number of linearly independent parameters, which was taken here to be the number of weights () for simplicity. Training a network with more than 55 weights was considered untenable given only 118 samples. The two methods broadly agree that there is no significant benefit in the selection of any one specific architecture for this dataset. Consequently, we choose the two extremes, 2 hidden nodes and two layer of three hidden nodes, to illustrate the range of behaviours in tests. These also happen to be the architectures which were marginally better on one these two evaluation criteria.

Figure 4: (a) Squared difference leave-one-out value plotted as a function of number of network parameters. (b) Akaike Information Criterion, plotted against the number of free network parameters (for; 2, 3, 4, 5, 2x2, and 2x3 hidden nodes).

The re-mapping of weight parameters is shown in Figure 5. Prior to remapping the log likelihood function is highly asymmetrical ((a) and (b)). The network with two hidden nodes has all parameters constrained with a likelihood function which allows mapping on both sides (Figure 5(c)). In contrast, the network with two layers of three hidden nodes has parameters which do not map to the full range of (Figure 5(b)). These are situations where changing the parameter exhibits a plateau followed by a steep rise on one side of the optima. This is likely due to the nature of the transfer functions, which limit the potential for increase in a nodes output to a fixed value. The addition of weights and nodes seems to increase the prevalence of this effect. It is corrected for during uncertainty assessment by eliminating non-mapped ranges, as described above.

Figure 5: (a) The log Likelihood function variation for 20 parameters form a network with 2 hidden nodes. (b) The log Likelihood function variation for 20 parameters form a network with 2x3 hidden nodes. (c) The log Likelihood function variation for 20 parameters form a network with 2 hidden nodes mapped to a quadratic. (d) The log Likelihood function variation for 20 parameters form a network with 2x3 hidden nodes mapped to a quadratic. The steep lines mark the maximum and minimum valid parameter range.

The distribution of Mahalanobis distance against original ANN cost function is shown in Figure 6. In theory these plots should be consistent with a line with unit slope. We can interpret spreading around this line in terms as an equivalent random error on the estimate of each . A Monte-Carlo simulation for the network with two hidden nodes suggests that the amount of variance around the expected line is consistent with a 0.1 random error on , i.e. it is statistically negligible given the expected accuracy of this parameter (i.e. 1). The network with two layers of three hidden nodes exhibits both a greater variation along the expect line (due to the change in the number of degrees of freedom) and greater spread, now consistent with a random error of of 0.2. Although this is still considered negligible, it provides an indication that as the complexity of the network is increased, the degree of non-linearities increase and the available data are less able to constrain parameter variation. As a consequence the degree of conformity of the re-mapped parameter to a Gaussian Likelihood is reduced.

Figure 6: (a) The distribution of Mahanaobis estimates compared against the original ANN cost function for a network with 2 hidden nodes. parameters. (b) The distribution of Mahanaobis estimates compared against the original ANN cost function for a network with 2 layers of 3 hidden nodes.

The distribution over expected outputs for two patterns (A,B) identified in figure 3 is shown in Figure 7. We can note that the uncertainty distribution for A reflects its position among the distributions seen in Figure 3(a). We will discuss the result for B below.

Figure 7: (a) The frequency distribution of 100 classification outputs from a network with 2 hidden nodes computed by MCMC for data point A in Figure 3(a). (b) The frequency distribution of 100 classification outputs from a network with 2 hidden nodes computed by MCMC for data point B in Figure 3(a).

4 Discussion

4.1 Numerical Considerations

The noisy nature of the inverse covariance terms computed using equation  (21) (typically a few percent), required a limit to be put on the absolute values of off-diagonal terms to restrict them to be consistent with the mathematical limit of unity. A truncation to was found to give good stability without restricting the description of genuine parameter covariance.

The calculation of the Mahalanobis distance in Eqn. (18

) required special consideration. The often singular nature of the inverse covariance matrix required the use of Singular Value Decomposition (SVD), and the calculation of the Mahalanobis distance from its eigen vectors (

) and eigen values (), i.e.

(22)

for use in Eqn. (17). A relative minimum condition limit of was also used, in order to avoid large numerical instabilities. Values below this value were set to this minimum limit value.

Finally, in order to restrict the samples to a more realistic approximation of the inverse parameter covariance, a maximum value was put on the true ANN cost function (equivalent to a chi-square of ) as each sample was generated, in order to eliminate implausible instances of weights.

4.2 Data Representation and the Use of Prior Knowledge.

Although the original published work  [56] might be useful as a bench-mark, it was not the intention here to perform a shoot-out between ANNs and kNN. Given the small sample size we are highly restricted in our choice of ANN architectures and it could not be done with any statistical power. Rather, the purpose of the current work is to illustrate how the epistemic (systematic) uncertainties in ANN outputs, due to the original uncertainties in training data samples, can be estimated for safety critical applications.

However, this work does identify some important issues regarding the use of prior knowledge and low parameter pattern recognition solutions to mitigate against problems with “black box” approaches  [57]. It must be stated here that for scientific and clinical studies requiring novel imaging there will always be difficulty in generating unlimited data quantities, due to the need for specialised ethics approval and funding.

Regarding the use of prior knowledge, volume normalisation was performed on the basis that the proportional structure of brains is expected to be independent of overall size, and specific diseases tend to affect specific anatomical regions. It is also known that normal ageing causes a gradual monotonic decrease in brain tissue volume after the age of about 40. Age corrections were performed with the intention of making the distributions of normal and disease groups less ambiguous.

Dimensional reduction was achieved by summing the individual (12) measured volumes into variables which might be expected to correlate due to proximity: front (4), middle (4), back (4), left (6), right (6), top (6) and bottom (6). From them some diagnostically relevant, mathematically orthogonal and homoscedastic combinations were constructed (see Appendix B). These combinations were identified by a subjective observation of graphed variables (e.g. Figure 3), in order to identify those which illustrated some obvious degree of class separability. Importantly, we did not set disambiguation as a quantitative target, as we wanted to reduce the chances of over-fitting, although the general effect of ambiguity reduction was confirmed graphically.

We have not enforced a Bayesian re-normalisation of the output probability estimates, as despite attempts to identify subjects with unique diagnoses, we decided that in-use the classifications would not be mutually exclusive, making Bayes Theorem in-appropriate. However, we would otherwise expect that exploitation of this constraint would improve the effective DOF in training data.

In the original work  [56], the final five homoscedastic variables () were re-scaled, using the available repeated measurement information, to approximate measurement precision. This last step was considered particularly important when using the variables in a kNN, as it imbued the Euclidean distance with a nominal statistical scaling. We omit this scaling in this work, as the extra parameters in an ANN can be adjusted to achieve this. However, the conventional ANN training does not exploit our knowledge regarding repeat precision and generates more parameter complexity than the corresponding kNN.

Whilst the corrections for volume and age, along with dimensional reduction to independent homoscedastic variables, and even Bayesian re-normalisation, could have been attempted by adding extra layers to the ANN (e.g. deep learning), the extra degrees of freedom and associated non-linearities could not have been exactly replicated with standard transfer functions and network architectures. Also the specific details of how to leverage the subsets of data (i.e. normals and repeated measurements for these processes) to achieve stable corrections, could not have been determined in a bottom-up manner without having access to exponentially more quantities of data (a 14 dimensional pattern space density rather than 5).

In summary, ignoring the prior knowledge, and attempting to replicate the pre-processing using trained neural network calculations (extra layers), would have needlessly put us close to the regime of deep learning and big data. As we have already stated above, there are reasons why such data is not going to be available in these kind of studies. Even if the data is available there are still benefits to good choice of input data representation. A low dimensional space can be mapped more accurately than a higher one containing the same information  [36] and we can guarantee that the calculations performed (such as scale invariance, variable normalisation and age correction) are appropriate and unconditionally stable across the data space. The final representation variables also support clinical interpretation by being related to simple structural biology. Such simple transparency was considered an important requirement for future clinical integration  [39].

4.3 Implications for Assessment of ANN Uncertainty

ANNs pose a particular challenge for the assessment of parameter uncertainty. The conventional approach, based upon second derivatives  [43], is unlikely to generate useful estimates of the inverse covariance matrix, due to the strong non-linearities used in ANN calculations. Whilst re-mapping the parameters to achieve a Gaussian likelihood function improves the inverse covariance approximation to the original cost function, and this alone may well be a good enough reason to take the approach, this is not the only reason for doing this. As we have explained in the introduction, making the Likelihood function Gaussian is the Frequentist equivalent to the Bayesian use of uninformative “priors”, the multiplicative terms needed to convert the likelihood into a quantitatively valid density estimate. The plots in Figure 6

are therefore as much a test of the probability theory as they are a test of the accuracy of the inverse parameter covariance. This also suggests that we cannot improve the conventional methods using higher order statistics to better the model the likelihood function, as the estimates of parameter density arising from the theory are simply wrong unless the likelihood is Gaussian, i.e. we cannot treat a likelihood function as though it is already a density over parameters.

At first glance the estimation of uncertainty shown for pattern B in Figure 7(b) might seem completely reasonable when we take into account its location in Figure 3(a). However, this location corresponds to a very low sample density, much like the data for the binomial example shown in Figure 2(a), but does not have the same degree of variation. It is more tightly grouped around because the ANN is a parametric fit.

Sigmoid functions have no choice but to be stable at either extremes of the input variables, where the outputs are . Unfortunately, this behaviour may not necessarily be justifiable. For example, if we consider the very common assumption that the underlying class distributions are Gaussian, then for two one dimensional densities with non-equal width, the conditional probability would converge on outputs for the broader distribution of (Figure 8).

Figure 8: The conditional probability of classification (yellow) is computed for two classes (A blue and B red) with Gaussian distributions of unequal width. An ANN trained with small numbers of samples from these distributions is likely to model the decision boundary using a sigmoid (green). Future test samples for will then be classified in the same incorrect way, regardless of any expected uncertainty in the location or scale of the sigmoid. The estimated epistemic error for such data (a few percent of all samples) would approach zero, even though the decision is always categorically wrong.

These issues are not unique to ANNs, they also occur in familiar statistical methods, for example both linear and logit regression are capable of generating nonsense if misused. We may be satisfied to make these assumptions where we know something about the input data properties. Biological factors regarding changes in brain structure due to disease (loss of tissue in specific anatomical regions) would lead us to expect the diagnostic classification would be as modelled here. However, if this were a “black box” application, with the same distribution of training data, we would have no right to just expect such assumptions to be true

999This example appears to us to contradict the assertion that uncertainties for mis-specified models can be accommodated when using NIC  [3]..

At this point it is instructive to consider a simpler pattern recognition system, called Linear Poisson Modelling (LPM)  [54], to emphasise the importance of model selection and quality control. LPMs approximate probability densities non-parametrically (i.e. with a minimum of functional restrictions) for systems which are known to be modellable as linear combinations of probability mass functions. The associated “white box” analysis  [57] supports estimation of both epistemic and aleotoric parameter uncertainties. To be applied correctly, the theory makes it clear that training data must take the form of histogram bins with independent Poisson samples. The uncertainties on model coefficients are computed using error propagation, assuming that sufficiently large Poisson sample perturbations can be approximated with a Gaussian covariance matrix. These assumptions can be, and often are, violated in real data. Consequently, whilst histograms appear ubiquitously within science, it has proven challenging to apply LPMs ‘out-of-the-box’. A mass spectrum, for example, is a type of histogram for ion counts. However, what is recorded are noisy step-changes in voltages subjected to non-Poisson instrumentation noise. LPMs have been successfully applied to such data  [15], but only with significant pre-processing and calibration  [55]. Tools such as Bland-Altman analysis are needed to confirm noise distributions; pull-distributions check that predicted uncertainties match those observed; and simulation and ground-truth datasets are used to corroborate all model assumptions.

Without a linear data generator and statistical conformity, predicted uncertainties on LPM derived measurements can be orders of magnitude away from reality. The current trend within ANN research has been to present raw data and tune parameters until an “acceptable” empirical result has been achieved on a test dataset. We should not expect that a “black box” ANN would reliably identify and solve complex problems within data, especially when a comparatively simple method requires such careful application.

Fundamentally our statistical analysis of uncertainty is based upon the assumption that the implied non-linear ANN model is valid, when in fact it is only an interpolating approximation of the training data. If we cannot trust the non-linear model then the computed uncertainties away from training data represent a best case, i.e. a statistical variation around a biased output. In safety critical situations we might be better using less stable parametric (or non-parametric) models so that uncertainty estimates reflect the limiting information content of training data. The requirement to understand uncertainty in saftey critical tasks therefore has implications for how we might choose ANN architectures.

General Conclusions

Regarding the general theory of uncertainty, standard results for Jeffreys priors give near identical results to the Frequentist approach of mapping the likelihood function onto a Gaussian. The main difference however, is that whilst general solutions to the construction of Jeffreys priors are highly complicated and impractical for complex non-linear systems, we have shown in this work that an approximation to a general solution for the Frequentist approach is numerically feasible. It also embodies a pragmatic technique for the better approximation of uncertainties using covariance matrices.

Now that we have a working system for the estimation of epistemic ANN output uncertainty we can deduce several fundamental issues regarding the feasibility of obtaining a successful uncertainty assessment:

  • In order to apply the parameter-remapping technique to the estimate of parameter uncertainty (), the likelihood function needs to be fully optimised. Although it is possible to deal with a cost function which has a plateau on one side of the optima, a partially optimised function can provide no information regarding the constraint (estimation error) on the associated parameter. Such issues have been reported previously  [3]. This has implications for training methods which employ early stopping i.e. it will be logically impossible to make any meaningful prediction regarding uncertainty.

  • In order to compute numerically stable estimates of the Mahalanobis distance, it is necessary to apply SVD for the extraction of eigen-vectors from the re-mapped parameter covariance. It may be impossible to estimate the Mahalanobis distance to sufficient precision to work with even 100’s of network parameters unless some other way can be found either to deal with ill-conditioned matrices or to remove parameter correlation from the ANN design.

  • The mapping process is also best if the cost function is smooth and differentially continuous, as discontinuities could pose numerical problems. This has implications for common processing acceleration systems, involving the use of RELU transfer functions.

  • The mapping approach also has more difficulty as the degree of non-linearity is increased, such as the process of adding additional non-linear hidden layers to the ANN. Also, as the architecture complexity is increased the increased number of degrees of freedom reduces the average statistical information available to constrain parameters (as described in  [43]). This in turn accentuates any non-linearities, by allowing the expected parameter variations to span a larger range of values. As a consequence, large dimensional input spaces and small datasets, which lead to over-fitting, will reduce the accuracy of any Gaussian approximation. This has implications for the assessment of uncertainty when using deep-learning.

  • Any statistical assessment of uncertainty is based upon the implicit non-linear ANN model being valid. As this can not generally be expected to be true in areas away from training data the degree of variation estimated can strictly only be a lower bound.

The above issues have obvious implications for the application of deep-learning to safety critical tasks. However, they can be mitigated to some extent by reducing the need for large networks by choosing appropriate input data representations, rather than simply using raw data, much as was common practice  [36, 56, 11] before the advent of Bigdata.

References

  • [1] H. Akaike (1974) A new look at the statistical model identification. IEEE Transactions on Automatic Control 19 (6), pp. 716–723. Cited by: §2.2.
  • [2] M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S. Nasrin, M. Hasan, B. C. Van Essen, A. A. Awwal, and V. K. Asari (2019) A state-of-the-art survey on deep learning theory and architectures. Electronics 8 (3), pp. 292. Cited by: §1.
  • [3] Anders,U and Korn,O (1999) Model selection in neural networks. Neural Networks 12, pp. 309–323. Cited by: §1.1, 1st item, footnote 9.
  • [4] F. J. Anscombe (1948) The transformation of Poisson, binomial and negative-binomial data. Biometrica 35, pp. 3–4. Cited by: Appendix B.
  • [5] F. J. Anscombe (1964) Normal likelihood functions. Annals of the Institute of Statistical Mathematics 16 (1), pp. 1–19. Cited by: §1.1.2, footnote 4.
  • [6] D. Barber and C. M. Bishop (1998) Ensemble learning in Bayesian neural networks. Nato ASI Series F Computer and Systems Sciences 168, pp. 215–238. Cited by: §1.
  • [7] O. E. Barndorff-Nielsen (1986) Inference on full or partial parameters based on the standardized signed log likelihood ratio. Biometrika 73 (2), pp. 307–322. Cited by: §1.1.1.
  • [8] C. M. Bishop (1994) Novelty detection and neural network validation. IEE Proceedings - Vision, Image and Signal processing 141 (4), pp. 217–222. Cited by: §1.
  • [9] C. M. Bishop (1995) Neural networks for pattern recognition. Oxford University Press. Cited by: §1.1.2.
  • [10] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Vol. 37, pp. 1613–1622. Cited by: §1.
  • [11] Bromiley,P.A, N. Thacker, and Courtney,P (2001) Colour image segmentation by non-paramteric density estimation in colour space. In Proceedings of the British Machine Vision Conference, pp. 283–292. Cited by: §4.
  • [12] H. Cramér (1946) Mathematical methods of statistics. Princeton University Press. Cited by: §1.1.2.
  • [13] G. S. Datta and R. Mukerjee (2012) Probability matching priors: higher order asymptotics. Lecture Notes in Statistics, Vol. 178, Springer Science & Business Media. Cited by: §1.1.2.
  • [14] A.P. Dawid (1986) Probability forecasting. In Encyclopedia of Statistical Sciences, Vol. 7, pp. 210–218. Cited by: §1.1.2, §1.
  • [15] S. Deepaisarn, P. Tar, N. A. Thacker, A. Seepujak, and A. McMahon (2017)

    Quantifying biological samples using linear poisson independent component analysis for MALDI-ToF mass spectra

    .
    Bioinformatics 34 (6), pp. 1001–1008. External Links: Link Cited by: §4.3.
  • [16] T. J. DiCiccio and M. A. Martin (1993) Simple modifications for signed roots of likelihood ratio statistics. Journal of the Royal Statistical Society: Series B (Methodological) 55 (1), pp. 305–316. Cited by: §1.1.2, §1.2.1.
  • [17] D. L. Dowe (2011)

    MML, hybrid Bayesian network graphical models, statistical consistency, invariance and uniqueness

    .
    In Philosophy of statistics, pp. 901–982. Cited by: footnote 3.
  • [18] R. A. Fisher (1915) Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika 10 (4), pp. 507–521. Cited by: footnote 5.
  • [19] R. A. Fisher (1921) On the ‘probable error’ of a coefficient of correlation deduced from a small sample. Metron 1, pp. 1–32. Cited by: footnote 5.
  • [20] R. A. Fisher (1956) Statistical methods and scientific inference.. Hafner Publishing Co.. Cited by: §1.1.1.
  • [21] Y. Gal and Z. Ghahramani (2016)

    Bayesian convolutional neural networks with Bernoulli approximate variational inference

    .
    In Proceedings of the International Conference on Learning Representations (ICLR), Workshop Track, arXiv preprint arXiv:1506.02158, Cited by: §1.
  • [22] Y. Gal and Z. Ghahramani (2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research (PMLR), Vol. 48, pp. 1050–1059. Cited by: §1, §1.
  • [23] Y. Gal, J. Hron, and A. Kendall (2017) Concrete dropout. In Advances in neural information processing systems 30 (NIPS 2017), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3581–3590. Cited by: §1.
  • [24] M. Ghosh (2011) Objective priors: an introduction for frequentists. Statistical Science 26 (2), pp. 187–202. Cited by: §1.1.1, §1.1.2, §1.1.
  • [25] A. Graves (2011) Practical variational inference for neural networks. In Advances in neural information processing systems 24 (NIPS 2011), J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 2348–2356. Cited by: §1.
  • [26] V. C. Hachinski, L. D. Iliff, E. Zilhka, G. H. Du Boulay, V. L. McAllister, J. Marshall, R. W. Ross Russell, and L. Symon (1975) Cerebral blood flow in dementia. Archives of neurology 32 (9), pp. 632–637. Cited by: §2.1.
  • [27] R. M. Haralick (1989) Performance assessment of near-perfect machines. Machine vision and applications 2 (1), pp. 1–16. Cited by: §1.
  • [28] G. E. Hinton and D. Van Camp (1993) Keeping neural networks simple by minimizing the description length of the weights. In

    Proceedings of the 6th Annual ACM Conference on Computational Learning Theory

    ,
    pp. 5–13. Cited by: §1.
  • [29] E. Horel and K. Giesecke (2019) Towards explainable AI: significance tests for neural networks. arXiv preprint arXiv:1902.06021. Cited by: §1.
  • [30] P. Hougaard (1982) Parametrizations of non-linear models. Journal of the Royal Statistical Society: Series B (Methodological) 44 (2), pp. 244–252. Cited by: §1.1.2.
  • [31] H. Jeffreys (1946)

    An invariant form for the prior probability in estimation problems

    .
    Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences 186 (1007), pp. 453–461. Cited by: §1.1.1, §1.1, §1.2.1.
  • [32] H. Jeffreys (1961) The theory of probability. 3rd edition, Oxford Classic Texts in the Physical Sciences, Oxford University Press. Cited by: §1.1, §1.2.1, footnote 8.
  • [33] J. L. Jensen (1986) Similar tests and the standardized log likelihood ratio statistic. Biometrika 73 (3), pp. 567–572. Cited by: §1.1.2.
  • [34] R. E. Kass and L. Wasserman (1996) The selection of prior distributions by formal rules. Journal of the American Statistical Association 91 (435), pp. 1343–1370. Cited by: §1.1, §1.2.1.
  • [35] A. Kendall and Y. Gal (2017)

    What uncertainties do we need in Bayesian deep learning for computer vision?

    .
    In Advances in neural information processing systems 30 (NIPS 2017), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5574–5584. Cited by: §1, §1.
  • [36] Lacey,A.J, Thacker,N.A, and Seed,N.L (1995) Smart feature detection using an invariance network architecture. Proceedings of the British Machine Vision Conference, pp. 327–336. Cited by: §4.2, §4.
  • [37] J. A. Leonard, M. A. Kramer, and L. H. Ungar (1992) A neural network architecture that computes its own reliability. Computers & chemical engineering 16 (9), pp. 819–835. Cited by: §1.
  • [38] J. A. Leonard, M. A. Kramer, and L. H. Ungar (1992) Using radial basis functions to approximate a function and its error bounds. IEEE transactions on neural networks 3 (4), pp. 624–627. Cited by: §1.
  • [39] Leslie,D (2019)

    Understanding artificial intelligence ethics and safety: a guide for the responsible design and implementation of ai systems in the public sector

    .
    https://doi.org/10.5281/zenodo.3240529. Cited by: §4.2.
  • [40] Lund and M. Groups (1994) Clinical and neuropathological criteria for frontotemporal dementia. The Lund and Manchester groups.. Journal of Neurology, Neurosurgery & Psychiatry 57 (4), pp. 416–418. External Links: Document, ISSN 0022-3050, Link, https://jnnp.bmj.com/content/57/4/416.full.pdf Cited by: §2.1.
  • [41] Maddison,C.J, Mnih,A, and Teh,Y.W (2017)

    The concrete distribution: a continuous relaxation of discrete random variables

    .
    In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §1.
  • [42] G. McKhann, D. Drachman, M. Folstein, R. Katzman, D. Price, and E. M. Stadlan (1984) Clinical diagnosis of Alzheimer’s disease: : report of the NINCDS-ADRDA work group under the auspices of the health and human services task force on Alzheimer’s disease. Neurology 34 (7), pp. 939–944. External Links: Document Cited by: §2.1.
  • [43] Murata,N, Yoshizawa,S, and Amari,S (1994) Network information criterion - determining the number of hidden units for and artificial neural network model. IEE Transactions on Neural Networks 5 (6), pp. 865–872. Cited by: §1.1, §1, 4th item, §4.3.
  • [44] D. Neary, S. Snowden, L. Gustafson, U. Passant, D. Stuss, S. Black, M. Freedman, A. Kertesz, P. H. Robert, M. Albert, K. Boone, B. L. Miller, J. Cummings, and D. F. Benson (1998) Frontotemporal lobar degeneration: a consensus on clinical diagnostic criteria. Neurology 51 (6), pp. 1546–1554. Cited by: §2.1.
  • [45] E. Parzen (1962)

    On estimation of a probability density function and mode

    .
    The annals of mathematical statistics 33 (3), pp. 1065–1076. Cited by: §1.
  • [46] H. W. Peers (1965)

    On confidence points and Bayesian probability points in the case of several parameters

    .
    Journal of the Royal Statistical Society: Series B (Methodological) 27 (1), pp. 9–16. Cited by: §1.1.
  • [47] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery (1992) Numerical recipes in c (2nd ed.): the art of scientific computing. Cambridge University Press. Cited by: §1.1.1, §2.2.
  • [48] C. E. Rasmussen and C. K. Williams (2006) Gaussian processes for machine learning. MIT press Cambridge, MA. Cited by: §1.
  • [49] M. D. Richard and R. P. Lippmann (1991) Neural network classifiers estimate Bayesian a posteriori probabilities. Neural computation 3 (4), pp. 461–483. Cited by: §1.2.2, §1.
  • [50] M. Riedmiller and H. Braun (1993) A direct adaptive method for faster back-propagation learning: the RPROP algorithm. In Proceedings of the IEEE international conference on neural networks, Vol. 1993, pp. 586–591. Cited by: §2.2.
  • [51] Royal Society Working Group (2017) Machine learning: the power and promise of computers that learn by example. Technical report Royal Society. Cited by: §1.
  • [52] J. Snowden, D. Neary, and D. Mann (1996) Fronto-temporal lobar degeneration : fronto-temporal dementia, progressive aphasia, semantic dementia. Clinical neurology and neurosurgery monographs, New York ; Edinburgh: Churchill Livingstone. Cited by: §2.1.
  • [53] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §1.
  • [54] P. Tar and N. Thacker (2014) Linear poisson models: a pattern recognition solution to the histogram composition problem. Annals of the BMVA 2014 (1), pp. 1–22. Cited by: §4.3.
  • [55] N. A. Thacker, P. D. Tar, A. P. Seepujak, and J. D. Gilmour (2018) The statistical properties of raw and preprocessed ToF mass spectra. International Journal of Mass Spectrometry 428, pp. 62–70. Cited by: §4.3.
  • [56] N. A. Thacker, A. R. Varma, D. Bathgate, S. Stivaros, J. S. Snowden, D. Neary, and A. Jackson (2002) Dementing disorders: volumetric measurement of cerebrospinal fluid to distinguish normal from pathologic findings - feasibility study. Radiology 224 (1), pp. 278–285. Cited by: §2.1, §2.1, §4.2, §4.2, §4.
  • [57] Thacker,N, Clark,A, Barron,J, Beveridge,R, Courtney,P, Crum,W, Ramesh,V, and Clark,C (2008) Performance characterisation in computer vision: a guide to best practices. Computer Vision and Image Understanding 109, pp. 305–334. Cited by: §4.2, §4.3.
  • [58] B. Welch and H. Peers (1963) On formulae for confidence points based on integrals of weighted likelihoods. Journal of the Royal Statistical Society: Series B (Methodological) 25 (2), pp. 318–329. Cited by: §1.1.1, §1.1.1, §1.1.2, §1.1.2, §1.1.
  • [59] A. Winterbottom (1979) A note on the derivation of Fisher’s transformation of the correlation coefficient. The American Statistician 33 (3), pp. 142–143. Cited by: §1.1.2.

Appendix A Iteration of Jeffreys Priors

In order to obtain the matching equation for the prior, we assumed that we could manipulate the prior in order to obtain a better fit near the optimum between the posterior and a unit Gaussian. The result shows that the Jeffreys prior gives the first-order approximate answer to mapping the Likelihood to a Gaussian. The Jeffreys prior can hence be used to define a mapping of (log)-likelihood functions for some range of parameter values about an optimum thus:

where we now consider the total (log) Likelihood and the total Fisher Information function. This Jeffreys-prior-based mapping procedure gives us an order-preserving and optimum-preserving mapping between Likelihood functions. The derivation above showed us that one application of this Jeffreys mapping gave us a better approximation to a Gaussian Likelihood. It is straightforward to see that a unit Gaussian is a fixed point of this Jeffreys mapping, since:

If we take a Gaussian with a different scale:

(23)

so we see that it maps to a unit Gaussian after one iteration.

Let us consider a perturbation to a Gaussian. In terms of the log-likelihood, we consider a cubic perturbation to the quadratic Gaussian log-likelihood. After some algebra, we find that:

(24)

This means that in terms of the log-likelihood function, the magnitude of the coefficient of the perturbation term has been halved . We hence see that the Gaussian fixed point for the Jeffreys mapping has some non-empty basin of attraction.

Appendix B Dimensionality Reduction for Volumetric MR Data

It was assumed that the normal development of atrophy in any normalised and pooled variables would be of the form

Thus we can construct a new variable representing the proportion of atrophy relative to that expected at a particular age.

Variables which combined Anscombe transforms  [4] and partial orthonormality were selected as follows;

  • The age corrected relative degree of atrophy between the middle and front of the CSF space.

  • The age corrected relative degree of atrophy between the middle back of the CSF space.

  • The age corrected relative degree of total atrophy

  • The age corrected relative degree of atrophy between the left and right sides of the CSF space.

  • The age corrected relative degree of atrophy between the top and bottom of the CSF space.