1 Introduction
Alzheimer’s disease (AD) is the most common form of dementia and highly prevalent among elderly individuals. As societies around the world are ageing, the number of patients affected is rapidly increasing. AD is associated with an enormous disease burden regarding morbidity, mortality, and financial expenses. It impairs basic bodily functions such as walking and swallowing, has devastating impact on memory and cognition, and is ultimately fatal. The costs of care for individuals with AD or other dementias are already enormous Association [2017], dementia being one of the costliest conditions to society Hurd et al. [2013].
Since AD progresses over time, early accurate diagnosis and precise clinical monitoring would be essential. Yet, in daily clinical routine, AD assessment is usually done by subjective clinical interpretations once there is a strong prior that the disease is already at a progressed stage.
To date, no lowcost, noninvasive biomarkers have been established to advance the objectivization of diagnosis and disease progression assessment. For a wide use, such markers should not rely on expensive equipment (e.g., scanners for magnetic resonance imaging (MRI), or positron emission tomography (PET)), or require invasive procedures (injection of a positron emitter (PET), or lumbar puncture (analysis of cerebrospinal fluid)). Properties like noninvasiveness, costeffectiveness and a wide availability make electroencephalography (EEG) a potential candidate modality for such markers. Various studies have shown associations between quantitative EEG (QEEG) markers of slowing, complexity, and functional connectivity and AD (for reviews, see Dauwels et al. [2010], Jeong [2004]).
Here, we utilize Bayesian neural networks Neal [2012] to develop a multivariate predictor for AD severity, as measured by the MiniMental State Examination (MMSE). We use 820 distinct QEEG markers as features, all belonging to the aforementioned categories of slowing, complexity, and functional connectivity.
Bayesian techniques allow for an adjustment of predictions through varying priors. Therefore potential variations in diagnostic settings (distinct locations) or distinct disease prevalence (varying age groups) can be accounted for. Furthermore, Bayesian approaches provide probabilistic predictions quantifying predictive uncertainty – which is of utmost importance for medical practitioners. Finally, the Bayesian treatment of neural networks prevents overfitting to sparse data sets as typically found in clinical neuroscience.
To demonstrate the potential of Bayesian deep learning techniques, we investigate the performance of two different Bayesian approaches, i.e., Monte Carlo dropout Gal and Ghahramani [2016] (MC dropout BNN) and Hamiltonian Monte Carlo Neal et al. [2011]
(HMC BNN), and compare their results to traditional (nonBayesian) networks (NN). All models use identical QEEG features from one of the largest prospective AD EEG trials ever conducted.
2 Materials and Methods
2.1 Experimental data
One hundred and eightyeight AD patients (133 with probable, 55 with possible AD diagnosis according to NINCDSADRDA criteria; 100 females; mean age 74.86
8.06 standard deviation (SD); mean MMSE score 22.84
3.70 SD; mean years of education 11.14 3.04 SD; mean duration of illness (months) 26.88 24.11 SD ; 115 (61.17%) with antidementia medication) were considered for this investigation. They were recruited prospectively at the tertiaryreferral memory clinics of the neurological departments at the Medical Universities of Graz, Innsbruck and Vienna, as well as Linz General Hospital as part of the PRODEM (Prospective Dementia Database Austria) cohort study. Assessment of disease severity was done using the MMSE Folstein et al. [1975]. Continuous EEG (alpha trace EEG recorder, 1020 electrode placement) was analyzed for an eyesclosed (180 sec) and an eyesopen resting condition (180 sec). For details on the entire PRODEM experimental protocol and preprocessing pipeline, see Waser et al. [2016], Garn et al. [2014], Fruehwirt et al. [2017].2.2 Feature generation
In total, 820 QEEG markers were computed considering various measures of slowing (absolute band power, relative band power, center frequency), complexity (automutual information, Shannon entropy, Tsallis entropy), and functional connectivity (coherence, partial coherence, phase coherence, canonical correlation, dynamic canonical correlation, Granger causality, conditional Granger causality, crossmutual information), as well as two resting state conditions (eyesclosed, eyesopen), various brain regions Garn et al. [2015], Dauwels et al. [2010], and multiple frequency bands Garn et al. [2015].
Features were normalized and projected into a lowerdimensional space (five components) using principal components analysis, before passing the data to the models.
2.3 Bayesian neural networks
In this paper we use Bayesian neural networks (BNNs) as our model for regression MacKay [1992], Neal [2012]. These are neural networks, where we define the prior for each layer
as a product of multivariate normal distributions
(where is the prior lengthscale) and the Gaussian likelihood as , whereis the noise variance. Assigning probability distributions over the weights leads to a function approximation
that is also a distribution. Therefore, we are also provided with important uncertainty estimates of our predictions.
In the presence of data , we can perform inference to get the posterior distribution over the weights,
(1) 
which requires the normalizing distribution, otherwise known as the marginal likelihood,
(2) 
The inferred posterior distribution is then used to form the predictive distribution over a new test point ,
(3) 
The integrations in Equations (2) and (3) are analytically intractable, which is the bottleneck in BNNs. To perform these integrations there are two common solutions. One solution is to replace the true posterior with a variational approximation that is either cheaper to sample from Gal [2016] or conjugate to the likelihood MacKay [1992]
. The other option is to use a Markov Chain Monte Carlo technique such as Hamiltonian Monte Carlo
Neal et al. [2011] to sample from the posterior.In this work, we make a direct comparison between two BNN implementations:

MC dropout Gal and Ghahramani [2016], which is a variational approximation.

Hamiltonian Monte Carlo Neal et al. [2011], which is a Markov Chain Monte Carlo technique that utilises Hamiltonian dynamics to explore the parameter space of networks.
In addition, we also implement a standard (nonBayesian) neural network as a useful baseline.
MC dropout is a variational approximation which implements dropout at test time of a neural network, as well as during training. A dropout mask is drawn from Bernoulli distributed random variables and sets a proportion of the weights in the network to zero. Dropping these weights at test time gives an approximation to the predictive distribution by sampling
new dropout masks for each forwardpass through the network. This results in samples that represent the predictive distribution.Hamiltonian Monte Carlo can be used to infer the predictive distribution in Equation (3) by introducing a Hamiltonian,
(4) 
that defines the total energy. This introduces an additional momentum variable , which is used to form the kinetic energy . However, our interest lies in the potential energy as
(5) 
Given this Hamiltonian, we define a distribution over the phase space . Therefore in sampling from this distribution and ignoring the momentum parameters, we can retrieve the posterior distribution over , by using the Hamiltonian dynamics to sample from the phase space. For further details, we refer to Neal [2012].
2.4 Metrics for comparison
To compare results across the model we use both the mean squared error (MSE) and the standardized mean squared error (SMSE) Rasmussen and Williams [2006], where for test points, we define
(6) 
The SMSE is the MSE normalised by the predictive variance and we also define the predictive mean for a given data point as . This is an appropriate metric for regression tasks, where we are also interested in the model predictive variance.
2.5 Nested crossvalidation and grid search
A twolevel nested crossvalidation (
) was used to determine generalization performance. We included a nested loop to perform a grid search to select the hyperparameter settings with minimal expected generalization error. For the HMC BNN, we optimized for the prior weight precision (see
Neal [2012] for an intuition behind the weight precision) and for the MC dropout network, we optimized for dropout, which is directly related to the weight precision (see Gal [2016] for details of the relationship). We also used the nested loop to optimize for the dropout rate and early stopping for the standard NN. In addition to selecting these hyperparameters in our grid search, we also searched through a range of small neural network architectures, with the smallest having a hidden layer of units to the largest having two hidden layers of units.3 Results and Discussion
For each of the five outerfold test sets, we first selected the optimal hyperparameters for each type of network (HMC BNN, MC dropout BNN and NN) based on the performance in the five inner folds of the nested crossvalidation loop. Then, we used the corresponding settings to compute the results in the outer folds, as reported in Table 1.
Overall, HMC BNNs significantly ( < 0.05) outperformed MC dropout BNNs and standard NNs. Differences between model performances were assessed by statistically comparing squared errors of test set predictions.
All Bayesian models selected for the outer folds had two hidden layers of units. The BNNs showed an average lower MSE than the best standard NN across the test sets. This highlights the importance of combining the function approximating power of neural networks with Bayesian techniques. This is especially true for sparse data sets. In the present study, the data set consisted of patients, which allowed us to directly compute the derivatives in the Hamiltonian. As such, we saw the HMC BNN outperforming the MC dropout BNN, which is often seen in practice Gal and Smith [2018]. The comparably high SMSE values for the MC dropout model is due to the underestimation of the predictive uncertainty. This is a common limitation of variational inference techniques as the approximate posterior is optimised through an objective that is modeseeking and can fail in modelling probability mass far away from the mode. In larger data sets, approximate inference (such as MC dropout) is currently the most practical way to implement BNNs.
Overall, we have shown how BNNs can be used to facilitate the development of lowcost, noninvasive markers of AD severity. To the best of our knowledge, this is the first article reporting a BNN approach for building QEEG markers of AD severity.
HMC BNN  MC dropout BNN  NN  

Test set  MSE  SMSE  MSE  SMSE  MSE  SMSE 
1  12.42  15.21  17.29  1198.36  23.56  n.a. 
2  14.95  185.41  13.94  1027.51  24.94  n.a. 
3  11.49  144.14  13.40  1030.63  20.39  n.a. 
4  14.36  173.73  19.81  708.82  81.21  n.a. 
5  7.65  101.62  18.91  775.49  18.45  n.a. 
15  12.17  124.02  16.67  948.16  33.71  n.a. 
Acknowledgements
The PRODEM cohort study has been supported by the Austrian Research Promotion Agency FFG, project no. 827462.
References
 Association [2017] Alzheimer’s Association. 2017 alzheimer’s disease facts and figures. Alzheimer’s & Dementia, 13(4):325–373, 2017. ISSN 15525260. doi: https://doi.org/10.1016/j.jalz.2017.02.001.
 Hurd et al. [2013] M. D. Hurd, P. Martorell, A. Delavande, K. J. Mullen, and K. M. Langa. Monetary costs of dementia in the united states. New England Journal of Medicine, 368(14):1326–1334, 2013. doi: 10.1056/NEJMsa1204629.
 Dauwels et al. [2010] J. Dauwels, F. Vialatte, and A. Cichocki. Diagnosis of alzheimer’s disease from eeg signals: where are we standing? Current Alzheimer Research, 7(6):487–505, 2010. ISSN 18755828,.
 Jeong [2004] J. Jeong. Eeg dynamics in patients with alzheimer’s disease. Clinical Neurophysiology, 115(7):1490–1505, 2004. ISSN 13882457. doi: 10.1016/j.clinph.2004.01.001.
 Neal [2012] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.

Gal and Ghahramani [2016]
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty
in deep learning.
In
international conference on machine learning
, pages 1050–1059, 2016.  Neal et al. [2011] Radford M Neal et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.
 Folstein et al. [1975] M. F. Folstein, S. E. Folstein, and P. R. McHugh. Minimental state. Journal of Psychiatric Research, 12(3):189–198, 1975. ISSN 00223956. doi: 10.1016/00223956(75)900266.
 Waser et al. [2016] M. Waser, H. Garn, R. Schmidt, T. Benke, P. DalBianco, G. Ransmayr, H. Schmidt, S. Seiler, G. Sanin, F. Mayer, G. Caravias, D. Grossegger, W. Fruhwirt, and M. Deistler. Quantifying synchrony patterns in the EEG of Alzheimer’s patients with linear and nonlinear connectivity markers. Journal of Neural Engineering, 123(3):297–316, 2016. ISSN 14351463 (Electronic) 03009564 (Linking). doi: 10.1007/s007020151461x.
 Garn et al. [2014] H. Garn, M. Waser, M. Deistler, R. Schmidt, P. DalBianco, G. Ransmayr, J. Zeitlhofer, H. Schmidt, S. Seiler, G. Sanin, G. Caravias, P. Santer, D. Grossegger, W. Fruehwirt, and T. Benke. Quantitative EEG in Alzheimer’s disease: cognitive state, resting state and association with disease severity. International Journal of Psychophysiology, 93(3):390–7, 2014. ISSN 18727697 (Electronic) 01678760 (Linking). doi: 10.1016/j.ijpsycho.2014.06.003.
 Fruehwirt et al. [2017] W. Fruehwirt, P. Zhang, M. Gerstgrasser, D. Grossegger, R. Schmidt, T. Benke, P. DalBianco, G. Ransmayr, L. Weydemann, H. Garn, M. Waser, M. Osborne, and G. Dorffner. Bayesian Gaussian Process Classification from EventRelated Brain Potentials in Alzheimer’s Disease, pages 65–75. Springer International Publishing, Cham, 2017. ISBN 9783319597584. doi: 10.1007/9783319597584_7.
 Garn et al. [2015] H. Garn, M. Waser, M. Deistler, T. Benke, P. DalBianco, G. Ransmayr, H. Schmidt, G. Sanin, P. Santer, G. Caravias, S. Seiler, D. Grossegger, W. Fruehwirt, and R. Schmidt. Quantitative eeg markers relate to alzheimer’s disease severity in the prospective dementia registry austria (prodem). Clinical Neurophysiology, 126(3):505–513, 2015. ISSN 13882457. doi: http://dx.doi.org/10.1016/j.clinph.2014.07.005.

MacKay [1992]
David JC MacKay.
A practical Bayesian framework for backpropagation networks.
Neural computation, 4(3):448–472, 1992.  Gal [2016] Yarin Gal. Uncertainty in deep learning. University of Cambridge, 2016.
 Rasmussen and Williams [2006] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
 Gal and Smith [2018] Yarin Gal and Lewis Smith. Idealised Bayesian Neural Networks Cannot Have Adversarial Examples: Theoretical and Empirical Study. arXiv preprint arXiv:1806.00667, 2018.
Comments
There are no comments yet.