Bayesian deep neural networks for low-cost neurophysiological markers of Alzheimer's disease severity

12/12/2018 ∙ by Wolfgang Fruehwirt, et al. ∙ 0

As societies around the world are ageing, the number of Alzheimer's disease (AD) patients is rapidly increasing. To date, no low-cost, non-invasive biomarkers have been established to advance the objectivization of AD diagnosis and progression assessment. Here, we utilize Bayesian neural networks to develop a multivariate predictor for AD severity using a wide range of quantitative EEG (QEEG) markers. The Bayesian treatment of neural networks both automatically controls model complexity and provides a predictive distribution over the target function, giving uncertainty bounds for our regression task. It is therefore well suited to clinical neuroscience, where data sets are typically sparse and practitioners require a precise assessment of the predictive uncertainty. We use data of one of the largest prospective AD EEG trials ever conducted to demonstrate the potential of Bayesian deep learning in this domain, while comparing two distinct Bayesian neural network approaches, i.e., Monte Carlo dropout and Hamiltonian Monte Carlo.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Alzheimer’s disease (AD) is the most common form of dementia and highly prevalent among elderly individuals. As societies around the world are ageing, the number of patients affected is rapidly increasing. AD is associated with an enormous disease burden regarding morbidity, mortality, and financial expenses. It impairs basic bodily functions such as walking and swallowing, has devastating impact on memory and cognition, and is ultimately fatal. The costs of care for individuals with AD or other dementias are already enormous Association [2017], dementia being one of the costliest conditions to society Hurd et al. [2013].

Since AD progresses over time, early accurate diagnosis and precise clinical monitoring would be essential. Yet, in daily clinical routine, AD assessment is usually done by subjective clinical interpretations once there is a strong prior that the disease is already at a progressed stage.

To date, no low-cost, non-invasive biomarkers have been established to advance the objectivization of diagnosis and disease progression assessment. For a wide use, such markers should not rely on expensive equipment (e.g., scanners for magnetic resonance imaging (MRI), or positron emission tomography (PET)), or require invasive procedures (injection of a positron emitter (PET), or lumbar puncture (analysis of cerebrospinal fluid)). Properties like non-invasiveness, cost-effectiveness and a wide availability make electroencephalography (EEG) a potential candidate modality for such markers. Various studies have shown associations between quantitative EEG (QEEG) markers of slowing, complexity, and functional connectivity and AD (for reviews, see Dauwels et al. [2010], Jeong [2004]).

Here, we utilize Bayesian neural networks Neal [2012] to develop a multivariate predictor for AD severity, as measured by the Mini-Mental State Examination (MMSE). We use 820 distinct QEEG markers as features, all belonging to the aforementioned categories of slowing, complexity, and functional connectivity.

Bayesian techniques allow for an adjustment of predictions through varying priors. Therefore potential variations in diagnostic settings (distinct locations) or distinct disease prevalence (varying age groups) can be accounted for. Furthermore, Bayesian approaches provide probabilistic predictions quantifying predictive uncertainty – which is of utmost importance for medical practitioners. Finally, the Bayesian treatment of neural networks prevents over-fitting to sparse data sets as typically found in clinical neuroscience.

To demonstrate the potential of Bayesian deep learning techniques, we investigate the performance of two different Bayesian approaches, i.e., Monte Carlo dropout Gal and Ghahramani [2016] (MC dropout BNN) and Hamiltonian Monte Carlo Neal et al. [2011]

(HMC BNN), and compare their results to traditional (non-Bayesian) networks (NN). All models use identical QEEG features from one of the largest prospective AD EEG trials ever conducted.

2 Materials and Methods

2.1 Experimental data

One hundred and eighty-eight AD patients (133 with probable, 55 with possible AD diagnosis according to NINCDS-ADRDA criteria; 100 females; mean age 74.86

8.06 standard deviation (SD); mean MMSE score 22.84

3.70 SD; mean years of education 11.14 3.04 SD; mean duration of illness (months) 26.88 24.11 SD ; 115 (61.17%) with anti-dementia medication) were considered for this investigation. They were recruited prospectively at the tertiary-referral memory clinics of the neurological departments at the Medical Universities of Graz, Innsbruck and Vienna, as well as Linz General Hospital as part of the PRODEM (Prospective Dementia Database Austria) cohort study. Assessment of disease severity was done using the MMSE Folstein et al. [1975]. Continuous EEG (alpha trace EEG recorder, 10-20 electrode placement) was analyzed for an eyes-closed (180 sec) and an eyes-open resting condition (180 sec). For details on the entire PRODEM experimental protocol and preprocessing pipeline, see Waser et al. [2016], Garn et al. [2014], Fruehwirt et al. [2017].

2.2 Feature generation

In total, 820 QEEG markers were computed considering various measures of slowing (absolute band power, relative band power, center frequency), complexity (auto-mutual information, Shannon entropy, Tsallis entropy), and functional connectivity (coherence, partial coherence, phase coherence, canonical correlation, dynamic canonical correlation, Granger causality, conditional Granger causality, cross-mutual information), as well as two resting state conditions (eyes-closed, eyes-open), various brain regions Garn et al. [2015], Dauwels et al. [2010], and multiple frequency bands Garn et al. [2015].

Features were normalized and projected into a lower-dimensional space (five components) using principal components analysis, before passing the data to the models.

2.3 Bayesian neural networks

In this paper we use Bayesian neural networks (BNNs) as our model for regression MacKay [1992], Neal [2012]. These are neural networks, where we define the prior for each layer

as a product of multivariate normal distributions

(where is the prior length-scale) and the Gaussian likelihood as , where

is the noise variance. Assigning probability distributions over the weights leads to a function approximation

that is also a distribution. Therefore, we are also provided with important uncertainty estimates of our predictions.

In the presence of data , we can perform inference to get the posterior distribution over the weights,


which requires the normalizing distribution, otherwise known as the marginal likelihood,


The inferred posterior distribution is then used to form the predictive distribution over a new test point ,


The integrations in Equations (2) and (3) are analytically intractable, which is the bottleneck in BNNs. To perform these integrations there are two common solutions. One solution is to replace the true posterior with a variational approximation that is either cheaper to sample from Gal [2016] or conjugate to the likelihood MacKay [1992]

. The other option is to use a Markov Chain Monte Carlo technique such as Hamiltonian Monte Carlo

Neal et al. [2011] to sample from the posterior.

In this work, we make a direct comparison between two BNN implementations:

  1. MC dropout Gal and Ghahramani [2016], which is a variational approximation.

  2. Hamiltonian Monte Carlo Neal et al. [2011], which is a Markov Chain Monte Carlo technique that utilises Hamiltonian dynamics to explore the parameter space of networks.

In addition, we also implement a standard (non-Bayesian) neural network as a useful baseline.

MC dropout is a variational approximation which implements dropout at test time of a neural network, as well as during training. A dropout mask is drawn from Bernoulli distributed random variables and sets a proportion of the weights in the network to zero. Dropping these weights at test time gives an approximation to the predictive distribution by sampling

new dropout masks for each forward-pass through the network. This results in samples that represent the predictive distribution.

Hamiltonian Monte Carlo can be used to infer the predictive distribution in Equation (3) by introducing a Hamiltonian,


that defines the total energy. This introduces an additional momentum variable , which is used to form the kinetic energy . However, our interest lies in the potential energy as


Given this Hamiltonian, we define a distribution over the phase space . Therefore in sampling from this distribution and ignoring the momentum parameters, we can retrieve the posterior distribution over , by using the Hamiltonian dynamics to sample from the phase space. For further details, we refer to Neal [2012].

2.4 Metrics for comparison

To compare results across the model we use both the mean squared error (MSE) and the standardized mean squared error (SMSE) Rasmussen and Williams [2006], where for test points, we define


The SMSE is the MSE normalised by the predictive variance and we also define the predictive mean for a given data point as . This is an appropriate metric for regression tasks, where we are also interested in the model predictive variance.

2.5 Nested cross-validation and grid search

A two-level nested cross-validation (

) was used to determine generalization performance. We included a nested loop to perform a grid search to select the hyperparameter settings with minimal expected generalization error. For the HMC BNN, we optimized for the prior weight precision (see

Neal [2012] for an intuition behind the weight precision) and for the MC dropout network, we optimized for dropout, which is directly related to the weight precision (see Gal [2016] for details of the relationship). We also used the nested loop to optimize for the dropout rate and early stopping for the standard NN. In addition to selecting these hyperparameters in our grid search, we also searched through a range of small neural network architectures, with the smallest having a hidden layer of units to the largest having two hidden layers of units.

3 Results and Discussion

For each of the five outer-fold test sets, we first selected the optimal hyperparameters for each type of network (HMC BNN, MC dropout BNN and NN) based on the performance in the five inner folds of the nested cross-validation loop. Then, we used the corresponding settings to compute the results in the outer folds, as reported in Table 1.

Overall, HMC BNNs significantly ( < 0.05) outperformed MC dropout BNNs and standard NNs. Differences between model performances were assessed by statistically comparing squared errors of test set predictions.

All Bayesian models selected for the outer folds had two hidden layers of units. The BNNs showed an average lower MSE than the best standard NN across the test sets. This highlights the importance of combining the function approximating power of neural networks with Bayesian techniques. This is especially true for sparse data sets. In the present study, the data set consisted of patients, which allowed us to directly compute the derivatives in the Hamiltonian. As such, we saw the HMC BNN outperforming the MC dropout BNN, which is often seen in practice Gal and Smith [2018]. The comparably high SMSE values for the MC dropout model is due to the underestimation of the predictive uncertainty. This is a common limitation of variational inference techniques as the approximate posterior is optimised through an objective that is mode-seeking and can fail in modelling probability mass far away from the mode. In larger data sets, approximate inference (such as MC dropout) is currently the most practical way to implement BNNs.

Overall, we have shown how BNNs can be used to facilitate the development of low-cost, non-invasive markers of AD severity. To the best of our knowledge, this is the first article reporting a BNN approach for building QEEG markers of AD severity.

1 12.42 15.21 17.29 1198.36 23.56 n.a.
2 14.95 185.41 13.94 1027.51 24.94 n.a.
3 11.49 144.14 13.40 1030.63 20.39 n.a.
4 14.36 173.73 19.81 708.82 81.21 n.a.
5 7.65 101.62 18.91 775.49 18.45 n.a.
1-5 12.17 124.02 16.67 948.16 33.71 n.a.
Table 1: Mean squared error (MSE) and the standardized mean squared error (SMSE) of outer loop test sets (nested cross-validation) for predicting AD severity, as measured by the Mini-Mental State Examination (MMSE) for the HMC Bayesian neural networks (HMC BNN), MC dropout Bayesian neural networks (MC dropout BNN) and standard neural networks (NN).


The PRODEM cohort study has been supported by the Austrian Research Promotion Agency FFG, project no. 827462.


  • Association [2017] Alzheimer’s Association. 2017 alzheimer’s disease facts and figures. Alzheimer’s & Dementia, 13(4):325–373, 2017. ISSN 1552-5260. doi:
  • Hurd et al. [2013] M. D. Hurd, P. Martorell, A. Delavande, K. J. Mullen, and K. M. Langa. Monetary costs of dementia in the united states. New England Journal of Medicine, 368(14):1326–1334, 2013. doi: 10.1056/NEJMsa1204629.
  • Dauwels et al. [2010] J. Dauwels, F. Vialatte, and A. Cichocki. Diagnosis of alzheimer’s disease from eeg signals: where are we standing? Current Alzheimer Research, 7(6):487–505, 2010. ISSN 1875-5828,.
  • Jeong [2004] J. Jeong. Eeg dynamics in patients with alzheimer’s disease. Clinical Neurophysiology, 115(7):1490–1505, 2004. ISSN 1388-2457. doi: 10.1016/j.clinph.2004.01.001.
  • Neal [2012] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
  • Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In

    international conference on machine learning

    , pages 1050–1059, 2016.
  • Neal et al. [2011] Radford M Neal et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.
  • Folstein et al. [1975] M. F. Folstein, S. E. Folstein, and P. R. McHugh. Mini-mental state. Journal of Psychiatric Research, 12(3):189–198, 1975. ISSN 0022-3956. doi: 10.1016/0022-3956(75)90026-6.
  • Waser et al. [2016] M. Waser, H. Garn, R. Schmidt, T. Benke, P. Dal-Bianco, G. Ransmayr, H. Schmidt, S. Seiler, G. Sanin, F. Mayer, G. Caravias, D. Grossegger, W. Fruhwirt, and M. Deistler. Quantifying synchrony patterns in the EEG of Alzheimer’s patients with linear and non-linear connectivity markers. Journal of Neural Engineering, 123(3):297–316, 2016. ISSN 1435-1463 (Electronic) 0300-9564 (Linking). doi: 10.1007/s00702-015-1461-x.
  • Garn et al. [2014] H. Garn, M. Waser, M. Deistler, R. Schmidt, P. Dal-Bianco, G. Ransmayr, J. Zeitlhofer, H. Schmidt, S. Seiler, G. Sanin, G. Caravias, P. Santer, D. Grossegger, W. Fruehwirt, and T. Benke. Quantitative EEG in Alzheimer’s disease: cognitive state, resting state and association with disease severity. International Journal of Psychophysiology, 93(3):390–7, 2014. ISSN 1872-7697 (Electronic) 0167-8760 (Linking). doi: 10.1016/j.ijpsycho.2014.06.003.
  • Fruehwirt et al. [2017] W. Fruehwirt, P. Zhang, M. Gerstgrasser, D. Grossegger, R. Schmidt, T. Benke, P. Dal-Bianco, G. Ransmayr, L. Weydemann, H. Garn, M. Waser, M. Osborne, and G. Dorffner. Bayesian Gaussian Process Classification from Event-Related Brain Potentials in Alzheimer’s Disease, pages 65–75. Springer International Publishing, Cham, 2017. ISBN 978-3-319-59758-4. doi: 10.1007/978-3-319-59758-4_7.
  • Garn et al. [2015] H. Garn, M. Waser, M. Deistler, T. Benke, P. Dal-Bianco, G. Ransmayr, H. Schmidt, G. Sanin, P. Santer, G. Caravias, S. Seiler, D. Grossegger, W. Fruehwirt, and R. Schmidt. Quantitative eeg markers relate to alzheimer’s disease severity in the prospective dementia registry austria (prodem). Clinical Neurophysiology, 126(3):505–513, 2015. ISSN 1388-2457. doi:
  • MacKay [1992] David JC MacKay.

    A practical Bayesian framework for backpropagation networks.

    Neural computation, 4(3):448–472, 1992.
  • Gal [2016] Yarin Gal. Uncertainty in deep learning. University of Cambridge, 2016.
  • Rasmussen and Williams [2006] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
  • Gal and Smith [2018] Yarin Gal and Lewis Smith. Idealised Bayesian Neural Networks Cannot Have Adversarial Examples: Theoretical and Empirical Study. arXiv preprint arXiv:1806.00667, 2018.