Modeling Disease Progression in Mild Cognitive Impairment and Alzheimer's Disease with Digital Twins

by   Daniele Bertolini, et al.

Alzheimer's Disease (AD) is a neurodegenerative disease that affects subjects in a broad range of severity and is assessed in clinical trials with multiple cognitive and functional instruments. As clinical trials in AD increasingly focus on earlier stages of the disease, especially Mild Cognitive Impairment (MCI), the ability to model subject outcomes across the disease spectrum is extremely important. We use unsupervised machine learning models called Conditional Restricted Boltzmann Machines (CRBMs) to create Digital Twins of AD subjects. Digital Twins are simulated clinical records that share baseline data with actual subjects and comprehensively model their outcomes under standard-of-care. The CRBMs are trained on a large set of records from subjects in observational studies and the placebo arms of clinical trials across the AD spectrum. These data exhibit a challenging, but common, patchwork of measured and missing observations across subjects in the dataset, and we present a novel model architecture designed to learn effectively from it. We evaluate performance against a held-out test dataset and show how Digital Twins simultaneously capture the progression of a number of key endpoints in clinical trials across a broad spectrum of disease severity, including MCI and mild-to-moderate AD.



There are no comments yet.


page 1

page 2

page 3

page 4


Generating Digital Twins with Multiple Sclerosis Using Probabilistic Neural Networks

Multiple Sclerosis (MS) is a neurodegenerative disorder characterized by...

Improving Mild Cognitive Impairment Prediction via Reinforcement Learning and Dialogue Simulation

Mild cognitive impairment (MCI) is a prodromal phase in the progression ...

The relative efficiency of time-to-progression and continuous measures of cognition in pre-symptomatic Alzheimer's

Pre-symptomatic (or Preclinical) Alzheimer's Disease is defined by bioma...

Graph Neural Network on Electronic Health Records for Predicting Alzheimer's Disease

The cause of Alzheimer's disease (AD) is poorly understood, so forecasti...

Representing Alzheimer's Disease Progression via Deep Prototype Tree

For decades, a variety of predictive approaches have been proposed and e...

A brain signature highly predictive of future progression to Alzheimer's dementia

Early prognosis of Alzheimer's dementia is hard. Mild cognitive impairme...

Analyzing the effect of APOE on Alzheimer's disease progression using an event-based model for stratified populations

Alzheimer's disease (AD) is the most common form of dementia and is phen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The development of computational models that comprehensively and accurately forecast a patient’s prognosis under the current standard-of-care has the potential to revolutionize medical research. Recall that a typical clinical trial compares the safety and efficacy of an investigational therapy to an established therapy, often in combination with a dummy treatment (i.e., placebo). Therefore, generative models trained to represent disease progression with existing treatments can be used to generate synthetic clinical records, which we call ‘Digital Subjects’, for use in clinical trial design or as external control groups [27, 7]. Similarly, conditional generative models can be used to generate distributions of potential control outcomes for individual patients ([27, 7, 31, 26]), which we call ‘Digital Twins’, that can be incorporated into clinical trials to increase statistical power or reduce required sample sizes [19, 26].

Alzheimer’s Disease (AD) is one indication with a particularly dire need for the development of new technologies that could increase the probability of finding an effective treatment. Despite decades of research, there is still no approved disease-modifying treatment for AD, and it often takes years to enroll enough subjects to achieve sufficient statistical power for a new clinical trial 

[4, 3]. Therefore, the ability to decrease required sample sizes by augmenting AD clinical trials with data from generative models would make it possible to search for new AD treatments more quickly.

During a typical AD clinical trial, each subject reports to a clinical site at regular intervals (e.g., every 3 or 6 months) and is assessed with a variety of metrics to measure cognitive function and overall health. Usually, cognitive function is assessed through composite questionnaires such as the Alzheimer’s Disease Assessment Scale - Cognitive Subscale (ADAS-Cog) [18], the Clinical Dementia Rating (CDR) [11], and the Mini-Mental State Exam (MMSE) [8]. In addition, vital signs, blood tests, and biomarkers derived from magnetic resonance imaging (MRI) or positron emission tomography (PET) may be measured. That is, data collected in clinical trials typically correspond to a type of panel data [30] – a multidimensional array that describes how every subject’s health data changes over the course of the trial.

Previously, Fisher et al. [7] demonstrated that a type of generative model called a Conditional Restricted Boltzmann Machine (CRBM) can be trained using an adversary to generate panel data representative of control groups of AD clinical trials. In addition, they showed that CRBMs can also be used as conditional generative models to forecast potential outcomes for individual subjects. Here, we aim to extend this work by incorporating additional training data that includes patient populations with earlier stage disease, additional cognitive measures such as CDR, as well as additional biomarkers. These improvements are crucial for addressing the needs of a changing AD clinical trials landscape that is increasingly focused on earlier stages of the disease including subjects with Mild Cognitive Impairment (MCI) [21].

Most clinical trials focus on relatively homogeneous populations, and trials in AD are no exception. Few individual studies span the spectrum from MCI to moderate AD. In addition, an average clinical trial in AD only includes around 400 subjects [4, 3]. Therefore, it’s necessary to integrate data from multiple interventional and observational studies in order to collect a sufficiently broad dataset. Indeed, we use clinical records from nearly 7,000 subjects – covering close to 35,000 subject-visits – integrated from 21 different studies. Unfortunately, these different studies did not all collect the same measurements or have subjects visit clinical sites with the same frequency. As a result, integrating these different studies solves the sample size problem, but creates a missing data problem. For example, the components of ADAS-Cog were typically measured in 3-month intervals, whereas the components of CDR were typically measured in 6-month intervals. To overcome this problem, we introduce a new approach that combines two CRBMs, each trained to represent a specific timescale, into a single generative model for creating Digital Subjects and Digital Twins for AD.

Using analyses on a held-out test set, we demonstrate that a generative model composed of two CRBMs accurately simulates subjects with MCI and AD under the current standard-of-care. That is, statistical properties computed from Digital Twins agree with statistical properties computed from actual subjects in the control arms of previous AD clinical trials, which is a necessary condition for using Digital Twins to increase the statistical power of future AD trials. Moreover, we demonstrate that the updated model represents a substantial improvement over previously published work, particularly for subjects with MCI.

The manuscript is organized as follows. Section II provides a brief description of our dataset, modeling approach, and how we test the model and compare it to previous work. Section III presents various comparisons of Digital Twins generated by the model to actual subject records in order to assess the quality of the generative model. Finally, Section IV discusses implications of these results and future directions for research. Appendices AB, and C give additional details on the methods used, and Appendices D and E give additional results beyond those discussed in the main text.

Ii Methods

ii.1 Data

Previously, Fisher et al. [7] used a dataset with 1,909 subjects covering 44 clinical variables to train and evaluate a generative model for AD clinical trajectories. Here, we improve on this previously published work by training a generative model using a much larger dataset composed of 6,919 subjects covering 64 clinical variables across a broader set of domains, obtained by integrating data from 21 different studies.

Data were obtained from the C-Path Online Data Repository for Alzheimer’s Disease (CODR-AD) [17, 16] and the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [15]. The former correspond to data from the control arms of previously completed AD clinical trials, whereas the latter is a large observational study of MCI and AD subjects. These two datasets differ in a number of ways. Data from CODR-AD is primarily from mild to moderate AD studies, with a visit interval of at most 3 months. While CODR-AD data is rich in ADAS-Cog and MMSE measurements as well as safety data such as laboratory tests, it lacks some measurements prevalent in many current AD trials, particularly CDR and certain biomarker data. ADNI data has a visit interval of 6 months or more and contains data for subjects across the entire disease spectrum, with a substantial fraction of data for MCI subjects111ADNI also includes cognitively normal cohorts, which are not used in our datasets.. ADNI includes a more diverse set of measurements with both CDR and biomarker data but without, for example, laboratory tests.

We combined these two sources and built an integrated dataset with a more complete set of variables than either individual dataset can provide. This integrated dataset is comprised of panel data for 6,919 subjects with 34,224 subject-visits in which approximately 25% of subjects have MCI and the remainder have AD. Subject-matter experts were consulted to determine variables of clinical significance and, based on data availability, 64 variables were included. Most of the variables were measured in regular 3- or 6-month intervals. A few variables measured only at the first (baseline) visit were treated as background variables, i.e., properties of the clinical state of subjects at the start of the study. The variables are listed in Table 1. Additional details on the variables and the construction of the dataset are provided in Appendix A.1.

Before training, the dataset was split in the ratio into training, validation, and test datasets, stratified by study. The test dataset was held out until the end of model development and used for all analyses shown in this work unless otherwise noted.

Variable Name Group Variable Name Group
ADAS cancellation 3 Alanine aminotransferase 3
ADAS commands 3, 6 Alkaline phosphatase 3
ADAS comprehension 3, 6 Aspartate aminotransferase 3
ADAS construction 3, 6 Cholesterol 3
ADAS delayed word recall 3, 6 Creatinine kinase 3
ADAS ideational 3, 6 Creatinine 3
ADAS naming 3, 6 Eosinophils 3
ADAS orientation 3, 6 Gamma-glutamyl transferase 3
ADAS remember instructions 3, 6 Glucose 3
ADAS spoken language 3, 6 Hematocrit 3
ADAS word finding 3, 6 Hemoglobin 3
ADAS word recall 3, 6 Hemoglobin A1C 3
ADAS word recognition 3, 6 Indirect Bilirubin 3
MMSE attention 3, 6 Lymphocytes 3
MMSE language 3, 6 Monocytes 3
MMSE orientation 3, 6 Platelet 3
MMSE recall 3, 6 Potassium 3
MMSE registration 3, 6 Sodium 3
CDR community 6 Triglycerides 3
CDR home & hobbies 6 Heart rate 3
CDR judgement 6 Diastolic blood pressure 3
CDR memory 6 Systolic blood pressure 3
CDR orientation 6 Weight 3
CDR personal care 6 Serious adverse events 3
Baseline Age 3, 6 AChEI/Memantine use 3, 6
Sex 3, 6 History of hypertension 3, 6
Years of education 3, 6 History of type-2 diabetes 3, 6
ApoE 4 allele count 3, 6 Amyloid status 3, 6
Region (Europe) 3 CSF phosphorylated tau 181 3, 6
Region (Northern America) 3 CSF total tau 3, 6
Region (Other) 3 Vitamin B12 3, 6
Height 3, 6 RCT or Observational Study 3, 6
Table 1: Variables used in modeling AD progression. Background variables are denoted in italics. The group(s), 3-month and/or 6-month, that each variable is used with in modeling, based on the visit interval of the data and the relationship to other variables, are given.

ii.2 Modeling Approach

Modeling the integrated panel dataset presents two challenges: variables are observed with two characteristic timescales (3-months and 6-months) and only a subset of those features are ever observed simultaneously.

There are at least three different ways to solve the multiple timescale problem. First, one could downsample all of the data to a 6-month visit interval. However, this makes the data unsuitable for many clinical trial applications, which typically record observations every 3 months. Alternatively, one could directly train a model with a 3-month visit frequency on the combined dataset. Training such a model is extremely challenging as some key variables, such as the components of CDR, would be missing in a large fraction of observations. Therefore, we adopt a third approach in which we explicitly construct a generative model with two timescales.

We handle the two-timescale character of the data explicitly by training a composite model consisting of two CRBMs of the form of Walsh et al. [27]: 3-month and 6-month CRBMs that model variables with those respective time intervals in the panel data. Variables available at 3-month intervals are modeled by the 3-month CRBM, and variables only available at 6-month intervals as well as key variables available at 3-month intervals are modeled by the 6-month CRBM. Hence, there is a set of variables included in both CRBMs. In our case, these shared variables are components of the ADAS-Cog and MMSE assessments as well as background variables. The components of CDR are modeled by the 6-month CRBM alone. Table 2 gives the number of subjects from each dataset used in the 3-month and 6-month CRBMs. A summary of CRBM models is presented in Appendix C.

Dataset Number of subjects
3-month CRBM 6-month CRBM
CODR-AD 5331 208
ADNI 1588 1588
Table 2: The number subjects from each dataset used for each CRBM. The 3-month CRBM uses all 6,919 subjects in the integrated dataset.

The 3-month and 6-month CRBMs are naturally integrated into a single model. Each can be separately trained, but when they are used to generate Digital Subjects or Digital Twins the 3-month CRBM is used first, followed by the 6-month CRBM. Any shared variables are generated by the 3-month CRBM and conditioned upon (held fixed) when generating data from the 6-month CRBM. This structure is natural, as the 3-month CRBM predicts a given subject’s clinical trajectory for all variables except CDR. Subsequently, key data variables generated by the 3-month CRBM are used by the 6-month CRBM to better inform the predictions of CDR on a 6-month timescale.

The 3-month and 6-month CRBMs are trained independently, each following a hyperparameter selection procedure presented in

Walsh et al. [27]

and reviewed here. Using the training component of the dataset, a set of CRBMs are trained via stochastic gradient descent, each with different hyperparameters. The validation dataset is used to compute metrics that judge model performance, and optimal hyperparameters are selected. Finally, the CRBM is retrained using these optimal hyperparameters on both the training and validation components of the dataset. This results in 3-month and 6-month CRBMs that can be integrated into the final model.

Figure 1: An overview of the data curation, CRBM training, and model sampling. In (A), the data used to build the conditional generative model is curated from two sources, CODR-AD and ADNI. The 3-month and 6-month CRBMs are built by training models in separate grids and using a selection process to determine optimal hyperparameters. After retraining using these selected hyperparameters, the CRBMs are composed into the final model. In (B), Digital Twins are built from baseline data by first sampling from the 3-month model then sampling from the 6-month model. When sampling from the 6-month model, the variables shared with the 3-month model are conditioned upon using the samples generated from the 3-month model.

Data curation, CRBM training, and model sampling procedures are outlined in Figure 1. For an extended description of the methods described here, see Appendix A.

ii.3 Model Evaluation

The goal of our modeling approach is to be able to generate Digital Twins whose panel data are well-calibrated

, meaning they accurately predict the probability distribution of actual subject data. This is practically useful because it provides an unbiased prediction for subject outcomes along with an unbiased estimate for their variability. Digital Twins go beyond simple predictions of mean outcomes – for each subject, a set of Digital Twins predicts the distribution of possible outcomes. Well-calibrated Digital Twins can be used to reliably predict subject-level or study-level outcomes and their expected variability.

We use the test dataset, which the constituent CRBMs have not seen, and compare test subjects to their Digital Twins created from the baseline data. A number of different approaches are used to compare these two groups of data to assess variable-level, subject-level, and population-level measures. As most outcomes of interest for clinical applications are linear combinations of the variables modeled, such as the ADAS-Cog total score, MMSE total score, or CDR Sum-of-Boxes (CDR-SB) score, we are particularly interested in the performance in modeling linear combinations of variables.

Iii Results

iii.1 Evaluating the Performance of the Model in Generating Digital Twins

We evaluate the ability of the trained conditional generative model to create well-calibrated forecasts for variables in three ways. First, for each variable, we ask if observed values tend to fall in the bulk of the distribution predicted by the model, conditioned on the observed data at time zero (the baseline data). Next, we ask if the population average means, variances, and pairwise (auto-)correlations predicted by the model are consistent with those estimated from the data. Finally, we train a linear classifier to distinguish between actual subject records and Digital Twins sampled from the model. All analyses are performed on the held-out test dataset not used by the model during training. Additional details on the methods used in this section can be found in Appendix 


, and additional results on the statistical agreement between moments and correlations in disease progression measures for actual control subjects and their Digital Twins are presented in Appendices 

D and E.

Figure 2: Time-dependent marginal distributions computed from the model are well-calibrated. For each variable, at each visit, we compute a statistic,

, that has a standard normal distribution if observations from actual subjects are consistent with the distribution defined by the Digital Twins (see


). We plot the mean as a point and standard deviation as an error bar, and apply a Kolmogorov-Smirnov test to check if the distributions of

are consistent with . Red marks any deviations that remain statistically significant after a Bonferroni correction. Means above (below) 0 indicate a biased model distribution with lower (higher) mean than the mean of the data; similarly, variances smaller (larger) than 1 indicate a model variance larger (smaller) than variance of the data.

Our first analysis focuses on the calibration of the time-dependent marginal distributions, conditioned on observed baseline data for each subject. We generate 100 Digital Twins for each subject and, for each variable at each visit for each subject, verify the consistency of the distribution over these Digital Twins with the observed value from the actual subject. In particular, for each variable at each visit, we compute the probability that a Digital Twin sampled from the conditional generative model has a larger value of that variable than the what was observed for a given subject. Then, we normalize this ‘p-value’ by computing a transformation, , in which

is the cumulative distribution function of the normal distribution. Finally, we perform a statistical test to determine if the distribution of

is consistent with a standard normal distribution, , which would be expected if the Digital Twins and actual subject data are drawn from the same distribution. Figure 2 shows the mean and standard deviation of for all variables and time points. For the vast majority of variables, differences between observed and expected distributions are not statistically significant after correcting for multiple testing. The only differences of note are that the variance for weight predicted by the model is too large, and the mean values for hemoglobin A1C at 15 and 18 months and ADAS cancellation at 15 months slightly biased. Therefore, the time-dependent marginal distributions are generally well-calibrated.

Next, we examine on the ability of the conditional generative model to forecast the population-level statistics that are most important for linear combinations of variables such as the ADAS-Cog, CDR-SB, or MMSE composite scores. We compare population averaged means, variances, pairwise correlations, and the 3-, 6-, and 9-month lagged pairwise autocorrelations computed from Digital Twins and actual subjects. Specifically, we sampled a single Digital Twin for each subject in the test dataset to create a Digital Twin cohort. Then, we computed the above mentioned statistics from the Digital Twin cohort and from the test dataset. Figure 3

shows the values of these statistics from the test dataset plotted against the corresponding values computed from the Digital Twin cohort, with the goodness-of-fit assessed across statistics of the same type using linear regressions. In the regressions, points receive a weight proportional to the fraction of data present when computing that statistic to account for the impact of missing data. Theil-Sen regression is used to quantify goodness-of-fit for the means and standard deviations to mitigate outlier dependence due to different units (and scales) across variables. We find that all slopes are close to 1 and all intercepts are close to 0, illustrating that our model captures the leading statistical moments that are relevant for linear combinations of variables.


Figure 3: The model captures the leading statistical moments of the data. We compare the leading statistical moments of the distribution of all variables from the test data and from the model. A single Digital Twin is randomly generated for each subject and the moments are compared to the data for each 3-months follow-up visit up to 18 months. We report slope and intercept from a regression of the data against the model predictions, weighted by the fraction of data present at each particular point (indicated by the color bar on the right). For correlations and autocorrelations we also report the coefficient of determination .

Finally, we test the ability of a linear classifier to distinguish between actual subjects and their Digital Twins. In a sense, this is directly testing if Digital Twins are statistically indistinguishable from actual subjects, and is closely related to the adversarial methods used while training the model. For each time point, we train a logistic regression to distinguish between each subject and their Digital Twins. In addition, we train a logistic regression to distinguish between actual subjects and their Digital Twins using the difference in panel data between two consecutive time points. Figure 

4 shows that the linear classifiers only obtain accuracies that are consistent, or nearly consistent, with random chance. This demonstrates that individual subjects are statistically indistinguishable from their Digital Twins using linear combinations of variables.

Figure 4: Linear classifiers are unable to distinguish actual subjects from their Digital Twins.

We train a logistic regression model to differentiate actual subjects from their Digital Twins. Top panel: we compare two models, one in which we use data at each visit, and one in which we use differences between two consecutive visits. On the right, we report the accuracy of both models. An accuracy of 0.5 is consistent with random chance. Error bars indicate 95% confidence interval errors. Bottom panel: boxplots for the distribution of logistic regression coefficients for the single visit model, where a weight equal to 0 corresponds to an irrelevant feature.

iii.2 Progression of Common Clinical Endpoints

The components of the cognitive questionnaires are particularly important for clinical trial applications because ADAS-Cog, CDR, and MMSE are frequently used as inclusion criteria and endpoints in AD clinical trials. We model 13 components of the ADAS-Cog score, 6 components of the CDR score, and 5 components of the MMSE score, which are listed in Tables 4 and 3 (see Appendix B for more detail). We compared the mean change-from-baseline in ADAS-Cog11, CDR-SB, and MMSE predicted from a Digital Twin cohort sampled from the conditional generative model to the corresponding changes estimated from the test dataset. We present the results stratified by disease severity at baseline. Figure 5 shows that we consistently predict the mean progression across all time points, scores, and cohorts, indicating that the conditional generative model is well-calibrated for these important clinical outcomes. Further, Appendix D shows that predicted changes-from-baseline for these scores in individual subjects are highly correlated with observed change from baseline from actual subjects. Taken together, these results demonstrate that the model can be used to accurately forecast the prognosis of subjects with MCI and AD.

Figure 5: Progression of clinical endpoints stratified by disease severity at baseline.

We compare the predicted mean change-from-baseline for ADAS-Cog11, CDR-SB, and MMSE to the mean change-from-baseline in test dataset. We stratified the population by ADAS-Cog11 measured at baseline, with lower values corresponding to milder cases. For each time point, the mean change-from-baseline, and its standard error, were estimated by drawing 100 Digital Twins for each subject and averaging over the population. We consistently predict the mean progression across all time points, scores, and cohorts. Error bars represent 95% confidence interval errors. No CDR test data was available at 18 months for the most severe group.

iii.3 Comparison to an Earlier AD Progression Model

Previously, Fisher et al. [7] developed a machine learning model to simulate AD progression using a smaller dataset than that used in this manuscript. In addition, the prior model did not include the components of CDR, or other key biomarker and safety data, and mostly modeled subjects with mild to moderate AD. In this paper, we combined multiple data sources to create a larger and more comprehensive dataset. In addition, a much larger fraction of subjects in the new dataset had MCI, which resulted to a dramatic improvement in predictions for this cohort of subjects. We compare the predictions for ADAS-Cog11 progression from the two models on AD and MCI cohorts in Figure 6. In addition, Tables 8 and 9 in Appendix E compare the mean and standard deviation of the marginal distributions of each variable computed from the two models to those computed from the test dataset. The performance of the current model on subjects with AD is comparable to the previous model, but it substantially outperforms the previous model for subjects with MCI.

Figure 6: Mean change-from-baseline in ADAS-Cog11 for subjects with AD or MCI. We compare the predictions for mean ADAS-Cog11 progression from the CRBM model of Fisher et al. [7] and the model of this paper on two different cohorts: subjects with baseline diagnosis of AD or MCI. The model presented here outperforms the previous model on the MCI cohort. Mean progressions are calculated as in Figure 5 and error bars represent 95% confidence interval errors.

Iv Discussion

A typical clinical trial compares the safetey and efficacy of an investigational therapy to a placebo and standard-of-care. Therefore, there are many uses for models that can generate accurate forecasts for disease progression under standard-of-care ranging from clinical trial design to statistical analysis plans that directly incorporate prognostic forecasts to improve statistical power [19, 26]. Here, we built on previous work by training a generative model to forecast disease progression for subjects with AD, aiming to obtain well-calibrated forecasts across the spectrum of severity from early to severe disease.

A subject’s prognosis can be forecast by using the model to generate Digital Twins – longitudinal trajectories sampled from a multivariate probability distribution that describes how that subject’s characteristics are likely to change after baseline. Each Digital Twin has 16 background variables that are only measured at baseline, and 48 time-dependent variables that are measured in 3- or 6-month intervals. In total, these 64 variables include cognitive questionnaires such as ADAS-Cog, CDR, and MMSE that are commonly used to assess disease progression in clinical trials, as well as lab tests and other biomarkers.

Modeling this comprehensive set of variables required integrating two large datasets of AD clinical trials and observational studies. However, the different studies in the integrated dataset typically observed patients in either 3- or 6-month intervals, and they did not always measure the same set of variables. Therefore, we introduced a new architecture comprised of two Conditional Restricted Boltzmann Machines (CRBMs), one to represent the 3-month timescale, and the other to represent the 6-month timescale. The two CRBMs can be combined to generate trajectories that have the variables measured in 3-month intervals and 6-month intervals. The challenge of modeling panel data with multiple timescales is commonly encountered when aggregating data from multiple studies, and the approach described could be easily extended to more general cases. Therefore, we expect this approach may be effective in modeling disease progression for many diseases.

We showed that, at the single variable level, actual subjects’ observed measurements are consistent with the distribution of their Digital Twins at different visits. At the population level, we showed that Digital Twins capture the leading statistical moments of the data, including means, standard deviations, correlations, and lagged autocorrelations of all variables. Finally, we have shown that logistic regression cannot distinguish actual subjects from their Digital Twins, with an accuracy consistent or nearly consistent with random chance. We also showed that the model accurately predicts the progression of ADAS-Cog11, CDR-SB, and MMSE, across multiple severity cohorts. Therefore, this generative model provides reliable forecasts for disease progression of the majority of patient characteristics that are of interest in AD clinical trials.

In comparison to a previous model used to forecast disease progression in subjects with AD presented by Fisher et al. [7], the model described here was based on a dataset with more than 3 times as many subjects, as well as additional clinical variables. In addition to providing more comprehensive forecasts, the new model provides more accurate forecasts – particularly for patients with early stage disease.

Generative models have a number of potential applications when applied to health data. For example, a generative model can be used to create datasets that have desired statistical properties but don’t contain any private health information. This application could be useful for training discriminative models while protecting patient privacy or intellectual property rights of data owners. Similarly, cohorts of Digital Subjects can be used for simulating clinical trials with different inclusion criteria to facilitate clinical trial design.

Similarly, conditional generative models have a number of potential applications in clinical trials, and even clinical care. For example, Digital Twins can be integrated into clinical trials as prognostic scores to increase their power or reduce the number of subjects required to achieve a desired power [19, 26]. Further into the future, conditional generative models that can create comprehensive forecasts of likely outcomes for individual patients could form the basis of clinical decision support systems.

Here, we have demonstrated that a particular type of generative model (i.e., CRBMs) can be used to accurately model disease progression for patients with MCI or AD. CRBMs are particularly useful for this task because they effectively model panel data, can easily handle binary, continuous, or ordinal data types, can impute missing data at train or test time, and can be used as a generative model (to create Digital Subjects) or as a conditional generative model (to create Digital Twins). Future work could explore the advantages and disadvantages of different types of generative models for these types of data.

V Data Availability

Certain data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database ( The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). For up-to-date information, see

Certain data used in the preparation of this article were obtained from the Critical Path for Alzheimer’s Disease (CPAD) database. In 2008, Critical Path Institute, in collaboration with the Engelberg Center for Health Care Reform at the Brookings Institution, formed the Coalition Against Major Diseases (CAMD), which was then renamed to CPAD in 2018. The Coalition brings together patient groups, biopharmaceutical companies, and scientists from academia, the U.S. Food and Drug Administration (FDA), the European Medicines Agency (EMA), the National Institute of Neurological Disorders and Stroke (NINDS), and the National Institute on Aging (NIA). CPAD currently includes over 200 scientists, drug development and regulatory agency professionals, from member and non-member organizations. The data available in the CPAD database has been volunteered by CPAD member companies and non-member organizations.

Vi Acknowledgements

Data collection and sharing for this project was funded in part by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health ( The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

We would like to thank Pankaj Mehta and Alejandro Schuler for helpful comments while preparing the manuscript.


  • Ackley et al. [1985] David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.
  • Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
  • Cummings et al. [2018] Jeffrey Cummings, Carl Reiber, and Parvesh Kumar. The price of progress: Funding and financing Alzheimer’s disease drug development. Alzheimer’s & Dementia: Translational Research & Clinical Interventions, 4:330–343, 2018.
  • Cummings et al. [2014] Jeffrey L Cummings, Travis Morstorf, and Kate Zhong. Alzheimer’s disease drug-development pipeline: few candidates, frequent failures. Alzheimer’s research & therapy, 6(4):1–7, 2014.
  • Fischer and Igel [2012] Asja Fischer and Christian Igel. An introduction to restricted Boltzmann machines. In

    Iberoamerican congress on pattern recognition

    , pages 14–36. Springer, 2012.
  • Fisher et al. [2018] Charles K Fisher, Aaron M Smith, and Jonathan R Walsh. Boltzmann encoded adversarial machines. arXiv preprint arXiv:1804.08682, 2018.
  • Fisher et al. [2019] Charles K Fisher, Aaron M Smith, and Jonathan R Walsh. Machine learning for comprehensive forecasting of Alzheimers disease progression. Scientific reports, 9(1):1–14, 2019.
  • Folstein et al. [1975] Marshal F Folstein, Susan E Folstein, and Paul R McHugh. “mini-mental state”: a practical method for grading the cognitive state of patients for the clinician. Journal of psychiatric research, 12(3):189–198, 1975.
  • Hinton [2010] Geoffrey Hinton. A practical guide to training restricted Boltzmann machines. Momentum, 9(1):926, 2010.
  • Hudson et al. [2018] Lynn D. Hudson, Rebecca D. Kush, Eileen Navarro Almario, Nathalie Seigneuret, Tammy Jackson, Barbara Jauregui, David Jordan, Ronald Fitzmartin, F. Liz Zhou, James K. Malone, Jose Galvez, and Lauren B. Becnel. Global Standards to Expedite Learning From Medical Research Data. Clinical and Translational Science, 11(4):342–344, July 2018. doi: 10.1111/cts.12556. URL
  • Hughes et al. [1982] CP Hughes, L Berg, WL Danziger, LA Coben, and RL Martin. A new clinical scale for the staging of dementia. The British journal of psychiatry, 140:566–572, 1982.
  • Kubick et al. [2007] Wayne R Kubick, Stephen Ruberg, and Edward Helton. Toward a comprehensive cdisc submission data standard. Drug information journal, 41(3):373–382, 2007.
  • LeCun et al. [2006] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
  • Mnih et al. [2012] Volodymyr Mnih, Hugo Larochelle, and Geoffrey E. Hinton. Conditional restricted Boltzmann machines for structured output prediction. arXiv preprint arXiv:1202.3748, 2012.
  • Mueller et al. [2005] Susanne G Mueller, Michael W Weiner, Leon J Thal, Ronald C Petersen, Clifford R Jack, William Jagust, John Q Trojanowski, Arthur W Toga, and Laurel Beckett. Ways toward an early diagnosis in Alzheimer’s disease: the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimer’s & dementia: the journal of the Alzheimer’s Association, 1(1):55–66, 2005.
  • Neville et al. [2015] Jon Neville, Steve Kopko, Steve Broadbent, Enrique Avilés, Robert Stafford, Christine M Solinsky, Lisa J Bain, Martin Cisneroz, Klaus Romero, and Diane Stephenson. Development of a unified clinical trial database for Alzheimer’s disease. Alzheimer’s & dementia: the journal of the Alzheimer’s Association, 11(10):1212–1221, 2015.
  • Romero et al. [2009] K Romero, M De Mars, D Frank, M Anthony, J Neville, L Kirby, K Smith, and RL Woosley. The Coalition Against Major Diseases: developing tools for an integrated drug development process for Alzheimer’s and Parkinson’s diseases. Clinical Pharmacology & Therapeutics, 86(4):365–367, 2009.
  • Rosen et al. [1984] Wilma G Rosen, Richard C Mohs, and Kenneth L Davis. A new rating scale for Alzheimer’s disease. The American journal of psychiatry, 1984.
  • Schuler et al. [2020] Alejandro Schuler, David Walsh, Diana Hall, Jon Walsh, and Charles Fisher. Increasing the efficiency of randomized trial estimates via linear adjustment for a prognostic score. arXiv preprint arXiv:2012.09935, 2020.
  • Smolensky [1987] Paul Smolensky. Information processing in dynamical systems foundations of harmony theory. In David E. Rumelhart and James L. McClelland, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition Foundations, pages 194–281. MIT Press, 1987.
  • Sperling et al. [2011] Reisa A Sperling, Paul S Aisen, Laurel A Beckett, David A Bennett, Suzanne Craft, Anne M Fagan, Takeshi Iwatsubo, Clifford R Jack Jr, Jeffrey Kaye, Thomas J Montine, et al. Toward defining the preclinical stages of Alzheimer’s disease: Recommendations from the national institute on aging-Alzheimer’s association workgroups on diagnostic guidelines for Alzheimer’s disease. Alzheimer’s & dementia, 7(3):280–292, 2011.
  • Sutton and McCallum [2007] Charles Sutton and Andrew McCallum. Piecewise pseudolikelihood for efficient training of conditional random fields. In Proceedings of the 24th international conference on Machine learning, pages 863–870, 2007.
  • Taylor et al. [2007] Graham W Taylor, Geoffrey E Hinton, and Sam T Roweis. Modeling human motion using binary latent variables. In Advances in neural information processing systems, pages 1345–1352, 2007.
  • Theis et al. [2016] L Theis, A van den Oord, and M Bethge. A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR 2016), pages 1–10, 2016.
  • Tieleman [2008] Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pages 1064–1071. ACM, 2008.
  • [26] David Walsh, Alejandro Schuler, Diana Hall, Jonathan Walsh, and Charles Fisher. Using prognostic scores for more efficient randomized trial estimates through bayesian linear regression. To appear.
  • Walsh et al. [2020] Jonathan R Walsh, Aaron M Smith, Yannick Pouliot, David Li-Bland, Anton Loukianov, and Charles K Fisher. Generating digital twins with Multiple Sclerosis using probabilistic neural networks. arXiv preprint arXiv:2002.02779, 2020.
  • Welling et al. [2004] Max Welling, Michal Rosen-Zvi, and Geoffrey E Hinton. Exponential family harmoniums with an application to information retrieval. Advances in neural information processing systems, 17:1481–1488, 2004.
  • Wickham et al. [2014] Hadley Wickham et al. Tidy data. Journal of Statistical Software, 59(10):1–23, 2014.
  • Wooldridge [2010] Jeffrey M Wooldridge. Econometric analysis of cross section and panel data. MIT press, 2010.
  • Yoon et al. [2018] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. Ganite: Estimation of individualized treatment effects using generative adversarial nets. In International Conference on Learning Representations, 2018.

Appendix A Expanded Methods

a.1 Data

Name Group Category Type Longitudinal Present [%]
ADAS cancellation 3 Cognitive Ordinal Yes 35.5
Serious adverse events 3 Adverse events Ordinal Yes 58.9
Heart rate 3 Clinical Continuous Yes 98.5
Diastolic blood pressure 3 Clinical Continuous Yes 99.3
Systolic blood pressure 3 Clinical Continuous Yes 99.3
Weight 3 Clinical Continuous Yes 83.3
Alanine aminotransferase 3 Laboratory Continuous Yes 52.7
Alkaline phosphatase 3 Laboratory Continuous Yes 52.7
Aspartate aminotransferase 3 Laboratory Continuous Yes 52.7
Cholesterol 3 Laboratory Continuous Yes 41.8
Creatinine kinase 3 Laboratory Continuous Yes 47.5
Creatinine 3 Laboratory Continuous Yes 52.7
Eosinophils 3 Laboratory Continuous Yes 42.4
Gamma-Glutamyl Transferase 3 Laboratory Continuous Yes 39.6
Glucose 3 Laboratory Continuous Yes 50.8
Hematocrit 3 Laboratory Continuous Yes 45.9
Hemoglobin 3 Laboratory Continuous Yes 52.6
Hemoglobin A1C 3 Laboratory Continuous Yes 36.7
Indirect Bilirubin 3 Laboratory Continuous Yes 36.4
Lymphocytes 3 Laboratory Continuous Yes 44.2
Monocytes 3 Laboratory Continuous Yes 44.2
Platelet 3 Laboratory Continuous Yes 52.4
Potassium 3 Laboratory Continuous Yes 50.7
Sodium 3 Laboratory Continuous Yes 46.1
Trisglycerides 3 Laboratory Continuous Yes 35.4
Region Europe 3 Background Binary No 95.7
Region Northern America 3 Background Binary No 95.7
Region Other 3 Background Binary No 95.7
Table 3: Variables in the 3-month group , modeled by the 3-month component only. See discussion in A.2 for a description of the 3-month and 6-month components.
Name Group Category Type Longitudinal Present [%]
ADAS commands 3,6 Cognitive Ordinal Yes 99.7
ADAS comprehension 3,6 Cognitive Ordinal Yes 99.7
ADAS construction 3,6 Cognitive Ordinal Yes 99.6
ADAS delayed word recall 3,6 Cognitive Ordinal Yes 60.2
ADAS ideational 3,6 Cognitive Ordinal Yes 99.6
ADAS naming 3,6 Cognitive Ordinal Yes 99.6
ADAS orientation 3,6 Cognitive Ordinal Yes 99.6
ADAS remember instructions 3,6 Cognitive Ordinal Yes 99.6
ADAS spoken language 3,6 Cognitive Ordinal Yes 99.7
ADAS word finding 3,6 Cognitive Ordinal Yes 99.7
ADAS word recall 3,6 Cognitive Ordinal Yes 99.6
ADAS word recognition 3,6 Cognitive Ordinal Yes 99.5
MMSE attention 3,6 Cognitive Ordinal Yes 60.8
MMSE language 3,6 Cognitive Ordinal Yes 57.9
MMSE orientation 3,6 Cognitive Ordinal Yes 60.8
MMSE recall 3,6 Cognitive Ordinal Yes 60.8
MMSE registration 3,6 Cognitive Ordinal Yes 60.8
CDR community 6 Cognitive Ordinal Yes 25.8
CDR home hobbies 6 Cognitive Ordinal Yes 25.8
CDR judgement 6 Cognitive Ordinal Yes 25.8
CDR memory 6 Cognitive Ordinal Yes 25.8
CDR orientation 6 Cognitive Ordinal Yes 25.8
CDR personal care 6 Cognitive Ordinal Yes 25.8
Sex Female 3,6 Background Binary No 100.0
Years of education 3,6 Background Ordinal No 23.0
Age 3,6 Background Continuous No 99.4
Height 3,6 Background Continuous No 80.0
AChEI/Memantine use 3,6 Background Binary No 100.0
History of hypertension 3,6 Background Binary No 99.8
History of type-2 diabetes 3,6 Background Binary No 99.8
Clinical trial 3,6 Background Binary No 100.0
Amyloid status 3,6 Background Binary No 10.3
Vitamin B12 3,6 Background Continuous No 50.0
ApoE 4 allele count 3,6 Background Ordinal No 57.1
CSF phosphorylated tau 181 3,6 Background Continuous No 2.3
CSF total tau 3,6 Background Continuous No 3.5
Table 4: Variables belonging to the 6-month group , modeled by the 6-month model . See discussion in A.2 for a description of the 3-month and 6-month components.

Data for training and evaluating the model comes from two different sources that provide complementary views of AD disease progression. One source is the C-Path Online Data Repository for Alzheimer’s Disease (CODR-AD), a database provided by the Critical Path for Alzheimer’s Disease (CPAD) consortium [17, 16], that consists of the control arms of 29 mostly mild to moderate AD clinical trials with more than 7000 subjects. The other source is from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database [15], a collection of four long-running observational studies enrolling subjects across the AD disease spectrum since 2004, focusing primarily on MCI and cognitively normal subjects. These data have varying duration, visit interval, inclusion criteria, and measured features. Notably, the ADNI studies typically have a 6-month cadence and an extremely broad set of observations that include imaging, biomarker data, and important disease severity measures; in contrast CODR-AD trials have a cadence of at most 3-months but a narrower set of measurements. The CODR-AD data are encoded according to the Study Data Tabulation Model (SDTM) format, a structured record format for clinical data designed to facilitate review by regulatory authorities [12, 10]. The ADNI data are encoded in a custom format that accommodates the diverse array of measurements made in the ADNI studies. These formats cannot be used for machine learning directly because they have complicated non-unique feature encoding schemes and contain extraneous information. We reprocessed the databases to extract measured features and encoded them into a consistent wide-form ("tidy") tabular format [29].

Sixty-four features were selected for inclusion in the model as variables based on clinical significance determined by recommendations from subject-matter experts and on the availability of data. Tables 3 and 4 give basic properties of each of these variables. Of primary interest for clinical trial applications are the components of ADAS-Cog, CDR, and MMSE, which are often used as inclusion criteria or clinical endpoints in trials (see Appendix B for a summary of these endpoints). In addition, variables from the demographics, vital signs, medical history, questionnaires, laboratory measurements, adverse events, and biomarker domains were also included. Each was classified as a background (measured at baseline) or longitudinal (measured through time) variable and encoded as binary, ordinal, categorical or continuous. Additionally, two binary features were added, one indicating the baseline visit and another to denote whether the subject is enrolled in a clinical trial or an observational study.

Studies were selected for inclusion in the processed dataset based on three criteria: the study must record measurements of ADAS-Cog (it is one of the most commonly used endpoints in AD clinical trials and it is crucial that we include it in our model), it must have a visit cadence of no more than 6 months (including studies with larger cadence would generate records with large missing portions, corresponding to the unrecorded visits) , and it must contain a disease population (we aim to model the progression of subjects with AD or MCI). These criteria eliminate a number of CODR-AD studies that do not record ADAS-Cog data, the ADNI3 study that has a visit cadence of 12 months, and cognitively normal cohorts from the ADNI studies. To differentiate cohorts with different disease severity in ADNI, the four ADNI studies were divided into smaller studies of cognitively normal, MCI, and AD subjects.

The resulting processed dataset consists of 6,919 subjects and 34,224 subject-visits split across 21 studies, with approximately 25% of subjects in MCI stage and the remainder with AD.

Before training, the dataset was split in the ratio into training, validation, and test datasets, stratified by study. The test dataset was held out until the end of model development and used for all analyses shown in this work unless otherwise noted.

a.2 Problem Statement

We denote the data as a set of vectors

, where each measurement corresponds to a clinical record of the th subject, and the dataset contains subjects. Variables corresponding to the observed measurement are classified according to domain (continuous, binary, ordinal, categorical) which controls model encoding and whether they represent a static or time-dependent (longitudinal) observation [27, 7]. In the following, we denote the static portion of a trajectory as and the longitudinal portion with trajectory length . Any combination of observations in may be missing, and are encoded by a special value.

We aim to train a model that represents the data distribution to support conditional generation of synthetic clinical records. To simplify this task, we assume a causal Markov structure for with a lag ,


which naturally splits this model into a baseline component and an autoregressive component . Following Walsh et al. [27], we call a sample of a complete clinical record a Digital Subject; when conditioned on baseline measurements from an actual subject, , this clinical record is called a Digital Twin.

The training data contains some variables that are measured at a 3-month cadence and others that are measured at a 6-month cadence. We exploit this feature of the dataset by training one model for the 3-month variables and another for the 6-month variables . Note that certain variables such as the ADAS-Cog endpoints are modeled by both components. Suppressing the time-dependence for the moment, the full model assumes a hierarchical form,


Here the 6-month model explains variables that are observed only at 6-month intervals by conditioning on the overlap variables . The 3-month model generates the remainder of the variables with a 3-month cadence. This simplifying assumption corresponds to neglecting the contribution from unobserved 3-month measurements to a subject’s 6-month only variables.

We require that the trained model supports the ability to generate Digital Twins. Using the hierarchical decomposition eq:conditional-objective, this involves the following steps:

  1. Baseline measurements are taken directly from actual subjects that are to be simulated.

  2. The 3-month model is used to generate a partial trajectory of variables in 3-month group by autoregressive sampling using eq:markovian-factorization, conditioned on the baseline.

  3. The 6-month model is used to complete the partial trajectory by autoregressive sampling of the 6-month variables, conditioned on the 3-month variables and the baseline variables.

The component models must therefore support a variety of variable types, be tractable to train in the presence of missing data, and support conditional sequence generation.

We choose CRBMs to represent both the 3-month and 6-month models because they meet these requirements and have been shown to be effective in modeling disease progression in AD [7] and MS [27]. Appendix C contains a summary of CRBMs. For the 3-month model , a lag-2 CRBM with a cadence of 3 months is used. For the 6-month model , a lag-1 CRBM is used instead.

a.3 Training Scheme and Hyperparameter Sweep

We train the composite model implicitly by maximizing the performance of the 3-month and 6-month CRBMs independently. This is equivalent to maximizing a composite pseudolikelihood [22] approximation to the joint model eq:conditional-objective.

The 3-month component is a lag-2 model with 3 month visit cadence that is trained on 3-month variables from the ADNI and CODR-AD datasets. Training samples for this model consist of the baseline variables and longitudinal variables of three consecutive visits with a spacing of 3 months. In the ADNI data (that have visit cadence of 6 months), these samples are of two types: either the first and last visit missing (type I), or the middle visit missing (type II). Instead of relying on the data imputation capabilities of the CRBM, we trained an auxiliary lag-1 CRBM imputation model with a visit cadence of 3 months on the CODR-AD data which is used to impute the type II samples in ADNI. Samples of type I consisting of mostly imputed data were not used during training to avoid introducing too much bias from the simpler imputation model.

The 6-month component is a lag-1 model with six month cadence that is trained on the 6-month variables from ADNI and a single study from CODR-AD (that records CDR). As this is a simpler model, no special training procedures were required.

The training procedure for each model is an elaboration of that described in Section IIC and Appendix C of Walsh et al. [27], which we summarize here. We trained the 3-month and 6-month models by performing a grid search over hyperparameters in combination with minibatch stochastic gradient descent using the ADAM optimizer to obtain the parameters. We did not optimize the hyperparameters of the imputer model. We report details of the grids used for each model in Table 5. For each grid point, we trained each model on the training portion of the dataset and evaluated it on the validation portion. We performed model selection with a two-step minimax procedure similar to the one outlined in Walsh et al. [27]. First, models from a grid search are ranked on the statistical and performance metrics of Table 6 and assigned a score equal to the worst rank. Models are then re-ranked from minimum to maximum score and the bottom 75% are rejected. Second, we applied the same minimax procedure to the remaining 25%, based on performance metrics only. Top-ranked models are selected as best and retrained on the joint training and validation splits. In Figure 7 and Figure 8 we show the distributions of selection metrics over all trained models and the values for the selected 3-month and 6-month models. We aimed to choose models that performed well across all metrics, even if they were not the best-performing on any single metric.

After an optimal 3-month and 6-month models were selected, each was retrained using these optimal hyperparameters on both the training and validation portions of the dataset.

Hyperparameter Model
3-month 6-month Imputer
Batch size

Number of epochs

1000 1000 1000
Learning rate
Beta std
Weight penalty 0.001 0.001 0.001
Monte Carlo steps 25 25 25
Adversary weight
Number of hidden units
Table 5: Hyperparameter grids used in model selection. The values selected by the minimax procedure are denoted in bold.
Model Statistical Metrics Performance Metrics
3-month autocorrelations
6-month autocorrelations
9-month autocorrelations
RMS ADAS-Cog11 progression at 6 months
RMS ADAS-Cog11 progression at 12 months
RMS ADAS-Cog11 progression at 18 months
6-month autocorrelations
12-month autocorrelations
RMS ADAS-Cog11 progression at 6 months
RMS ADAS-Cog11 progression at 12 months
RMS ADAS-Cog11 progression at 18 months
RMS CDR-SB progression at 6 months
RMS CDR-SB progression at 12 months
RMS CDR-SB progression at 18 months
Table 6: Selection metrics for the 3-month and 6-month models. Statistical metrics measure how well the model reproduces correlations across variables, while performance metrics measure how well the model predicts the progression for key endpoints, i.e., ADAS-Cog11 and CDR-SB. is the coefficient of determination between data and predicted correlations or lagged autocorrelations. RMS is the root-mean-square error. We apply a two-step minimax selection process. First, models are ranked across statistical and performance metrics and assigned a score equal to the worst rank. Models are then re-ranked from minimum to maximum score and the top 25% are selected. Second, we applied the same minimax procedure to the pre-selected models, based on performance metrics only.

a.4 Model Evaluation Methods

In this section we provide details on the model evaluation described in III.1 and summarized in Figures 2, 3, and 4.

In Figure 2, for each variable and each visit, we compare the values from a given subject to the distribution obtained from 100 Digital Twins of that subject. First, for a given variable , for a subject , and for visit , we compute the p-value of the observation under the Digital Twins distribution


where is the Digital Twins empirical cumulative distribution. If observed values are consistent with the model distributions one expects a uniform p-value distribution. Instead of testing directly this hypothesis, we define a derived statistic that is more interpretable. From a p-value we calculate a statistic from the inverse normal distribution,


The values are expected to be normally distributed across subjects with zero mean and unit standard deviation, if the observations are consistent with the model distributions. Means above (below) 0 indicate a biased model distribution with lower (higher) mean than the mean of the data. Variances smaller (larger) than 1 indicate a model variance larger (smaller) than variance of the data. In Figure 2 we plot the mean and standard deviation of for all variables and visits. We compute a 1-sample Kolmogorov-Smirnov test comparing and . The significance of the test is adjusted for multiple comparisons with a Bonferroni correction. The significance level is rescaled by the total number of comparisons, which is given by the number of variables multiplied by the number of visits. This leads to a corrected significance level of .

In Figure 3 we compare summary statistics from test subjects and their Digital Twins. Let for or indicate either test subjects data or their Digital Twins at visit , respectively. We measure means for each variable at each visit , standard deviations for each variable at each visit , and correlations over all times, , for

, which correspond to equal-time correlations and 3-, 6-, 9-months lagged auto-correlations, respectively. Each statistic is calculated over test subjects and their corresponding Digital Twins, where a single Digital Twin is assigned to each test subject. We compute a regression between the values obtained from test subjects and values obtained from Digital Twins, weighted by the fraction of data present for each particular point. Means and standard deviations span multiple orders of magnitude and ordinary least squares regression is particularly sensitive to outliers, so we use Theil-Sen regression. For correlations and auto-correlations we use ordinary least squares regression and we also report the corresponding coefficient of determination.

In Figure 4 we train a classifier to distinguish actual subjects from Digital Twins, based on the complete set of longitudinal variables. We compare subjects with their Digital Twins at each visit, and we also consider a different classifier where the features are the differences of variables between two consecutive visits. Missing data is a potential source of bias, since actual subjects have missing values and their Digital Twins do not. We remove the potential bias from missing data by mean-imputing all missing values and by assigning the same imputed value to the corresponding Digital Twins. Classifiers are evaluated using 5-fold cross validation. For each fold, a model is trained on 4 folds and evaluated on the remaining one. Performance is evaluated as the average over the 5 folds. We generate 100 Digital Twins for each subject, and a separate classifier model is trained and evaluated for each set of twins. In the top panel of Figure 4, for any given visit, we report the average performance and the corresponding 95% confidence interval error over the 100 trials. In the bottom panel, we report boxplots for the distribution of the logistic regression coefficients. The distribution includes all weights estimated from the 100 Digital Twins, from the 6 visits, and from the 5 folds of cross validation. Weights are related to the relative importance of features, where a weight equal to 0 corresponds to an irrelevant feature.

Appendix B ADAS-Cog, CDR, and MMSE variables

The ADAS-Cog test is widely used in clinical trials to asses cognitive decline in subjects with Alzheimer’s Disease. ADAS-Cog most commonly includes between 11 and 14 tasks, which are scored independently and then combined into composite scores (e.g., ADAS-Cog11 or ADAS-Cog14). The tasks include both subject-completed tests and observer-based assessments, and evaluate the domains of memory, language, and praxis.

ADAS-Cog is less suitable to detect changes at milder stages of dementia, and more sensitive tests have been introduced for this purpose. A widely used test is CDR. It employs an interview format to collect detailed information on the subject’s ability to function in various domains. It consists of 6 components which are commonly combined into the composite CDR Sum-of-Boxes (CDR-SB). CDR-SB is the gold standard for staging dementia in subjects who eventually develop AD and this is reflected in the early stage AD trials that have been run in the last decade.

A commonly used quick assessment of cognitive impairment is the MMSE questionnaire. MMSE has been a staple for inclusion and exclusion criteria in clinical trials for over 20 years and is used in nearly 100% of trials for treatments from early to late-stage AD.

Figure 7: Sweep distributions for the 3-month model. The gray histograms represent the distributions of selection metrics (as reported in Table 6) over the hyperparameter sweep. The model selected from this sweep is shown as a dashed line at its value for each metric. For each metric, an arrow indicates whether smaller or larger values are better.
Figure 8: Sweep distributions for the 6-month model. Same as Figure 7 but for the 6-month model.

Appendix C Summary of CRBMs

Restricted Boltzmann Machines [1, 20, 28] are a well-known type of generative latent-variable model that possess a number of qualities that make them suitable for modeling clinical data. An RBM is defined by an energy function [13] that measures the compatibility between a set of observed variables and a set of latent variables

introduced to model the complex dependencies in the data. The data distribution is then the marginal of a joint distribution

defined by this energy function,

with the normalizing factor. This model makes few assumptions on the domain of the variables , which allows flexible modeling of the data by choosing an appropriate energy function. By construction, RBMs can efficiently generate samples conditioned on observed data through the use of MCMC methods, and can therefore be used for data imputation. They are also efficient to train, and naturally support training in the presence of missing observations. See recent reviews [5, 9] for methods of training and sampling from RBMs.

A Conditional Restricted Boltzmann Machine (CRBM) is an extension of an RBM to support sequence data in the form of eq:markovian-factorization. In the original development [23, 14], the RBM was extended to model conditional distributions of the form . Although sufficient for autoregressively extending sequences into the future, such a model cannot handle missing data in the conditioning set during training, and is hard to use for imputing backwards in time, . We use instead the CRBM architecture of Fisher et al. [7], which differs from that in [23, 14] because it is bidirectional and it allows for some of the visible variables (such as sex or race) to be time-independent (). To make the model bidirectional, we define a single lag- joint distribution that models correlations between consecutive longitudinal measurements and the baseline variables as an RBM. This leads to a joint probability distribution given by


in which the energy function takes the form,


Each variable of the visible and latent layers has bias parameters determined by the choice of functions and , as well as scale parameters and . The connection between the layers is parameterized by the weight matrices . Understood as a conventional RBM, our CRBM contains the visible units for multiple time points coupled to a standard hidden layer. Clinical trajectories may be obtained from a CRBM model by sampling according to the Markov decomposition eq:markovian-factorization. This is efficient because the joint RBM model is easy to sample conditionally, .

To optimize the parameters of a CRBM, we replace the maximum likelihood objective (Eq eq:markovian-factorization) with a piecewise pseudolikelihood approximation [22], to obtain a piecewise loss in log space,


that is averaged over adjacent windows of trajectories which we call shingles

. This objective is then optimized with persistent contrastive divergence algorithm


Training directly with this objective can result in poor sample quality. This is because maximum likelihood objectives allow the model to generate arbitrarily poor out-of-distribution data [24, 6]. We mitigate this problem by augmenting the loss with a weighted adversarial (contrastive) loss term that acts as a regularizer [6] (see Chen et al. [2]

for a similar application in computer vision). We call the resulting training objective BEAM 




is a regularization strength hyperparameter. Although this procedure is superficially similar to generative adversarial network (GAN) training, we emphasize that the BEAM objective and model properties of CRBMs are fundamentally different from that of GANs. Indeed, we treat the weight of the regularization term as a hyperparameter which can be set to zero, whereas a GAN cannot be trained without an adversary.

The computational cost of the BEAM procedure is similar to conventional RBM training and allows reusing existing RBM implementations. We refer the reader to Section IIC of Walsh et al. [27] for details on how CRBMs are trained on sequence data under the Markov assumption, and to Fisher et al. [6] for a motivation for the BEAM objective.

Appendix D Additional Results on Clinical Endpoints Progression

In Table 7 we report results from a linear regression of progression of test subject against the model predictions for various endpoints. We report slope, intercept, and Pearson correlation for ADAS-Cog11, MMSE, and CDR-SB for two representative visits at 12 months and 18 months from baseline. High correlations support the conclusion that our model predicts well progression also at subject level.

Score progression Slope Intercept
12 months
ADAS-Cog11 0.70 (0.05) 0.68 (0.22) 0.36
MMSE 0.72 (0.06) -0.27 (0.15) 0.41
CDR-SB 0.42 (0.08) 0.45 (0.10) 0.26
18 months
ADAS-Cog11 0.91 (0.06) 0.46 (0.40) 0.49
MMSE 0.79 (0.08) -0.46 (0.27) 0.45
CDR-SB 0.76 (0.13) 0.31 (0.23) 0.47
Table 7: Fit parameters for subject-level score progression predictions. We fit a linear regression where subject progressions are regressed on the model predictions and report slope, intercept and their standard error in parenthesis. The model prediction for each subject is averaged over 100 Digital Twins generated for that subject. We also report the Pearson correlation coefficient between test subject progressions and the predicted progressions.

Appendix E Additional Results on Marginal Distributions

In Tables 8 and 9 we calculate means and standard deviations of marginal distributions for all longitudinal variables and compare test data with the model of this paper and the model of Fisher et al. [7]

. We also report significant deviations of either models from data, using a t-test to compare means and the Levene’s test to compare standard deviations. We report 3 representative visits, but the test is performed for all visits from 3 months to 18 months in steps of 3 months, and significance is corrected for multiple comparisons for both models independently. We observe a larger number of significant deviations for MMSE marginals in the model of Fisher et al., showing that the model described in this paper improves predictions of MMSE. Both models show consistency for the remaining variables.

Table 8: Moments of marginal distributions (cognitive tests). For each variable and visit we show the mean (first row), and standard deviation (second row) of the marginal distributions. For each row, the first entry is from test subjects data, the second entry is from the model presented in this paper, and the third entry from the model of Fisher et al. [7]. We compare model means to data with a t-test, and standard deviations with Levene’s test, and report p-values in parenthesis (0.00 indicate values < 0.005). Bold values are significant after a Bonferroni correction, with significance level 0.05.
Table 9: Moments of marginal distributions (laboratory measurements). Same as Table 8 for laboratory measurements.