I Introduction
The development of computational models that comprehensively and accurately forecast a patient’s prognosis under the current standardofcare has the potential to revolutionize medical research. Recall that a typical clinical trial compares the safety and efficacy of an investigational therapy to an established therapy, often in combination with a dummy treatment (i.e., placebo). Therefore, generative models trained to represent disease progression with existing treatments can be used to generate synthetic clinical records, which we call ‘Digital Subjects’, for use in clinical trial design or as external control groups [27, 7]. Similarly, conditional generative models can be used to generate distributions of potential control outcomes for individual patients ([27, 7, 31, 26]), which we call ‘Digital Twins’, that can be incorporated into clinical trials to increase statistical power or reduce required sample sizes [19, 26].
Alzheimer’s Disease (AD) is one indication with a particularly dire need for the development of new technologies that could increase the probability of finding an effective treatment. Despite decades of research, there is still no approved diseasemodifying treatment for AD, and it often takes years to enroll enough subjects to achieve sufficient statistical power for a new clinical trial
[4, 3]. Therefore, the ability to decrease required sample sizes by augmenting AD clinical trials with data from generative models would make it possible to search for new AD treatments more quickly.During a typical AD clinical trial, each subject reports to a clinical site at regular intervals (e.g., every 3 or 6 months) and is assessed with a variety of metrics to measure cognitive function and overall health. Usually, cognitive function is assessed through composite questionnaires such as the Alzheimer’s Disease Assessment Scale  Cognitive Subscale (ADASCog) [18], the Clinical Dementia Rating (CDR) [11], and the MiniMental State Exam (MMSE) [8]. In addition, vital signs, blood tests, and biomarkers derived from magnetic resonance imaging (MRI) or positron emission tomography (PET) may be measured. That is, data collected in clinical trials typically correspond to a type of panel data [30] – a multidimensional array that describes how every subject’s health data changes over the course of the trial.
Previously, Fisher et al. [7] demonstrated that a type of generative model called a Conditional Restricted Boltzmann Machine (CRBM) can be trained using an adversary to generate panel data representative of control groups of AD clinical trials. In addition, they showed that CRBMs can also be used as conditional generative models to forecast potential outcomes for individual subjects. Here, we aim to extend this work by incorporating additional training data that includes patient populations with earlier stage disease, additional cognitive measures such as CDR, as well as additional biomarkers. These improvements are crucial for addressing the needs of a changing AD clinical trials landscape that is increasingly focused on earlier stages of the disease including subjects with Mild Cognitive Impairment (MCI) [21].
Most clinical trials focus on relatively homogeneous populations, and trials in AD are no exception. Few individual studies span the spectrum from MCI to moderate AD. In addition, an average clinical trial in AD only includes around 400 subjects [4, 3]. Therefore, it’s necessary to integrate data from multiple interventional and observational studies in order to collect a sufficiently broad dataset. Indeed, we use clinical records from nearly 7,000 subjects – covering close to 35,000 subjectvisits – integrated from 21 different studies. Unfortunately, these different studies did not all collect the same measurements or have subjects visit clinical sites with the same frequency. As a result, integrating these different studies solves the sample size problem, but creates a missing data problem. For example, the components of ADASCog were typically measured in 3month intervals, whereas the components of CDR were typically measured in 6month intervals. To overcome this problem, we introduce a new approach that combines two CRBMs, each trained to represent a specific timescale, into a single generative model for creating Digital Subjects and Digital Twins for AD.
Using analyses on a heldout test set, we demonstrate that a generative model composed of two CRBMs accurately simulates subjects with MCI and AD under the current standardofcare. That is, statistical properties computed from Digital Twins agree with statistical properties computed from actual subjects in the control arms of previous AD clinical trials, which is a necessary condition for using Digital Twins to increase the statistical power of future AD trials. Moreover, we demonstrate that the updated model represents a substantial improvement over previously published work, particularly for subjects with MCI.
The manuscript is organized as follows. Section II provides a brief description of our dataset, modeling approach, and how we test the model and compare it to previous work. Section III presents various comparisons of Digital Twins generated by the model to actual subject records in order to assess the quality of the generative model. Finally, Section IV discusses implications of these results and future directions for research. Appendices A, B, and C give additional details on the methods used, and Appendices D and E give additional results beyond those discussed in the main text.
Ii Methods
ii.1 Data
Previously, Fisher et al. [7] used a dataset with 1,909 subjects covering 44 clinical variables to train and evaluate a generative model for AD clinical trajectories. Here, we improve on this previously published work by training a generative model using a much larger dataset composed of 6,919 subjects covering 64 clinical variables across a broader set of domains, obtained by integrating data from 21 different studies.
Data were obtained from the CPath Online Data Repository for Alzheimer’s Disease (CODRAD) [17, 16] and the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [15]. The former correspond to data from the control arms of previously completed AD clinical trials, whereas the latter is a large observational study of MCI and AD subjects. These two datasets differ in a number of ways. Data from CODRAD is primarily from mild to moderate AD studies, with a visit interval of at most 3 months. While CODRAD data is rich in ADASCog and MMSE measurements as well as safety data such as laboratory tests, it lacks some measurements prevalent in many current AD trials, particularly CDR and certain biomarker data. ADNI data has a visit interval of 6 months or more and contains data for subjects across the entire disease spectrum, with a substantial fraction of data for MCI subjects^{1}^{1}1ADNI also includes cognitively normal cohorts, which are not used in our datasets.. ADNI includes a more diverse set of measurements with both CDR and biomarker data but without, for example, laboratory tests.
We combined these two sources and built an integrated dataset with a more complete set of variables than either individual dataset can provide. This integrated dataset is comprised of panel data for 6,919 subjects with 34,224 subjectvisits in which approximately 25% of subjects have MCI and the remainder have AD. Subjectmatter experts were consulted to determine variables of clinical significance and, based on data availability, 64 variables were included. Most of the variables were measured in regular 3 or 6month intervals. A few variables measured only at the first (baseline) visit were treated as background variables, i.e., properties of the clinical state of subjects at the start of the study. The variables are listed in Table 1. Additional details on the variables and the construction of the dataset are provided in Appendix A.1.
Before training, the dataset was split in the ratio into training, validation, and test datasets, stratified by study. The test dataset was held out until the end of model development and used for all analyses shown in this work unless otherwise noted.
Variable Name  Group  Variable Name  Group 
ADAS cancellation  3  Alanine aminotransferase  3 
ADAS commands  3, 6  Alkaline phosphatase  3 
ADAS comprehension  3, 6  Aspartate aminotransferase  3 
ADAS construction  3, 6  Cholesterol  3 
ADAS delayed word recall  3, 6  Creatinine kinase  3 
ADAS ideational  3, 6  Creatinine  3 
ADAS naming  3, 6  Eosinophils  3 
ADAS orientation  3, 6  Gammaglutamyl transferase  3 
ADAS remember instructions  3, 6  Glucose  3 
ADAS spoken language  3, 6  Hematocrit  3 
ADAS word finding  3, 6  Hemoglobin  3 
ADAS word recall  3, 6  Hemoglobin A1C  3 
ADAS word recognition  3, 6  Indirect Bilirubin  3 
MMSE attention  3, 6  Lymphocytes  3 
MMSE language  3, 6  Monocytes  3 
MMSE orientation  3, 6  Platelet  3 
MMSE recall  3, 6  Potassium  3 
MMSE registration  3, 6  Sodium  3 
CDR community  6  Triglycerides  3 
CDR home & hobbies  6  Heart rate  3 
CDR judgement  6  Diastolic blood pressure  3 
CDR memory  6  Systolic blood pressure  3 
CDR orientation  6  Weight  3 
CDR personal care  6  Serious adverse events  3 
Baseline Age  3, 6  AChEI/Memantine use  3, 6 
Sex  3, 6  History of hypertension  3, 6 
Years of education  3, 6  History of type2 diabetes  3, 6 
ApoE 4 allele count  3, 6  Amyloid status  3, 6 
Region (Europe)  3  CSF phosphorylated tau 181  3, 6 
Region (Northern America)  3  CSF total tau  3, 6 
Region (Other)  3  Vitamin B12  3, 6 
Height  3, 6  RCT or Observational Study  3, 6 
ii.2 Modeling Approach
Modeling the integrated panel dataset presents two challenges: variables are observed with two characteristic timescales (3months and 6months) and only a subset of those features are ever observed simultaneously.
There are at least three different ways to solve the multiple timescale problem. First, one could downsample all of the data to a 6month visit interval. However, this makes the data unsuitable for many clinical trial applications, which typically record observations every 3 months. Alternatively, one could directly train a model with a 3month visit frequency on the combined dataset. Training such a model is extremely challenging as some key variables, such as the components of CDR, would be missing in a large fraction of observations. Therefore, we adopt a third approach in which we explicitly construct a generative model with two timescales.
We handle the twotimescale character of the data explicitly by training a composite model consisting of two CRBMs of the form of Walsh et al. [27]: 3month and 6month CRBMs that model variables with those respective time intervals in the panel data. Variables available at 3month intervals are modeled by the 3month CRBM, and variables only available at 6month intervals as well as key variables available at 3month intervals are modeled by the 6month CRBM. Hence, there is a set of variables included in both CRBMs. In our case, these shared variables are components of the ADASCog and MMSE assessments as well as background variables. The components of CDR are modeled by the 6month CRBM alone. Table 2 gives the number of subjects from each dataset used in the 3month and 6month CRBMs. A summary of CRBM models is presented in Appendix C.
Dataset  Number of subjects  
3month CRBM  6month CRBM  
CODRAD  5331  208 
ADNI  1588  1588 
The 3month and 6month CRBMs are naturally integrated into a single model. Each can be separately trained, but when they are used to generate Digital Subjects or Digital Twins the 3month CRBM is used first, followed by the 6month CRBM. Any shared variables are generated by the 3month CRBM and conditioned upon (held fixed) when generating data from the 6month CRBM. This structure is natural, as the 3month CRBM predicts a given subject’s clinical trajectory for all variables except CDR. Subsequently, key data variables generated by the 3month CRBM are used by the 6month CRBM to better inform the predictions of CDR on a 6month timescale.
The 3month and 6month CRBMs are trained independently, each following a hyperparameter selection procedure presented in
Walsh et al. [27]and reviewed here. Using the training component of the dataset, a set of CRBMs are trained via stochastic gradient descent, each with different hyperparameters. The validation dataset is used to compute metrics that judge model performance, and optimal hyperparameters are selected. Finally, the CRBM is retrained using these optimal hyperparameters on both the training and validation components of the dataset. This results in 3month and 6month CRBMs that can be integrated into the final model.
ii.3 Model Evaluation
The goal of our modeling approach is to be able to generate Digital Twins whose panel data are wellcalibrated
, meaning they accurately predict the probability distribution of actual subject data. This is practically useful because it provides an unbiased prediction for subject outcomes along with an unbiased estimate for their variability. Digital Twins go beyond simple predictions of mean outcomes – for each subject, a set of Digital Twins predicts the distribution of possible outcomes. Wellcalibrated Digital Twins can be used to reliably predict subjectlevel or studylevel outcomes and their expected variability.
We use the test dataset, which the constituent CRBMs have not seen, and compare test subjects to their Digital Twins created from the baseline data. A number of different approaches are used to compare these two groups of data to assess variablelevel, subjectlevel, and populationlevel measures. As most outcomes of interest for clinical applications are linear combinations of the variables modeled, such as the ADASCog total score, MMSE total score, or CDR SumofBoxes (CDRSB) score, we are particularly interested in the performance in modeling linear combinations of variables.
Iii Results
iii.1 Evaluating the Performance of the Model in Generating Digital Twins
We evaluate the ability of the trained conditional generative model to create wellcalibrated forecasts for variables in three ways. First, for each variable, we ask if observed values tend to fall in the bulk of the distribution predicted by the model, conditioned on the observed data at time zero (the baseline data). Next, we ask if the population average means, variances, and pairwise (auto)correlations predicted by the model are consistent with those estimated from the data. Finally, we train a linear classifier to distinguish between actual subject records and Digital Twins sampled from the model. All analyses are performed on the heldout test dataset not used by the model during training. Additional details on the methods used in this section can be found in Appendix
A.4, and additional results on the statistical agreement between moments and correlations in disease progression measures for actual control subjects and their Digital Twins are presented in Appendices
D and E.Our first analysis focuses on the calibration of the timedependent marginal distributions, conditioned on observed baseline data for each subject. We generate 100 Digital Twins for each subject and, for each variable at each visit for each subject, verify the consistency of the distribution over these Digital Twins with the observed value from the actual subject. In particular, for each variable at each visit, we compute the probability that a Digital Twin sampled from the conditional generative model has a larger value of that variable than the what was observed for a given subject. Then, we normalize this ‘pvalue’ by computing a transformation, , in which
is the cumulative distribution function of the normal distribution. Finally, we perform a statistical test to determine if the distribution of
is consistent with a standard normal distribution, , which would be expected if the Digital Twins and actual subject data are drawn from the same distribution. Figure 2 shows the mean and standard deviation of for all variables and time points. For the vast majority of variables, differences between observed and expected distributions are not statistically significant after correcting for multiple testing. The only differences of note are that the variance for weight predicted by the model is too large, and the mean values for hemoglobin A1C at 15 and 18 months and ADAS cancellation at 15 months slightly biased. Therefore, the timedependent marginal distributions are generally wellcalibrated.Next, we examine on the ability of the conditional generative model to forecast the populationlevel statistics that are most important for linear combinations of variables such as the ADASCog, CDRSB, or MMSE composite scores. We compare population averaged means, variances, pairwise correlations, and the 3, 6, and 9month lagged pairwise autocorrelations computed from Digital Twins and actual subjects. Specifically, we sampled a single Digital Twin for each subject in the test dataset to create a Digital Twin cohort. Then, we computed the above mentioned statistics from the Digital Twin cohort and from the test dataset. Figure 3
shows the values of these statistics from the test dataset plotted against the corresponding values computed from the Digital Twin cohort, with the goodnessoffit assessed across statistics of the same type using linear regressions. In the regressions, points receive a weight proportional to the fraction of data present when computing that statistic to account for the impact of missing data. TheilSen regression is used to quantify goodnessoffit for the means and standard deviations to mitigate outlier dependence due to different units (and scales) across variables. We find that all slopes are close to 1 and all intercepts are close to 0, illustrating that our model captures the leading statistical moments that are relevant for linear combinations of variables.
Finally, we test the ability of a linear classifier to distinguish between actual subjects and their Digital Twins. In a sense, this is directly testing if Digital Twins are statistically indistinguishable from actual subjects, and is closely related to the adversarial methods used while training the model. For each time point, we train a logistic regression to distinguish between each subject and their Digital Twins. In addition, we train a logistic regression to distinguish between actual subjects and their Digital Twins using the difference in panel data between two consecutive time points. Figure
4 shows that the linear classifiers only obtain accuracies that are consistent, or nearly consistent, with random chance. This demonstrates that individual subjects are statistically indistinguishable from their Digital Twins using linear combinations of variables.We train a logistic regression model to differentiate actual subjects from their Digital Twins. Top panel: we compare two models, one in which we use data at each visit, and one in which we use differences between two consecutive visits. On the right, we report the accuracy of both models. An accuracy of 0.5 is consistent with random chance. Error bars indicate 95% confidence interval errors. Bottom panel: boxplots for the distribution of logistic regression coefficients for the single visit model, where a weight equal to 0 corresponds to an irrelevant feature.
iii.2 Progression of Common Clinical Endpoints
The components of the cognitive questionnaires are particularly important for clinical trial applications because ADASCog, CDR, and MMSE are frequently used as inclusion criteria and endpoints in AD clinical trials. We model 13 components of the ADASCog score, 6 components of the CDR score, and 5 components of the MMSE score, which are listed in Tables 4 and 3 (see Appendix B for more detail). We compared the mean changefrombaseline in ADASCog11, CDRSB, and MMSE predicted from a Digital Twin cohort sampled from the conditional generative model to the corresponding changes estimated from the test dataset. We present the results stratified by disease severity at baseline. Figure 5 shows that we consistently predict the mean progression across all time points, scores, and cohorts, indicating that the conditional generative model is wellcalibrated for these important clinical outcomes. Further, Appendix D shows that predicted changesfrombaseline for these scores in individual subjects are highly correlated with observed change from baseline from actual subjects. Taken together, these results demonstrate that the model can be used to accurately forecast the prognosis of subjects with MCI and AD.
iii.3 Comparison to an Earlier AD Progression Model
Previously, Fisher et al. [7] developed a machine learning model to simulate AD progression using a smaller dataset than that used in this manuscript. In addition, the prior model did not include the components of CDR, or other key biomarker and safety data, and mostly modeled subjects with mild to moderate AD. In this paper, we combined multiple data sources to create a larger and more comprehensive dataset. In addition, a much larger fraction of subjects in the new dataset had MCI, which resulted to a dramatic improvement in predictions for this cohort of subjects. We compare the predictions for ADASCog11 progression from the two models on AD and MCI cohorts in Figure 6. In addition, Tables 8 and 9 in Appendix E compare the mean and standard deviation of the marginal distributions of each variable computed from the two models to those computed from the test dataset. The performance of the current model on subjects with AD is comparable to the previous model, but it substantially outperforms the previous model for subjects with MCI.
Iv Discussion
A typical clinical trial compares the safetey and efficacy of an investigational therapy to a placebo and standardofcare. Therefore, there are many uses for models that can generate accurate forecasts for disease progression under standardofcare ranging from clinical trial design to statistical analysis plans that directly incorporate prognostic forecasts to improve statistical power [19, 26]. Here, we built on previous work by training a generative model to forecast disease progression for subjects with AD, aiming to obtain wellcalibrated forecasts across the spectrum of severity from early to severe disease.
A subject’s prognosis can be forecast by using the model to generate Digital Twins – longitudinal trajectories sampled from a multivariate probability distribution that describes how that subject’s characteristics are likely to change after baseline. Each Digital Twin has 16 background variables that are only measured at baseline, and 48 timedependent variables that are measured in 3 or 6month intervals. In total, these 64 variables include cognitive questionnaires such as ADASCog, CDR, and MMSE that are commonly used to assess disease progression in clinical trials, as well as lab tests and other biomarkers.
Modeling this comprehensive set of variables required integrating two large datasets of AD clinical trials and observational studies. However, the different studies in the integrated dataset typically observed patients in either 3 or 6month intervals, and they did not always measure the same set of variables. Therefore, we introduced a new architecture comprised of two Conditional Restricted Boltzmann Machines (CRBMs), one to represent the 3month timescale, and the other to represent the 6month timescale. The two CRBMs can be combined to generate trajectories that have the variables measured in 3month intervals and 6month intervals. The challenge of modeling panel data with multiple timescales is commonly encountered when aggregating data from multiple studies, and the approach described could be easily extended to more general cases. Therefore, we expect this approach may be effective in modeling disease progression for many diseases.
We showed that, at the single variable level, actual subjects’ observed measurements are consistent with the distribution of their Digital Twins at different visits. At the population level, we showed that Digital Twins capture the leading statistical moments of the data, including means, standard deviations, correlations, and lagged autocorrelations of all variables. Finally, we have shown that logistic regression cannot distinguish actual subjects from their Digital Twins, with an accuracy consistent or nearly consistent with random chance. We also showed that the model accurately predicts the progression of ADASCog11, CDRSB, and MMSE, across multiple severity cohorts. Therefore, this generative model provides reliable forecasts for disease progression of the majority of patient characteristics that are of interest in AD clinical trials.
In comparison to a previous model used to forecast disease progression in subjects with AD presented by Fisher et al. [7], the model described here was based on a dataset with more than 3 times as many subjects, as well as additional clinical variables. In addition to providing more comprehensive forecasts, the new model provides more accurate forecasts – particularly for patients with early stage disease.
Generative models have a number of potential applications when applied to health data. For example, a generative model can be used to create datasets that have desired statistical properties but don’t contain any private health information. This application could be useful for training discriminative models while protecting patient privacy or intellectual property rights of data owners. Similarly, cohorts of Digital Subjects can be used for simulating clinical trials with different inclusion criteria to facilitate clinical trial design.
Similarly, conditional generative models have a number of potential applications in clinical trials, and even clinical care. For example, Digital Twins can be integrated into clinical trials as prognostic scores to increase their power or reduce the number of subjects required to achieve a desired power [19, 26]. Further into the future, conditional generative models that can create comprehensive forecasts of likely outcomes for individual patients could form the basis of clinical decision support systems.
Here, we have demonstrated that a particular type of generative model (i.e., CRBMs) can be used to accurately model disease progression for patients with MCI or AD. CRBMs are particularly useful for this task because they effectively model panel data, can easily handle binary, continuous, or ordinal data types, can impute missing data at train or test time, and can be used as a generative model (to create Digital Subjects) or as a conditional generative model (to create Digital Twins). Future work could explore the advantages and disadvantages of different types of generative models for these types of data.
V Data Availability
Certain data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a publicprivate partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). For uptodate information, see www.adniinfo.org.
Certain data used in the preparation of this article were obtained from the Critical Path for Alzheimer’s Disease (CPAD) database. In 2008, Critical Path Institute, in collaboration with the Engelberg Center for Health Care Reform at the Brookings Institution, formed the Coalition Against Major Diseases (CAMD), which was then renamed to CPAD in 2018. The Coalition brings together patient groups, biopharmaceutical companies, and scientists from academia, the U.S. Food and Drug Administration (FDA), the European Medicines Agency (EMA), the National Institute of Neurological Disorders and Stroke (NINDS), and the National Institute on Aging (NIA). CPAD currently includes over 200 scientists, drug development and regulatory agency professionals, from member and nonmember organizations. The data available in the CPAD database has been volunteered by CPAD member companies and nonmember organizations.
Vi Acknowledgements
Data collection and sharing for this project was funded in part by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH1220012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; BristolMyers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. HoffmannLa Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
We would like to thank Pankaj Mehta and Alejandro Schuler for helpful comments while preparing the manuscript.
References
 Ackley et al. [1985] David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.
 Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
 Cummings et al. [2018] Jeffrey Cummings, Carl Reiber, and Parvesh Kumar. The price of progress: Funding and financing Alzheimer’s disease drug development. Alzheimer’s & Dementia: Translational Research & Clinical Interventions, 4:330–343, 2018.
 Cummings et al. [2014] Jeffrey L Cummings, Travis Morstorf, and Kate Zhong. Alzheimer’s disease drugdevelopment pipeline: few candidates, frequent failures. Alzheimer’s research & therapy, 6(4):1–7, 2014.

Fischer and Igel [2012]
Asja Fischer and Christian Igel.
An introduction to restricted Boltzmann machines.
In
Iberoamerican congress on pattern recognition
, pages 14–36. Springer, 2012.  Fisher et al. [2018] Charles K Fisher, Aaron M Smith, and Jonathan R Walsh. Boltzmann encoded adversarial machines. arXiv preprint arXiv:1804.08682, 2018.
 Fisher et al. [2019] Charles K Fisher, Aaron M Smith, and Jonathan R Walsh. Machine learning for comprehensive forecasting of Alzheimers disease progression. Scientific reports, 9(1):1–14, 2019.
 Folstein et al. [1975] Marshal F Folstein, Susan E Folstein, and Paul R McHugh. “minimental state”: a practical method for grading the cognitive state of patients for the clinician. Journal of psychiatric research, 12(3):189–198, 1975.
 Hinton [2010] Geoffrey Hinton. A practical guide to training restricted Boltzmann machines. Momentum, 9(1):926, 2010.
 Hudson et al. [2018] Lynn D. Hudson, Rebecca D. Kush, Eileen Navarro Almario, Nathalie Seigneuret, Tammy Jackson, Barbara Jauregui, David Jordan, Ronald Fitzmartin, F. Liz Zhou, James K. Malone, Jose Galvez, and Lauren B. Becnel. Global Standards to Expedite Learning From Medical Research Data. Clinical and Translational Science, 11(4):342–344, July 2018. doi: 10.1111/cts.12556. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6039196/.
 Hughes et al. [1982] CP Hughes, L Berg, WL Danziger, LA Coben, and RL Martin. A new clinical scale for the staging of dementia. The British journal of psychiatry, 140:566–572, 1982.
 Kubick et al. [2007] Wayne R Kubick, Stephen Ruberg, and Edward Helton. Toward a comprehensive cdisc submission data standard. Drug information journal, 41(3):373–382, 2007.
 LeCun et al. [2006] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energybased learning. Predicting structured data, 1(0), 2006.
 Mnih et al. [2012] Volodymyr Mnih, Hugo Larochelle, and Geoffrey E. Hinton. Conditional restricted Boltzmann machines for structured output prediction. arXiv preprint arXiv:1202.3748, 2012.
 Mueller et al. [2005] Susanne G Mueller, Michael W Weiner, Leon J Thal, Ronald C Petersen, Clifford R Jack, William Jagust, John Q Trojanowski, Arthur W Toga, and Laurel Beckett. Ways toward an early diagnosis in Alzheimer’s disease: the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimer’s & dementia: the journal of the Alzheimer’s Association, 1(1):55–66, 2005.
 Neville et al. [2015] Jon Neville, Steve Kopko, Steve Broadbent, Enrique Avilés, Robert Stafford, Christine M Solinsky, Lisa J Bain, Martin Cisneroz, Klaus Romero, and Diane Stephenson. Development of a unified clinical trial database for Alzheimer’s disease. Alzheimer’s & dementia: the journal of the Alzheimer’s Association, 11(10):1212–1221, 2015.
 Romero et al. [2009] K Romero, M De Mars, D Frank, M Anthony, J Neville, L Kirby, K Smith, and RL Woosley. The Coalition Against Major Diseases: developing tools for an integrated drug development process for Alzheimer’s and Parkinson’s diseases. Clinical Pharmacology & Therapeutics, 86(4):365–367, 2009.
 Rosen et al. [1984] Wilma G Rosen, Richard C Mohs, and Kenneth L Davis. A new rating scale for Alzheimer’s disease. The American journal of psychiatry, 1984.
 Schuler et al. [2020] Alejandro Schuler, David Walsh, Diana Hall, Jon Walsh, and Charles Fisher. Increasing the efficiency of randomized trial estimates via linear adjustment for a prognostic score. arXiv preprint arXiv:2012.09935, 2020.
 Smolensky [1987] Paul Smolensky. Information processing in dynamical systems foundations of harmony theory. In David E. Rumelhart and James L. McClelland, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition Foundations, pages 194–281. MIT Press, 1987.
 Sperling et al. [2011] Reisa A Sperling, Paul S Aisen, Laurel A Beckett, David A Bennett, Suzanne Craft, Anne M Fagan, Takeshi Iwatsubo, Clifford R Jack Jr, Jeffrey Kaye, Thomas J Montine, et al. Toward defining the preclinical stages of Alzheimer’s disease: Recommendations from the national institute on agingAlzheimer’s association workgroups on diagnostic guidelines for Alzheimer’s disease. Alzheimer’s & dementia, 7(3):280–292, 2011.
 Sutton and McCallum [2007] Charles Sutton and Andrew McCallum. Piecewise pseudolikelihood for efficient training of conditional random fields. In Proceedings of the 24th international conference on Machine learning, pages 863–870, 2007.
 Taylor et al. [2007] Graham W Taylor, Geoffrey E Hinton, and Sam T Roweis. Modeling human motion using binary latent variables. In Advances in neural information processing systems, pages 1345–1352, 2007.
 Theis et al. [2016] L Theis, A van den Oord, and M Bethge. A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR 2016), pages 1–10, 2016.
 Tieleman [2008] Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pages 1064–1071. ACM, 2008.
 [26] David Walsh, Alejandro Schuler, Diana Hall, Jonathan Walsh, and Charles Fisher. Using prognostic scores for more efficient randomized trial estimates through bayesian linear regression. To appear.
 Walsh et al. [2020] Jonathan R Walsh, Aaron M Smith, Yannick Pouliot, David LiBland, Anton Loukianov, and Charles K Fisher. Generating digital twins with Multiple Sclerosis using probabilistic neural networks. arXiv preprint arXiv:2002.02779, 2020.
 Welling et al. [2004] Max Welling, Michal RosenZvi, and Geoffrey E Hinton. Exponential family harmoniums with an application to information retrieval. Advances in neural information processing systems, 17:1481–1488, 2004.
 Wickham et al. [2014] Hadley Wickham et al. Tidy data. Journal of Statistical Software, 59(10):1–23, 2014.
 Wooldridge [2010] Jeffrey M Wooldridge. Econometric analysis of cross section and panel data. MIT press, 2010.
 Yoon et al. [2018] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. Ganite: Estimation of individualized treatment effects using generative adversarial nets. In International Conference on Learning Representations, 2018.
Appendix A Expanded Methods
a.1 Data
Name  Group  Category  Type  Longitudinal  Present [%] 
ADAS cancellation  3  Cognitive  Ordinal  Yes  35.5 
Serious adverse events  3  Adverse events  Ordinal  Yes  58.9 
Heart rate  3  Clinical  Continuous  Yes  98.5 
Diastolic blood pressure  3  Clinical  Continuous  Yes  99.3 
Systolic blood pressure  3  Clinical  Continuous  Yes  99.3 
Weight  3  Clinical  Continuous  Yes  83.3 
Alanine aminotransferase  3  Laboratory  Continuous  Yes  52.7 
Alkaline phosphatase  3  Laboratory  Continuous  Yes  52.7 
Aspartate aminotransferase  3  Laboratory  Continuous  Yes  52.7 
Cholesterol  3  Laboratory  Continuous  Yes  41.8 
Creatinine kinase  3  Laboratory  Continuous  Yes  47.5 
Creatinine  3  Laboratory  Continuous  Yes  52.7 
Eosinophils  3  Laboratory  Continuous  Yes  42.4 
GammaGlutamyl Transferase  3  Laboratory  Continuous  Yes  39.6 
Glucose  3  Laboratory  Continuous  Yes  50.8 
Hematocrit  3  Laboratory  Continuous  Yes  45.9 
Hemoglobin  3  Laboratory  Continuous  Yes  52.6 
Hemoglobin A1C  3  Laboratory  Continuous  Yes  36.7 
Indirect Bilirubin  3  Laboratory  Continuous  Yes  36.4 
Lymphocytes  3  Laboratory  Continuous  Yes  44.2 
Monocytes  3  Laboratory  Continuous  Yes  44.2 
Platelet  3  Laboratory  Continuous  Yes  52.4 
Potassium  3  Laboratory  Continuous  Yes  50.7 
Sodium  3  Laboratory  Continuous  Yes  46.1 
Trisglycerides  3  Laboratory  Continuous  Yes  35.4 
Region Europe  3  Background  Binary  No  95.7 
Region Northern America  3  Background  Binary  No  95.7 
Region Other  3  Background  Binary  No  95.7 
Name  Group  Category  Type  Longitudinal  Present [%] 
ADAS commands  3,6  Cognitive  Ordinal  Yes  99.7 
ADAS comprehension  3,6  Cognitive  Ordinal  Yes  99.7 
ADAS construction  3,6  Cognitive  Ordinal  Yes  99.6 
ADAS delayed word recall  3,6  Cognitive  Ordinal  Yes  60.2 
ADAS ideational  3,6  Cognitive  Ordinal  Yes  99.6 
ADAS naming  3,6  Cognitive  Ordinal  Yes  99.6 
ADAS orientation  3,6  Cognitive  Ordinal  Yes  99.6 
ADAS remember instructions  3,6  Cognitive  Ordinal  Yes  99.6 
ADAS spoken language  3,6  Cognitive  Ordinal  Yes  99.7 
ADAS word finding  3,6  Cognitive  Ordinal  Yes  99.7 
ADAS word recall  3,6  Cognitive  Ordinal  Yes  99.6 
ADAS word recognition  3,6  Cognitive  Ordinal  Yes  99.5 
MMSE attention  3,6  Cognitive  Ordinal  Yes  60.8 
MMSE language  3,6  Cognitive  Ordinal  Yes  57.9 
MMSE orientation  3,6  Cognitive  Ordinal  Yes  60.8 
MMSE recall  3,6  Cognitive  Ordinal  Yes  60.8 
MMSE registration  3,6  Cognitive  Ordinal  Yes  60.8 
CDR community  6  Cognitive  Ordinal  Yes  25.8 
CDR home hobbies  6  Cognitive  Ordinal  Yes  25.8 
CDR judgement  6  Cognitive  Ordinal  Yes  25.8 
CDR memory  6  Cognitive  Ordinal  Yes  25.8 
CDR orientation  6  Cognitive  Ordinal  Yes  25.8 
CDR personal care  6  Cognitive  Ordinal  Yes  25.8 
Sex Female  3,6  Background  Binary  No  100.0 
Years of education  3,6  Background  Ordinal  No  23.0 
Age  3,6  Background  Continuous  No  99.4 
Height  3,6  Background  Continuous  No  80.0 
AChEI/Memantine use  3,6  Background  Binary  No  100.0 
History of hypertension  3,6  Background  Binary  No  99.8 
History of type2 diabetes  3,6  Background  Binary  No  99.8 
Clinical trial  3,6  Background  Binary  No  100.0 
Amyloid status  3,6  Background  Binary  No  10.3 
Vitamin B12  3,6  Background  Continuous  No  50.0 
ApoE 4 allele count  3,6  Background  Ordinal  No  57.1 
CSF phosphorylated tau 181  3,6  Background  Continuous  No  2.3 
CSF total tau  3,6  Background  Continuous  No  3.5 
Data for training and evaluating the model comes from two different sources that provide complementary views of AD disease progression. One source is the CPath Online Data Repository for Alzheimer’s Disease (CODRAD), a database provided by the Critical Path for Alzheimer’s Disease (CPAD) consortium [17, 16], that consists of the control arms of 29 mostly mild to moderate AD clinical trials with more than 7000 subjects. The other source is from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database [15], a collection of four longrunning observational studies enrolling subjects across the AD disease spectrum since 2004, focusing primarily on MCI and cognitively normal subjects. These data have varying duration, visit interval, inclusion criteria, and measured features. Notably, the ADNI studies typically have a 6month cadence and an extremely broad set of observations that include imaging, biomarker data, and important disease severity measures; in contrast CODRAD trials have a cadence of at most 3months but a narrower set of measurements. The CODRAD data are encoded according to the Study Data Tabulation Model (SDTM) format, a structured record format for clinical data designed to facilitate review by regulatory authorities [12, 10]. The ADNI data are encoded in a custom format that accommodates the diverse array of measurements made in the ADNI studies. These formats cannot be used for machine learning directly because they have complicated nonunique feature encoding schemes and contain extraneous information. We reprocessed the databases to extract measured features and encoded them into a consistent wideform ("tidy") tabular format [29].
Sixtyfour features were selected for inclusion in the model as variables based on clinical significance determined by recommendations from subjectmatter experts and on the availability of data. Tables 3 and 4 give basic properties of each of these variables. Of primary interest for clinical trial applications are the components of ADASCog, CDR, and MMSE, which are often used as inclusion criteria or clinical endpoints in trials (see Appendix B for a summary of these endpoints). In addition, variables from the demographics, vital signs, medical history, questionnaires, laboratory measurements, adverse events, and biomarker domains were also included. Each was classified as a background (measured at baseline) or longitudinal (measured through time) variable and encoded as binary, ordinal, categorical or continuous. Additionally, two binary features were added, one indicating the baseline visit and another to denote whether the subject is enrolled in a clinical trial or an observational study.
Studies were selected for inclusion in the processed dataset based on three criteria: the study must record measurements of ADASCog (it is one of the most commonly used endpoints in AD clinical trials and it is crucial that we include it in our model), it must have a visit cadence of no more than 6 months (including studies with larger cadence would generate records with large missing portions, corresponding to the unrecorded visits) , and it must contain a disease population (we aim to model the progression of subjects with AD or MCI). These criteria eliminate a number of CODRAD studies that do not record ADASCog data, the ADNI3 study that has a visit cadence of 12 months, and cognitively normal cohorts from the ADNI studies. To differentiate cohorts with different disease severity in ADNI, the four ADNI studies were divided into smaller studies of cognitively normal, MCI, and AD subjects.
The resulting processed dataset consists of 6,919 subjects and 34,224 subjectvisits split across 21 studies, with approximately 25% of subjects in MCI stage and the remainder with AD.
Before training, the dataset was split in the ratio into training, validation, and test datasets, stratified by study. The test dataset was held out until the end of model development and used for all analyses shown in this work unless otherwise noted.
a.2 Problem Statement
We denote the data as a set of vectors
, where each measurement corresponds to a clinical record of the th subject, and the dataset contains subjects. Variables corresponding to the observed measurement are classified according to domain (continuous, binary, ordinal, categorical) which controls model encoding and whether they represent a static or timedependent (longitudinal) observation [27, 7]. In the following, we denote the static portion of a trajectory as and the longitudinal portion with trajectory length . Any combination of observations in may be missing, and are encoded by a special value.We aim to train a model that represents the data distribution to support conditional generation of synthetic clinical records. To simplify this task, we assume a causal Markov structure for with a lag ,
(1) 
which naturally splits this model into a baseline component and an autoregressive component . Following Walsh et al. [27], we call a sample of a complete clinical record a Digital Subject; when conditioned on baseline measurements from an actual subject, , this clinical record is called a Digital Twin.
The training data contains some variables that are measured at a 3month cadence and others that are measured at a 6month cadence. We exploit this feature of the dataset by training one model for the 3month variables and another for the 6month variables . Note that certain variables such as the ADASCog endpoints are modeled by both components. Suppressing the timedependence for the moment, the full model assumes a hierarchical form,
(2) 
Here the 6month model explains variables that are observed only at 6month intervals by conditioning on the overlap variables . The 3month model generates the remainder of the variables with a 3month cadence. This simplifying assumption corresponds to neglecting the contribution from unobserved 3month measurements to a subject’s 6month only variables.
We require that the trained model supports the ability to generate Digital Twins. Using the hierarchical decomposition eq:conditionalobjective, this involves the following steps:

Baseline measurements are taken directly from actual subjects that are to be simulated.

The 3month model is used to generate a partial trajectory of variables in 3month group by autoregressive sampling using eq:markovianfactorization, conditioned on the baseline.

The 6month model is used to complete the partial trajectory by autoregressive sampling of the 6month variables, conditioned on the 3month variables and the baseline variables.
The component models must therefore support a variety of variable types, be tractable to train in the presence of missing data, and support conditional sequence generation.
We choose CRBMs to represent both the 3month and 6month models because they meet these requirements and have been shown to be effective in modeling disease progression in AD [7] and MS [27]. Appendix C contains a summary of CRBMs. For the 3month model , a lag2 CRBM with a cadence of 3 months is used. For the 6month model , a lag1 CRBM is used instead.
a.3 Training Scheme and Hyperparameter Sweep
We train the composite model implicitly by maximizing the performance of the 3month and 6month CRBMs independently. This is equivalent to maximizing a composite pseudolikelihood [22] approximation to the joint model eq:conditionalobjective.
The 3month component is a lag2 model with 3 month visit cadence that is trained on 3month variables from the ADNI and CODRAD datasets. Training samples for this model consist of the baseline variables and longitudinal variables of three consecutive visits with a spacing of 3 months. In the ADNI data (that have visit cadence of 6 months), these samples are of two types: either the first and last visit missing (type I), or the middle visit missing (type II). Instead of relying on the data imputation capabilities of the CRBM, we trained an auxiliary lag1 CRBM imputation model with a visit cadence of 3 months on the CODRAD data which is used to impute the type II samples in ADNI. Samples of type I consisting of mostly imputed data were not used during training to avoid introducing too much bias from the simpler imputation model.
The 6month component is a lag1 model with six month cadence that is trained on the 6month variables from ADNI and a single study from CODRAD (that records CDR). As this is a simpler model, no special training procedures were required.
The training procedure for each model is an elaboration of that described in Section IIC and Appendix C of Walsh et al. [27], which we summarize here. We trained the 3month and 6month models by performing a grid search over hyperparameters in combination with minibatch stochastic gradient descent using the ADAM optimizer to obtain the parameters. We did not optimize the hyperparameters of the imputer model. We report details of the grids used for each model in Table 5. For each grid point, we trained each model on the training portion of the dataset and evaluated it on the validation portion. We performed model selection with a twostep minimax procedure similar to the one outlined in Walsh et al. [27]. First, models from a grid search are ranked on the statistical and performance metrics of Table 6 and assigned a score equal to the worst rank. Models are then reranked from minimum to maximum score and the bottom 75% are rejected. Second, we applied the same minimax procedure to the remaining 25%, based on performance metrics only. Topranked models are selected as best and retrained on the joint training and validation splits. In Figure 7 and Figure 8 we show the distributions of selection metrics over all trained models and the values for the selected 3month and 6month models. We aimed to choose models that performed well across all metrics, even if they were not the bestperforming on any single metric.
After an optimal 3month and 6month models were selected, each was retrained using these optimal hyperparameters on both the training and validation portions of the dataset.
Hyperparameter  Model  
3month  6month  Imputer  
Batch size 


500  
Number of epochs 
1000  1000  1000  
Learning rate 


0.02  
Beta std 


0.15  
Weight penalty  0.001  0.001  0.001  
Monte Carlo steps  25  25  25  
Adversary weight 


0.30  
Number of hidden units 


32 
Model  Statistical Metrics  Performance Metrics  
3month 



6month 


a.4 Model Evaluation Methods
In this section we provide details on the model evaluation described in III.1 and summarized in Figures 2, 3, and 4.
In Figure 2, for each variable and each visit, we compare the values from a given subject to the distribution obtained from 100 Digital Twins of that subject. First, for a given variable , for a subject , and for visit , we compute the pvalue of the observation under the Digital Twins distribution
(3) 
where is the Digital Twins empirical cumulative distribution. If observed values are consistent with the model distributions one expects a uniform pvalue distribution. Instead of testing directly this hypothesis, we define a derived statistic that is more interpretable. From a pvalue we calculate a statistic from the inverse normal distribution,
(4) 
The values are expected to be normally distributed across subjects with zero mean and unit standard deviation, if the observations are consistent with the model distributions. Means above (below) 0 indicate a biased model distribution with lower (higher) mean than the mean of the data. Variances smaller (larger) than 1 indicate a model variance larger (smaller) than variance of the data. In Figure 2 we plot the mean and standard deviation of for all variables and visits. We compute a 1sample KolmogorovSmirnov test comparing and . The significance of the test is adjusted for multiple comparisons with a Bonferroni correction. The significance level is rescaled by the total number of comparisons, which is given by the number of variables multiplied by the number of visits. This leads to a corrected significance level of .
In Figure 3 we compare summary statistics from test subjects and their Digital Twins. Let for or indicate either test subjects data or their Digital Twins at visit , respectively. We measure means for each variable at each visit , standard deviations for each variable at each visit , and correlations over all times, , for
, which correspond to equaltime correlations and 3, 6, 9months lagged autocorrelations, respectively. Each statistic is calculated over test subjects and their corresponding Digital Twins, where a single Digital Twin is assigned to each test subject. We compute a regression between the values obtained from test subjects and values obtained from Digital Twins, weighted by the fraction of data present for each particular point. Means and standard deviations span multiple orders of magnitude and ordinary least squares regression is particularly sensitive to outliers, so we use TheilSen regression. For correlations and autocorrelations we use ordinary least squares regression and we also report the corresponding coefficient of determination.
In Figure 4 we train a classifier to distinguish actual subjects from Digital Twins, based on the complete set of longitudinal variables. We compare subjects with their Digital Twins at each visit, and we also consider a different classifier where the features are the differences of variables between two consecutive visits. Missing data is a potential source of bias, since actual subjects have missing values and their Digital Twins do not. We remove the potential bias from missing data by meanimputing all missing values and by assigning the same imputed value to the corresponding Digital Twins. Classifiers are evaluated using 5fold cross validation. For each fold, a model is trained on 4 folds and evaluated on the remaining one. Performance is evaluated as the average over the 5 folds. We generate 100 Digital Twins for each subject, and a separate classifier model is trained and evaluated for each set of twins. In the top panel of Figure 4, for any given visit, we report the average performance and the corresponding 95% confidence interval error over the 100 trials. In the bottom panel, we report boxplots for the distribution of the logistic regression coefficients. The distribution includes all weights estimated from the 100 Digital Twins, from the 6 visits, and from the 5 folds of cross validation. Weights are related to the relative importance of features, where a weight equal to 0 corresponds to an irrelevant feature.
Appendix B ADASCog, CDR, and MMSE variables
The ADASCog test is widely used in clinical trials to asses cognitive decline in subjects with Alzheimer’s Disease. ADASCog most commonly includes between 11 and 14 tasks, which are scored independently and then combined into composite scores (e.g., ADASCog11 or ADASCog14). The tasks include both subjectcompleted tests and observerbased assessments, and evaluate the domains of memory, language, and praxis.
ADASCog is less suitable to detect changes at milder stages of dementia, and more sensitive tests have been introduced for this purpose. A widely used test is CDR. It employs an interview format to collect detailed information on the subject’s ability to function in various domains. It consists of 6 components which are commonly combined into the composite CDR SumofBoxes (CDRSB). CDRSB is the gold standard for staging dementia in subjects who eventually develop AD and this is reflected in the early stage AD trials that have been run in the last decade.
A commonly used quick assessment of cognitive impairment is the MMSE questionnaire. MMSE has been a staple for inclusion and exclusion criteria in clinical trials for over 20 years and is used in nearly 100% of trials for treatments from early to latestage AD.
Appendix C Summary of CRBMs
Restricted Boltzmann Machines [1, 20, 28] are a wellknown type of generative latentvariable model that possess a number of qualities that make them suitable for modeling clinical data. An RBM is defined by an energy function [13] that measures the compatibility between a set of observed variables and a set of latent variables
introduced to model the complex dependencies in the data. The data distribution is then the marginal of a joint distribution
defined by this energy function,with the normalizing factor. This model makes few assumptions on the domain of the variables , which allows flexible modeling of the data by choosing an appropriate energy function. By construction, RBMs can efficiently generate samples conditioned on observed data through the use of MCMC methods, and can therefore be used for data imputation. They are also efficient to train, and naturally support training in the presence of missing observations. See recent reviews [5, 9] for methods of training and sampling from RBMs.
A Conditional Restricted Boltzmann Machine (CRBM) is an extension of an RBM to support sequence data in the form of eq:markovianfactorization. In the original development [23, 14], the RBM was extended to model conditional distributions of the form . Although sufficient for autoregressively extending sequences into the future, such a model cannot handle missing data in the conditioning set during training, and is hard to use for imputing backwards in time, . We use instead the CRBM architecture of Fisher et al. [7], which differs from that in [23, 14] because it is bidirectional and it allows for some of the visible variables (such as sex or race) to be timeindependent (). To make the model bidirectional, we define a single lag joint distribution that models correlations between consecutive longitudinal measurements and the baseline variables as an RBM. This leads to a joint probability distribution given by
(5) 
in which the energy function takes the form,
(6) 
Each variable of the visible and latent layers has bias parameters determined by the choice of functions and , as well as scale parameters and . The connection between the layers is parameterized by the weight matrices . Understood as a conventional RBM, our CRBM contains the visible units for multiple time points coupled to a standard hidden layer. Clinical trajectories may be obtained from a CRBM model by sampling according to the Markov decomposition eq:markovianfactorization. This is efficient because the joint RBM model is easy to sample conditionally, .
To optimize the parameters of a CRBM, we replace the maximum likelihood objective (Eq eq:markovianfactorization) with a piecewise pseudolikelihood approximation [22], to obtain a piecewise loss in log space,
(7) 
that is averaged over adjacent windows of trajectories which we call shingles
. This objective is then optimized with persistent contrastive divergence algorithm
[25].Training directly with this objective can result in poor sample quality. This is because maximum likelihood objectives allow the model to generate arbitrarily poor outofdistribution data [24, 6]. We mitigate this problem by augmenting the loss with a weighted adversarial (contrastive) loss term that acts as a regularizer [6] (see Chen et al. [2]
for a similar application in computer vision). We call the resulting training objective BEAM
[6],(8) 
where
is a regularization strength hyperparameter. Although this procedure is superficially similar to generative adversarial network (GAN) training, we emphasize that the BEAM objective and model properties of CRBMs are fundamentally different from that of GANs. Indeed, we treat the weight of the regularization term as a hyperparameter which can be set to zero, whereas a GAN cannot be trained without an adversary.
The computational cost of the BEAM procedure is similar to conventional RBM training and allows reusing existing RBM implementations. We refer the reader to Section IIC of Walsh et al. [27] for details on how CRBMs are trained on sequence data under the Markov assumption, and to Fisher et al. [6] for a motivation for the BEAM objective.
Appendix D Additional Results on Clinical Endpoints Progression
In Table 7 we report results from a linear regression of progression of test subject against the model predictions for various endpoints. We report slope, intercept, and Pearson correlation for ADASCog11, MMSE, and CDRSB for two representative visits at 12 months and 18 months from baseline. High correlations support the conclusion that our model predicts well progression also at subject level.
Score progression  Slope  Intercept 


12 months  
ADASCog11  0.70 (0.05)  0.68 (0.22)  0.36  
MMSE  0.72 (0.06)  0.27 (0.15)  0.41  
CDRSB  0.42 (0.08)  0.45 (0.10)  0.26  
18 months  
ADASCog11  0.91 (0.06)  0.46 (0.40)  0.49  
MMSE  0.79 (0.08)  0.46 (0.27)  0.45  
CDRSB  0.76 (0.13)  0.31 (0.23)  0.47 
Appendix E Additional Results on Marginal Distributions
In Tables 8 and 9 we calculate means and standard deviations of marginal distributions for all longitudinal variables and compare test data with the model of this paper and the model of Fisher et al. [7]
. We also report significant deviations of either models from data, using a ttest to compare means and the Levene’s test to compare standard deviations. We report 3 representative visits, but the test is performed for all visits from 3 months to 18 months in steps of 3 months, and significance is corrected for multiple comparisons for both models independently. We observe a larger number of significant deviations for MMSE marginals in the model of Fisher et al., showing that the model described in this paper improves predictions of MMSE. Both models show consistency for the remaining variables.
Comments
There are no comments yet.