In Alzheimer’s disease—the annual cost of which exceeds $800 billion globally Marinescu et al. (2018)—the effectiveness of therapeutic treatments is often limited by the challenge of identifying patients at early enough stages of progression for treatments to be of use. As a result, accurate and personalized prognosis in earlier stages of cognitive decline is critical for effective intervention and subject selection in clinical trials. Conventional statistical methods in survival analysis often begin by choosing explicit functions to model the underlying stochastic process Doksum and Hbyland (1992); Kleinbaum and Klein (2005); Rodríguez (2005); Lee and Whitmore (2010); Fernández et al. (2016); Alaa and van der Schaar (2017). However, the constraints—such as linearity and proportionality Singh and Mukhopadhyay (2011); Cox (1992) in the popular Cox model—may not be valid or verifiable in practice.
In spite of active research, a conclusive understanding of Alzheimer’s disease progression remains elusive, owing to heterogeneous biological pathways Beckett et al. (2015); Hardy and Selkoe (2002), complex temporal patterns Jedynak et al. (2012); Donohue et al. (2014), and diverse interactions Frölich et al. (2017); Pascoal et al. (2017)
. Hence Alzheimer’s data is a prime venue for leveraging the potential advantages of deep learning models for survival. Neural networks offer versatile alternatives by virtue of their capacity as general-purpose function approximators, able to learn—without restrictive assumptions—the latent structure between an individual’s prognostic factors and odds of survival.
Contributions. Our goal is to establish a novel convolutional model for survival prediction, using Alzheimer’s disease as a case study for experimental validation. Primary contributions are threefold: First, we formulate a generalized framework for longitudinal survival prediction, laying the foundation for effective cross-model comparison. Second, our proposal is uniquely designed to capitalize on longitudinal data to issue dynamically updated survival predictions, accommodating potentially informative patterns of data missingness, and combining human input with model predictions. Third, we propose methods for deriving clinically meaningful insight into the model’s inference process. Finally, we demonstrate state-of-the-art results in comparison with comprehensive benchmarks.
2 Problem formulation
Let there be patients in a study, indexed . Time is treated as a discrete dimension of fixed resolution . Each longitudinal datum consists of the tuple , where
is the vector of covariates recorded at time, and
is the binary survival indicator. Let random variabledenote the time-to-event, the time of right-censoring, and . Per convention, we assume that censoring is not correlated with the eventual outcome Tsiatis and Davidian (2004); Zheng and Heagerty (2005); van Houwelingen and Putter (2011). Let the complete longitudinal dataset be given by . Then we can define
to be the set of observations for patient extending from time into a width- window of the past, where parameter depends on the model under consideration. Given longitudinal measurements in , our goal is to issue risk predictions corresponding to length- horizons into the future. Formally, given a backward-looking historical window , we are interested in the failure function for forward-looking prediction intervals
of event occurrence within each prediction interval. Observe that parameterizing the width of the historical window results in a generalized framework—for instance, a Cox model only utilizes the most recent measurement; that is, . At the other extreme, recurrent models may consume the entire history; that is, . As we shall see, the best performance is in fact obtained via a flexible intermediate approach—that is, by learning the optimal width of a sliding window of history.
We propose Match-Net: a Missingness-Aware Temporal Convolutional Hitting-time Network, innovating on current approaches in three respects. Dynamic prediction: Existing deep learning models issue prognoses on the basis of information from a single time point Faraggi and Simon (1995); Buckley and James (1979); Katzman et al. (2016); Xiang et al. (2000); Mariani et al. (1997); Liestbl et al. (1994); Biganzoli et al. (2009); Yu et al. (2011); Luck et al. (2017); Fotso (2018); Lee et al. (2018), potentially discarding valuable information in the presence of longitudinal data. We investigate temporal convolutions in capturing heterogeneous representations of temporal dependencies among observed covariates, enabling truly dynamic predictions on the basis of historical information. Informative missingness: Current survival methods rely on the common assumption that the timing and frequency of covariate measurements is uninformative van Houwelingen and Putter (2011); Rizopoulos (2012). By contrast, our model is missingness-aware—that is, we explicitly account for informative missingness by learning correlations between patterns of missingness and disease progression. Human input: Instead of issuing predictions solely on the basis of quantitative clinical measurements, we optionally incorporate clinicians’ most recent diagnoses of patient state into model estimates to examine the incremental informativeness of subjective input. Each innovation is a source of gain in performance (see Section 4; further details in the appendix).
Match-Net accepts as input a sliding-window of observed covariates , as well as a parallel binary mask of missing-value indicators taking on values of one to denote missing covariate measurements. In addition, the network optionally accepts a one-hot vector describing the most recent clinician diagnosis of Alzheimer’s disease progression. Starting from the base of the network, the convolutional block first learns representations of longitudinal covariate trajectories by extracting local features from temporal patterns. After each layer, filter activations from the auxiliary branch are concatenated with those in the main branch. Then, the fully-connected block captures more global relationships by combining local information. Finally, the output layers produce failure estimates
for pre-specified prediction intervals, where is the maximal horizon desired. This convolutional dual-stream architecture explicitly captures representations of temporal dependencies within each stream, as well as between covariate trajectories and missingness patterns in association with disease progression. This accounts for the potential informativeness of both irregular sampling (i.e. intervals between consecutive clinical visits may vary) and asynchronous sampling (i.e. not all features are measured at the same time) Rubin (1976); Bang et al.
. This also encourages the network to distinguish between actual measurements and imputed values, reducing sensitivity to specific imputation methods chosen. The final architecture uses convolutions of length, with more filters per layer in the main branch.
Loss function. With the preceding notation, the negative log-likelihood of a single empirical result and model estimate in relation to some input window is given by
where denotes the parameters of the network. The total loss function is then computed to simultaneously take into account the quality of predictions for all prediction horizons , all times available along each patient’s longitudinal trajectory, and all patients in the survival dataset:
where accounts for failure or right-censoring. This is a natural generalization of the log-likelihood in Liestbl et al. (1994) to accommodate longitudinal survival. Weight function allows trading off the relative importance of different patients, time steps, and prediction horizons. First, this allows standardizing patient contributions with , thereby counteracting the bias against patients with shorter survival durations. Second, in the context of heavily imbalanced classes, this allows up-weighting positive instances—that is, input windows that correspond to actual failure.
The Alzheimer’s Disease Neuroimaging Initiative (ADNI) study data is a longitudinal survival dataset of per-visit measurements for 1,737 patients Marinescu et al. (2018). The data tracks disease progression through clinical measurements at -year intervals, including quantitative biomarkers, cognitive tests, demographics, and risk factors. Our objective is to predict the first stable occurrence of Alzheimer’s disease for each patient. Further information on the dataset, preparation, and training can be found in the appendix.
Benchmarks. We evaluate Match
-Net against both traditional longitudinal methods in survival analysis and recent deep learning approaches; the former includes Cox landmarking and joint modeling methods, and the latter includes static and dynamic multilayer perceptrons and recurrent neural network models. Performance is evaluated on the basis of the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC), both computed with respect to prediction horizons. Five-fold cross validation metrics are reported in Table 1.
Performance. Match-Net produces state-of-the-art results, consistently outperforming both conventional statistical and neural network benchmarks. Gains are especially apparent in AUPRC scores—improving on the MLP by an average of 15% and on joint models by 16% across all horizons, and by 27% and 26% for one-step-ahead predictions. To understand the sources of improvement, we observe a 4% gain in AUPRC from introducing the sliding window mechanism (MLP to S-MLP), a 9% gain from incorporating temporal convolutions (S-MLP to S-TCN), and a further 2% gain from accommodating informative missingness (S-TCN to Match-Net). In addition, including the most recent clinician diagnosis results in a further 17% gain (further details located in the appendix).
Visualization. From the preceding, we observe the largest gains by introducing convolutions. While this is consistent with our motivating hypothesis that convolutions are better able to capture temporal patterns, clinicians often desire a degree of transparency into the prediction process Ribeiro et al. (2016); Avati et al. (2017). We adopt the partial dependence approach in Friedman (2001) to understand the input-output relationship, as well as examining the utility of convolutions. For each observed covariate , we want to approximate how the estimated failure function varies based on the value of . We define the dependence
where . By evaluating Equation 6 on the values present in the data, the influence of each covariate can be measured by estimating its slope. For a global picture of what impact each feature and time step has on the model’s predictions, we compute the influence for all features to produce an average saliency map Zeiler and Fergus (2014); Simonyan et al. (2013) highlighting the effect of convolutional layers. All else equal, absent convolutions we see that having worse covariate values any time step almost invariably has an upward impact on risk (i.e. negative impact on survival). On the other hand, with temporal convolutions we see that having worse covariate values at earlier time steps may result in a downward impact on risk (i.e. positive impact on survival), suggesting that convolutions may better facilitate modeling relative movements (e.g. sudden declines) than simply paying attention to levels. Further visualizations for added perspective on input-output relationships are found in the appendix.
- Marinescu et al.  Razvan V Marinescu, Neil P Oxtoby, Alexandra L Young, et al. Tadpole challenge: Prediction of longitudinal evolution in alzheimer’s disease. arXiv preprint arXiv:1805.03909, 2018.
Doksum and Hbyland 
Kjell A Doksum and Arnljot Hbyland.
Models for variable-stress accelerated life testing experiments based on wiener processes and the inverse gaussian distribution.Technometrics, 34(1):74–82, 1992.
- Kleinbaum and Klein  D Kleinbaum and M Klein. Survival analysis statistics for biology and health. Survival, 510:91665, 2005.
- Rodríguez  Germán Rodríguez. Parametric survival models. Lectures Notes, Princeton University, 2005.
- Lee and Whitmore  Mei-Ling Ting Lee and GA Whitmore. Proportional hazards and threshold regression: their theoretical and practical connections. Lifetime data analysis, 16(2):196–214, 2010.
- Fernández et al.  Tamara Fernández, Nicolás Rivera, and Yee Whye Teh. Gaussian processes for survival analysis. In Advances in Neural Information Processing Systems, pages 5021–5029, 2016.
- Alaa and van der Schaar  Ahmed M Alaa and Mihaela van der Schaar. Deep multi-task gaussian processes for survival analysis with competing risks. In Proceedings of the 30th Conference on Neural Information Processing Systems, 2017.
- Singh and Mukhopadhyay  Ritesh Singh and Keshab Mukhopadhyay. Survival analysis in clinical trials: Basics and must know areas. Perspectives in clinical research, 2(4):145, 2011.
- Cox  David R Cox. Regression models & life-tables. In Breakthroughs in statistics. Springer, 1992.
- Beckett et al.  Laurel A Beckett, Michael C Donohue, Cathy Wang, et al. The alzheimer’s disease neuroimaging initiative phase 2: Increasing the length, breadth, and depth of our understanding. Alzheimer’s & Dementia, 11(7):823–831, 2015.
- Hardy and Selkoe  John Hardy and Dennis J Selkoe. The amyloid hypothesis of alzheimer’s disease: progress and problems on the road to therapeutics. science, 297(5580):353–356, 2002.
- Jedynak et al.  Bruno M Jedynak, Andrew Lang, Bo Liu, et al. A computational neurodegenerative disease progression score: method and results with the alzheimer’s disease neuroimaging initiative cohort. Neuroimage, 63(3):1478–1486, 2012.
- Donohue et al.  Michael C Donohue, Hélène Jacqmin-Gadda, Mélanie Le Goff, et al. Estimating long-term multivariate progression from short-term data. Alzheimer’s & Dementia, 10(5):S400–S410, 2014.
- Frölich et al.  Lutz Frölich, Oliver Peters, Piotr Lewczuk, et al. Incremental value of biomarker combinations to predict progression of mild cognitive impairment to alzheimer’s dementia. Alzheimer’s research & therapy, 9(1):84, 2017.
- Pascoal et al.  Tharick A Pascoal, Sulantha Mathotaarachchi, Monica Shin, et al. Synergistic interaction between amyloid and tau predicts the progression to dementia. Alzheimer’s & Dementia, 13(6):644–653, 2017.
- Tsiatis and Davidian  Anastasios A Tsiatis and Marie Davidian. Joint modeling of longitudinal and time-to-event data: an overview. Statistica Sinica, pages 809–834, 2004.
- Zheng and Heagerty  Yingye Zheng and Patrick J Heagerty. Partly conditional survival models for longitudinal data. Biometrics, 61(2):379–391, 2005.
- van Houwelingen and Putter  Hans van Houwelingen and Hein Putter. Dynamic prediction in clinical survival analysis. CRC, 2011.
- Faraggi and Simon  David Faraggi and Richard Simon. A neural network model for survival data. Statistics in medicine, 14(1):73–82, 1995.
- Buckley and James  Jonathan Buckley and Ian James. Linear regression with censored data. Biometrika, 66(3):429–436, 1979.
- Katzman et al.  Jared L Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deep survival: A deep cox proportional hazards network. stat, 1050:2, 2016.
- Xiang et al.  Anny Xiang, Pablo Lapuerta, Alex Ryutov, Jonathan Buckley, and Stanley Azen. Comparison of the performance of neural network methods and cox regression for censored survival data. Computational statistics & data analysis, 34(2):243–257, 2000.
- Mariani et al.  L Mariani, D Coradini, E Biganzoli, et al. Prognostic factors for metachronous contralateral breast cancer: a comparison of the linear cox regression model and its artificial neural network extension. Breast cancer research and treatment, 44(2):167–178, 1997.
- Liestbl et al.  Knut Liestbl, Per Kragh Andersen, and Ulrich Andersen. Survival analysis and neural nets. Statistics in medicine, 13(12):1189–1200, 1994.
- Biganzoli et al.  Elia M Biganzoli, Federico Ambrogi, and Patrizia Boracchi. Partial logistic artificial neural networks (plann) for flexible modeling of censored survival data. In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 340–346. IEEE, 2009.
- Yu et al.  Chun-Nam Yu, Russell Greiner, Hsiu-Chin Lin, and Vickie Baracos. Learning patient-specific cancer survival distributions as a sequence of dependent regressors. In Advances in Neural Information Processing Systems, pages 1845–1853, 2011.
- Luck et al.  Margaux Luck, Tristan Sylvain, Héloïse Cardinal, Andrea Lodi, and Yoshua Bengio. Deep learning for patient-specific kidney graft survival analysis. arXiv preprint arXiv:1705.10245, 2017.
- Fotso  Stephane Fotso. Deep neural networks for survival analysis based on a multi-task framework. arXiv preprint arXiv:1801.05512, 2018.
- Lee et al.  Changhee Lee, William R Zame, Jinsung Yoon, and Mihaela van der Schaar. Deephit: A deep learning approach to survival analysis with competing risks. AAAI, 2018.
- Rizopoulos  Dimitris Rizopoulos. Joint models for longitudinal and time-to-event data: With applications in R. Chapman and Hall/CRC, 2012.
- Rubin  Donald B Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
-  Seo-Jin Bang, Yuchuan Wang, and Yang Yang. Phased-lstm based predictive model for longitudinal ehr data with missing values.
- Long et al.  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In , pages 3431–3440, 2015.
- Lim and van der Schaar  Bryan Lim and Mihaela van der Schaar. Forecasting disease trajectories in alzheimer’s disease using deep learning. arXiv preprint arXiv:1807.03159, 2018.
Ribeiro et al. 
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin.
Why should i trust you?: Explaining the predictions of any classifier.In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM, 2016.
- Avati et al.  Anand Avati, Kenneth Jung, Stephanie Harman, Lance Downing, Andrew Ng, and Nigam H Shah. Improving palliative care with deep learning. In Bioinformatics and Biomedicine (BIBM), 2017 IEEE International Conference on, pages 311–316. IEEE, 2017.
Jerome H Friedman.
Greedy function approximation: a gradient boosting machine.Annals of statistics, pages 1189–1232, 2001.
- Zeiler and Fergus  Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
- Simonyan et al.  Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Parisot et al.  Sarah Parisot, Sofia Ira Ktena, Enzo Ferrante, et al. Disease prediction using graph convolutional networks: Application to autism spectrum disorder and alzheimer’s disease. Medical image analysis, 2018.
- Li and Luo  Kan Li and Sheng Luo. Functional joint model for longitudinal and time-to-event data: an application to alzheimer’s disease. Statistics in medicine, 36(22):3560–3572, 2017.
Liu et al. 
Ke Liu, Kewei Chen, et al.
Prediction of mild cognitive impairment conversion using a combination of independent component analysis and the cox model.Frontiers in human neuroscience, 11:33, 2017.
- Vos et al.  Stephanie JB Vos, Frans Verhey, Lutz Frölich, et al. Prevalence and prognosis of alzheimer’s disease at the mild cognitive impairment stage. Brain, 138(5):1327–1338, 2015.
- Therneau  Terry M Therneau. A package for survival analysis in s. R package version 2.42, 2018.
Stepwise variable selection procedures for regression analysis.R package version 0.1.0, 2018.
- Wu  Lang Wu. Mixed effects models for complex data. Chapman and Hall/CRC, 2009.
Gal and Ghahramani 
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty
in deep learning.
international conference on machine learning, pages 1050–1059, 2016.
- Saito and Rehmsmeier  Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3):e0118432, 2015.
While various settings may benefit from Match-Net as a matter of clinical decision support, we illustrate one possible application within the context of personalized screening. Figure 3 shows the historical risk trajectory and forward risk estimates for a randomly selected ADNI patient. During the first seven years of bi-annual visits, the patient exhibits cognitively normal behavior, and the Match-Net risk estimates—computed via measured biomarkers and neuropsychological tests—reflect this steady clinical state (see historical trajectory in blue). In fact, as of precisely seven years of follow-up, the predicted 30-month forward risk remains less than 10% (see predicted trajectory in orange).
However, two clinical visits later, when the patient returns for regular checkup and clinical measurements, the projected 30-month forward risk jumps to over 50% (see predicted trajectory in red). In this situation, the clinician is immediately alerted to the sudden increase in risk of dementia, and may decide to advise more frequent checkups, or to administer a wider range of tests and biomarker measurements in the immediate term to better assess the overall risk in light of the recent downturn. In fact, as it turns out in this case, the patient is indeed diagnosed with Alzheimer’s disease at years, shedding light on Match-Net’s potential as an early warning and subject selection system.
|Faraggi & Simon||(1995)||✓||✗||✗||✗||✗|
|Katzman et al.||(2016)||✓||✓||✗||✗||✗|
|Luck et al.||(2017)||✓||✓||✓||✗||✗|
|Lee et al.||(2018)||✓||✓||✓||✓||✗|
The first study to investigate neural networks formally in the context of time-to-event analysis was done by Faraggi and Simon . By swapping out the linear functional in the Cox model for the topology of a hidden layer, their nonlinear proportional hazards approach was extended to other models for censored data, such as Rodríguez  and Buckley and James . In 2016, Katzman et al. 
were the first to apply modern techniques in deep learning to survival, in particular without prior feature selection or domain expertise. While previous studies followingFaraggi and Simon ’s model generally produced mixed results Xiang et al. , Mariani et al. , Katzman et al.  demonstrated comparable or superior performance of multilayer perceptrons in relation to conventional statistical methods.
Instead of predicting the hazard function as an intermediate objective, Liestbl et al.  first proposed—and Biganzoli et al.  further developed—an alternative approach to predict survival directly for grouped time intervals. In 2017, Luck et al.  combined the use of the Cox partial likelihood with the goal of predicting probabilities for pre-specified time intervals. Inspired by the work of Yu et al. 
on multi-task logistic regression models for survival, they generalized the idea to deep learning via multi-task multilayer perceptrons.
Recently, Lee et al.  and Fotso  in 2018 proposed learning the distribution of survival times directly, making no assumptions regarding the underlying stochastic processes—in particular with respect to the time-invariance of hazards. By being process-agnostic, Lee et al.  demonstrated significant improvements over existing statistical, machine learning, and neural network survival models on multiple real and synthetic datasets. In the context of Alzheimer’s disease, Parisot et al.  studied the use of medical image sequences for classifying patient progression. More pertinently, Lim and van der Schaar 
used recurrent networks to forecast disease trajectories with ADNI data, but relied on the explicit assumption of exponential distributions. Finally, building on these developments, one of the main contributions of this study is the use of temporal convolutions for dynamic survival prediction—while making no assumptions, and allowing the associations between covariates and risks to evolve over time (see Table2).
Details on dataset
We are primarily interested in the clinical status for each patient at any given time. An official diagnosis is recorded at each patient’s visit, and consists of two attributes. First, each diagnosis may be either stable or transitive. The former consists of stable diagnoses of normal brain functioning (“NL”), mild cognitive impairment (“MCI”), or Alzheimer’s disease (“AD”), and the latter consists of preliminary diagnoses indicating transitions between these categories, which may take the form of either conversions or reversions. Conversions indicate a forward progression in the disease trajectory, and reversions indicate a regression back towards an earlier stage of the disease.
Patients may remain in stable or transition diagnosis states for any duration at a time. The average patient who receives a transition diagnosis is observed to persist in that state for one year, while some patients do not exit this state until almost 5 years have elapsed. Patients who receive a transition diagnosis may not actually be confirmed with a subsequent stable diagnosis; in fact, less than half of the transition diagnoses for dementia were confirmed by a stable diagnosis at the next step, and almost one quarter are never followed by a stable diagnosis at any point until right-censoring. In addition, patients often actually undergo reversion transitions back towards earlier stages of the disease; in fact, over 5% of the study population receive reversion diagnoses at some point in time.
|CRD Sum of Boxes||Numeric||0.0E+00||1.8E+01||2.2E+00||2.8E+00||29.7%|
|Mini Mental State||Numeric||0.0E+00||3.0E+01||2.7E+01||4.0E+00||29.9%|
Note on class imbalance. The per-patient failure rate is 14% (243 patients out of the total 1,737). However, given the online nature of the sliding window mechanism in training and testing, the effective fraction of observations with positive event labels for any prediction horizon is around 2%.
Since the ADNI dataset is an amalgamation from multiple related studies, most features are sparsely populated. Features with less than half of the entries missing are retained, leaving 18 numeric and 4 categorical features (see Table 3
); the latter are represented by one-hot encoding, resulting in 16 binary features. Consistent with existing Alzheimer’s studies, patients are aligned according to time elapsed since baseline measurementsLi and Luo , Liu et al. , Vos et al. . Timestamps are mapped onto a discrete axis with a fixed resolution of -year intervals; where multiple measurements qualify for the same destination, the most recent measurement per feature takes precedence. Original measurements were made at roughly -year intervals, so we observe that the average absolute deviation between original values and final timestamps amounts to an insignificant 4 days (i.e. less than 2% of each interval).
provided in parallel to the network. Finally, to improve numerical conditioning, all features are normalized with empirical means and standard deviations from the training set data.
Training begins with input tuple , where , , and , and terminates with a set of calibrated network weights
. The network is trained until convergence, up to a maximum of 50 epochs. Training loss is only computed for event labels corresponding to actual recorded clinical visits (i.e. timestamps with recorded covariate values); neither imputed nor forward-filled labels are included. Analogous to our total loss function definition, the convergence metric is the sum of performance scores across all prediction tasks,
where , optionally allow trading off the relative importance between the two measures and different horizons. In this study we simply use the unweighted sum, although any convex combination would be valid. Empirically, results are not meaningfully improved by favoring one metric over the other. Elastic net regularization is used, and validation performance is computed every 10 iterations. For early stopping, validation scores serve as proxies for generalization error. Positive instances are oversampled to counteract class imbalance. To augment training data, artificial labels for dummy events of the form are generated, and positive labels are filled forward for all horizons where corresponds to the first failure for that example. Patients diagnosed with Alzheimer’s at baseline (20%) are excluded from testing and validation, since survival is undefined in those cases.
|Connected Layers||1, 2, 3, 4, 5|
|Convolutional Layers||1, 2, 3, 4, 5|
|Dropout Rate||0.1, 0.2, 0.3, 0.4, 0.5|
|Epochs for Convergence||1, 2, 3, 4, 5, 6, 7, 8, 9, 10|
|Learning Rate||1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2|
|L1-Regularlisation||None, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1|
|L2-Regularization||None, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1|
|Minibatch Size||32, 64, 128, 256, 512|
|Number of Filters (Covariates)||32, 64, 128, 256, 512|
|Number of Filters (Masks)||8, 16, 32, 64, 128|
|Oversample Ratio||None, 1, 2, 3, 5, 10|
|Recurrent Unit State Size||1, 2, 3, 4, 5|
|Width of Connected Layers||32, 64, 128, 256, 512|
|Width of Convolutional Filters||3, 4, 5, 6, 7, 8, 9, 10|
|Width of Sliding Window||3, 4, 5, 6, 7, 8, 9, 10|
For Cox landmarking, we use the implementation in Therneau  for interval-censored data, fitting a sequence of proportional hazards regression models for observation groups. Optimal groupings are determined by exhaustive search in -year increments. Preliminary feature-selection is performed by stepwise regression for the best candidate model, using the implementation of Hu . For joint modeling, we adopt the common two-stage method in Wu , first fitting linear mixed effects sub-models for significant variables, then fitting landmarking models based on mean estimates from the sub-models. In both cases, consistent with literature, time is defined as years since initial follow-up Li and Luo , Liu et al. , Vos et al. .
For all neural network models, hyperparameter optimization is carried out via 100 iterations of random search(see Table 4
). In addition, activation functions are also searched over for gating units in recurrent network cells. Model selection is performed on the basis of final composite scores—as in Equation7—for each candidate. We use 5-fold cross validation to evaluate performance, stratified at the patient level—that is, patients are randomly selected into datasets for training (60%), validation (20%), and testing (20%), with the ratio of positive patients (i.e. those for which at least one sliding window contains at least one outcome of failure) to negative patients is kept uniform across folds.
In addition, we utilize output scatters to aid our understanding of the input-output relationship. We use MC dropout to generate average responses by explicitly varying one input feature at a time Gal and Ghahramani  in equation 6. This is done by either (1) varying only the final value of the measurement within the sliding window, or by (2) varying all values of the measurement the same time. While the latter gives insight into how the response changes with respect to a difference in levels, the former sheds light on how the response is affected by a sudden change instead. Interestingly in the case of temporal convolutions, for multiple features (Figure 5 shows the MMSE and CDRSB features as examples) the response is actually stronger to final value changes. This effect is not present in any feature for multilayer perceptrons. In light of the performance of convolutional models, this is consistent with our hypothesis that they appear able to learn more complex temporal patterns than multilayer perceptrons; in this case, these results suggest that a patient who experiences a sudden decline in certain features is expected to fare worse than one who has always scored just as poorly all along.
Sources of gain
While the advantage of using multilayer perceptrons over traditional statistical survival models has been studied (see Lee et al.  for example), here we account for the additional sources of gain from our design choices. Table 5 (a) shows the initial benefit from incorporating longitudinal histories of covariate measurements. Recurrent networks are used as a reasonable starting point; improvements—where positive—are marginal at best. (b) However, experiments on differenced data—capturing relative movements without absolute levels—indicate substantial informativeness (see df-RNN trained on differences vs. joint models trained on levels). (c) Utilizing limited sliding windows instead of recurrent cells shows promising improvements, boosting average AUPRC by 4% over both RNN and MLP models. (d) The addition of temporal convolutions produces incremental AUPRC gains of 9%. (e) Accommodating informative missingness by introducing the dual-stream architecture results in further incremental AUPRC gains of 2%. Compared with the joint modeling baseline, Match-Net achieves average AUPRC improvements of 15% and one-step-ahead improvements of 26%. (f) Furthermore, incorporating the most recent clinician input (Match-Net
) improves AUPRC by an additional 17%. While the distribution of gains skews near-term with our default choices of, , and (i.e. unweighted by ), alternate distributions may be obtained via appropriate weight functions depending on deployment context. (g) Finally, using Match-Net as an example, we observe individual and cumulative benefits due to miscellaneous design choices applied to all models.