1 Introduction
COVID19 is the most significant global health emergency in recent memory, with hundreds of thousands dead and widespread economic disruption. There is growing evidence that imaging is useful for the diagnosis and management of COVID19 [1, 2]. Clinicians use radiology imaging to assess structural information which cannot be assessed with laboratory tests or physical examination. In COVID19, chest imaging adds a highdimensional assessment of the degree of pulmonary involvement of the disease. It allows clinicians to rule out other conditions which might contribute to the patient’s presentation such as lobular pneumonia and pneumothorax and to assess the patient for comorbidities such as heart failure, emphysema, and coronary artery disease. Some researchers have already found that imaging features predict mortality in COVID19 [3].
In this paper we address the challenge of predicting the time course of COVID19 patient outcomes; for example, the probability that a specific patient will need an ICU bed in the next few days following hospital admission. Classical statistical techniques for timetoevent analysis (sometimes referred to as survival analysis) are widely used, but struggle with incorporating images due to their high dimensionality.
We begin with an overview of timetoevent analysis and a discussion of the challenges that images and COVID19 present. Our deep learning approach is presented in section 3, followed by a review of related work in section 4. We describe our clinical dataset and some implementation details, including the baseline, in section 5. Experimental results are given in section 6, with additional data and analysis in the supplemental material.
2 Timetoevent analysis
Timetoevent analysis techniques [4] predict the probability of an outcome event occurring before a specific time, while accounting for rightcensored (incompletely observed) data. Rightcensoring happens when the event under study may not be observed within the relevant time period. In the clinical setting, these methods can predict a patient’s probability of undergoing an event in a particular time interval as a function of their features. In our dataset for instance, when predicting if a hospitalized COVID19 patient will be admitted to the ICU, rightcensoring happens when, as of today, the patient has not been admitted.
Timetoevent analysis focuses on three interrelated quantities: (1) the hazard function, the rate of an event at time t given that the event did not occur before time t, which is not affected by rightcensoring [5]; (2) the cumulative hazard function, the integral of the hazard function between 0 and time t; and (3) the survival function, a decreasing function that provides the probability that a patient will not experience the event of interest beyond any specified time t and is expressed as the exponential of the negative of the cumulative hazard function.
While the hazard function is not a probability, can be viewed as the probability of the event occurring in a small interval around given that the event did not occur before
. For clinical purposes, once we have estimated the hazard function we can then compute the probability of an event occurring during a specific time interval, e.g. ICU admission in the 72 hours after hospitalization.
2.1 Cox proportional hazards model
We model the hazard function using the most popular model, the Cox model [6], defined as
(1) 
Here is the time, is the set of features, is the baseline hazard, the hazard of the specific event under study shared by all patients at time , and is the risk function, which describes the relationship between a patient’s features and the hazard of experiencing an event. Note that only depends on time and not on features.
The Cox model has several advantages. First, it has no distributional assumptions, so its results will closely approximate the results for the correct model [7]. Second, even though the baseline hazard is assumed to be “unknown” and left unspecified, under minimal assumptions, the hazard, the cumulative hazard and the survival functions can be directly determined. These can then be used for predicting the probability of an event occurring before the observing time ^{1}^{1}1
Another statistical technique used to predict probabilities of binary events is logistic regression
[8]. Although widely used, logistic regression ignores the information about the time to an event and the censoring time, while timetoevent techniques fully exploit this important information. [7].The Cox model assumes that timefixed features (features that do not change over time) have a linear multiplicative effect on the hazard function and that the hazard ratio is constant over time. This is known as the proportional hazards (PH) assumption [6]. In our task, this means that, for example, patients with a low income have a higher (or lower) hazard of dying compared with patients with a high income and this ratio is constant over time. Note that this assumption is only needed for timefixed features and successful strategies can be easily implemented to detect and overcome its violation. Examples include the graphical approach of loglog plots for detection, as well as adding into the Cox model an interaction between the nonproportional timefixed feature and time for overcoming its violation [9, 10].
2.2 Images present challenges
The Cox model is a mainstay of timetoevent analysis, and has been extended to deal with complex scenarios [11, 12, 13, 14, 15, 16]. However, there are two features of our task that require us to go beyond the state of the art. First, images pose a significant challenge due to their high dimensionality. Second, the time course of COVID19 involves multiple interrelated events that cannot be predicted independently.
While there is compelling evidence that imaging studies are helpful in the diagnosis and management of COVID19 [1, 2, 3], images present significant challenges. The amount of data in a single imaging study is orders of magnitude larger than the data available from other sources; a single medical image can easily be hundreds of megabytes.^{2}^{2}2While CT is becoming increasingly available, early in the pandemic imaging was primarily chest Xray (CXR). However, the Cox proportional hazards model cannot directly handle images as features due to their high dimensionality. As [17] reports, such inputs lead to degenerate behavior. It would of course be possible to learn a feature from an imaging study, for example a rating of disease severity on a 3point scale. Such an approach would severely and unnecessarily limit what can be learned from the images, which is a particularly poor choice for a novel disease.
As mentioned, the COVID19 disease process involves competing and interrelated events. A straightforward application of timetoevent analysis would predict these events independently. This could easily lead to incoherent and selfcontradictory predictions (for example, predicting that ICU discharge will almost certainly happen before admission to the ICU).
3 Our approach
Our main goal is to predict the probability of experiencing death, ICU admission, ICU discharge, hospital admission, and hospital discharge before the observing time , as a function of patient’s features. To do so, we assume nonlinear proportional hazards [13] for the timefixed features.^{3}^{3}3This assumption is only needed for timefixed features. Timedependent features already depend on time, making the hazard also depending on time. This assumption relaxes the more strict assumption of linear proportional hazards of the classical Cox model. In our analyses this means that, for instance, the hazard of dying among older patients at baseline increases nonlinearly compared with younger patients at baseline and this ratio is constant over time [7]. This assumption has been already used in a variety of state of the art deep learning timetoevent techniques [11, 18, 13]. We also made the common assumption of noninformative censoring [19] which states that after conditioning on observable features, censoring and the risk for an event of interest are independent, i.e., the censoring mechanism does not depend on the unobserved features.
To estimate the hazard function introduced in Eq. (1) we compute two components, the baseline hazard that only depends on the time and the risk function that only depends on features . Once the hazard function is estimated, the cumulative hazard function and the survival function can be easily derived [4]; these are then used to predict the patient’s probability of undergoing an event before time . In other words, for each patient’s set of features , we can predict the probability of an event happening before the observed time . The baseline hazard does not depend on the features and therefore, it can be easily computed by using classical estimators. We used the one presented in Eq. 4.34 of [19]. The risk function, however, depends on timefixed and timedependent image and nonimage features. To estimate the risk function while taking into account these challenging types of features, we developed novel deep learning techniques.
3.1 Architecture
To incorporate timedependent imaging studies, timedependent nonimage data, and timefixed variables, our proposed architecture has three components. First, we use a convolutional LSTM (ConvLSTM)[20] and an RNNLSTM to extract timedependent image features and timedependent nonimage features respectively. Then, we concatenate (
) the features extracted from the networks with timefixed variables mapped to its corresponding embedding space, and passed the concatenated vector through a set of fully connected layers (FC Layers) to predict the risk function (Risk). The architecture is shown in figure
1.3.2 Loss function
Computed on the hazard function, the Cox’s partial likelihood loss has been successfully used in recent stateoftheart deep learning techniques [18, 13]. This likelihood function, however, is unsuitable, since it only applies to continuous time data where no two events occur at the same time. In the case of discrete time data, ties may occur and all possible orders of these tied events should be considered. Therefore, we adopt Efron’s approximation for handling ties, which is a computationally efficient estimation on the original Cox’s partial likelihood when ties are present[21].
Specifically, the loss function is as follows:
(2) 
Here denotes a unique time point and is the followup time for patient . We define the event indicator , where if the patient experiences an event at followup time , and if censored. The risk estimate is the output of our architectures for patient . is the set of patients whose followup time and , and . It is worth noticing that by construction, this loss function does not contain the baseline hazard , making its computation easier.
3.3 Evaluation and inference
We use concordance error [22] to compare performances of different models. This calculates the proportion of falsely ordered event pairs divided by the total number of possible evaluation pairs.
Similar to the loss function, during inference time, our model estimates the baseline hazard and the cumulative baseline hazard function using the Efron estimator for handling ties:
(3)  
(4) 
3.4 Minibatch SGD and stratified sampling
In the Cox partial likelihood function, we see that the formulation involves the risk predictions of all the patients whose followup time is longer than
. It is computationally costly and almost infeasible to optimize this loss when models use timedependent images. Instead, we use minibatch stochastic gradient descent: for each iteration, we sample a subset of patients and compute the loss on the subset. To closely mimic the data distribution of the original patient group and to keep the loss function in a stable range (the range of loss function correlates to the number of events in the patient group), we use stratified sampling to maintain the same ratio of noncensored and censored patients in each minibatch.
4 Related work
Timetoevent analysis is an important tool for many applications and plays a crucial role in healthcare. A wide variety of techniques have been developed [4]
, notably including nonparametric methods such as KaplanMeier, parametric methods such as Weibull and Gompertz, and semiparametric models, such as the Cox model. Powerful machine learning algorithms have also been adapted to this task; popular examples include Survival Support Vector Machines
[23] and Random Survival Forests[24]. The Cox model, which directly estimates the hazard function, stands out as perhaps the most popular timetoevent analysis method.Deep Survival Analysis. In recent years, deep learning has been used to extend the Cox model. DeepSurv[13] was one of the first to use deep learning to model the risk function
in the hazard function. They demonstrated strong performance in both linear and nonlinear settings. There are also attempts to tackle the limitations of DeepSurv to only structured data. By using convolution neural networks,
[18, 25, 26] models hazards on unstructured features such as images, which are much harder to incorporate due to their high dimensionality. Our architecture further extends the Cox model so that it takes both structured and unstructured longitudinal input.Unstructured Longitudinal Data. Incorporating longitudinal data in timetoevent prediction has received increasing attention, typically using deep learning techniques. [14] proposes DeepHit, a model that enables the possibility that the relationship between covariates and risks to change over time. [27, 28, 29]
have demonstrated the effectiveness of recurrent neural networks on longitudinal structured datasets. We appear to be the first to adopt recurrent neural network on longitudinal unstructured medical data, and the recurrent nature of our model is successful in capturing time dependency relationships within the data.
EndtoEnd Training. As mentioned in section 3.4, the negative partial likelihood function loss function is computationally expensive and almost infeasible (in terms of GPU memory) to compute when we have a large patient group with unstructured image data. [30] adopted an alternative twostep training strategy which requires first training a feature extraction network on image data with expert labelling, then uses the extracted features as covariates of the Cox model. Using minibatch sampling, our training process is endtoend and does not require any expert labelling.
5 Experimental setup and clinical dataset
Our techniques are designed for a clinical setting, where a combination of timedependent and timeindependent patient features are available, including imaging. There are no publicly available datasets for COVID19 that contain this information, and patient privacy considerations make it unlikely that such data will be available anytime soon. The closest existing datasets for COVID19 focus primarily on images, and generally do not contain significant additional information. The recent BIMCVCOVID19+ dataset [31] is an exception, and contains a limited amount of information such as demographics and antibody test results, but falls far short of the detailed clinical information that our methods are designed to exploit. Notably, it does not contain patient outcomes or lab values.
5.1 Clinical dataset
Admit  ICU Admit  ICU Discharge  Discharge  Death  
Event #  137  287  251  1171  290 
Censored #  99  224  150  219  427 
Event Removed #  1395  138  1  26  6 
Censored Removed #  263  1245  1492  478  1171 
Baseline Date  First available Xray  Admit  ICU Admit  Admit  Admit 
Start Date  Same as Baseline  7 day Before  7 Day Before  7 Day Before  7 Day Before 
CutOff Day  30 Days  10 Days  10 Days  10 Days  10 Days 
Age  Smoking  Pregnancy  Cancer  

Data Breakdown  IQR: 23  Active Smoker: 70  Yes: 7  Solid: 71 
Median: 64  Former Smoker: 382  No: 1705  Liquid: 42  
NonSmoker: 1257  No: 1257  
Missing Report  29.8%  9.8%  9.6%  27.6% 
We are fortunate to have IRBapproved access to a large clinical dataset of 1,894 COVID19 patients, longitudinally collected from 90 consecutive days early in the pandemic. The dataset includes patient demographics, hourlyrecorded vital signs, treatment regimes and clinical notes. Crucially for our purposes there are a significant number of imaging studies also available: over 14,000 studies, primarily chest Xray (CXR) but also an increasing number of computed tomography (CT) exams. Outcomes are also available, including such important events as hospitalization admission and discharge, ICU admission and discharge, and mortality. All patients tested COVID19 positive by PCR. Multiple Xrays, 4 types of timefixed features and 29 types of timedependent nonimage features of patients during their stay are also available. Timefixed features are smoking and pregnancy status, active cancer and age at admission.
The data was provided to us by a study research coordinator. Due to possible inconsistencies in the ways that events are recorded at their institution, there may be a small number of patients whose date of events was not included. In such a case, the patient’s events occurred but were unobserved, which are rightcensored data in a timetoevent framework. Our approach to censored outcomes, discussed below, handles this situation straightforwardly.
We process the available data with the following pipeline:

For each type of event, we select a baseline date for the Cox model based on advice from clinicians, and generate a time interval for each patient. The start date and end (followup) date of the interval is dependent on the baseline date and the type of events. We filter out any patient without an appropriate baseline date.

We discretize the time interval generated for each qualified patient by day, and associate each timedependent feature in this time interval with their corresponding days. We remove patients who do not have any Xrays taken during this interval.

A cutoff day is selected for each type of event, and we remove any data collected beyond this cutoff day.
Table 2 summarizes our choices of baseline date, start date and cutoff day for each type of event. For subjects with event outcomes, the date of the followup is exactly the date when the event occurred. For censored outcomes (including unrecorded deaths), we use the date of the most recent Xray to be the followup date, where the study terminates without reaching a definite conclusion. Statistics on timefixed features are provided in Table 2. For completeness, the set of timedependent features can be found in the supplements.
In our dataset, longitudinal information is available until the patient reaches an absorbing state. This is not a realistic setting, since predictions are mostly likely needed at earlier stages, where only data close to the baseline date are available. However, our cutoff day procedure prevents simple cheating such as counting the number of days where data is available. The cutoff also avoids other forms of cheating, for example if we go far enough beyond the baseline date, some lab results may start to give obvious clues.
Note that our model is recurrent and capable of analyzing input without any Xrays. However, we want to study the impact of images on timetoevent analysis, and wish to compare our model with another imagebased survival analysis method – DeepConvSurv [18]. As a result we we have filtered out patients who did not have any Xrays taken during the corresponding intervals.
After obtaining the desired patient population for each event, we follow the standard train, validation, test split (60:20:20). We divide the population to train, validation and test set such that these subsets share the same event and censored ratio.
5.2 Implementation details
Nonimage timedependent data is represented by a matrix of size , where is the cutoff day, i.e. the maximum number of days of data the model could use in this event prediction, and is the total number of types of lab results available. For each vector of length in this matrix, the first values are the lab results rescaled to 01 range, and the remaining values are indicators. if the corresponding is a lab measurement available on that day, and otherwise.
Xray images of the same patient are downsized and stacked, resulting in a tensor of size
, where is the number of patient Xrays available for this event.To predict the risk function for patient , the image data is fed to a convLSTM (kernel size , 2 hidden layers size and ) recurrently to extract image features. The extracted features are further downsized by a 2D Adaptive Average Pooling layer () following a fully connected layer (size 64). Nonimage timedependent inputs are first mapped to an embedding space (embedding size 15), then fed recurrently to LSTM to extract the respective features. Patient demographics, which are all discrete categorical values, are first mapped to their corresponding embedding space (embedding size 2), then concatenated altogether with the extracted features from convLSTM and LSTM branches. The concatenated vector goes through three fully connected layers (size 32, 16 and 1) to predict the final outcome.
During training, we use Adam [32] with
, with a batch size of 40. We divide the training phase into 30 epochs, each epoch consists of 20 randomly sampled minibatches. With limited GPU memory available and to reduce computational costs, we further sample the Xray data and limit the number of Xrays
toper patient, and use zero padding when patients have less than
valid Xrays.In the validation phase, for each epoch we choose the minibatch that the model has the lowest concordance error on as our baseline hazard, and compute the concordance error on the validation set. The epoch with the lowest concordance error and its bestperforming minibatch on validation set will be used to compete against other baseline models on the testing set.
5.3 Baseline details
We compare our proposed architecture with 8 different models, including parametric and semiparametric models, standard and nonlinear CoxPH models, and other popular machine learning survival methods:

Image Input Models: Deep Convolutional Neural Network for Survival Analysis (DeepConvSurv) [18]
Recall that for our model, the nonimage timedependent input are formatted to matrix of size . For the nonimage baselines, which take 1D input, we concatenate the row vectors in the matrix along the day axis, and further combine it with timefixed data to obtain a long vector representation of the patient data.
For DeepSurv [13]
, we used a model with 2 fully connected layers (size 128, 64) with ReLu activation. For DeepConvSurv
[18], we implemented a network very similar to the original, which consists of three convolution layers (conv1: kernel size, stride 3, channel 32; conv2: kernel size
, stride 2, channel 32; conv3: kernel size, stride 2, channel 32), following a single FC layer. We also added max pooling and 2D dropout to the network to improve model performance. Similar to our own architecture, we train DeepConvSurv using randomly selected images for each patient, and experiment on both the case where the loss is computed over the entire training set, and the case where minibatch gradient descent is used.
To investigate the impact of images on prediction, we provide two baselines of our own: 1) a model where the convLSTM branch is removed and predicts only from nonimage data, and 2) an “complementary” model where the LSTM branch is removed and we predict solely from imagedata. In these two baseline models, we also use minibatch gradient descent with batch size of 40.
6 Results
6.1 Performance comparisons
We evaluate model performance using timedependent concordance error discussed in section 3.3. Results of the comparisons are in Table 3. Additional experiments and ablation studies are included in the supplemental material.
Method  Admit  ICU Admit  ICU Discharge  Discharge  Death 

CoxPH  0.434  0.371  0.486  0.240  0.358 
Weibull  0.319  0.372  0.478  0.437  0.477 
Gempertz  0.382  0.362  0.413  0.272  0.360 
DeepSurv  0.316  0.358  0.454  0.398  0.373 
Survival+SVM  0.333  0.365  0.464  0.347  0.316 
Survival+Random Forest 
0.337  0.309  0.435  0.427  0.282 
DeepConvSurv (GD)  0.403  0.417  0.468  0.356  0.351 
DeepConvSurv (minibatch)  0.419  0.415  0.478  0.419  0.435 
ImageOnly (ours)  0.238  0.408  0.459  0.350  0.401 
NonImage (ours)  0.247  0.265  0.439  0.262  0.278 
Image+NonImage (ours)  0.198  0.241  0.385  0.229  0.246 
In our experiments, nonlinear CoxPH models (DeepSurv and ours) and Random Survival Forest almost always outperform parametric models. This is expected since Cox models make no distributional assumptions about the data. The performances of DeepConvSurv and our ImageOnly baseline are also noteworthy: they demonstrates that image alone (without any lab values or demographics) provides useful information for prediction. Using multiple images further improves the performances.
Our recurrent baseline model, which captures timedependency relations, achieves competitive performance across all 5 events even without images. Adding our ConvLSTM branch that processes timedependent images has further improved the predictions by an average of , with a substantial improvement of on Hospitalization (Admit) and on ICU Discharge.
To better illustrate the comparison, in figure 3 we show the timedependent Brier score [36] for our method and two standard timetoevent prediction techniques. This is an extension of the Brier score [37] for a specific time horizon.
Our experiments demonstrate the effectiveness of our recurrent architecture on nonlinear, longitudinal data. We also show that incorporating multiple, timedependent imaging studies significantly improves timetoevent predictions.
6.2 Understanding the dataset and model
We conduct several experiments to provide insights into what our models learned.
Feature Importance Test: While our imaging studies were anonymized, fields describing individual scanners were preserved. This provides some ways to check if our model is simply learning an association between certain scanners and disease severity (for example, sicker patients might be in certain areas of the hospital, and routed to the nearest xray machine) . To test whether our model is learning simple associations between scanners and events, we conducted permutationbased feature importance tests (using survival random forest) to measure performance drops if certain features are removed. Specifically, we add scanner IDs of patient xrays as one of the features for event prediction, and measure the average concordance error increase when the relationship between the feature and survival time is removed by random shuffling. We found that for Hospital Admission and ICU Admission, Scanner ID has an average concordance error increase of and respectively, and on the ICU Discharge, Discharge and Death events. Compared with other features such as Age ( on ICU Discharge), the permutation tests results on Scanner ID suggest little correlation between scanner types and events.
Black Box Test: Recent papers, such as [38], have identified flaws in the evaluation of deep learning models for COVID19. Deep learning models were able to utilize nonmedical information in the Xrays like artifacts or watermarks in order to differentiate between patients with COVID and those without. In order to demonstrate that our dataset and model do not exhibit this flaw, we performed similar experiments to [38], where we obscured various portions of the images with black boxes and retrained the model. This should determine what part of the images are important to the model. If our technique is relying on incidental properties of the images, such as identifiable artifacts from individual scanners, we would expect to see better prediction accuracy from an image and labs model over a labsonly model even when most of the image is obscured.
Event  Image + labs (ours)  80% obscured  90% obscured  95% cols obscured  100% obscured  Labs only (ours) 

Admit Date  0.198  0.250  0.257  0.287  0.268  0.247 
ICU Admit  0.241  0.262  0.271  0.262  0.264  0.265 
ICU Discharge  0.335  0.413  0.437  0.431  0.437  0.439 
Discharge  0.229  0.272  0.268  0.276  0.264  0.262 
Death  0.254  0.296  0.274  0.271  0.278  0.278 
Results are shown in table 4. The first and last columns give our results with images included (first column) and nonimage data only (last column). We used several different sized boxes and zeroed out the portions of the image within them. As a sanity check we obscured all of the image (“100% obscured”). We also obscured large portions of the image (“80% obscured”, “90% obscured”, the percentages indicate how much of the image was obscured starting at the center). Note that sometimes clues concerning illness severity (e.g. tubes) are located at the top, and thus visible in the 80% obscure and 90% obscure cases. Therefore, we also included a “95% cols obscured” case, where 95% of the center columns are removed. The boxes are shown overlayed on a publicly available chest xray above table 4.
The experimental results suggest that our models are not learning coincidental features such as scanner artifacts or watermarks. We also observe that there are some small improvements when our models use 80% obscured images, and there is a natural explanation: unless the image is almost entirely obscured, various lines and sensors are easily discernible even outside the lungs, and these possibly provide an indication of disease severity.
Number of Input Images Ablation: We also perform an ablation on the number of images used () when training and evaluating the model. We find that the model improves as the number of images increases for the majority of tasks (admission, ICU discharge, and discharge). Specifically, we observe that provides a 1020% improvement in concordance error over ; for example, going from to leads to a decrease in concordance error of in admission; in discharge; and in ICU discharge. The remaining tasks show little to no improvement from a larger number of images. However, an analysis of the dataset statistics show that these tasks tend to have fewer images per patient; this means that our training set also has fewer examples of patients with more data which may leads to problems with generalization. We include more details on this ablation in the supplemental material.
Minibatch Size Ablation: We also investigated the effects of minibatch size , and observed that minibatch size does not affect the model performance for sufficiently large values of (). Particularly small minibatches () contains very few events, which results in performance drops. However, for , the models tend to converge to the same point. In figure 4, we show an example of the validation error curves of discharge events with respect to the number of examples seen for the case, where and . From the figure, we see that the size of the minibatch does not seem to affect convergence.
7 Conclusions
We describe a deep learning approach that incorporates multiple, timefixed data, longitudinal nonimage data and longitudinal images into timetoevent analysis. Our technique accurately predicts the probability of experiencing an event in the presence of rightcensored data.
We used a large COVID19 dataset containing longitudinal imaging and nonimaging information. While this dataset contains valuable information to predict the occurrence of clinical events, there is some risk of selection bias. For instance, we only included in our analysis patients with multiple imaging studies over time. Multiple images are usually performed when a patient is sicker, for example to confirm central line placement. This selection could lead to a sample that is not representative of the COVID19 hospitalized population. Selection bias is a common problem in machine learning [39], statistics [40], and epidemiology [41]; as a result, a number of techniques have been developed to correct it [39].
We have demonstrated that neural networks can be used to explicitly support transitions between competing events and be used to predict the transitionspecific risk (e.g., the risk of transition from hospitalization to ICU admission) for a particular set of patient features. While our focus is on COVID19, the techniques we propose should be generally applicable to a wide class of serious diseases where imaging can improve the prediction of patient outcomes.
Acknowledgements We thank Joshua Geleris MD and George Shih MD for their help understanding various clinical issues. We received considerable help acquiring the dataset from Martin Prince MD PhD (PI on the IRB), Benjamin Cobb, Sadjad Riyahi MD, and Evan Sholle. This research was supported by a gift from Sensetime.
References
 [1] Damiano Caruso, Marta Zerunian, Michela Polici, Francesco Pucciarelli, Tiziano Polidori, Carlotta Rucci, Gisella Guido, Benedetta Bracci, Chiara de Dominicis, and Andrea Laghi. Chest CT Features of COVID19 in Rome, Italy. Radiology, page 201237, April 2020.
 [2] Sakiko Tabata, Kazuo Imai, Shuichi Kawano, Mayu Ikeda, Tatsuya Kodama, Kazuyasu Miyoshi, Hirofumi Obinata, Satoshi Mimura, Tsutomu Kodera, Manabu Kitagaki, Michiya Sato, Satoshi Suzuki, Toshimitsu Ito, Yasuhide Uwabe, and Kaku Tamura. Clinical characteristics of COVID19 in 104 people with SARSCoV2 infection on the Diamond Princess cruise ship: A retrospective analysis. The Lancet Infectious Diseases, June 2020.
 [3] Danielle Toussie, Nicholas Voutsinas, Mark Finkelstein, Mario A Cedillo, Sayan Manna, Samuel Z Maron, Adam Jacobi, Michael Chung, Adam Bernheim, Corey Eber, Jose Concepcion, Zahi Fayad, and Yogesh Sean Gupta. Clinical and Chest Radiography Features Determine Patient Outcomes In Young and Middle Age Adults with COVID19. Radiology, page 201754, May 2020.
 [4] David G Kleinbaum and Mitchel Klein. Survival Analysis, volume 3. Springer, 2010.
 [5] Jan Beyersmann, Arthur Allignol, and Martin Schumacher. Competing Risks and Multistate Models with R. Springer Science & Business Media, 2011.
 [6] David R Cox. Regression models and lifetables. Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202, 1972.
 [7] David G Kleinbaum and Mitchel Klein. The Cox proportional hazards model and its characteristics. In Survival Analysis, pages 97–159. Springer, 2012.
 [8] David G Kleinbaum, K Dietz, M Gail, Mitchel Klein, and Mitchell Klein. Logistic Regression. Springer, 2002.
 [9] David G Kleinbaum and Mitchel Klein. Extension of the Cox proportional hazards model for timedependent variables. In Survival Analysis, pages 241–288. Springer, 2012.
 [10] David G Kleinbaum and Mitchel Klein. Evaluating the proportional hazards assumption. In Survival Analysis, pages 161–200. Springer, 2012.
 [11] David Faraggi and Richard Simon. A neural network model for survival data. Statistics in medicine, 14(1):73–82, 1995.
 [12] Robert Tibshirani. The lasso method for variable selection in the Cox model. Statistics in medicine, 16(4):385–395, 1997.
 [13] Jared L Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC medical research methodology, 18(1):24, 2018.

[14]
Changhee Lee, William R Zame, Jinsung Yoon, and Mihaela van der Schaar.
Deephit: A deep learning approach to survival analysis with
competing risks.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  [15] Jie Hao, Youngsoon Kim, Tejaswini Mallavarapu, Jung Hun Oh, and Mingon Kang. CoxPASNet: Pathwaybased sparse deep neural network for survival analysis. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 381–386. IEEE, 2018.
 [16] Changhee Lee, Jinsung Yoon, and Mihaela van der Schaar. DynamicDeepHit: A Deep Learning Approach for Dynamic Survival Analysis With Competing Risks Based on Longitudinal Data. IEEE transactions on biomedical engineering, 67(1):122–133, January 2020.
 [17] Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of statistical software, 39(5):1, 2011.
 [18] Xinliang Zhu, Jiawen Yao, and Junzhou Huang. Deep convolutional neural network for survival analysis with pathological images. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 544–547. IEEE, 2016.
 [19] John D Kalbfleisch and Ross L Prentice. The Statistical Analysis of Failure Time Data, volume 360. John Wiley & Sons, 2011.
 [20] Xingjian Shi, Zhourong Chen, Hao Wang, DitYan Yeung, Waikin Wong, and Wangchun WOO. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 802–810. Curran Associates, Inc., 2015.
 [21] Bradley Efron. The efficiency of Cox’s likelihood function for censored data. Journal of the American statistical Association, 72(359):557–565, 1977.
 [22] Hajime Uno, Tianxi Cai, Michael J. Pencina, Ralph B. D’Agostino, and L. J. Wei. On the Cstatistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in Medicine, 30(10):1105–1117, May 2011.
 [23] Faisal M Khan and Valentina Bayer Zubek. Support vector regression for censored data (svrc): a novel tool for survival analysis. In 2008 Eighth IEEE International Conference on Data Mining, pages 863–868. IEEE, 2008.
 [24] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[25]
JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros.
Unpaired ImageToImage Translation Using CycleConsistent Adversarial Networks.
InProceedings of the IEEE International Conference on Computer Vision
, pages 2223–2232, 2017.  [26] Alice Zheng and Amanda Casari. Feature engineering for machine learning: principles and techniques for data scientists. ” O’Reilly Media, Inc.”, 2018.
 [27] Jiarui Jin, Yuchen Fang, Weinan Zhang, Kan Ren, Guorui Zhou, Jian Xu, Yong Yu, Jun Wang, Xiaoqiang Zhu, and Kun Gai. A deep recurrent survival model for unbiased ranking. arXiv preprint arXiv:2004.14714, 2020.
 [28] Eleonora Giunchiglia, Anton Nemchenko, and Mihaela van der Schaar. Rnnsurv: A deep recurrent model for survival analysis. In International Conference on Artificial Neural Networks, pages 23–32. Springer, 2018.
 [29] Kan Ren, Jiarui Qin, Lei Zheng, Zhengyu Yang, Weinan Zhang, Lin Qiu, and Yong Yu. Deep recurrent survival analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4798–4805, 2019.
 [30] Yifan Peng, Tiarnan D Keenan, Qingyu Chen, Elvira Agrón, Alexis Allot, Wai T Wong, Emily Y Chew, and Zhiyong Lu. Predicting risk of late agerelated macular degeneration using deep learning. NPJ digital medicine, 3(1):1–10, 2020.
 [31] Maria de la Iglesia Vaya, Jose Manuel Saborit, Joaquim Angel Montell, Antonio Pertusa, Aurelia Bustos, Miguel Cazorla, Joaquin Galant, Xavier Barber, Domingo OrozcoBeltran, Francisco GarciaGarcia, Marisa Caparros, German Gonzalez, and Jose Maria Salinas. BIMCV COVID19+: a large annotated dataset of RX and CT images from COVID19 patients, 2020. See: https://bimcv.cipf.es/bimcvprojects/bimcvcovid19/.
 [32] Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
 [33] David G Kleinbaum and Mitchel Klein. Parametric survival models. In Survival Analysis, pages 289–361. Springer, 2012.
 [34] Hemant Ishwaran, Udaya B Kogalur, Eugene H Blackstone, Michael S Lauer, et al. Random survival forests. The annals of applied statistics, 2(3):841–860, 2008.
 [35] Vanya Van Belle, Kristiaan Pelckmans, JAK Suykens, and Sabine Van Huffel. Support vector machines for survival analysis. In Proceedings of the Third International Conference on Computational Intelligence in Medicine and Healthcare (CIMED2007), pages 1–8, 2007.
 [36] R Schoop, E Graf, and M Schumacher. Quantifying the predictive performance of prognostic models for censored survival data with timedependent covariates. Biometrics, 64(2):603–610, 2008.
 [37] Erika Graf, Claudia Schmoor, Willi Sauerbrei, and Martin Schumacher. Assessment and comparison of prognostic classification schemes for survival data. Statistics in medicine, 18(1718):2529–2545, 1999.
 [38] Gianluca Maguolo and Loris Nanni. A critic evaluation of methods for covid19 automatic detection from xray images. arXiv preprint arXiv:2004.12823, 2020.
 [39] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In International conference on algorithmic learning theory, pages 38–53. Springer, 2008.

[40]
Alice S Whittemore.
Collapsibility of multidimensional contingency tables.
Journal of the Royal Statistical Society: Series B (Methodological), 40(3):328–340, 1978.  [41] James M Robins. Data, design, and background knowledge in etiologic inference. Epidemiology, pages 313–320, 2001.