1 Introduction
Understanding the effect of a treatment on a response for an individual patient with characteristics (covariates)
, is a central problem of data analysis in various fields. In medical sciences, it often arises when physicians are working on a treatment to cure a certain disease. Our collaborators at the Children Oncology Group (COG) provided us with a real data set of children with Hodgkin Lymphoma. Our goal is to construct an interpretable classifier that selects the right treatment for the right patient. Since the human body is complicated and there might exist some extremely complex interactions between the characteristics, the treatment and the response of the patient, we need a model that allows for interactions of high order between variables. Since it is impossible to truly know the degree to which the patient is sick, the patient characteristics
serve to estimate the true patient health status. Since the treatments were selected by doctors based on the observed covariates, the data suffers from treatment selection bias
stukel07 ; Wallis18 that needs to be accounted for.To accomplish this, we use a deeplatent variable model inspired by Louizos et al. Louizos17 . To account for treatment selection bias, the true patient status is represented by a set of hidden variables that affects both the observed characteristics and the survival chances of the patient. Although learning the exact posterior distribution is intractable in many cases Bishop07 ; Kingma17 , variational inference allows for inference on latent variable models Kingma13 ; Rezende14 . The Variational AutoEncoder (VAE) framework proposed by Kingma Kingma13 ; Kingma17
offers interesting properties: the probabilistic modelling allows for interpretable results and useful statistical properties; the neural network parameterization allows for complex relationships between variables as needed; and the latent variables allow for more flexible observed data distributions. Everything is combined in a system that can be jointly optimized using variational inference.
2 Data set challenges
Friedman et al. Friedman14
previously introduced the data set we received. It is a small data set by machine learning standards, the response variable is rightcensored for many observations and it contains observations with missing values. VAEs were not designed to handle such data sets and therefore we had to adapt our VAE model in order to face those challenges.
First, since obtaining medical data is expensive, our data set is small. In that situation, underfitting is a concern, but as reproducibility is a major concern in the medical community we must take precautions so that we don’t overfit either. To prevent overfitting we relied on L2 regularization, also known as weight decay, and we performed thorough model checking in order to ensure our model is not overfitting.
Second, as it is the case in many survival data sets, the response is rightcensored for many observations. Therefore, we established a model that is inspired by survival analysis theory and we used classic survival analysis techniques to account for the rightcensored observations as will be explained in section 3.1.
Finally, the data set contains many missing values. Our collaborators indicate that missing values are assumed to be missing at random (MAR) and thus we performed missing values imputation
vanBuuren11 ; Azur11 during the data processing phase.3 The model
The main purpose of latent variables is to allow for more flexible and complicated distributions for the observed variables Bishop07 ; Koller09 . Here is a graphical representation of the model we suggest :
The representation illustrated in figure 1
induces a natural factorization of the joint distribution. As said earlier, from a practical perspective, the latent variables
are incorporated to allow for a more flexible observed data distribution but the graphical model also leads to an intuitive description. Here, the set of latent variables represents the true patient health status and directly affects the survival chances of the patient and the covariates gathered as proxy of the true health status. The treatment is considered a special covariate as it is selected by a physician based upon the observed covariates . Finally, the distribution for the response is based upon the patient health status and the treatment selected . Neural network parameterizations along edges of the graphical representation insure that the model includes high order of interactions between the variables which is important to us but difficult to do with the Cox PH model.3.1 Model distributions
This model illustrated in figure 1 suggest the following factorization :
(1) 
The prior
is a multivariate Normal distribution with mean 0 and identity variance. As the graphical representation suggests, the observed covariates
are conditionally independent given the latent variables . The conditional distributions can take multiple forms such as Normal, Poisson, Bernoulli, etc. We want to understand the effect of additional treatments such has intensive chemotherapy or radiotherapy. Within the collected data set, these treatments are either received or not by the patients and therefore we’ve established them as a set of Bernoulli variables.Finally, the response is modelled as Weibull. In order to account for censored data the loglikelihood for the survival distribution will be computed as follows :
(2) 
where if is observed and 0 if is censored, is the density and is the survival function. More information about distributions and parameterizations is located in the appendices.
3.2 Fitting the parameters
The parameters of the various neural networks will require training. The Evidence Lower BOund (ELBO) is a lower bound for the loglikelihood of the observed data and will be the objective function to maximize during training. Using the factorization (1) explained in section 3.1 we have :
(3) 
Since the observed data parameters and the variational distribution parameters are obtained through various neural network functions, we will attempt to maximize the ELBO with respect to the neural network functions parameters. This will require the use of backpropagation combined with a gradientbased optimizer.
3.3 Prediction
The ultimate goal of this analysis is to provide tools to physicians to allow them to make a decision about the treatment needed for a patient. With our probabilistic approach we aim at giving physicians a wide range of information which they can utilize however they see fit. Our model produces a predicted distribution for the eventfree survival time of a patient based upon its characteristics and the selected treatment. Having such distributions for every possible treatment gives flexibility regarding decisionmaking as physicians can look at various properties of the predicted distributions such as the expected value or the survival function. Using importance sampling our model let us produce the following predicted distribution for a new patient with characteristics for a given treatment :
(4) 
which resembles a mixture of Weibull where is the number of components, and is the component weight. The appendices contain more details about the importance sampling formulation.
4 Results
We used the architecture presented in section 3 and optimized the parameters using the ADAM optimizer Kingma14 ; Kingma17 . Model selection and calibration is performed using a validation set. Since the loglikelihood estimates on a heldout set are a better estimates of the loglikelihood under the true data generation distribution Shai14 selecting the tuning parameters using the validation set contributes towards preventing overfitting. Similarly the gap between the loglikelihood on the training set and the validation set is small in proportion to the loglikelihood estimates on the validation set in figure 2. This indicates that we have managed to prevent overfitting which was one of the challenges we established in section 2.
Loglikelihood lower bound (ELBO) through out epochs. We stopped the optimization procedure when the training set ELBO stabilized and before the validation set ELBO decreased.
Because of the censored observations, the accuracy of our model cannot be assessed with the meansquared error. The concordance index (cindex) Harrell96 is one of the most popular performance measures for censored data Chen12 ; Steck07 . It is a rankbased measure that computes the proportion of all usable observation pairs in which the predictions and true outcomes are concordant Harrell96 . A cindex of 0.5 is equivalent to random ordering and 1 is perfect ordering.
The concordance index on the training set is 0.682 for the Cox PH model against 0.649 for our model, but on the validation set this index is equal to 0.522 for the predicted hazards using the Cox PH model and 0.574 for the VAE we propose. This indicates that our model suffers less from overfitting than Cox PH and outperforms it on heldout data according to this performance measure.
Finally, let’s quickly introduce an example of decisionmaking. With our model, the predicted survival function can be obtained quite simply :
(5) 
The predicted survival function of equation 5 can be used to estimate the increase in survival chances caused by selecting treatment instead of treatment by computing : . For example, our collaborators’ suggested decisionmaking was to intensify the treatment using either intensive chemotherapy or radiation therapy if it increases the 4 year eventfree survival chances by at least 7%. Our model would allow to make such decisions.
5 Conclusion
We have adapted a new machine learning technique to a typical medical framework and compared the results to a common technique. Three challenges (small data set, censored data, and missing values) were identified and we managed to adapt our model in order to face them. The result is a VAE adapted for survival analysis that allows for complex interactions between the variables, that accounts for treatment selections bias, that can generate a predicted distribution for a new patient allowing for insightful decisionmaking and that outperforms the popular Cox PH model in terms of prediction accuracy.
As future improvements we are looking at alternatives to face the challenges identified in section 2. We would like to establish a recognition model that accounts for missing data such as recommended by Nazabal et al. Nazabal18 . We would also like to utilize other regularization techniques to prevent overfitting, such as dropout Hinton12 ; Srivastava14 and variational dropout Kingma17 ; Kingma15 . Finally, we would like to test the accuracy of our model with other performance measures and to compare our model with a wider range of classic survival models.
Acknowledgement
The authors are grateful to Qinglin Pei and the rest of the Children Oncology Group (COG) staff for collecting, handling and providing us with this data set. The authors are thankful to David Duvenaud and Chris Cremer for their insightful guidance. Finally, the authors would like to acknowledge the financial support of the NSERC of Canada.
References
 [1] Stukel T. A., Fisher E. S., Wennberg D. E., Alter D. A., Gottlieb D. J., and Vermeulen M. J. Analysis of observational studies in the presence of treatment selection bias: Effects of invasive cardiac management on ami survival using propensity score and instrumental variable methods. JAMA, 297(3):278–285, 2007.
 [2] M. J. Azur, E. Stuart, C. Frangakis, and P. Leaf. Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1):40–49, 3 2011.
 [3] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2007.
 [4] HungChia Chen, Ralph L. Kodell, Kuang Fu Cheng, and James J. Chen. Assessment of performance of survival prediction models for cancer prognosis. BMC Medical Research Methodology, 12(1):102, Jul 2012.
 [5] Debra L. Friedman, Lu Chen, Suzanne Wolden, Allen Buxton, Kathleen McCarten, Thomas J. FitzGerald, Sandra Kessel, Pedro A. De Alarcon, Allen R. Chen, Nathan Kobrinsky, Peter Ehrlich, Robert E. Hutchison, Louis S. Constine, and Cindy L. Schwartz. Doseintensive responsebased chemotherapy and radiation therapy for children and adolescents with newly diagnosed intermediaterisk hodgkin lymphoma: A report from the children’s oncology group study ahod0031. Journal of Clinical Oncology, 32(32):3651–3658, 2014. PMID: 25311218.
 [6] Frank E. Harrell, Kerry L. Lee, and Daniel B. Mark. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15(4):361–387.
 [7] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. ArXiv eprints, July 2012.

[8]
D. Jimenez Rezende, S. Mohamed, and D. Wierstra.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
ArXiv eprints, January 2014. 
[9]
D. P. Kingma.
Variational Inference & Deep Learning : A New Synthesis
. PhD thesis, Universiteit van Armsterdam, 10 2017.  [10] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ArXiv eprints, December 2014.
 [11] D. P. Kingma, T. Salimans, and M. Welling. Variational Dropout and the Local Reparameterization Trick. ArXiv eprints, June 2015.
 [12] D. P Kingma and M. Welling. AutoEncoding Variational Bayes. ArXiv eprints, December 2013.
 [13] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques  Adaptive Computation and Machine Learning. The MIT Press, 2009.
 [14] C. Louizos, U. Shalit, J. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal Effect Inference with Deep LatentVariable Models. ArXiv eprints, May 2017.
 [15] A. Nazabal, P. M. Olmos, Z. Ghahramani, and I. Valera. Handling Incomplete Heterogeneous Data using VAEs. ArXiv eprints, July 2018.
 [16] Shai S. S. and Shai B. D. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY, USA, 2014.
 [17] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
 [18] Harald Steck, Balaji Krishnapuram, Cary Dehingoberije, Philippe Lambin, and Vikas C Raykar. On ranking in survival analysis: Bounds on the concordance index. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1209–1216. Curran Associates, Inc., 2008.
 [19] S. v. Buuren and K. GroothuisOudshoorn. mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, Articles, 45(3):1–67, 2011.
 [20] Christopher Wallis, Gerard Morton, Sender Herschorn, Ronald Kodama, Girish Kulkarni, Sree Apuu, Bobby Shayegan, Roger Buckley, Arthur Grabowski, Steven Narod, and Robert Nam. Pd2008 the effect of selection and referral biases for the treatment of localized prostate cancer with surgery or radiation. The Journal of Urology, 199(4, Supplement):e404 – e405, 2018. 2018 Annual Meeting Program Abstracts.
Comments
There are no comments yet.