I Introduction
After stunning successes in cognitive domains, deep learning is expected to transform healthcare
[21]. Most remarkable results thus far in health have been in diagnostic imaging [7, 9], which is a natural step given record–breaking results in computer vision. However, diagnostic imaging is only a small part of the story. A full intelligent medical system should be able to reason about the past (historical illness), present (diagnosis) and future (prognosis). Here we adopt the notion of
reasoning as “algebraically manipulating previously acquired knowledge in order to answer a new question” [1]. For that we learn to embed discrete medical objects into continuous vectors, which lend themselves to a wide host of powerful algebraic and statistical operators [3, 22]. For example, with diseases represented as vectors, computing diseasedisease correlation is simply a cosine similarity between the two corresponding vectors. Illness – recorded as a bag of discrete diseases – can then be a function of set of vectors. The same holds for care. Importantly, if diseases, treatments (or even doctors) are embedded in the same space then recommendation of treatments (or doctors) for a given disease will be as simple as finding the nearest vectors.
The algebraic view makes it easily to adapt powerful tools from the recent deep learning advances [15] for healthcare. In particular, we can build endtoend models for risk prediction without manual feature engineering [2, 16, 17, 19]. As the models are fully differentiable, credit assignment to distant risk factors can be carried out [16], making the models more transparent than commonly thought. The learning path through recorded medical data necessitates the modeling of the dynamic interaction between the three processes: the illness, the care and the data recording [17]. For the purpose of this paper, we assume that a clinic visit at a time manifests through a set of discrete diseases and treatments. A healthcare trajectory is therefore a sequence of timestamped records. This necessitates a settheoretic treatment of each visit and a dynamic treatment of the entire trajectory.
We introduce , a recurrent model of healthcare trajectory as a sequence of sets. Although neural modeling of unordered sets has been recently studied [24, 26], sequences of sets have not been formally investigated, to the best of our knowledge. In , features for a set are computed through a multivalued set function, which is permutation invariant. There are two set functions, one for diseases and the other for treatments. A dualinput function of these two set functions encodes the multidisease–multitreatment interaction at the visit level. Finally, visits are connected through a LSTM to model the temporal dynamics between visits.
With this design, addresses an important aspect of healthcare: the dynamic interaction between illness and care. Although care is supposed to lessen the illness, it is often designed through highly controlled trials where one treatment is targeted at one disease, on a specific cohort, at a specific time [12]. Much less is known for the effect of multiple treatments on multiple diseases, in general hospitalized patients, over time. A recent model known as DeepCare [17] partly addresses this problem by considering the moderation effect of treatments on illness state transition between visits. However, unlike , DeepCare does not address the multidisease–multitreatment interaction within visits.
We evaluate on the task of predicting the important medical outcomes such as unplanned readmission or death at discharge, treatment recommendation and future diseases. We focus on chronic diseases (diabetes and mental health) as they are highly complex with multiple causes, often associated with multiple comorbidities, and the treatments are not always effective. Results from over one hundred thousand visits to a large regional hospital data demonstrate the efficacy of .
To summarize, we claim the following contributions: 1) A novel representation of timestamped healthcare trajectory as a sequence of sets. 2) A novel deep learning architecture for sequence of sets called , which – when applied to healthcare – uncovers the structure of the disease/treatment space and predicts future outcomes. 3) An evaluation of these claimed capabilities on real patients with hundred thousands of hospital visits on three tasks: readmission prediction, treatments recommendation and diseases progression.
Ii Related Work
Iia Deep Learning for Healthcare
The past few years have witnessed an intense interest in applying recent deep learning advances to healthcare [19]. The most ready area is perhaps medical imaging [8]. Thanks to the recordbreaking successes in convolutional nets in computer vision, we now can achieve diagnosis accuracy comparable with experts in certain subareas such as skincancer [7]. However, it is largely open to see if deep learning succeeds in other areas where data are less wellstructured and of lower quality such as electronic medical records (EMR) [17].
Within EMRs, three set of techniques have been employed. The first is finding distributed representation of medical objects such as diseases, treatments and visits
[3, 22]. The techniques are not strictly deep but they offer a compelling algebraic view of healthcare. The second group of techniques involve 1D convolutional nets, which are designed for detecting short translation invariant motifs over time [2, 16]. The third group, to which this paper belongs, employs recurrent neural nets to capture the temporal structure of care [4, 17]. For more comprehensive review of this highly dynamic research area in recent time, we refer to [20].Predictive healthcare begs the question of modeling treatment effects over time. This has traditionally been in the realm of randomized controlled trials. Our work here is, on the other hand, based entirely on observational administrative data stored in Electronic Medical Records.
IiB Neural Nets for Sets
Sets are fundamental to mathematics, and are pervasive in many learning tasks such as clustering (set membership assignment), feature selection (subset of features), multilabel learning (subset of labels) and multiinstance learning (set classification). Due to its permutationinvariance and variation in size, set does not lend itself naturally to traditional neural networks. In
[24], mapping set to set is framed as mapping sequence to sequence, in which a set is “pretended” to be a pseudosequence. This does not address the permutationinvariant property of sets. A more systematic investigation is [26], where conditions for set functions are specified. Sets have also been studied indirectly, e.g., in pooling operations in CNN; in attention mechanism (e.g., see [23]) – which is essentially a function over sets, in deep multiX learning [18]. Predicting sets have also been studied in [10]. In the existing literature, neural models of sequences of sets seem to be missing.Iii Methods
In this section, we present our main modeling contribution, the as a recurrent model for sequence of sets, in the context of its primary application in healthcare.
Iiia Set Function
Let us start with set, an unordered collection of elements. Let be a set of vectors in . A set function is a mapping invariant against permutation of set elements, that is, , for any permutation operator . For simplicity, we are interested in a function that receives a set of vectors and returns another vector in the same real space . We use the following normalized set function:
(1) 
where is a smoothing factor. This is essentially a linear rectifier of the sum, approximately normalized to unit vector. The factor lets when, but when .
IiiB Clinic Visit as SetSet Interaction
Each medical record contains information about the history of clinic visits by a patient. For simplicity, we consider a visit record as a bag of diseases deemed relevant for care at the time of visit, and a bag of treatments administered for the patient. Among older cohorts, nonsingleton bags are prevalent, reflecting the comorbidity picture of modern healthcare, that is, an elderly typically suffers from multiple cooccurring conditions. As a result, treatments must be carefully administered to work with, or at least not to cancel out, each other. This also calls for a sensible way to model the complexity of multidisease–multitreatment interaction. Most existing biostatistics methods, however, are designed for simplified treatment effect against just one condition in a controlled experimental setting, and thus inadequate in this operational setting. Our solution is based on setset interaction, which we detail subsequently.
We first use vector representation of diseases and treatments, following the recent practice in NLP (e.g., see [5]). Let be the embedding of disease , the representation of the treatment , and the vectors are embedded in a common space. The bags of diseases and bags of treatments are also represented as vectors in the same space.
Denote by the bag of diseases and the bag of treatments recorded for the visit at time . Let be the set representation of the disease bag, and the set representation of the treatment bag, as given in Eq. (1). Let denote the interaction function between diseases, as encapsulated in , and treatments, as coded in . A popular method is to use a bilinear function, e.g., where is a matrix and is an elementwise nonlinear transformation, but this will result in lots of parameters.
One intuitive function is as follows:
(2) 
which we dub the subtractive interaction. The difference reflects the intuition that treatments are supposed to lessen the illness. We found works well, suggesting that the diseasetreatment interaction is nonlinear. This warrants a deeper investigation in future work, e.g., as a neural network itself. We also experimented with other interaction forms: the implicit with , the additive with , and multiplicative with . See Section IV for empirical results.
IiiC Healthcare Trajectory as Sequence of Sets
While we might expect that the disease subset together with the treatment subset reflect the illness state at the time of discharge, it is not necessarily the case. This is because of several reasons. First, the coding of those diseases and treatments is often optimized for billing purposes, not all diseases are included. Second, errors do occur sometimes. And third, the treatments usually take time to get the full intended effect.
For this reason, it is better to include historical visits to assess the current state as well as to predict future risk. An efficient way is to model the visit sequences as a Recurrent Neural Network (RNN) [6]
. In this paper, we choose Long ShortTerm Memory (LSTM) due to its capability to remember distant events
[11]. Since each visit is a set – or precisely, an interaction of two sets – health trajectory can be modeled as a sequence of sets. We term the model , which stands for Recurrent Sequence of Sets. Figure 1 depicts a graphical illustration of .Given a sequence of input vectors (one per visit) the LSTM reads an input
at a time and estimates the illness state
. To connect to the past, LSTM maintains an internal shortterm memory , which is updated after seeing the input. Let be the new candidate memory update after seeing , the memory is updated over time as:where is forget gate determining how much of past memory to keep; is the input gate controlling the amount of new information to add into the present memory. The input gate is particularly useful when some recorded information is irrelevant to the final prediction tasks.
The memory gives rise to the state as follows:
where is the output gate, determining how much external information can be extracted from the internal memory. The candidate memory and the three gates are parametric functions of .
With this long shortterm memory system in place, information of the far past is not entirely forgotten, and credit can be assigned to it. Second, partially recorded information can be integrated to offer a better picture of current illness.
IiiC1 Regularizing state transitions
For chronic diseases, it might be beneficial to regularize the state transition. We consider adding the following regularizers to the loss function:
as suggested in [14]. This asks the amount of information available at each time step, encapsulated in the norm , to be stable over time. This is less aggressive than maintaining state coherence, i.e., by minimizing .
IiiD Predictions
Once the LSTM is specified, its states are pooled for prediction at each admission or discharge, i.e., . The pooling function can be as simple as the mean()(i.e., ) or last()(i.e., ). We also experimented with exponential smoothing, i.e.,
for and . A small
would mean the recent visits have more influence in future outcomes. Next, a differentiable classifier (e.g., a feedforward neural net) is placed on top of the pooled state to classify the medical records (e.g., those in population stratification) or to predict the outcome. The loss function is typically the negative loglikelihood of outcome given the historical observations, e.g.,
. We emphasize here is the system is endtoend differentiable, starting from the disease and treatment lookup table at the bottom to the final classifier at the top. No feature engineering is needed.There are many prediction tasks in healthcare. For example, at discharge we can predict unplanned readmissions, mortality, future lengthofstay within 12 months. These are single outcome prediction. In what follows, we discuss two classes that predict multiple outcomes: Diseases prediction and treatments recommendation.
IiiD1 Diseases Prediction
Disease prediction is an important task in healthcare. If the model is presented with a sequence of admissions, it will learn to predict what are the top probable diseases that the patient will have in the next admission. This model shares similarity to the risk prediction model at the recurrent level, however the top layers are reconstructed to allow multiple label output. The pooled output from recurrent layer is used as feature input to the disease prediction network. Finally, the output from this network are the top predicted diseases: .
Contrary to the single outcome model, we need to output multiple diseases instead of binary value as in risk prediction. Hence we employed a multilabel output approach to train the model. There are two ways to define the loss function in this approach. The first one is to let the network output the probabilities of every diseases by using a sigmoid activation function. Then the loss function is simply binary crossentropy log loss. However, this method would suffer from the imbalance between the small number of diseases a patient has and the large number of all diseases, since it amplify the loss of the nonoccurred diseases by an amount proportional to the nonoccurring ratio, hence degrading the network performance. The second method is to only backpropagate the loss of not picking up the right diseases in the next admission. The network would output the probabilities of all diseases normalized by a softmax function instead of perdisease probabilities by sigmoid activation.
In this paper, we reported predicted diseases for , and used precision at as the performance measures.
IiiD2 Treatments Recommendation
Treatments recommendation is a very important task in healthcare. It promises to reduce time and costs for doctors as well as patients, and offers an unbiased access to care. This model is the same as disease prediction model, but the top layer network is trained using the treatment data from hospital instead of disease data.
Iv Experimental Results
Iva Data
We chose to study the data previously reported in [17], which consists of two chronic cohorts: diabetes and mental health. These two cohorts are among the most prevalent, and have caused great economical and societal burdens. As the natures of the two conditions are very different (one is physical, the other is mental; one is typical among the old, the other is typical among the young), consistent findings will demonstrate the versatility of the models.
The data was collected between 20022013 from a large regional Australian hospital. Each record contains at least 2 hospital visits. Diseases and treatments are coded using the ICD10 coding scheme. In ICD10, diseases are arranged in a tree, where the leaves represent the most detailed subtype classification. We use only the first twolevel in the ICD10 tree to allow for sufficient statistics for each node. Data statistics are summarized in Table I. Overall, there are over a hundred thousand visits for both cohorts combined.
Statistics  Diabetes  Mental health 

# patients  7,191  6,109 
# visits  53,208  52,049 
% male  55.5  49.4 
median age  73  37 
# diseases  243  247 
# treatments  1,118  1,071 
IvB Implementation
Models are implemented in Julia using the Knet.jl package [25]. Optimizer is Adam [13]
with learning rate of 0.01 and other default parameters. The recurrent layer hidden size is 32, embedding size 32. The number of admissions is limited to the last 10 admissions. The RELU activation is applied at the input to the recurrent layer. Dropout is used in training with dropout probability 0.5. The chosen exponential smoothing parameter after training is 0.1. Three baselines are implemented. One is bagofwords trained using regularized logistic regression (BoW+LR), where diseases and treatments are considered as words, and the medical history as document. No temporal information is modeled. Although this is a simplistic treatment, prior research has indicated that BoW works surprisingly well
[16, 17].The other is a recent model known as Deepr [16], which is based on convolutional net for sequence classification. Unlike the BoW, which are unordered, in Deepr words are sequenced by their temporal order. Words of the same visit are randomly sequenced. Interaction between diseases and treatments within a short period of time is partially modelled through convolutional kernels. However, the Deepr does not model the temporal transition between illness states but rather seeks for the most risky states over the history. The Deepr model parameters have embedding size 32, filter sizes 5, 10, 15 and the number of filters is 60 (20 for each size). The Deepr input sequence length is limited to the last 100 words. The last baseline is LSTM, which runs on the same data as the Deepr.
IvC Predicting Unplanned Readmission
Table II reports the Area Under the ROC Curve (AUC) for all methods in predicting unplanned readmission. The proposed methods shows a competitive performance against the baselines. The shows better prediction rate than those without set formulation (the BoW, Deepr and LSTM). It suggests that a proper modeling of care over time is needed, not only for understanding the underlying processes, but also to achieve a competitive predictive performance.
Method  Diabetes  Mental health 

BoW+LR  0.673  0.705 
Deepr [16]  0.680  0.714 
LSTM  0.701  0.725 
– Implicit interaction  0.710  0.726 
– Subtractive interaction  0.718  0.726 
– Sub. interact + exp smoothing  0.701  0.730 
IvD Treatment Recommendation
Table III reports the precision at scores for different methods in predicting the treatments for the diseases at the current time step. The top two scores in each performance measure are shown in bold. In this task, no treatment at the current admission is input to the model, only diseases input. The table shows the proposed methods frequently have better performance than the baselines. The additive and implicit interaction models show better prediction rate than others for the diabetes cohort while the subtractive and additive models outperform the remaining in mental health data. The multiplicative model just performs similar to the baseline on average. This suggests multiplicative interaction is a too strong assumption. The exponential smoothing does not help improve recommending treatments for mental health data.
Method  Diabetes  Mental health  

P@1  P@2  P@3  P@1  P@2  P@3  
BOW+LR  0.608  0.481  0.419  0.516  0.4382  0.395 
Deepr  0.634  0.463  0.395  0.615  0.532  0.466 
LSTM  0.694  0.535  0.446  0.614  0.507  0.427 
– Implicit interaction  0.738  0.564  0.492  0.692  0.582  0.498 
– Additive interaction  0.74  0.567  0.486  0.708  0.588  0.496 
– Subtractive interaction  0.704  0.553  0.48  0.7  0.591  0.51 
– Multiplicative interaction  0.65  0.484  0.401  0.553  0.511  0.428 
– Add. interaction with exp smoothing  0.726  0.564  0.465  0.654  0.537  0.458 
– Sub. interaction with exp smoothing  0.730  0.561  0.465  0.641  0.528  0.452 
IvE Disease Prediction
Table IV reports the precision at scores for predicting diseases in the next admission. Proposed methods again frequently have better performance than the baselines. For this task, the subtractive and implicit interaction models show better prediction rate than others. The exponential smoothing clearly improves the prediction rate for diabetes data.
Method  Diabetes  Mental health  

P@1  P@2  P@3  P@1  P@2  P@3  
BOW+LR  0.508  0.441  0.393  0.396  0.350  0.323 
Deepr  0.496  0.42  0.397  0.424  0.392  0.346 
LSTM  0.541  0.476  0.417  0.466  0.430  0.372 
– Implicit interaction  0.530  0.478  0.438  0.504  0.471  0.406 
– Additive interaction  0.528  0.496  0.449  0.488  0.448  0.392 
– Subtractive interaction  0.533  0.491  0.444  0.494  0.469  0.41 
– Multiplicative interaction  0.496  0.44  0.401  0.453  0.406  0.362 
– Add. interaction with exponential smoothing  0.563  0.513  0.459  0.468  0.429  0.373 
– Sub. interaction with exponential smoothing  0.567  0.516  0.46  0.47  0.43  0.376 
IvF Visualization
Visualization is of paramount importance in healthcare because of the demand for transparency. The progression of the illness state and probability of readmission over time is visualized in Fig. 2 for two typical patients. The highrisk case is shown in Fig. 2(a) – it seems that the illness gets worse over time. In contrast, the lowrisk case is depicted in Fig. 2(b), where the illness is rather stable over time.
(a) Worsening progression ()  (b) Improving progression () 
Code embedding also reveals the space of diseases, as visualized in Fig. 3.
(a) Diabetes related diseases  (b) Mental health related 
V Discussion
We have taken an algebraic view of healthcare
in that medical artifacts are represented as algebraic objects such as vectors and tensors. The continuous representation of diseases make it easy to study the disease space, that is, which diseases are related and may be interacting. The same holds for the treatments, and the clinic visits. The view also allows natural modelling of the evolution of illness as a result of the interaction between multiple diseases and multiple treatments over time. More specifically, we have argued for representing a healthcare trajectory as a sequence of (interaction of) sets, which is then realized by our new model dubbed
. The model employs a simple multivalued set function for diseases and for treaments. Multidisease–multitreatment interaction per visit is a dualinput function of the two set functions. A healthcare trajectory is then modelled using LSTM for its capability of memorizing distant events. Importantly, the entire system is endtoend: the model reads the medical record and predicts future risks without any manual feature engineering. Results on over a hundred thousand visits by patients suffering from chronic conditions, diabetes and metal health, demonstrate the usefulness of the model.Future work will refine in the healthcare context to address the irregular timing of visits, more comprehensive set functions (e.g., with selfattention), interaction functions, and predicting sequence of sets. We wish to emphasize here that can be tailored to other problems of sequence of sets. For example, a video is a sequence of shots, each of which is a set of objects and actions.
References

[1]
Léon Bottou.
From machine learning to machine reasoning.
Machine Learning, 94(2):133–149, 2014.  [2] Yu Cheng, Fei Wang, Ping Zhang, and Jianying Hu. Risk prediction with electronic health records: A deep learning approach. In SIAM International Conference on Data Mining (SDM 2016). SIAM, 2016.
 [3] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, and Jimeng Sun. Multilayer representation learning for medical concepts. KDD, 2016.
 [4] Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. In Advances in Neural Information Processing Systems, pages 3504–3512, 2016.
 [5] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493–2537, 2011.
 [6] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
 [7] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologistlevel classification of skin cancer with deep neural networks. Nature, 542(7639):115–118, 2017.
 [8] Hayit Greenspan, Bram van Ginneken, and Ronald M Summers. Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Transactions on Medical Imaging, 35(5):1153–1159, 2016.
 [9] Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22):2402–2410, 2016.

[10]
S Hamid Rezatofighi, BG Kumar, Anton Milan, Ehsan Abbasnejad, Anthony Dick, Ian
Reid, et al.
Deepsetnet: Predicting sets with deep neural networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 5247–5256, 2017.  [11] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [12] David M Kent and Rodney A Hayward. Limitations of applying summary results of clinical trials to individual patients: the need for risk stratification. Jama, 298(10):1209–1212, 2007.
 [13] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [14] David Krueger and Roland Memisevic. Regularizing RNNs by Stabilizing Activations. arXiv preprint arXiv:1511.08400, 2015.
 [15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 [16] Phuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, and Svetha Venkatesh. Deepr: A Convolutional Net for Medical Records. Journal of Biomedical and Health Informatics, 21(1), 2017.
 [17] Trang Pham, Truyen Tran, Dinh Phung, and Svetha Venkatesh. Predicting healthcare trajectories from medical records: A deep learning approach. Journal of biomedical informatics, 69:218–229, 2017.
 [18] Trang Pham, Truyen Tran, and Svetha Venkatesh. One size fits many: Column bundle for multix learning. arXiv preprint arXiv:1702.07021, 2017.
 [19] Daniele Ravì, Charence Wong, Fani Deligianni, Melissa Berthelot, Javier AndreuPerez, Benny Lo, and GuangZhong Yang. Deep learning for health informatics. IEEE Journal of Biomedical and Health Informatics, 21(1):4–21, 2017.
 [20] Benjamin Shickel, Patrick James Tighe, Azra Bihorac, and Parisa Rashidi. Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. IEEE Journal of Biomedical and Health Informatics, 2017.
 [21] Truyen Tran. Deep learning for biomedicine: A tutorial. ACML, Seoul, Korea. URL: https://truyentran.github.io/acml17tute.html, 2017.

[22]
Truyen Tran, Tu Dinh Nguyen, Dinh Phung, and Svetha Venkatesh.
Learning vector representation of medical objects via EMRdriven nonnegative restricted Boltzmann machines (eNRBM).
Journal of biomedical informatics, 54:96–105, 2015.  [23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. NIPS, 2017.
 [24] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. ICLR, 2016.
 [25] Deniz Yuret. Knet: beginning deep learning with 100 lines of julia. In Machine Learning Systems Workshop at NIPS 2016, 2016.
 [26] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander Smola. Deep sets. NIPS, 2017.