In recent years, massive medical data have been accumulated from hospital information systems. It provides a great opportunity to develop machine learning methods to discover valuable information from the data. The quality of data representation heavily determines the performance of the methods[Bengio et al.2013].
. High-quality distributed representations for various medical concepts, such as diagnosis, medical activities (drugs and procedures), hospital visits and patients’ journeys, can be extracted end-to-end without human intervene[Choi et al.2016d, Choi et al.2016a, Choi et al.2016c, Liu et al.2018, Cai et al.2018, Choi et al.2017]. With the outstanding ability, a variety of studies achieved excellent performance on different clinical tasks [Choi et al.2016b, Cheng et al.2016, Zhu et al.2016, Baytas et al.2017, Ma et al.2017]. Fig.(a)a
shows a typical data form that these methods applied on. The medical activities in a visit are unordered, while the patient’s visits are ordered. They use the co-occurrence information and temporal relations in such data to construct the deep neural networks for RL, following the key principle of word2vec[Mikolov et al.2013] that similar words (medical concepts) share similar context.
However, inpatient data111Intensive-Care-Unit (ICU) patients is not considered in this paper. , one of important categories of medical data, have distinct form and analyzing goals compared to the above one. Considering the condition of inpatients is commonly more severe than outpatient, the RL for inpatient is essential for various tasks, such as predicting next day activity, in-hospital mortality and length-of-stay (LOS). As illustrated in Fig.(b)b, an inpatient visit is composed by several temporal related days, and the medical activities in a day are unordered. For inpatient data, existing medical RL methods are facing following challenges:
(1) Temporal relations
Temporal feature plays a vital role in RL for medical data. For inpatient, the temporal relation is reflected in day-level, which are stronger than the visit-level relation of outpatient. For example, there are two consecutive visits of an outpatient that the pre-visit is for common cold and the post-visit is for fracture. Even if their time interval is short (e.g. several days), the temporal relation between the two visits is weak. In contrast, the treatment for most of inpatients is on a daily basis, so that the days in sequence are closely related. Therefore, the stronger temporal relations should be taken into account for the inpatient RL.
(2) Importance of diagnosis
In most of previous RL methods on medical data, diagnosis is usually treated as a kind of medical activities [Choi et al.2016a, Cai et al.2018], because a patient’s multiple visits may correspond to different diagnose information. It means that diagnosis would be mapped to the same representation space as medical activities. While for inpatient, diagnosis plays a guidance role for all the days [Xu et al.2018]. In this work, we highlight the importance of the first diagnosis of each inpatient visit for RL.
(3) Unordered medical activity set
As mentioned before, there are medical activities with same time-stamps in medical data. It is different from the nature language area that words in a sentence are always in sequence. To solve this problem, the popular strategy is that using a pooling operation on the unordered set, such as sum, average and maximum, to generate the medical activity representation [Cai et al.2018, Choi et al.2016d] (see Fig.(a)a) or visit representation [Choi et al.2016a] (see Fig.(b)b). However, pooling makes each medical activity in the unordered set equally contributes to the RL, which is not in conformity with the clinical practice.
To tackle the above three challenges, this paper propose Inpatient2Vec, a novel medical RL approach for inpatients. We aim to learn three kinds of representations: (medical) activity, (hospital) day and diagnosis, which can not only cover the core data characteristics of inpatients, but also satisfy the requirements for various analyzing applications of inpatients. Inpatient2Vec is an extension of Bidirectional Encoder Representations from Transformers (BERT) [Devlin et al.2018] that contains two learning tasks. The first one is masked activity prediction that using a percentage of activities in a day to predict the other activities in the day. Co-occurrence in the unordered set is utilized in this task. The second one is next day activity prediction that using the pre-days of an inpatient to predict the activities in the next day by a bi-directional LSTM [Hochreiter and Schmidhuber1997].This task highlights the temporal relations between ordered days. We design a Transformer-based network [Vaswani et al.2017], which combined activity, day and diagnosis representations, for the two learning tasks. In this network, the diagnosis plays a guidance role for the days through a time-aware mechanism, and the contributions of different activities in a day are distinguished by an attention mechanism.
The main contributions of this study are summarized bellow:
Inpatient2Vec, an effective RL approach for inpatients, is proposed to cope with the three challenges, including temporal relations, importance of diagnosis and unordered medical activity set.
A Transformer-based network is presented to capture the correlations between activity, day and diagnosis representations, and two learning tasks are designed as the training objective.
We conduct experiments on real-world data to evaluate the quality of the learned representations from two aspects, that one is the semantic similarity and the other one is the performance for prediction tasks. The results shows that Inpatient2Vec outperforms the competitive baselines.
In this section, we give some related notations and definitions firstly. Then we introduce our proposed method Inpatient2Vec in detail, including the input sequence construction and two kinds of unsupervised training tasks for RL.
We denote the set of inpatient visits as , where is the number of visits in our dataset. Each inpatient visit may have more than one diagnosis, while in this paper we only concern the first diagnosis which largely determines the treatment strategy during the visit. The diagnosis code is represent as with size . For an inpatient visit , its diagnosis is denoted as . Each inpatient visit is composed by several days, where is the number of days in . Each day contains a set of medical activities (e.g. drugs, procedures and nursing cares), and we defined all the unique activities in our dataset as with size .
Inpatient2Vec is an extension of BERT, which gets the state-of-the-art performance on a series of nature language processing tasks. BERT consists of two parts: pre-training that considers both left and right context, and fine-tuning. In our model, there are three kinds of representations, including activity, day and diagnosis. The core component of Inpatient2Vec is Transformer, a day-level feature extractor. We construct a sequence that composed by the activities and diagnosis in a day as the input for Transformer. Two domain-related tasks are designed to learn the representations.
2.2.1 Input sequence construction
We firstly conduct a concept mapping between nature language and inpatient data. In nature language, a document is composed by a set of sentences, and each sentence contains several words. Analogically, we treat inpatient visit, day and activity as document, sentence and word, respectively. Similar to the construction of input sequence for Transformer in BERT, we combine the representations of the activities and diagnosis of a day as our input sequence (see the blue rectangles in Fig.3). The order of the activities in sequence can be arbitrary, because Transformer is insensitive to the order information. It meets the fact that all the activities in a day are an unordered set. Note that the first item ([CLS]) of the sequence is a special day-element which would be used for generating day representation through Transformer.
Activity representation. We map each medical activity to a low-dimension representation with size (see the yellow rectangles in Fig.3), which is denoted as . Specially, we create an activity representation for [CLS].
Diagnosis representation. For an inpatient visit, the diagnosis plays a guidance role in all the days. Considering the treatment for a diagnosis is usually based on days, we map each diagnosis to a matrix, , where is the maximum LOS of inpatient visits with diagnosis (see the orange rectangles in Fig.3
). There are two advantages to adopt the three-dimension diagnosis representation. The first one is that the core treatment information in each day of one diagnosis is preserved in a vector with size. The second one is that the actual time information have been involved in the RL architecture, which is similar to the positional embedding in BERT [Devlin et al.2018].
Therefore, each item in the input sequence is the aggregation of the corresponded activity and diagnosis representation.
2.2.2 Unsupervised training tasks
In this part, we introduce the two unsupervised training tasks used to learn the inpatient representations.
Masked activity prediction. In this task, we want to utilize the co-occurrence among medical activities in a day to train the RL model. The activities frequently occurred together may refers to similar clinical function, such as the drugs for anti-inflammation and analgesia. We randomly mask 15% activities in all days. A special activity representation is used to replace the masked activities, which is similar to . Then, for each masked activity in a day, we use the other activities in the day to predict it.
Next day activity prediction. Previous task only considers context activities in a day, and the temporal relations between days have not been used. For inpatient visit, the treatment of a day strongly depends on the previous days, so we propose next day activity prediction task to capture the temporality. Specifically, given the previous days of an inpatient visit, the task is to predict the most likely activities that would be used in the next day.
2.2.3 Model architecture
Fig.4 shows the architecture of our model. The bottom rectangles are the input sequence generated from a day. We fed the input sequence into a N-layer Transformer, which has been proved as an outstanding feature extractor. Each layer is consisted of a multi-head attention, a feed forward and two normalization layers. Equation (1), which is composed of three matrices: Query, Key and Value, is a portion of multi-head attention.
In our model, Q, K and V are set to equal, which is called self-attention. It can figure out the attentions among activities that represent the different importances for RL.
Through the Transformer, we can get the day representations as described as follows.
Day representation. A day representation is generated from the day-element ([CLS]). With the help of Transformer, the day representation takes into account the relations with activities in the day.
In addition, besides , the other outputs of Transformer (top rectangles in Fig.4) can be regarded as the day-based activity representations. This representation reflects the activity ambiguity in different days. The two unsupervised training tasks take and as the input, respectively.
For masked activity prediction task, the prediction result is defined as follows.
we calculate the loss according to the cross entropy between and the true label , which is the one-hot vector of the masked activity.
where is the number of masked activities in all visits.
For next day activity prediction, we put the day representation into a single layer bi-directional LSTM. Previous t-1 days (from to ) are used to predict the activities may be used in the next day . The hidden state of bi-directional LSTM is used for the prediction as follows.
Similar to masked activities prediction, the loss of this task, denoted as , is also the cross entropy between and true label .
We add the and
as the loss function to train the model.
In this section, experiments are conducted to demonstrate the effectiveness of the learned representations. We firstly give the description of dataset and experimental setting, including baseline methods, evaluation and implementation details. Then we analyze the experimental results.
We use a real-world insurance claims dataset which comes from a Chinese city. We filter out some data according to following criteria: (1) a visit whose the number of days is less than 2 or more than 50; (2) a diagnosis whose number of visits is out of range 100 to 3000. Each visit corresponds to a diagnosis code that follows the ICD-10. Table 1 lists the details about dataset.
|# of visits||226,420|
|# of days||1,869,294|
|# of diagnosis codes (ICD-10)||479|
|# of medical codes||3,952|
|Avg. # of activities per day||13.97|
|Avg. of length of stay||9.26|
3.2 Experiment Setup
Through Inpatient2Vec, we can get three kinds of representations: activity, day and diagnosis representation. To comprehensively evaluate the performance of the learned representations, we designed two categories of experiments. The first one focuses on the semantic similarity, including the activity intrusion task for activity representations, and clustering task for diagnosis representations222We do not evaluate the semantic similarity for day representations because it is hard to find the corresponded ground truth. . The second one contains two prediction tasks, next day activity prediction and remaining length-of-stay (LOS, refers to the number of days of an inpatient stay in hospital from admission to discharge) prediction, which are used to verify the applicability of the learned representation. Furthermore, we did an ablation study to evaluate the importances of different parts of Inpatient2Vec.
3.2.1 Semantic similarity measurement
Inspired by the word intrusion task [Murphy et al.2012, Luo et al.2015] which is widely used in evaluating the semantic quality of word representations, we designed the activity intrusion task for activity representations. In this task, we firstly calculated the Euclidean distance between every two activity representations. Then, given an activity , we constructed a set containing 6 activities that 5 of them are the top 5 nearest activities and the rest one (called ”intrusion”) comes from the last 50% activities based on the Euclidean distance. Lastly, we invited three doctors to pick up the ”intrusion” activity and calculated the precision of the correct picking as the measurement.
For diagnosis representation, we used the hierarchy of ICD-10 as the clustering ground truth, that 479 diagnosis are grouped into 131 categories (keeping the top 3 characters). K-means (implemented by scikit-learn 0.19.0) was used as clustering method, and normalized mutual information(NMI) was used as the measurement.
We compared Inpatient2Vec against 3 state-of-the-art models, i.e, CBOW (as shown in Fig.2), Med2Vec (based on skip-gram) [Choi et al.2016a] and RoMCP (an extension of CBOW for inpatients considering diagnosis information) [Xu et al.2018]. The methods without generating day representations, such as the works in [Cai et al.2018, Choi et al.2016d], are not considered for the comparison. It is worth mentioning that diagnosis is treated as a kind of activities in CBOW and Med2Vec, so that they would be mapped to the same representation space.
3.2.2 Inpatient Prediction Tasks
The core goal of RL is to improve the performance of different analyzing tasks. In this part, we selected two typical inpatient prediction tasks for evaluation. One is next day activity prediction, which is same to the second training task of Inpatient2Vec. The other one is remaining LOS prediction, that calculate the possible number of days from each time-stamp to discharge. Three state-of-the-art prediction approaches were selected as the basic models. We evaluated that if the approaches could benefit from the pre-train representations from Med2Vec, RoMCP and Inpatient2Vec 333CBOW is not considered here because of the similar architecture to Med2Vec. on the two tasks through a fine-tuning procedure.
The three basic models are listed as follows: 1) Retain [Choi et al.2016b] is an interpretable predictive model with two kind of reverse time attention mechanism, which focus on visit level and day level. 2) Dipole [Ma et al.2017] is a bi-directional LSTM network, with three kinds of attention. We use the location-based attention, which performs best in our prediction tasks. 3) T-LSTM [Baytas et al.2017] focuses on handling irregular time intervals in longitudinal patient records.
On the one hand, we evaluated the performance of the three models with original inputs (diagnosis and activities represented by one-hot vectors). On the other hand, we input the three pre-train representations to the models with fine-tuning for comparison.
For next day activity prediction, we calculate RECALL@k for the correctly predicted medical activities in top k value of as the measurement.
where is the activity count of the interaction between and the top k of , refer to the activity count that actually occurred in
. Besides, the variance of RECALL@k with adaptive k is also used as the measurement, which is defined as.
For remaining LOS prediction, RMSE between the actual and predictive remaining LOS is used as the measurement.
3.2.3 Implementation Details
All approaches are implemented in TensorFlow 1.12.0. We randomly divided dataset into the training, validation and testing set in a 0.75:0.1:0.15 ratio. The validation set is used to determine the values of hyper-parameters. For pre-training model, we use Adam with learning rate of 1e-4,= 0.9,
= 0.999, L2 weight decay of 0.01. The diagnosis and activity representation size is 384, the number of attention head is 6, and the number of Transformer is 6. The hidden layer size of bi-directional LSTM is 200. For the two prediction tasks, we use Adadelta optimizer to train our model, with a mini-batch of 128 patients. The hidden layer size of Retain, Dipole and T-LSTM is 200. We execute 10 epochs and show the best performance for each approaches in above two tasks.
3.3 Results Analysis
The precision of activity intrusion task among different RL methods is shown in the left side of Fig.5. We can observe that CBOW and Med2Vec, which adopt similar architectures, get the similar poor performances. RoMCP obtains nearly 40% improvements compared to the CBOW and Med2Vec. The significant difference between them is that RoMCP considers the strict temporal relations in inpatient visit by concatenating the day representations for prediction. While in CBOW, the context days are simply aggregated to predict the center day, and in Med2Vec, the center day is used to predict all the context days. Inpatient2Vec outperforms the other methods. Compared to RoMCP, the improvement may stem from two aspects. One is that the usage of bi-directional LSTM can better capture the temporal relations with long distance. This further confirms that the stronger temporal relations play an important role in inpatient RL. The other one is that Transformer calculate the attention weights between activities in a day. In contrast to the pooling operation, our method distinguish the contributions between different activities.
The right side of Fig.5 illustrates the NMI results for diagnosis clustering. It is observed that Inpatient2Vec and RoMCP perform better than CBOW and Med2Vec. The reason is that the latter two methods map diagnosis and activity into the same representation space, without considering the guidance role of diagnosis for inpatient. While in the former two methods, diagnosis information is regarded as an independent representation, which reserves the most important factors for the diagnosis. Inpatient2Vec shows sight improvement to RoMCP. The main reason is that the diagnosis representation in Inpatient2Vec are day-based, that each day of the diagnosis corresponds to a vector. It is in conformity to the clinical practice that the treatment for inpatient is on a daily basis.
Next day activity prediction
Table 2 shows the results for next day activity prediction. Among the three approaches with original input, Dipole and T-LSTM achieve better performance than Retain, which has a trade-off between precision accuracy and interpretability. By inputting pre-train representations to the approaches with fine-tuning procedure, the prediction performance have changed in different scales. For Med2Vec, the change is small. This is because Retain, Dipole and T-LSTM adopt the same pooling strategy as Med2Vec for processing the original one-hot input. It means that the fine-tuning procedure is same as the end-to-end training procedure of the three prediction models. Therefore, when the models achieve convergence, the different initializations have limited impact on the final performance. However, the architectures of RoMCP and Inpatient2Vec can extract more proper representations for prediction with the help of fine-tuning procedure. In contrast, Inpatient2Vec contributes more to the prediction models, because the Transformer has a ability to handle various dependencies in inpatient data.
Remaining LOS prediction
The performance of remaining LOS prediction is shown in Table 2. We can observe that the approaches with Inpatient2Vec outperform others, and Med2Vec makes minimum contributes to the prediction models. The reasons are similar to next day activity prediction task. It is worth mentioning that even best performance achieved by Inpatient+T-LSTM is only 3.3084, which we can infer that the remaining LOS prediction is a difficult task. More information, such as lab testing and medical notes, should be introduced.
To evaluate the effectiveness of different components in Inpatient2Vec, we designed two comparative approaches. One is to remove our diagnosis representation, and treat each diagnosis as an activity. It is used to verify the importance of the proposed diagnosis representation for inpatient RL. The other one is to replace the second training task (next day activity prediction) by a pair-wise day prediction that given any two days, decides if they are consecutive. It is similar to the original training task (next sentence prediction)in BERT. This task focuses on testing the contribution of temporal relations for Inpatient2Vec. We select next day activity prediction as the evaluation task, and Dipole as the evaluation basic model. Fig.6 shows the comparison results. We can make a conclusion that the two removed components are of great importance for inpatient RL.
In this paper, we propose a novel Inpatient2Vec model to learn representations for inpatients. According to the distinctive data characteristics of inpatient, three kinds of representations, including activity, day and diagnosis, are combined by a Transformer-based network. The guidance role of the diagnosis and the dependency between unordered activities are well-designed in the network by a self-attention mechanism. We present two tasks, respectively focus on activity co-occurrence and day temporality, to train the networks. On a real-world dataset, semantic similarity measurement and inpatient clinical prediction are used as the evaluation tasks. The former one demonstrates that the learned activity and diagnosis representations can capture the clinical semantic information. The latter one shows the applicability of our method for prediction tasks by a fine-tuning procedure. One important future work is to integrate inpatient and outpatient data for a more comprehensive RL.
- [Baytas et al.2017] Inci M Baytas, Cao Xiao, Xi Zhang, Fei Wang, Anil K Jain, and Jiayu Zhou. Patient subtyping via time-aware lstm networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 65–74. ACM, 2017.
- [Bengio et al.2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
- [Cai et al.2018] Xiangrui Cai, Jinyang Gao, Kee Yuan Ngiam, Beng Chin Ooi, Ying Zhang, and Xiaojie Yuan. Medical concept embedding with time-aware attention. arXiv preprint arXiv:1806.02873, 2018.
- [Cheng et al.2016] Yu Cheng, Fei Wang, Ping Zhang, and Jianying Hu. Risk prediction with electronic health records: A deep learning approach. In Proceedings of the 2016 SIAM International Conference on Data Mining, pages 432–440. SIAM, 2016.
- [Choi et al.2016a] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, Michael Thompson, James Bost, Javier Tejedor-Sojo, and Jimeng Sun. Multi-layer representation learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1495–1504. ACM, 2016.
- [Choi et al.2016b] Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, pages 3504–3512, 2016.
- [Choi et al.2016c] Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. Medical concept representation learning from electronic health records and its application on heart failure prediction. arXiv preprint arXiv:1602.03686, 2016.
- [Choi et al.2016d] Youngduck Choi, Chill Yi-I Chiu, and David Sontag. Learning low-dimensional representations of medical concepts. AMIA Summits on Translational Science Proceedings, 2016:41, 2016.
[Choi et al.2017]
Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F Stewart, and Jimeng Sun.
Gram: graph-based attention model for healthcare representation learning.In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 787–795. ACM, 2017.
- [Devlin et al.2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[Liu et al.2018]
Luchen Liu, Jianhao Shen, Ming Zhang, Zichang Wang, and Jian Tang.
Learning the joint representation of heterogeneous temporal events
for clinical endpoint prediction.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- [Luo et al.2015] Hongyin Luo, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. Online learning of interpretable word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1687–1692, 2015.
[Ma et al.2017]
Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao.
Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks.In Proceedings of KDD, pages 1903–1911. ACM, 2017.
- [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
- [Miotto et al.2017] Riccardo Miotto, Fei Wang, Shuang Wang, Xiaoqian Jiang, and Joel T Dudley. Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics, 19(6):1236–1246, 2017.
- [Murphy et al.2012] Brian Murphy, Partha Talukdar, and Tom Mitchell. Learning effective and interpretable semantic models using non-negative sparse embedding. Proceedings of COLING 2012, pages 1933–1950, 2012.
- [Ravì et al.2017] Daniele Ravì, Charence Wong, Fani Deligianni, Melissa Berthelot, Javier Andreu-Perez, Benny Lo, and Guang-Zhong Yang. Deep learning for health informatics. IEEE journal of biomedical and health informatics, 21(1):4–21, 2017.
- [Shickel et al.2017] Benjamin Shickel, Patrick James Tighe, Azra Bihorac, and Parisa Rashidi. Deep ehr: A survey of recent advances in deep learning techniques for electronic health record (ehr) analysis. IEEE Journal of Biomedical and Health Informatics, 2017.
- [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- [Xu et al.2018] Xiao Xu, Ying Wang, Tao Jin, and Jianmin Wang. Learning the representation of medical features for clinical pathway analysis. In International Conference on Database Systems for Advanced Applications, pages 37–52. Springer, 2018.
- [Zhu et al.2016] Zihao Zhu, Changchang Yin, Buyue Qian, Yu Cheng, Jishang Wei, and Fei Wang. Measuring patient similarities via a deep architecture with medical concept embedding. In Proceedings of ICDM, pages 749–758. IEEE, 2016.