1 Introduction
The International Classification of Diseases (ICD) is provided by the World Health Organization (WHO), and contains codes for diseases, procedures, and external causes of injury or disease. ICD codes play an important role in patient electrical health records (EHRs). For example, the hospital admission of a patient is often summarized as a set of disease ICD codes and a set of procedure ICD codes. The disease ICD codes represent the diagnosis provided by doctors, and the procedure ICD codes indicate the treatments applied to the patient.
A significant and interesting problem is predicting procedures given diagnosed diseases, which can be useful for improving the effectiveness and the efficiency of hospital admission. From the viewpoint of machine learning, this problem can be formulated as an ICD code embedding task. Specifically, given a patient admission record, we aim to represent the ICD codes of the diseases and procedures appearing in the record as embedding vectors. Accordingly, one may anticipate a procedure for a given disease when the two have similar embedding vectors.
Unfortunately, most existing embedding methods may not be suitable for the proposed problem because of the special properties of ICD codes. For each admission, the corresponding disease ICD codes and procedure ICD codes are ranked according to a manuallydefined priority, rather than their real clinical relationships. Additionally, a disease often leads to multiple procedures and a procedure may correspond to multiple diseases. In other words, the diseases and the procedures in an admission often have complicated clinical relationships, but the mapping between these two code sets is unavailable. Such uncertainties in the admission records are challenging for existing embedding methods, because they require wellstructured observations, , sequential data like words in sentences (Mikolov et al., 2013; Pennington et al., 2014), pairwise interactions like useritem pairs in recommender systems (Rendle et al., 2009; Chen et al., 2018a), and node interactions in graphs (Perozzi et al., 2014; Tang et al., 2015; Grover and Leskovec, 2016). As a result, applying existing methods to embed ICD codes directly suffers from a high risk of model misspecification. Moreover, like other clinical data analysis tasks (Choi et al., 2016a; Mullenbach et al., 2018; Mahmood et al., 2018), the proposed embeddings of ICD codes should be interpretable, , it is desirable that the clinical relationships between diseases and procedures can be explicitly captured by the distance/similarity between their embedding vectors. Although some existing embedding methods can capture simple pairwise relationships between their embedded entities, it is still difficult to describe more complicated relationships among multiple entities.
Focusing on the challenges of ICD code embeddings, we propose an interpretable embedding method with novel self and mutualattention mechanisms. As illustrated in Figure 1, our method contains a selfattention network to fuse observed disease embeddings together. Accordingly, the similarity between the fused disease embedding and each procedure embedding is calculated, and procedures with high similarity are predicted to fit the observed procedures. When learning the embeddings and the selfattention network, we take advantage of the optimaltransport distance (Villani, 2008)
between observed diseases and procedures as the regularizer of our model, achieving a mutualattention map between them. The self and the mutualattention mechanisms are connected via the estimated probabilities of diseases. The proposed method has the advantages of interpretability of learned embeddings. Specifically, for each admission, the selfattention network estimates the probabilities of observed diseases, which can be interpreted as the significance of the disease in the admission. Additionally, the mutualattention regularizer estimates the optimal transport between the observed diseases and procedures, which estimates their clinical relationships.
The proposed method can be used to recommend suitable procedures according to diagnosed diseases, which can be used to improve the efficiency of hospital admission, , for some typical diseases, clinicians, especially those junior and with limited clinical experience, can query suitable procedures quickly. Additionally, the recommended procedure codes can help to doublecheck the codes entered by clinicians.For other clinical data analysis tasks like ICD code assignment (Baumel et al., 2017; Huang et al., 2018), which requires ICD code embeddings as the input of downstream models and applications, the proposed embeddings generated by our method can either provide highquality inputs or improve the training of their ICD code embeddings via good initialization.
2 Target Data Set and Problem Statement
The proposed work employs the publiclyavailable MIMICIII dataset (Johnson et al., 2016), developed by the MIT Lab for Computational Physiology. It comprises over 58,000 hospital admissions recorded from June 2001 to October 2012 for 38,645 adults and 7,875 neonates. For each admission, its ICD codes are generated for billing purposes at the end of the hospital stay, which includes a set of disease ICD codes and a set of procedure ICD codes. The ICD codes employ the ICD9 format.^{1}^{1}1https://www.cdc.gov/nchs/icd/icd9.htm In the MIMICIII dataset, 14,567 disease ICD codes and 3,882 procedure ICD codes are observed. Each admission contains 1 to 41 diseases and 1 to 40 procedures.
We consider three subsets of the MIMICIII dataset. For each subset, we select the admissions having nonempty diseases and procedures. The small dataset contains 28,315 admissions with 247 diseases and 75 procedures, and each ICD code appears at least 500 times. The medium dataset contains 30,555 admissions with 874 diseases and 258 procedures, and each ICD code appears at least 100 times. The large dataset contains 31,213 admissions with 2,765 diseases and 819 procedures, and each ICD code appears at least 10 times. For these three subsets, the histograms of admissions with respect to the number of diseases and that of procedures are shown in Figure 2
. We find that the histograms of admissions corresponding to the number of procedure ICD code per admission are consistent across different subsets, implying that the empirical distribution of procedures is stable and may yield an exponential distribution. On the other hand, the histograms of admissions with respect to the number of diseases per admission have changes with the increase of data size.
We denote the set of disease and procedure ICD codes as and , respectively, where the sizes of the two sets are and .^{2}^{2}2Here, represents the cardinality of the associated set, counting the number of elements. Each element represents a specific disease and each represents a specific procedure. Suppose that we observed hospital admissions. For the th admission, , the diseases and the procedures associated with the admission are denoted and , respectively. Here, contains the “positive” procedures corresponding to . Accordingly, we can define/generate “negative” procedures that are potentially irrelevant to , , , used in the following learning algorithm.
Given such observations, we aim to ) embed ICD codes of diseases and procedures; ) predict reasonable procedures from diagnosed diseases according to proposed embeddings; and ) it is desired that the prediction is clinicallyinterpretable. In the following sections, we denote the embeddings of diseases and procedures as and , respectively, where is the embedding dimension.
3 Proposed Model and Learning Algorithm
3.1 Predicting procedures from diseases
Our embedding method learns a predictive model of procedures. For the th admission, we predict the set of procedures from the set of diseases . Specifically, for each procedure , we represent the probability of conditioned on
via the following parametric model:
(1) 
where is the embedding of the procedure , contains the columns of corresponding to , and is a function fusing columns of to a single vector. For convenience, we denote the parameters of the predictive model as .
For the positive procedure , it is desirable that the proposed model makes approach . By contrast, for the negative procedure , the proposed model should suppress to . It should be noted that for each admission the number of negative procedures is much larger than that of positive procedures (, ). Therefore, in practice we must apply the negative sampling strategy used in other embedding methods (Rendle et al., 2009; Mikolov et al., 2013), randomly selecting a subset of negative procedures, , and ensuring .
Accordingly, we can learn the model via maximum likelihood estimation (MLE): for the th admission, its predictive loss is
(2) 
By minimizing , we maximize the loglikelihood of the observed procedures given the corresponding diagnosed diseases, and suppress the loglikelihood of the irrelevant procedures.
3.2 Fusing disease embeddings with a selfattention mechanism
The key of the above predictive model is the fusion function , which has a large influence on the interpretability of the model and the final performance on prediction. The simplest fusion strategy is average pooling, ,
, which (questionably) assumes that the different diseases have the same influence on the prediction of procedures. Another strategy is maxpooling, which keeps the maximum value for each feature dimension while ignoring the contributions from other diseases. To overcome the problems of the two strategies above and improve the performance of our predictive model, we propose a novel selfattention network as the fusion function, achieving an adaptive fusion strategy for disease embeddings.
The proposed selfattention network is inspired by the multihead attention architecture in (Vaswani et al., 2017) and the selfattentive embedding structure in (Lin et al., 2017). As shown in Figure 3, our selfattention network is a twolayer architecture. The first layer contains heads, each of which is a selfattention function that generates a weight vector from :^{3}^{3}3 represents the dimensional simplex.
(3) 
where and are the parameters of the th head. The second layer is a single selfattention function, which takes the concatenation of as input and generates the final weight vector. Denote , the final weight is derived via
(4) 
where and is the parameter of this layer. For convenience, we represent the whole process as , and accordingly .
From a probabilistic viewpoint, may be interpreted as a distribution of the diseases in : the higher probability a disease has, the more significant influence the disease has on predicting procedures.
3.3 Learning with an optimal transportbased mutualattention mechanism
Besides the predictive loss in (2), we further design a regularizer based on the optimaltransport distance between observed diseases and procedures, achieving a mutualattention mechanism in the training phase. As mentioned above, a disease may lead to multiple procedures and a procedure can be shared by different diseases. Therefore, for the th admission, there exists a complicated map from to . We estimate this map explicitly via minimizing the optimaltransport distance (Villani, 2008) between and , and the estimated map is used to regularize the learning of embeddings.
Denote the distributions of diseases and corresponding procedures in the th admission as and , respectively. Here, is the weighted vector derived via our selfattention networks, and the
is assumed to be a uniform distribution. As illustrated in Figure
3, given the embeddings of the diseases and the procedures, , and , we calculate the distance matrix between them, denoted . Because the probability in (1) is calculated based on the inner product between disease embedding and procedure embedding, we calculate the elements of the distance matrix based on the cosine similarity (normalized inner product) between the embeddings:
(5) 
Based on the distance matrix, the optimaltransport distance provides a measure of the dissimilarity between and , defined as
(6) 
where
is the set of all possible joint distributions with
and as marginals; represents the inner product between matrices. The optimal transport, , , is the joint distribution that minimizes the distance between and . A detailed introduction to the optimaltransport distance is found in Appendix A and within (Villani, 2008; Cuturi, 2013; Benamou et al., 2015).With the help of the optimaltransport distance, we achieve a mutualattention mechanism for embedding ICD codes: the optimal transport explicitly represents the clinical relationships between and – if disease mainly yields procedure , element in will ideally have a corresponding large value.
3.4 Proposed learning method
Given a set of admissions, , we jointly consider the predictive loss in (2) and the optimaltransport distance in (6), and learn the proposed model by solving the following optimization problem:
(7) 
This is a nested minimization problem, because the second term involves the optimization of optimal transport. We solve this problem via alternating optimization, which involves the following two steps:

Update optimal transport. For each admission, given current embeddings and , and the estimated distribution of , we update the optimal transport by solving (6), , . This optimization problem can be solved effectively via the proximal gradient method (Xie et al., 2018), with linear convergence. The detailed introduction of the proximal gradient method is given in Appendix A.

Update embeddings and selfattention network. Given updated optimal transport matrices , we plug them into the second term of (7), and update the ICD code embeddings and the selfattention network via minibatch gradient descent. Specifically, we solve the following problem:
(8) where represents a batch of admissions.
The learning algorithm is summarized in Algorithm 1. Note that the updating of optimal transport for different admissions can be done in parallel, and the gradient of (8
) can be calculated efficiently via backpropagation on graphical processing units (GPUs).
4 Related Work
4.1 Embedding techniques
Embeddings have been widely used to represent and analyze realworld entities. For useritem pairs in recommendation systems, lowrank factorization models are applied to estimate user and item embeddings from their observed interactions (Herlocker et al., 1999; Rendle et al., 2009). When side information is available, such as reviews of items provided by users or the images of items, the embeddings can be further parametrized via neural networks (Kang and McAuley, 2018; Chen et al., 2018a, b)
. For natural language processing, word embeddings have been widely used to represent words in sentences. Typical word embedding techniques include Word2Vec
(Mikolov et al., 2013) and GloVe (Pennington et al., 2014), which maximize the coherency of the words and their contexts in sentences. Following the same strategy, many node embedding methods have been proposed to represent nodes in graphs, , DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 2015), and Node2Vec (Grover and Leskovec, 2016). Most of these embedding methods employ the same framework – first generate sequential observations through random walks in the graph, and then apply word embeddings by maximizing the coherency of adjacent nodes on the random walks. Focusing on clinical data analysis, embedding techniques have been applied to many tasks, like ICD code assignment (Shi et al., 2017; Baumel et al., 2017; Mullenbach et al., 2018; Huang et al., 2018), clinical data clustering, and treatment prediction (Bajor et al., 2018; Harutyunyan et al., 2017; Choi et al., 2016a). Recently, the work in (Xu et al., 2018, 2019) makes efforts to learn ICD code embeddings directly through admission records based on optimal transportbased methods. However, the methods don’t scale well, and therefore it is hard to apply in practical systems.4.2 Attention models
Attention models have proven useful for many machine learning tasks. Such models are typically applied to a set or sequence of vectors, and the associated weights on the vectors characterize their importance for a given task (Vinyals et al., 2015). The weights highlight the important parts of the sequence of vectors, and effectively provide an adaptive pooling strategy to obtain a global representation of all observations (Lin et al., 2017). A successful example of attention models is its application to natural language processing (NLP) tasks, like questionanswering and document comprehension (Devlin et al., 2018; Choi et al., 2017b) – the attention model has been core to advanced NLP modules like the “transformer” (Vaswani et al., 2017). Leading language models like GPT (Radford et al., 2018) and BERT (Devlin et al., 2018) also rely on various attention mechanisms. Besides NLP, attention models have recently been applied to other tasks, including recommendation systems (Kang and McAuley, 2018)
(Kool and Welling, 2018), and multiinstance learning (Ilse et al., 2018). Attention models are also applicable for healthcare, a, , ECG rhythm classification (Goodfellow et al., 2018), admission prediction (Choi et al., 2016b), and heart failure prediction (Choi et al., 2017a).4.3 Optimal transportbased learning
Learning based on the optimaltransport distance has recently attracted much attention, such as in distribution estimation (Boissard et al., 2015) and clustering (Ye et al., 2017)
. It can also be used as a loss function when learning deep generative models
(Courty et al., 2017; Arjovsky et al., 2017). The main bottleneck of the application of optimaltransport distance is its high computational complexity. This problem has been eased greatly by the SinkhornKnopp algorithm (Sinkhorn and Knopp, 1967). Specifically, by adding an entropy regularizer (Cuturi, 2013), the optimaltransport distance can be approximated via iterative Bregman projection (Benamou et al., 2015). The algorithm achieves nearlineartime complexity (Altschuler et al., 2017), and its convergence and stability can be further improved via the inexact proximal point method (Xie et al., 2018). However, the application of optimal transport to problems in healthcare has not been widely investigated.5 Experiments
5.1 Comparisons on procedure recommendation
To demonstrate the effectiveness of the proposed embedding method, we test it on the three datasets introduced in Section 2 and compare it with its variants and existing embedding methods. Specifically, we denote our Embedding method with Self and MutualAttention mechanisms as E+SA+MA. Its variants include: 1) setting the fusion function to a maxpooling function, and learning without optimal transportbased mutualattention regularizer (, E+Pooling); 2) using the proposed selfattention network as the fusion function, but learning without the mutualattention regularizer (, E+SA); and 3) using maxpooling as the fusion function but considering the mutualattention regularizer (, E+MA
). These variants provide understanding of the significance of the self and mutualattention mechanisms. The key hyperparameters of the proposed method include the dimension of the embedding vector
, the weight of the mutualattention regularizer and the number of heads in the selfattention network . In all models we set , , and . The robustness of our method to these hyperparameters is analyzed in Section 5.2. All the methods learn ICD code embeddings via Adam (Kingma and Ba, 2014), setting the learning rate to , batch size as, and the number of epochs as
.We also consider the following methods as baselines: 1) the Word2Vec method (Mikolov et al., 2013), which enumerates all possible diseaseprocedure pairs in each admission, and learns the ICD code embeddings to maximize the loglikelihood of the pairs; 2) the classical Bayesian personalized ranking method (BPR) (Rendle et al., 2009), which maximizes the coherency between the averaged disease embedding and the procedure embeddings in each admission; 3) the distilled Wasserstein method (DWL) (Xu et al., 2018), that learns the ICD code embeddings via a hierarchical optimal transportbased method; and 4) the GromovWasserstein learning method (GWL) (Xu et al., 2019), that learns the ICD code embeddings with structural regularization on the embeddings within the domain of diseases and that of procedures. Similar to our method, these baselines also learn ICD code embeddings with dimension .
Dataset  Method  Top1 (%)  Top3 (%)  Top5 (%)  Top10 (%)  

R  P  F1  R  P  F1  R  P  F1  R  P  F1  
Word2Vec  19.5  47.8  24.7  35.4  34.9  30.8  47.1  29.6  32.0  62.3  21.1  28.5  
28,315  DWL  19.7  48.2  25.0  35.9  35.2  31.3  47.5  30.3  32.4  63.0  20.9  28.7 
Admissions  GWL  13.1  44.9  18.0  19.2  25.2  18.8  23.9  21.6  19.5  40.1  20.0  24.3 
247  BPR  23.5  57.6  29.8  44.8  43.5  38.7  56.8  35.7  38.8  73.1  24.8  33.6 
Diseases  E+Pooling  24.5  58.5  30.9  46.2  44.0  39.6  58.0  35.7  39.1  74.5  25.1  34.0 
75  E+SA  24.4  58.4  30.8  45.8  43.7  39.2  57.2  35.7  38.9  73.6  24.9  33.7 
Procedures  E+MA  23.7  57.4  30.0  45.2  43.2  38.8  57.4  35.6  38.8  74.2  25.0  33.8 
E+SA+MA  24.8  59.7  31.3  46.4  44.1  39.7  58.7  36.3  39.7  74.9  25.3  34.2  
Word2Vec  7.8  27.6  10.9  27.7  30.5  25.1  38.3  26.9  27.7  52.8  20.1  26.1  
30,535  DWL  8.0  27.5  11.1  27.9  30.8  25.2  39.5  27.0  27.9  53.9  20.9  27.4 
Admissions  GWL  9.2  36.9  13.2  11.8  15.9  11.6  12.7  10.9  10.0  15.8  7.8  8.9 
874  BPR  10.2  35.8  14.9  38.6  40.2  34.3  49.3  33.3  34.9  65.2  23.8  31.4 
Diseases  E+Pooling  10.1  35.4  14.3  38.0  39.7  33.9  50.0  33.6  35.3  65.6  24.1  31.7 
258  E+SA  18.2  50.0  23.6  36.6  39.1  33.0  48.7  33.1  34.6  66.0  24.1  31.8 
Procedures  E+MA  12.4  32.4  15.7  38.7  39.3  34.0  50.9  33.9  35.7  67.6  24.7  32.5 
E+SA+MA  20.3  53.1  26.1  40.7  42.3  36.2  53.0  35.4  37.2  68.9  25.1  33.1  
Word2Vec  5.3  22.9  8.7  14.6  21.1  15.3  24.8  21.0  20.1  41.1  17.7  22.2  
31,213  DWL  5.6  23.0  9.0  14.9  21.3  15.6  24.8  21.4  20.5  42.0  18.2  23.0 
Admissions  GWL  5.5  19.0  7.5  5.6  6.6  5.1  5.7  4.2  4.1  6.0  2.3  2.9 
2,765  BPR  7.3  26.7  10.2  23.0  27.1  21.2  38.4  27.6  27.9  56.6  21.7  28.0 
Diseases  E+Pooling  7.4  27.3  10.4  16.5  23.3  17.1  38.7  27.3  27.8  58.3  22.0  28.6 
819  E+SA  8.0  28.2  11.1  20.0  26.0  19.7  36.5  26.5  26.6  56.6  21.7  28.0 
Procedures  E+MA  6.9  25.7  9.7  18.5  22.7  17.5  33.2  23.4  23.8  56.5  21.5  27.8 
E+SA+MA  8.5  27.9  11.4  23.1  27.2  21.4  39.1  27.9  28.4  60.0  22.8  29.6 
For each dataset, we use 80% of the admissions for training and 20% for testing. In the testing phase, we evaluate each embedding method for the task of recommending procedures according to diagnosed diseases. Similar to other works of recommendation systems, we evaluate the performance of each method in the testing phase via top precision, recall, and F1score. For the th testing admission, , each method recommends a set of procedures with length , denoted . Given the groundtruth set , we calculate the top precision, recall and F1score as follows:
For each method, the embedding model is learned on the training admissions with 10fold cross validation. In particular, the model is learned on 9fold of training admission, validated on the remaining fold, and tested on the testing admission. The averaged top measurements of each model, with , are recorded in Table 1.^{4}^{4}4
The 90%confidence intervals of all the measurements are calculated as well. However, we find that the range of the confidence interval for all the methods and measurements are smaller than
, which means that our method and the alternatives have stable performance on our datasets. Therefore, here we just show averaged measurements in Table 1. We find that the proposed E+SA+MA method outperforms the alternatives consistently across different datasets and for all measurements. Specifically, among existing methods, only BPR is comparable to that of E+Pooling (the simplest variant of our method). Considering selfattention or mutualattention mechanism indeed boosts the recommendation result in most situations, as shown in the rows of E+SA and E+MA. Accordingly, combining these two mechanisms jointly (, our E+SA+MA) can achieve the best performance.5.2 Robustness to hyperparameters
As discussed above, the dimension of the embedding vectors, the weight of mutualattention regularizer , and the number of selfattention heads are important for our method, having significant influence on the recommendation results. Figure 4 illustrates the influence of these hyperparameters on the F1scores derived by our method on the small dataset. In Figure 4, we find that when the dimension of embedding falls in the range from to , the performance of our method is relatively stable. Setting achieves slightly better performance. When the dimension is too small, , or , the embeddings are not representative enough, which leads to underfitting. In Figure 4 we set . The proposed method obtains comparable F1scores when falls in the range to . When is too large, , , the proposed regularizer is too strong and becomes dominant in the loss function. As a result, the model suffers from serious model misspecification and the recommendation results degrade accordingly. In Figure 4, we set . When , the selfattention network only contains one attention layer with one head. With the increase of , the number of parameters in the selfattention network becomes larger and the model becomes more representative. When , the best performance is achieved. If we further increase , our model will have too many parameters and suffer from overfitting. The illustrations on the medium and the large datasets are similar to Figure 4. Based on the analysis above, we empirically set , and .
5.3 Interpretability
The embeddings we learn are found to have good interpretability. Specifically, given a set of diseases, our selfattention network estimates the significance of the diseases. Given predicted procedures, we can further calculate the optimal transport between the observed diseases and the predicted procedures explicitly via the mutualattention regularizer. Figure 5 shows typical procedure recommendation results. Each row corresponds to the result of an admission. For each admission, we observe some diagnosed diseases and recommended top5 procedures. Given the diagnosed diseases, their significance (, ) and their optimal transport (, ) to the recommended procedures are shown. We find that the significance reflects the seriousness of diseases: in Figure 5, the urgent disease “Cardiac arrest” is assigned with the highest significance; in Figure 5, dangerous diseases like “Acute kidney failure” are assigned with high significance. Additionally, the optimal transport estimates reasonable clinical relationships between diseases and procedures. For example, in Figure 5, we find that the disease “End stage renal disease” will transport to its regular procedure “Hemodialysis”, and the disease “Cardiac arrest” will transport to the suitable procedure “Cardiopulmonary resuscitation”. In Figure 5, the disease “Acute and chronic respiratory failure” may yield procedure “Continuous invasive mechanical ventilation”. In Figure 5, the disease “Other pulmonary insufficiency” may lead to “Insertion of endotracheal tube”. More examples can be found in Appendix B.
6 Conclusions and Future Work
A novel embedding method has been proposed for ICD codes, which has self and mutualattention mechanisms and outperforms existing embedding methods in procedure recommendation. The selfattention network in the proposed method achieves an adaptive fusion strategy of disease embeddings, which estimates the significance of various diseases in different admissions. An optimal transportbased mutualattention regularizer is considered in the training phase of our model, estimating the clinical relationship between the diseases and the procedures appearing in the same admissions. These two mechanisms enhance the interpretability of the learned embeddings and improve the recommendation accuracy. In the future, we plan to design moreeffective attention mechanisms and further improve the representation power of embeddings. Additionally, beyond the MIMICIII, we will explore the performance of our method in realworld largescale data.
References
 Altschuler et al. (2017) Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Nearlinear time approximation algorithms for optimal transport via Sinkhorn iteration. arXiv preprint arXiv:1705.09634, 2017.
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, pages 214–223, 2017.
 Bajor et al. (2018) Jacek M Bajor, Diego A Mesa, Travis J Osterman, and Thomas A Lasko. Embedding complexity in the data representation instead of in the model: A case study using heterogeneous medical data. arXiv preprint arXiv:1802.04233, 2018.
 Baumel et al. (2017) Tal Baumel, Jumana NassourKassis, Michael Elhadad, and Noemie Elhadad. Multilabel classification of patient notes a case study on ICD code assignment. arXiv preprint arXiv:1709.09587, 2017.
 Benamou et al. (2015) JeanDavid Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
 Boissard et al. (2015) Emmanuel Boissard, Thibaut Le Gouic, JeanMichel Loubes, et al. Distribution’s template estimate with wasserstein metrics. Bernoulli, 21(2):740–759, 2015.
 Chen et al. (2018a) Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, and Hongyuan Zha. Sequential recommendation with user memory networks. In WSDM, pages 108–116. ACM, 2018a.
 Chen et al. (2018b) Xu Chen, Yongfeng Zhang, Hongteng Xu, Zheng Qin, and Hongyuan Zha. Adversarial distillation for efficient recommendation with external knowledge. ACM Transactions on Information Systems (TOIS), 37(1):12, 2018b.
 Choi et al. (2016a) Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, Michael Thompson, James Bost, Javier TejedorSojo, and Jimeng Sun. Multilayer representation learning for medical concepts. In KDD, 2016a.
 Choi et al. (2016b) Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In NIPS, pages 3504–3512, 2016b.
 Choi et al. (2017a) Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F Stewart, and Jimeng Sun. Gram: graphbased attention model for healthcare representation learning. In KDD, pages 787–795. ACM, 2017a.
 Choi et al. (2017b) Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, and Jonathan Berant. Coarsetofine question answering for long documents. In ACL, volume 1, pages 209–220, 2017b.
 Courty et al. (2017) Nicolas Courty, Rémi Flamary, and Mélanie Ducoffe. Learning Wasserstein embeddings. arXiv preprint arXiv:1710.07457, 2017.
 Cuturi (2013) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, pages 2292–2300, 2013.
 Devlin et al. (2018) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Goodfellow et al. (2018)
Sebastian D Goodfellow, Andrew Goodwin, Robert Greer, Peter C Laussen, Mjaye
Mazwi, and Danny Eytan.
Towards understanding ecg rhythm classification using convolutional neural networks and attention mappings.
In Machine Learning for Healthcare Conference, pages 83–101, 2018.  Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, pages 855–864, 2016.
 Harutyunyan et al. (2017) Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017.
 Herlocker et al. (1999) Jonathan L Herlocker, Joseph A Konstan, Al Borchers, and John Riedl. An algorithmic framework for performing collaborative filtering. In SIGIR, pages 230–237. Association for Computing Machinery, Inc, 1999.
 Huang et al. (2018) Jinmiao Huang, Cesar Osorio, and Luke Wicent Sy. An empirical evaluation of deep learning for ICD9 code assignment using MIMICIII clinical notes. arXiv preprint arXiv:1802.02311, 2018.
 Ilse et al. (2018) Maximilian Ilse, Jakub M Tomczak, and Max Welling. Attentionbased deep multiple instance learning. arXiv preprint arXiv:1802.04712, 2018.
 Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Liwei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. MIMICIII, a freely accessible critical care database. Scientific data, 3:160035, 2016.
 Kang and McAuley (2018) WangCheng Kang and Julian McAuley. Selfattentive sequential recommendation. arXiv preprint arXiv:1808.09781, 2018.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kool and Welling (2018) WWM Kool and M Welling. Attention solves your tsp. arXiv preprint arXiv:1803.08475, 2018.
 Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured selfattentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
 Mahmood et al. (2018) Rafid Mahmood, Aaron Babier, Andrea McNiven, Adam Diamant, and Timothy CY Chan. Automated treatment planning in radiation therapy using generative adversarial networks. In Machine Learning for Healthcare Conference, pages 484–499, 2018.
 Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 Mullenbach et al. (2018) James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. Explainable prediction of medical codes from clinical text. arXiv preprint arXiv:1802.05695, 2018.
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In KDD, pages 701–710, 2014.
 Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pretraining. https://blog.openai.com/languageunsupervised/, 2018.

Rendle et al. (2009)
Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars SchmidtThieme.
Bpr: Bayesian personalized ranking from implicit feedback.
In
Proceedings of the twentyfifth conference on uncertainty in artificial intelligence
, pages 452–461. AUAI Press, 2009.  Shi et al. (2017) Haoran Shi, Pengtao Xie, Zhiting Hu, Ming Zhang, and Eric P Xing. Towards automated ICD coding using deep learning. arXiv preprint arXiv:1711.04075, 2017.
 Sinkhorn and Knopp (1967) Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343–348, 1967.
 Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In WWW, pages 1067–1077, 2015.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
 Villani (2008) Cédric Villani. Optimal transport: Old and new, volume 338. Springer Science & Business Media, 2008.
 Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, pages 2692–2700, 2015.
 Xie et al. (2018) Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha. A fast proximal point method for computing wasserstein distance. arXiv preprint arXiv:1802.04307, 2018.
 Xu et al. (2018) Hongteng Xu, Wenlin Wang, Wei Liu, and Lawrence Carin. Distilled wasserstein learning for word embedding and topic modeling. In NIPS, pages 1723–1732, 2018.
 Xu et al. (2019) Hongteng Xu, Dixin Luo, Hongyuan Zha, and Lawrence Carin. Gromovwasserstein learning for graph matching and node embedding. arXiv preprint arXiv:1901.06003, 2019.
 Ye et al. (2017) Jianbo Ye, Panruo Wu, James Z Wang, and Jia Li. Fast discrete distribution clustering using Wasserstein barycenter with sparse support. IEEE Transactions on Signal Processing, 65(9):2317–2332, 2017.
Appendix A Proximal Gradient Method for Optimal Transport
Mathematically, the optimaltransport distance can be defined as follows (Villani, 2008): Let be an arbitrary space with metric and the set of Borel probability measures on . For probability measures and in , their optimaltransport distance is
(9) 
where is the set of all probability measures on with and as marginals. When the metric is Euclidean, the optimal transport distace corresponds to wellknown Wasserstein distance (Villani, 2008; Cuturi, 2013). When is not a valid metric, the optimaltransport distance corresponds to the classcial Monge–Kantorovich transportation problem.
For the discrete case in our work, given a set of diseases , a set of procedures , and their distributions and , the definition in (9) can be reformulated as (6), where the cost matrix is calculated via (5).
We solve (6) via the proximal gradient method proposed in (Xie et al., 2018). In particular, this method solves (6) iteratively. In the th iteration, a proximal term is added to original problem as
(10) 
where is the optimal transport learned in previous iteration, and the proximal term is .
The optimization problem in (10) can be rewritten as
(11) 
where the second term is the entropy regularizer used in (Cuturi, 2013). Accordingly, SinkhornKnopp iteration can be applied to solve (11) effectively. In summary, the proposed proximal gradient method is shown in Algorithm 2. This method has linear convergence and has good numerical stability. More detailed analysis can be found in (Xie et al., 2018; Xu et al., 2019).
Appendix B Typical Examples of Learning Results
In Figures 6 and 7, we visualize the learning results of 6 admissions. For each admission, the significance of its diagnosed diseases (, ) and the optimal transport between the diseases and the recommended top5 procedures (, ) are shown. We can find that the significance learned by our method often indicates the main diseases or the most serious diseases in the admissions. The nonzero elements in the optimal transports often correspond to the pairs of the diseases and their related procedures.