Introduction
Multitask learning aims to jointly solve different learning tasks, while leveraging appropriate information sharing across all tasks [thrun1996learning, caruana1997multitask]. It has been shown that learning under a multitask setting usually yields enhanced performance relative to separately building singletask models [sermanet2013overfeat, hashimoto2016joint, ruder2017overview]. However, multitask learning has primarily been considered for homogeneous tasks that share the same objective (e.g., the same set of labels) [baxter1997bayesian, bakker2003task, yu2005learning, luo2015multi]. Realworld tasks are often heterogeneous [jin2014multi], meaning that each task potentially has a different objective and relies on complicated, often unobserved interactions. Examples of tasks having different objectives include classification, regression, recommendation etc.
From the perspective of generative models, heterogeneous tasks often correspond to distinct generative processes. This implies that traditional generative multitask learning methods [baxter1997bayesian, bakker2003task, yu2005learning, zhang2008flexible], which often generalize a single class of generative model to multiple tasks, are not appropriate. Under these circumstances, a new mechanism is required to leverage relationships across the entities from different tasks.
To overcome the aforementioned challenges, we propose a graphdriven generative model to learn heterogeneous tasks in a unified framework. Taking advantage of the graph structure that commonly appears in many realworld data, the proposed model treats feature views, entities and their relationships as nodes and edges in a graph, and formulates learning heterogeneous tasks as instantiating different subgraphs from the global data graph. Specifically, a subgraph contains the feature views and the entities related to a task and their interactions. Both the feature views and the interactions can be reused across all tasks while the representation of the entities are specialized for the task. We combine a shared graph convolutional network (GCN) [kipf2016semi] with multiple variational autoencoders (VAEs) [kingma2013auto]. The GCN serves as a generator of latent representations for the subgraphs, while the VAEs are specified to address the different tasks. The model is then optimized jointly over the objectives for all tasks to encourage the GCN to produce representations that can be used simultaneously by all of them.
In health care, our motivating example, ICD (International Statistical Classification of Diseases) codes for diseases and procedures can be used as a source of information for multiple tasks,
e.g., modeling clinical topics of admissions, recommending procedures according to diseases and predicting admission types. These three tasks require the capture of clinical relationships among ICD codes and admissions. For a given admission, it is associated with a set of disease and procedure codes (i.e., feature views).However, the admission has to be organized with different views (i.e., specialized entities) for tasks with different objectives. For instance, topic modeling is an unsupervised task needing procedures and diseases, admissiontype prediction is a supervised task also using procedures and diseases, and procedure recommendation is a supervised task that only uses disease codes. In the context of our work, ICD codes and hospital admissions constitute a graph as shown in Figure 1. The edges between ICD codes and those between ICD codes and admissions are quantified according to their coherency. The ICD codes are embedded during training, which are used to specialize the embeddings of admissions for different tasks. At test time, the GCN is used to represent subgraphs, i.e., collections of shared ICD codes, specialized admissions and their interactions, that feed into different taskspecific VAEs. We test our model on the three tasks described above. Experimental results show that the jointly learned representation for the admission graph indeed improves the performance of all tasks relative to the individual task model.
Proposed Model
To solve heterogeneous multitask learning from a generative model perspective, a natural solution is to model multiple generative processes, one for each task. In particular, given tasks, each task is associated with training data , where represents the target variable, and represents the variable associated with . We propose using sets of VAEs [kingma2013auto] for modeling in terms of latent variables , where each is inferred from using a taskspecific inference network. Note that here the term VAE is used loosely in the sense that and need not to be the same. The generative processes are defined as
(1) 
with corresponding inference networks specified as
(2) 
For the th task, represents a generative model (i.e., a stochastic decoder) with parameters , and is the prior distribution for latent code . The corresponding inference network for consists of two parts: () a deterministic encoder shared across all tasks to encode each into independently; and () a stochastic encoder with parameters to stochastically map into latent code . The distribution serves as an approximation to the unknown true posterior . Note that since are in general associated with heterogeneous tasks, they may represent different types of information. For example, they can be labels for classification and bagofwords for topic modeling. Motivated by the intuition that realworld tasks are likely to be latently related with each other, using a shared representation can be beneficial as a means to consolidate information in a way that allows tasks to leverage information from each other.
In likelihoodbased learning, the goal for heterogeneous multitask learning is to maximize the empirical expectation of the loglikelihood , with respect to the data provided for each task. Since the marginal likelihood rarely has a closedform expression, VAE seeks to maximize the following evidence lower bound (ELBO), which bounds the marginal loglikelihood from below
(3) 
However, for heterogeneous tasks, features are often organized in different views and the interactions between observed entities can as well be different. As a result, it is challenging to find a common for the with incompatible formats or even in incomparable data spaces.
Fortunately, such data can often be modeled as a data graph, whose nodes correspond to the entities appearing in different tasks and edges capturing their complex interactions. Accordingly, different tasks reorganize the graph and leverage its information from shared but different views. Specifically, we represent a data graph as , where is the set of nodes corresponding to the observed entities, is the adjacency matrix of the graph, and is a union of (trainable) feature sets. For the th task, is its feature set, where contains the nodes related to the task and is the feature of the node in task . Based on , the observations of the th task correspond to a subgraph from , i.e., , where selects rows and columns from . In such a situation, instead of finding a unified inference network for each individual observation in different tasks, for the subgraphs we define an inference network based on a graph convolutional network (GCN) [kipf2016semi], i.e., implementing in (Proposed Model) as a GCN with parameter and thus , hence a large portion of the parameters of the inference network are shared among tasks.
The independent generative processes with a shared GCNbased inference network match with the nature of heterogeneous tasks. In particular, the subgraphs in different tasks are derived from the same data graph with partially shared nodes and edges, enabling joint learning of latent variables through the shared inference network. Then, the inferred latent variables pass through different generative models under the guidance of different tasks. In the next section, we will show that this model is suitable for challenging healthcare tasks.
Typical Specification for Healthcare Tasks
Observations, tasks, and proposed data graph
To demonstrate the feasibility of our model, we describe a specification to solve tasks associated with hospital admissions. Let and denote the set of disease and procedure ICD codes, respectively, i.e., each component represents a specific disease and each represents a specific procedure. Suppose we observe hospital admissions, denoted as . Each is associated with some ICD codes and a label representing its type, i.e., for , where , and is an element in the set of admission types . Based on these observations, we consider three healthcare tasks: ) clinicallyinterpretable topic modeling of admissions; ) procedure recommendation; and ) admissiontype prediction.
As illustrated in Figure 1, the observations above can be represented as an admission graph , where the node set and is the adjacency matrix. The union of feature sets , where and
contain trainable vector embeddings of ICD codes for diseases and procedures, respectively. These embeddings are reused for different tasks.
, where contains the embeddings of admissions for different tasks. Specifically, for each admission , its embedding in the th task is derived from the aggregation of the embeddings of ICD codes, i.e., , where is the set of the ICD codes associated with task . For topic modeling and admissiontype prediction, , while for procedure recommendation, in which the procedure codes are unavailable, . Given this admission graph , the three healthcare tasks correspond to different subgraphs , which yield a typical heterogeneous scenario. Table 1 highlights their differences on target variables and subgraphs. Although the subgraphs specialize the information of admission nodes, they reuse the representations of ICD code nodes and the edges in .Construction of edges
Inspired by existing research [matveeva2006document, chen2013alternative, rekabsaz2017toward, yao2018graph], we enrich the representation power of our model with the meaningful population statistics, considering two types of edges in our adjacency matrix.
Edges between ICD codes. ICD codes appear coherently in many admissions, e.g., diabetes and its comorbidities like cardiovascular disease. Accordingly, edges between ICD codes with high coherency should be weighted heavily. Based on this principle, we apply pointwise mutual information (PMI), which is a commonlyused similarity measurement in various NLP tasks [levy2014neural, arora2016latent, newman2010automatic, mimno2011optimizing, ogura2013text], as the weight between each pair of ICD codes. Formally, for each pair of ICD codes, we evaluate their PMI as
(4) 
where and . Positive PMI values indicate that the ICD codes in the pair are highlycorrelated with each other. Conversely, negative PMI values imply weak correlation. Therefore, we only consider positive PMI values as the weights of edges.
Edges between ICD codes and admissions. Analogous with the relationship between words and documents, we weight the edge between ICD codes and admissions with the help of the term frequencyinverse document frequency (TFIDF)^{1}^{1}1https://en.wikipedia.org/wiki/Tf–idf. The term frequency (TF) is the normalized version of the number of times an ICD code appears in an admission, and the inverse document frequency (IDF) is the logscaled inverse fraction of the number of admissions that contain the ICD code. The TFIDF is the elementwise multiplication of TF and IDF, which defines how important an ICD code in an admission [onan2016ensemble, shen2018baseline].
Summarizing the above, elements in the adjacency matrix are represented as
(5) 
Graphdriven VAEs for different tasks
Focusing on the three tasks mentioned above, we specify our model as graphdriven variational autoencoder (GDVAE). Specifically, GDVAE consist of: ) a GCNbased inference network that is shared by all the three tasks, and ) three specialized generative networks that account for different sets of observations corresponding to the three tasks.
Task  

in  
Topic Modeling  Biterm ICD codes  
Procedure Recommendation  List of procedures  
Admissiontype Prediction  Admission type, 
Topic modeling of admissions In the context of topic modeling, each ICD code can be considered as a word or token, while each admission corresponds to a document, i.e., a collection of ICD codes. However, patient admissions exhibit extremesparsity issues in the sense that a very small set of codes are associated with each admission. Classic topic models, such as LDA [blei2003latent] and Neural Topic Model [miao2017discovering], can therefore be not appropriate in this case. To circumvent this problem, inspired by [yan2013biterm], instead of modeling a bagofICDcodes for a single admission, we aim to model biterm collections, and then aggregate all the unordered ICD code pairs (biterms) from several admissions together as one document. The generative process of our proposed Neural Biterm Topic Model (NBTM) is described as follows:
(6) 
where is the biterm variable and its instance is , where are two ICD codes. is the topic distribution. is the hyperparameter of the Dirichlet prior; a vector with length , where is the number of topics. are trainable parameters, each representing a learned topic, i.e., the distribution over ICD codes. The marginal likelihood for the entire admission corpus can be written as
(7) 
The Dirichlet prior is known to be essential for generating interpretable topics [wallach2009rethinking]. However, it can be rarely applied to VAE directly, since no effective reparameterization trick that can be adopted for the Dirichlet distribution. Fortunately, the Dirichlet distribution can be approximated with a logistic normal and a softmax formulation by Laplace approximation [hennig2012kernel]. When the number of topics is large, the Dirichlet distribution can be approximated with a multivariate logistic normal [srivastava2017autoencoding] with the th element of its mean and diagonal covariance matrix as follows:
(8) 
Under such an approximation, a topic distribution can be readily inferred by applying reparameterization trick, sampling and inferring via .
Procedure recommendation In this task, for an admission, we aim to predict the set of procedures for a set of diseases. Inspired by [liang2018variational], we consider the following generative process for modeling admission procedures:
(9) 
where is dimensional variable and its instance is a list of recommended procedures.
is a multilayer perceptron (MLP). The output of this function is normalized to be a probability distribution over procedures,
i.e., , where denotes a simplex. Then we derive procedures for the given admission by sampling times from a multinomial distribution with parameter .Admissiontype prediction Given an admission, the goal is to predict the admission type given both its diseases and procedures. We consider the following generative process for modeling admission types:
(10) 
where is a variable and its instance corresponds to an admission type in the set . is another MLP, whose output is normalized to be a distribution over admission types, i.e., . Finally, the instance of (the type of the given admission) is sampled once from a multinomial distribution with parameter .
Inference with a shared GCN The proposed model unifies three tasks via sharing a common GCNbased inference network. Specifically, the posteriors of the three latent variables are
(11) 
where , represents a diagonal matrix, and for .
Let denote the parameters of the generative networks for topic modeling, procedure recommendation and admissiontype prediction, respectively. In summary, all the parameters are optimized jointly via maximizing (Proposed Model).
Related Work
Multitask learning Early multitask learning methods learn a shared latent representation [thrun1996learning, caruana1997multitask, baxter1997bayesian], or impose structural constraints on the shared features for different tasks [ando2005framework, chen2009convex]. The work in [he2011graphbased] proposed a graphbased framework leveraging information across multiple tasks and multiple feature views. Following this work, the methods in [zhang2012inductive, jin2013shared] applied structural regularizers across different feature views and tasks, and ensured the learned predictive models are similar for different tasks. However, these methods require multiple tasks directly sharing some labeldependent information with each other, which is only applicable to homogeneous tasks. Focusing on heterogeneous tasks, many discriminative methods have been proposed, which map original heterogeneous features to a shared latent space through linear or nonlinear functions [zhang2011multi, jin2014multi, liu2018learning]
or sparsitydriven feature selection
[yang2009heterogeneous, jin2015heterogeneous], and solve heterogeneous tasks jointly in the framework of discriminant analysis. Generative models have achieve remarkable success in the past few years [wang2017topic, wang2018zero, wang2019improving]. However, to our knowledge, the generative solutions to heterogeneous multitask learning have not been fully investigated.ICD code embedding and analysis of healthcare dataMachine learning techniques have shown potential in many healthcare problems, e.g., ICD code assignment [shi2017towards, baumel2017multi, mullenbach2018explainable, huang2018empirical], admission prediction [ma2017dipole, liu2018early, xu2017patient], mortality prediction [harutyunyan2017multitask, xu2018distilled], procedure recommendation [mao2019medgcn], medical topic modeling [choi2017gram, suo2018deep], etc. Although these tasks have different objectives, they often share the same electronic health records data, e.g., admission records. To learn multiple healthcare tasks jointly, various multitask learning methods have been proposed [wang2014multi, alaa2017bayesian, suo2017multi, harutyunyan2017multitask, mao2019medgcn]. Traditional multitask learning methods imposed some structural regularizers on the features shared by different tasks [argyriou2007multi]. The work in [mao2019medgcn] applied GCNs [kipf2016semi]
to extract features and jointly train models for medication recommendation and lab test imputation, which constitutes an attempt to apply GCNs to multitask learning. However, introducing GCNs into the framework of generative heterogenous multitask learning remains unexplored, that this paper seeks to address.
Experiments
We test our method (GDVAE) on the MIMICIII dataset [johnson2016mimic], which contains more than 58,000 hospital admissions with 14,567 disease ICD codes and 3,882 procedures ICD codes. For each admission, it consists of a set of disease and procedure ICD codes. Three subsets of the MIMICIII dataset are considered, with summary statistics in Table 2. The subsets are generated by thresholding the frequency of ICD codes, i.e., the ICD codes appearing at least 500/100/50 times and the corresponding nonempty admissions constitute the small/median/large subset.
To demonstrate the effectiveness of our method, we compare GDVAE with stateoftheart approaches on each of the healthcare tasks mentioned above. Specifically, () for topic modeling, we compare with LDA [blei2003latent], AVITM [srivastava2017autoencoding] and BTM [yan2013biterm]. () For procedure recommendation, we compare with Bayesian Personalized Ranking (BPR) [rendle2009bpr], Distilled Wasserstein Learning (DWL) [xu2018distilled], and a VAE model designed for collaborative filtering (VAECF) [liang2018variational]. We also compare with a baseline method based on Word2Vec [mikolov2013efficient], which enumerates all possible diseaseprocedure pairs in each admission, and then recommends procedures according to the similarity between their embeddings and those of diseases. (
) For admissiontype prediction, we consider the following baselines: TFIDF (combined with a linear classifier), Word2Vec (learning ICD code embeddings with Word2Vec
[mikolov2013efficient], and using the mean of the learned embeddings to predict the label), FastText [joulin2016bag], SWEM [shen2018baseline] and LEAM [wang2018joint]. We use “T”, “R” and “P” to denote topic modeling, procedure recommendation and admissiontype prediction, respectively. GDVAE learns the three tasks jointly. To further verify the benefits of multitask learning, we consider variations of our method that only learn one or two tasks, e.g., GDVAE (T) means only learning a topic model, and GDVAE (TR) indicates the joint learning of topic modeling and procedure recommendation.Method  Small  Median  Large  
T=10  T=30  T=50  T=10  T=30  T=50  T=10  T=30  T=50  
LDA [blei2003latent]  0.110  0.106  0.098  0.123  0.102  0.107  0.101  0.106  0.103 
AVITM [srivastava2017autoencoding]  0.132  0.125  0.121  0.135  0.110  0.107  0.123  0.116  0.108 
BTM [yan2013biterm]  0.117  0.109  0.105  0.127  0.108  0.105  0.104  0.110  0.107 
GDVAE (T)  0.142  0.141  0.135  0.140  0.137  0.132  0.128  0.129  0.123 
GDVAE (TP)  0.142  0.138  0.136  0.143  0.137  0.134  0.129  0.127  0.125 
GDVAE (TR)  0.147  0.147  0.144  0.146  0.141  0.137  0.136  0.133  0.127 
GDVAE  0.151  0.149  0.145  0.148  0.144  0.140  0.136  0.137  0.131 

The standard deviation for GDVAE and its variants is around 0.003.
Dataset  Method  Top1 (%)  Top3 (%)  Top5 (%)  Top10 (%)  

R  P  F1  R  P  F1  R  P  F1  R  P  F1  
Word2Vec [mikolov2013efficient]  19.5  47.8  24.7  35.4  34.9  30.8  47.1  29.6  32.0  62.3  21.1  28.5  
DWL [xu2018distilled]  19.7  48.2  25.0  35.9  35.2  31.3  47.5  30.3  32.4  63.0  20.9  28.7  
BPR [rendle2009bpr]  23.5  57.6  29.8  44.8  43.5  38.7  56.8  35.7  38.8  73.1  24.8  33.6  
Small  VAECF [liang2018variational]  24.0  57.8  30.7  46.0  43.5  39.3  57.8  35.2  39.1  74.0  24.2  33.8 
GDVAE (R)  24.8  58.2  31.1  46.5  43.4  39.5  58.1  35.3  39.2  74.5  24.4  34.0  
GDVAE (RP)  25.0  58.3  31.3  46.8  43.5  39.5  58.2  35.4  39.2  74.7  24.5  34.1  
GDVAE (RT)  25.4  58.3  31.6  47.0  43.6  39.7  58.5  35.9  39.4  75.2  24.8  34.3  
GDVAE  25.6  58.6  31.8  47.0  43.8  39.8  58.7  36.2  39.6  75.9  25.1  34.5  
Word2Vec [mikolov2013efficient]  7.8  27.6  10.9  27.7  30.5  25.1  38.3  26.9  27.7  52.8  20.1  26.1  
DWL [xu2018distilled]  8.0  27.5  11.1  27.9  30.8  25.2  39.5  27.0  27.9  53.9  20.9  27.4  
BPR [rendle2009bpr]  10.2  35.8  14.9  38.6  40.2  34.3  49.3  33.3  34.9  65.2  23.8  31.4  
Median  VAECF [liang2018variational]  21.2  52.9  26.2  41.2  42.0  36.0  53.4  35.3  37.3  68.2  24.9  32.9 
GDVAE (R)  22.0  55.1  27.9  42.3  41.2  37.2  54.0  35.7  37.8  69.3  25.2  33.1  
GDVAE (RP)  22.3  55.1  28.0  42.7  41.5  37.4  53.7  35.5  37.6  69.6  25.1  33.4  
GDVAE (RT)  22.8  57.8  29.3  43.0  43.5  38.1  54.2  35.9  38.1  70.1  25.2  33.6  
GDVAE  23.2  57.9  29.6  43.2  43.9  38.2  54.6  36.0  38.4  70.4  25.3  33.7  
Word2Vec [mikolov2013efficient]  5.3  22.9  8.7  14.6  21.1  15.3  24.8  21.0  20.1  41.1  17.7  22.2  
DWL [xu2018distilled]  5.6  23.0  9.0  14.9  21.3  15.6  24.8  21.4  20.5  42.0  18.2  23.0  
BPR [rendle2009bpr]  7.3  26.7  10.2  23.0  27.1  21.2  38.4  27.6  27.9  56.6  21.7  28.0  
Large  VAECF [liang2018variational]  17.8  50.1  23.5  35.2  37.9  33.4  47.9  32.4  34.6  63.0  21.7  30.2 
GDVAE (R)  20.1  53.4  25.8  37.2  40.1  35.5  49.1  32.5  35.2  64.6  23.7  31.0  
GDVAE (RP)  20.4  53.3  26.1  37.9  39.7  35.9  49.9  32.7  35.5  65.1  24.0  31.2  
GDVAE (RT)  20.9  56.2  27.2  41.0  42.2  36.5  50.9  35.1  36.6  66.0  24.7  32.5  
GDVAE  21.2  56.4  27.4  40.9  43.0  36.7  51.4  35.2  36.8  66.5  24.9  32.7 

The standard deviation for GDVAE and its variants is less than 0.2.
Configurations of Our Method
We test various methods in 10 trials and record the mean value and standard deviation of the experimental results. In each trial, we split the data into train, validation and test sets with a ratio of 0.6, 0.2 and 0.2, respectively. For the network architecture, we fix the embedding space to be for ICD codes and admissions, and a twolayer GCN [kipf2016semi]
with residual connection is considered for the inference network. In terms of the dimension of latent variable,
is identical to the number of topics for topic modeling and for the other two tasks, and . In the aspect of the generative network, a linear layer is employed for both topic modeling and admission type prediction. For the procedure recommendation, a onehidden layer MLP withas the nonlinear activation function is used. As for the hyperparameters, we merge 10 randomly sampled admissions to generate a topic admission for our NBTM, such that
is not too sparse, and samples are generated so as to train the model. Following [srivastava2017autoencoding], the prior is a vector with constant value 0.02.Topic modeling
Topic coherence [mimno2011optimizing] is used to evaluate the performance of topic modeling methods. This metric is computed based on the normalized pointwise mutual information (NPMI), which has been proven to match well with human judgment [lau2014machine]. Table 3 compares different methods on the mean of NPMI over the top 5/10/15/20 topic words. We find that LDA [blei2003latent] performs worse than neural topic models (including ours), which demonstrates the necessity of introducing powerful inference networks to infer the latent topics. In terms of the GCNbased methods, GDVAE and its variants capture global statistics between ICD codes and those between ICD codes and admissions, thus outperforming the three baselines by substantial margins.
Compared with only performing topic modeling, i.e., GDVAE (T), considering more tasks brings improvements, and the proposed GDVAE achieves the best performance. In terms of leveraging knowledge across tasks, we find that the improvements are largely contributed by procedure recommendation, and marginally from admission prediction. This is because procedure recommendation accounts for the concurrence between disease codes and procedure codes within an admission, while the topic model considers the concurrence between the codes from different admissions. Both models capture the concurrence of ICD codes in different views, thus, naturally enhancing each other.
To further verify the quality of the learned topics, we visualize the top5 ICD codes for some learned topics in the Supplementary Material. We find that the topic words are clinicallycorrelated. For example, the ICD codes related to surgery and those about urology are concentrated in two topics, respectively. Additionally, each topic contains both disease codes and procedure codes, e.g., “d85306” and “p7817” are orthopedic surgery related disease and procedures, showing that disease and procedures can be closely correlated, which also implies the potential benefits brought to procedure recommendation.
Data  Small  Median  Large  
Method  P  R  F1  P  R  F1  P  R  F1 
TFIDF  84.26  87.19  85.18  86.12  88.61  87.22  88.45  89.10  87.76 
Word2Vec [mikolov2013efficient]  85.08  87.89  86.23  86.60  88.87  87.71  87.11  89.16  88.12 
FastText [joulin2016bag]  84.21  87.15  85.29  86.66  88.65  87.39  88.06  89.23  88.00 
SWEM [shen2018baseline]  85.56  88.10  86.77  87.01  89.28  88.12  87.55  89.88  88.67 
LEAM [wang2018joint]  85.34  88.03  86.55  87.03  89.29  88.14  87.61  89.94  88.73 
GDVAE (P)  86.01  88.13  86.91  87.76  89.31  88.51  88.23  90.41  89.30 
GDVAE (TP)  86.18  88.52  87.22  87.82  89.21  88.52  88.31  90.56  89.41 
GDVAE (RP)  86.87  89.38  87.93  88.08  89.57  88.82  89.07  90.98  90.00 
GDVAE  87.00  89.60  88.01  88.19  89.70  88.94  89.14  91.01  90.05 

The standard deviation for GDVAE and its variants is around 0.05 on F1 score.
Procedure recommendation
Similar to [chen2018sequential, xu2018distilled], we use top precision, recall and F1Score to evaluate the performance of procedure recommendation. Given the th admission, we denote and as the top list of recommended procedures and groundtruth procedures, respectively. The top precision, recall and F1score can be calculated as follows: , , . Results are provided in Table 4. GDVAE (R) is comparable to previous stateoftheart algorithms. With additional knowledge learned from topic modeling and admissiontype prediction, the results can be further improved. Similar to the observation in the previous section, topic modeling contributes more to procedure recommendation than admissiontype prediction, since both topic modeling and procedure recommendation explore the underlying relationship between diseases and procedures.
Admissiontype prediction
Similar to procedure recommendation, we use precision, recall and F1Score to evaluate the performance of admissiontype prediction. Results in Table 5 show that GDVAE outperforms its competitors. It is interesting to find that compared with topic modeling, procedure recommendation is more helpful to boost the results of admissiontype prediction. One possible explanation is that the admission type is more relevant to the set of procedures, hence the embedding jointly learned with procedure recommendation can better guide itself towards an accurate prediction, e.g., it is likely to observe a surgery procedure in an urgent admission. Additionally, to better understand the representation learned by GDVAE, we visualize the inferred latent code with SNE [maaten2008visualizing], as shown in Figure 2(a). The is visually representative of different admission types being clustered into different groups.
Rationality of graph construction
To explore the rationality of our graph construction, we also compare the proposed admission graph with its variants. In particular, the proposed admission graph considers the PMI between ICD codes and the TFIDF between ICD codes and admissions (i.e., PMI + TFIDF). Its variations include: () a simple graph with binary edges between admissions and ICD codes (Binary); () a graph only considering the TFIDF between admissions and ICD codes (TFIDF); () a graph considering the PMI between ICD codes and the binary edges between admissions and ICD codes (PMI + Binary). Results using different graphs are provided in Figures 2(b) through 2(d), which demonstrate that both PMI edges and TFIDF edges make significant contributions to the performance of the proposed GDVAE.
Conclusions
We have proposed a graphdriven variational autoencoder (GDVAE) to learn multiple heterogeneous tasks within a unified framework. This is achieved by formulating entities under different tasks as different types of nodes, and using a shared GCNbased inference network to leverage knowledge across all tasks. Our model is general in that it can be easily extended to new tasks by specifying the corresponding generative processes. Comprehensive experiments on realworld healthcare datasets demonstrate that GDVAE can better leverage information across tasks, and achieve stateoftheart results on clinical topic modeling, procedure recommendation, and admissiontype prediction.
References
GCNbased Inference Network
Graph convolutional network (GCN) [kipf2016semi] has attracted much attention for leveraging information representations for nodes and edges, and is promising to tasks with complex relational information [kipf2016semi, hamilton2017inductive, yao2018graph].
Given a graph , a graph convolution layer derives the node embeddings via
where is the dimension of feature space after GCN, is the normalized version of adjacency matrix, is a diagonal degree matrix with , and is a matrix of trainable graph convolution parameters. The GCN aggregates node information in local neighborhoods to extract local substructure information. In order to incorporate highorder neighborhoods, we can stack multiple graph convolution layers as
where , is the output of the th graph convolutional layer. However, GCN can be interpreted as Laplacian smoothing. Repeatedly applying Laplacian smoothing may fuse the features over vertices and make them indistinguishable [li2018deeper]. Inspired by [he2016deep], we alleviate this problem in our inference network by adding shortcut connections between different layers.
Description of topic ICD codes
A full description of the top5 topic ICD codes is shown in Table 6.