Multi-task learning aims to jointly solve different learning tasks, while leveraging appropriate information sharing across all tasks [thrun1996learning, caruana1997multitask]. It has been shown that learning under a multi-task setting usually yields enhanced performance relative to separately building single-task models [sermanet2013overfeat, hashimoto2016joint, ruder2017overview]. However, multi-task learning has primarily been considered for homogeneous tasks that share the same objective (e.g., the same set of labels) [baxter1997bayesian, bakker2003task, yu2005learning, luo2015multi]. Real-world tasks are often heterogeneous [jin2014multi], meaning that each task potentially has a different objective and relies on complicated, often unobserved interactions. Examples of tasks having different objectives include classification, regression, recommendation etc.
From the perspective of generative models, heterogeneous tasks often correspond to distinct generative processes. This implies that traditional generative multi-task learning methods [baxter1997bayesian, bakker2003task, yu2005learning, zhang2008flexible], which often generalize a single class of generative model to multiple tasks, are not appropriate. Under these circumstances, a new mechanism is required to leverage relationships across the entities from different tasks.
To overcome the aforementioned challenges, we propose a graph-driven generative model to learn heterogeneous tasks in a unified framework. Taking advantage of the graph structure that commonly appears in many real-world data, the proposed model treats feature views, entities and their relationships as nodes and edges in a graph, and formulates learning heterogeneous tasks as instantiating different sub-graphs from the global data graph. Specifically, a sub-graph contains the feature views and the entities related to a task and their interactions. Both the feature views and the interactions can be reused across all tasks while the representation of the entities are specialized for the task. We combine a shared graph convolutional network (GCN) [kipf2016semi] with multiple variational autoencoders (VAEs) [kingma2013auto]. The GCN serves as a generator of latent representations for the sub-graphs, while the VAEs are specified to address the different tasks. The model is then optimized jointly over the objectives for all tasks to encourage the GCN to produce representations that can be used simultaneously by all of them.
In health care, our motivating example, ICD (International Statistical Classification of Diseases) codes for diseases and procedures can be used as a source of information for multiple tasks,e.g., modeling clinical topics of admissions, recommending procedures according to diseases and predicting admission types. These three tasks require the capture of clinical relationships among ICD codes and admissions. For a given admission, it is associated with a set of disease and procedure codes (i.e., feature views).
However, the admission has to be organized with different views (i.e., specialized entities) for tasks with different objectives. For instance, topic modeling is an unsupervised task needing procedures and diseases, admission-type prediction is a supervised task also using procedures and diseases, and procedure recommendation is a supervised task that only uses disease codes. In the context of our work, ICD codes and hospital admissions constitute a graph as shown in Figure 1. The edges between ICD codes and those between ICD codes and admissions are quantified according to their coherency. The ICD codes are embedded during training, which are used to specialize the embeddings of admissions for different tasks. At test time, the GCN is used to represent sub-graphs, i.e., collections of shared ICD codes, specialized admissions and their interactions, that feed into different task-specific VAEs. We test our model on the three tasks described above. Experimental results show that the jointly learned representation for the admission graph indeed improves the performance of all tasks relative to the individual task model.
To solve heterogeneous multi-task learning from a generative model perspective, a natural solution is to model multiple generative processes, one for each task. In particular, given tasks, each task is associated with training data , where represents the target variable, and represents the variable associated with . We propose using sets of VAEs [kingma2013auto] for modeling in terms of latent variables , where each is inferred from using a task-specific inference network. Note that here the term VAE is used loosely in the sense that and need not to be the same. The generative processes are defined as
with corresponding inference networks specified as
For the -th task, represents a generative model (i.e., a stochastic decoder) with parameters , and is the prior distribution for latent code . The corresponding inference network for consists of two parts: () a deterministic encoder shared across all tasks to encode each into independently; and () a stochastic encoder with parameters to stochastically map into latent code . The distribution serves as an approximation to the unknown true posterior . Note that since are in general associated with heterogeneous tasks, they may represent different types of information. For example, they can be labels for classification and bag-of-words for topic modeling. Motivated by the intuition that real-world tasks are likely to be latently related with each other, using a shared representation can be beneficial as a means to consolidate information in a way that allows tasks to leverage information from each other.
In likelihood-based learning, the goal for heterogeneous multi-task learning is to maximize the empirical expectation of the log-likelihood , with respect to the data provided for each task. Since the marginal likelihood rarely has a closed-form expression, VAE seeks to maximize the following evidence lower bound (ELBO), which bounds the marginal log-likelihood from below
However, for heterogeneous tasks, features are often organized in different views and the interactions between observed entities can as well be different. As a result, it is challenging to find a common for the with incompatible formats or even in incomparable data spaces.
Fortunately, such data can often be modeled as a data graph, whose nodes correspond to the entities appearing in different tasks and edges capturing their complex interactions. Accordingly, different tasks re-organize the graph and leverage its information from shared but different views. Specifically, we represent a data graph as , where is the set of nodes corresponding to the observed entities, is the adjacency matrix of the graph, and is a union of (trainable) feature sets. For the -th task, is its feature set, where contains the nodes related to the task and is the feature of the node in task . Based on , the observations of the -th task correspond to a sub-graph from , i.e., , where selects rows and columns from . In such a situation, instead of finding a unified inference network for each individual observation in different tasks, for the sub-graphs we define an inference network based on a graph convolutional network (GCN) [kipf2016semi], i.e., implementing in (Proposed Model) as a GCN with parameter and thus , hence a large portion of the parameters of the inference network are shared among tasks.
The independent generative processes with a shared GCN-based inference network match with the nature of heterogeneous tasks. In particular, the sub-graphs in different tasks are derived from the same data graph with partially shared nodes and edges, enabling joint learning of latent variables through the shared inference network. Then, the inferred latent variables pass through different generative models under the guidance of different tasks. In the next section, we will show that this model is suitable for challenging healthcare tasks.
Typical Specification for Healthcare Tasks
Observations, tasks, and proposed data graph
To demonstrate the feasibility of our model, we describe a specification to solve tasks associated with hospital admissions. Let and denote the set of disease and procedure ICD codes, respectively, i.e., each component represents a specific disease and each represents a specific procedure. Suppose we observe hospital admissions, denoted as . Each is associated with some ICD codes and a label representing its type, i.e., for , where , and is an element in the set of admission types . Based on these observations, we consider three healthcare tasks: ) clinically-interpretable topic modeling of admissions; ) procedure recommendation; and ) admission-type prediction.
As illustrated in Figure 1, the observations above can be represented as an admission graph , where the node set and is the adjacency matrix. The union of feature sets , where and
contain trainable vector embeddings of ICD codes for diseases and procedures, respectively. These embeddings are reused for different tasks., where contains the embeddings of admissions for different tasks. Specifically, for each admission , its embedding in the -th task is derived from the aggregation of the embeddings of ICD codes, i.e., , where is the set of the ICD codes associated with task . For topic modeling and admission-type prediction, , while for procedure recommendation, in which the procedure codes are unavailable, . Given this admission graph , the three healthcare tasks correspond to different sub-graphs , which yield a typical heterogeneous scenario. Table 1 highlights their differences on target variables and sub-graphs. Although the sub-graphs specialize the information of admission nodes, they reuse the representations of ICD code nodes and the edges in .
Construction of edges
Inspired by existing research [matveeva2006document, chen2013alternative, rekabsaz2017toward, yao2018graph], we enrich the representation power of our model with the meaningful population statistics, considering two types of edges in our adjacency matrix.
Edges between ICD codes. ICD codes appear coherently in many admissions, e.g., diabetes and its comorbidities like cardiovascular disease. Accordingly, edges between ICD codes with high coherency should be weighted heavily. Based on this principle, we apply point-wise mutual information (PMI), which is a commonly-used similarity measurement in various NLP tasks [levy2014neural, arora2016latent, newman2010automatic, mimno2011optimizing, ogura2013text], as the weight between each pair of ICD codes. Formally, for each pair of ICD codes, we evaluate their PMI as
where and . Positive PMI values indicate that the ICD codes in the pair are highly-correlated with each other. Conversely, negative PMI values imply weak correlation. Therefore, we only consider positive PMI values as the weights of edges.
Edges between ICD codes and admissions. Analogous with the relationship between words and documents, we weight the edge between ICD codes and admissions with the help of the term frequency-inverse document frequency (TF-IDF)111https://en.wikipedia.org/wiki/Tf–idf. The term frequency (TF) is the normalized version of the number of times an ICD code appears in an admission, and the inverse document frequency (IDF) is the log-scaled inverse fraction of the number of admissions that contain the ICD code. The TF-IDF is the element-wise multiplication of TF and IDF, which defines how important an ICD code in an admission [onan2016ensemble, shen2018baseline].
Summarizing the above, elements in the adjacency matrix are represented as
Graph-driven VAEs for different tasks
Focusing on the three tasks mentioned above, we specify our model as graph-driven variational autoencoder (GD-VAE). Specifically, GD-VAE consist of: ) a GCN-based inference network that is shared by all the three tasks, and ) three specialized generative networks that account for different sets of observations corresponding to the three tasks.
|Topic Modeling||Bi-term ICD codes|
|Procedure Recommendation||List of procedures|
|Admission-type Prediction||Admission type,|
Topic modeling of admissions In the context of topic modeling, each ICD code can be considered as a word or token, while each admission corresponds to a document, i.e., a collection of ICD codes. However, patient admissions exhibit extreme-sparsity issues in the sense that a very small set of codes are associated with each admission. Classic topic models, such as LDA [blei2003latent] and Neural Topic Model [miao2017discovering], can therefore be not appropriate in this case. To circumvent this problem, inspired by [yan2013biterm], instead of modeling a bag-of-ICD-codes for a single admission, we aim to model bi-term collections, and then aggregate all the unordered ICD code pairs (bi-terms) from several admissions together as one document. The generative process of our proposed Neural Bi-term Topic Model (NBTM) is described as follows:
where is the bi-term variable and its instance is , where are two ICD codes. is the topic distribution. is the hyper-parameter of the Dirichlet prior; a vector with length , where is the number of topics. are trainable parameters, each representing a learned topic, i.e., the distribution over ICD codes. The marginal likelihood for the entire admission corpus can be written as
The Dirichlet prior is known to be essential for generating interpretable topics [wallach2009rethinking]. However, it can be rarely applied to VAE directly, since no effective re-parameterization trick that can be adopted for the Dirichlet distribution. Fortunately, the Dirichlet distribution can be approximated with a logistic normal and a softmax formulation by Laplace approximation [hennig2012kernel]. When the number of topics is large, the Dirichlet distribution can be approximated with a multivariate logistic normal [srivastava2017autoencoding] with the -th element of its mean and diagonal covariance matrix as follows:
Under such an approximation, a topic distribution can be readily inferred by applying re-parameterization trick, sampling and inferring via .
Procedure recommendation In this task, for an admission, we aim to predict the set of procedures for a set of diseases. Inspired by [liang2018variational], we consider the following generative process for modeling admission procedures:
where is -dimensional variable and its instance is a list of recommended procedures.i.e., , where denotes a simplex. Then we derive procedures for the given admission by sampling times from a multinomial distribution with parameter .
Admission-type prediction Given an admission, the goal is to predict the admission type given both its diseases and procedures. We consider the following generative process for modeling admission types:
where is a variable and its instance corresponds to an admission type in the set . is another MLP, whose output is normalized to be a distribution over admission types, i.e., . Finally, the instance of (the type of the given admission) is sampled once from a multinomial distribution with parameter .
Inference with a shared GCN The proposed model unifies three tasks via sharing a common GCN-based inference network. Specifically, the posteriors of the three latent variables are
where , represents a diagonal matrix, and for .
Let denote the parameters of the generative networks for topic modeling, procedure recommendation and admission-type prediction, respectively. In summary, all the parameters are optimized jointly via maximizing (Proposed Model).
Multi-task learning Early multi-task learning methods learn a shared latent representation [thrun1996learning, caruana1997multitask, baxter1997bayesian], or impose structural constraints on the shared features for different tasks [ando2005framework, chen2009convex]. The work in [he2011graphbased] proposed a graph-based framework leveraging information across multiple tasks and multiple feature views. Following this work, the methods in [zhang2012inductive, jin2013shared] applied structural regularizers across different feature views and tasks, and ensured the learned predictive models are similar for different tasks. However, these methods require multiple tasks directly sharing some label-dependent information with each other, which is only applicable to homogeneous tasks. Focusing on heterogeneous tasks, many discriminative methods have been proposed, which map original heterogeneous features to a shared latent space through linear or nonlinear functions [zhang2011multi, jin2014multi, liu2018learning]
or sparsity-driven feature selection[yang2009heterogeneous, jin2015heterogeneous], and solve heterogeneous tasks jointly in the framework of discriminant analysis. Generative models have achieve remarkable success in the past few years [wang2017topic, wang2018zero, wang2019improving]. However, to our knowledge, the generative solutions to heterogeneous multi-task learning have not been fully investigated.
ICD code embedding and analysis of healthcare dataMachine learning techniques have shown potential in many healthcare problems, e.g., ICD code assignment [shi2017towards, baumel2017multi, mullenbach2018explainable, huang2018empirical], admission prediction [ma2017dipole, liu2018early, xu2017patient], mortality prediction [harutyunyan2017multitask, xu2018distilled], procedure recommendation [mao2019medgcn], medical topic modeling [choi2017gram, suo2018deep], etc. Although these tasks have different objectives, they often share the same electronic health records data, e.g., admission records. To learn multiple healthcare tasks jointly, various multi-task learning methods have been proposed [wang2014multi, alaa2017bayesian, suo2017multi, harutyunyan2017multitask, mao2019medgcn]. Traditional multi-task learning methods imposed some structural regularizers on the features shared by different tasks [argyriou2007multi]. The work in [mao2019medgcn] applied GCNs [kipf2016semi]
to extract features and jointly train models for medication recommendation and lab test imputation, which constitutes an attempt to apply GCNs to multi-task learning. However, introducing GCNs into the framework of generative heterogenous multi-task learning remains unexplored, that this paper seeks to address.
We test our method (GD-VAE) on the MIMIC-III dataset [johnson2016mimic], which contains more than 58,000 hospital admissions with 14,567 disease ICD codes and 3,882 procedures ICD codes. For each admission, it consists of a set of disease and procedure ICD codes. Three subsets of the MIMIC-III dataset are considered, with summary statistics in Table 2. The subsets are generated by thresholding the frequency of ICD codes, i.e., the ICD codes appearing at least 500/100/50 times and the corresponding non-empty admissions constitute the small/median/large subset.
To demonstrate the effectiveness of our method, we compare GD-VAE with state-of-the-art approaches on each of the healthcare tasks mentioned above. Specifically, () for topic modeling, we compare with LDA [blei2003latent], AVITM [srivastava2017autoencoding] and BTM [yan2013biterm]. () For procedure recommendation, we compare with Bayesian Personalized Ranking (BPR) [rendle2009bpr], Distilled Wasserstein Learning (DWL) [xu2018distilled], and a VAE model designed for collaborative filtering (VAE-CF) [liang2018variational]. We also compare with a baseline method based on Word2Vec [mikolov2013efficient], which enumerates all possible disease-procedure pairs in each admission, and then recommends procedures according to the similarity between their embeddings and those of diseases. (
) For admission-type prediction, we consider the following baselines: TF-IDF (combined with a linear classifier), Word2Vec (learning ICD code embeddings with Word2Vec[mikolov2013efficient], and using the mean of the learned embeddings to predict the label), FastText [joulin2016bag], SWEM [shen2018baseline] and LEAM [wang2018joint]. We use “T”, “R” and “P” to denote topic modeling, procedure recommendation and admission-type prediction, respectively. GD-VAE learns the three tasks jointly. To further verify the benefits of multi-task learning, we consider variations of our method that only learn one or two tasks, e.g., GD-VAE (T) means only learning a topic model, and GD-VAE (TR) indicates the joint learning of topic modeling and procedure recommendation.
The standard deviation for GD-VAE and its variants is around 0.003.
|Dataset||Method||Top-1 (%)||Top-3 (%)||Top-5 (%)||Top-10 (%)|
The standard deviation for GD-VAE and its variants is less than 0.2.
Configurations of Our Method
We test various methods in 10 trials and record the mean value and standard deviation of the experimental results. In each trial, we split the data into train, validation and test sets with a ratio of 0.6, 0.2 and 0.2, respectively. For the network architecture, we fix the embedding space to be for ICD codes and admissions, and a two-layer GCN [kipf2016semi]
with residual connection is considered for the inference network. In terms of the dimension of latent variable,is identical to the number of topics for topic modeling and for the other two tasks, and . In the aspect of the generative network, a linear layer is employed for both topic modeling and admission type prediction. For the procedure recommendation, a one-hidden layer MLP with
as the nonlinear activation function is used. As for the hyper-parameters, we merge 10 randomly sampled admissions to generate a topic admission for our NBTM, such thatis not too sparse, and samples are generated so as to train the model. Following [srivastava2017autoencoding], the prior is a vector with constant value 0.02.
Topic coherence [mimno2011optimizing] is used to evaluate the performance of topic modeling methods. This metric is computed based on the normalized point-wise mutual information (NPMI), which has been proven to match well with human judgment [lau2014machine]. Table 3 compares different methods on the mean of NPMI over the top 5/10/15/20 topic words. We find that LDA [blei2003latent] performs worse than neural topic models (including ours), which demonstrates the necessity of introducing powerful inference networks to infer the latent topics. In terms of the GCN-based methods, GD-VAE and its variants capture global statistics between ICD codes and those between ICD codes and admissions, thus outperforming the three baselines by substantial margins.
Compared with only performing topic modeling, i.e., GD-VAE (T), considering more tasks brings improvements, and the proposed GD-VAE achieves the best performance. In terms of leveraging knowledge across tasks, we find that the improvements are largely contributed by procedure recommendation, and marginally from admission prediction. This is because procedure recommendation accounts for the concurrence between disease codes and procedure codes within an admission, while the topic model considers the concurrence between the codes from different admissions. Both models capture the concurrence of ICD codes in different views, thus, naturally enhancing each other.
To further verify the quality of the learned topics, we visualize the top-5 ICD codes for some learned topics in the Supplementary Material. We find that the topic words are clinically-correlated. For example, the ICD codes related to surgery and those about urology are concentrated in two topics, respectively. Additionally, each topic contains both disease codes and procedure codes, e.g., “d85306” and “p7817” are orthopedic surgery related disease and procedures, showing that disease and procedures can be closely correlated, which also implies the potential benefits brought to procedure recommendation.
The standard deviation for GD-VAE and its variants is around 0.05 on F1 score.
Similar to [chen2018sequential, xu2018distilled], we use top- precision, recall and F1-Score to evaluate the performance of procedure recommendation. Given the -th admission, we denote and as the top- list of recommended procedures and ground-truth procedures, respectively. The top- precision, recall and F1-score can be calculated as follows: , , . Results are provided in Table 4. GD-VAE (R) is comparable to previous state-of-the-art algorithms. With additional knowledge learned from topic modeling and admission-type prediction, the results can be further improved. Similar to the observation in the previous section, topic modeling contributes more to procedure recommendation than admission-type prediction, since both topic modeling and procedure recommendation explore the underlying relationship between diseases and procedures.
Similar to procedure recommendation, we use precision, recall and F1-Score to evaluate the performance of admission-type prediction. Results in Table 5 show that GD-VAE outperforms its competitors. It is interesting to find that compared with topic modeling, procedure recommendation is more helpful to boost the results of admission-type prediction. One possible explanation is that the admission type is more relevant to the set of procedures, hence the embedding jointly learned with procedure recommendation can better guide itself towards an accurate prediction, e.g., it is likely to observe a surgery procedure in an urgent admission. Additionally, to better understand the representation learned by GD-VAE, we visualize the inferred latent code with -SNE [maaten2008visualizing], as shown in Figure 2(a). The is visually representative of different admission types being clustered into different groups.
Rationality of graph construction
To explore the rationality of our graph construction, we also compare the proposed admission graph with its variants. In particular, the proposed admission graph considers the PMI between ICD codes and the TF-IDF between ICD codes and admissions (i.e., PMI + TF-IDF). Its variations include: () a simple graph with binary edges between admissions and ICD codes (Binary); () a graph only considering the TF-IDF between admissions and ICD codes (TF-IDF); () a graph considering the PMI between ICD codes and the binary edges between admissions and ICD codes (PMI + Binary). Results using different graphs are provided in Figures 2(b) through 2(d), which demonstrate that both PMI edges and TF-IDF edges make significant contributions to the performance of the proposed GD-VAE.
We have proposed a graph-driven variational autoencoder (GD-VAE) to learn multiple heterogeneous tasks within a unified framework. This is achieved by formulating entities under different tasks as different types of nodes, and using a shared GCN-based inference network to leverage knowledge across all tasks. Our model is general in that it can be easily extended to new tasks by specifying the corresponding generative processes. Comprehensive experiments on real-world healthcare datasets demonstrate that GD-VAE can better leverage information across tasks, and achieve state-of-the-art results on clinical topic modeling, procedure recommendation, and admission-type prediction.
GCN-based Inference Network
Graph convolutional network (GCN) [kipf2016semi] has attracted much attention for leveraging information representations for nodes and edges, and is promising to tasks with complex relational information [kipf2016semi, hamilton2017inductive, yao2018graph].
Given a graph , a graph convolution layer derives the node embeddings via
where is the dimension of feature space after GCN, is the normalized version of adjacency matrix, is a diagonal degree matrix with , and is a matrix of trainable graph convolution parameters. The GCN aggregates node information in local neighborhoods to extract local substructure information. In order to incorporate high-order neighborhoods, we can stack multiple graph convolution layers as
where , is the output of the -th graph convolutional layer. However, GCN can be interpreted as Laplacian smoothing. Repeatedly applying Laplacian smoothing may fuse the features over vertices and make them indistinguishable [li2018deeper]. Inspired by [he2016deep], we alleviate this problem in our inference network by adding shortcut connections between different layers.
Description of topic ICD codes
A full description of the top-5 topic ICD codes is shown in Table 6.