1 Introduction
Distributed representations, also known as embeddings, have brought immense success in numerous Natural Language Processing (NLP) and Computer Vision
[8]tasks. Recent works in Deep Learning applications for healthcare data (e.g., EHR)
[6, 19]emulated the concept of embedding as learning vector representations of medical concepts, that ensure that similar concepts will form natural clusters and relationships in vector space
[17]. Some other works generated visit representations from the learned medical concept embeddings [3, 4]. Considering each patient as a sequence of these visits, patient representation is then learned through the optimization of a supervised learning task. However, large amount of labeled healthcare data for predictive modeling is not available in practice, not to mention that manual labeling is laborious and expensive. Aiming towards alleviating annotation efforts, we propose an unsupervised model that relies only on the large unannotated EHR data to generate patient representations. We call this model the Mixed Pooling MultiView Attention Autoencoder (MPVAA).
In healthcare data (e.g., EHR), patient records may be available as heterogeneous data (e.g., demographics, laboratory results, clinical notes) that can provide an added dimension to learning personalized patient representations. For example, a patient with “diabetes” will have different attributes from a patient with “condition of heart failure”, and can also further vary among patients with different types of “diabetes” (e.g., type I, type II). Prior works focused on learning representations from predominantly one data type (e.g., unstructured notes or structured clinical events) that excludes relevant information available in other modalities. For example, a patient’s symptoms for a disease can be mentioned in physician notes, but could be missing from their structured clinical event data specified as medical codes. Therefore, to further improve the information usage with heterogeneous data, in this work we treat the different data modalities in EHR as separate views (i.e., inner view) to first learn patientspecific medical concept embeddings with graph autoencoder. Information from the embedding spaces of the different views are then fused together through attention mechanism to learn an unified patient representation.
Our attention autoencoder model, MPVAA, follows the encoderdecoder architecture. On the encoder side, a multilayer Transformer encoder [20] is first applied on the input patient vector, followed by a mixed pooling
strategy that combines meanmax pooling in a stochastic manner. As mean pooling and max pooling methods have their own advantages and drawbacks
[21], randomly weighting their importance results in a latent representation that encapsulates the most salient feature of the patient vector alongside capturing its general features. The decoder then reconstructs the input vector from the encoded representation through multiview attention mechanism to reconcile the interactions from the heterogeneous features.Our contributions in this paper can be summarized as follows:

We propose a new architecture MPVAA for learning patient representations, in which the various locally and globally relevant features present in the heterogeneous information associated with the patient are seamlessly integrated by facilitating interactions of crossmodal features. Multiview graph is constructed for each patient to promote personalization of the learned representation.

MPVAA is unsupervised and can be easily generalized to other domains with heterogeneous data.

We evaluate MPVAA on a publicly available EHR dataset on two different tasks. The results demonstrate the effectiveness of the MPVAA model compared to the other stateoftheart models.
2 Mixed Pooling MultiView Attention Autoencoder
Our proposed model MPVAA, shown in Figure 1, is an instance of a sequence autoencoder based on sequence to sequence learning [18]
. The sequence autoencoder works by first using an encoder to read the input sequence stepbystep to a hidden representation, which is then passed through a decoder to recreate the input sequence. However, unlike the traditional RNN autoencoders, MPVAA relies completely on selfattention to model input and output sequences without using RNNs or Convolution. In particular, MPVAA employs a multihead selfattention mechanism that allows extracting different aspects of the patient sequence into multiple vector representations. By augmenting this multihead selfattention mechanism with a mixed pooling multiview strategy, it further helps the model to associate heterogeneous medical information with each patient to generate a comprehensive representation specific to that patient.
Each input patient sequence, , is represented as a sequence of visits , such that each visit is in turn a sequence of medical concepts occurring during that visit. Here each medical concept, , where is the total number of unique medical concepts. In order to encapsulate the patient’s heterogeneous information into its learned representation, we consider three different views – , and . In the following sections, we first discuss how the inner view embeddings are obtained, and then elaborate on the MPVAA architecture to generate patient representations.
2.1 InnerView Embeddings
For each patient, we use graph autoencoder to get the relevant embeddings of the medical concepts in each view.
2.1.1 Preliminary: Graph Autoencoder
Use of graphbased neural network models like graph convolutional network (GCN)
[12] and graph autoencoder (GAE) [13] is becoming prevalent to learn robust representations for various applications ranging from social network analysis, bioinformatics to computer vision [23]. A GCN learns the representations of the nodes of a graph with total N nodes to generate output matrix , where F is the number of output features or latent dimensions of each node. A feature matrix, , with M input features per node and adjacency matrix, , are fed as inputs into the GCN to include both characteristicsbased and structurebased node information. Each layer of a GCN can be summarized as:(1) 
where f
is a nonlinear activation function and
and , for a total number of L layers. The function, f, and weight matrices, , aggregate locality information to learn representation of each node. Graph autoencoder (GAE) is an unsupervised extension of graph convolutional network which uses a GCN encoder and an inner product decoder.2.1.2 InnerView Graph Embeddings Using GAE
A graph is constructed for each type of data in EHR from among — demographic information (dem), laboratory results (lab) and clinical notes (notes) — with each considered as a separate view. Each medical concept is a node in this innerview network, with the similarity between two nodes forming an edge. We consider medical concepts from three different categories (i.e., disease, medication and procedure), which are extracted from the structured clinical codes.
Formally, for view , the respective graph for that view is with total nodes, adjacency matrix and feature matrix . is the total number of unique medical concepts and we set = , so that each node feature with respect to a view is defined as the similarity relationship between two medical concepts, and , in that view. The similarity is computed with Dice Coefficient as,
(2) 
where and are the raw feature vectors for the respective medical concepts in that view. The more similar two medical concepts are in terms of common features in that view, the closer to 1 the dice coefficient between them will be.
The graph autoencoder for each view i employs a twolayer GCN, where propagation rule is applied to get the output feature matrix, . It is defined as (to avoid clutter, we describe for one view as ):
(3) 
Here = is used for normalization, where is the adjacency matrix with added selfconnections,
is the identity matrix and
is the diagonal degree matrix of . and are trainable weight matrices of first and second layers respectively. in our case is a binary cooccurrence matrix of dimensions , such that entries with a “1” indicate that two medical concepts appeared within the same visit of the patient.2.1.3 Graph Construction Illustration for each View
To embed the heterogeneous hospital encounter information specific to a patient, innerview graphs are constructed for each patient. N unique medical concepts, comprised of diagnosis, procedure and drug codes, are first extracted from the structured clinical events in the patient’s EHR records. These medical concepts are modeled as the nodes in the innerview specific graphs, , , and , where V = N.
We describe the graph construction process for each view as:
dem: To get feature vector for each medical concept with respect to the view , age, weight and gender features of the patient are considered. The values for each feature are discretized into the following categorical bins, age , gender and weight . For the occurrence of m in any visit of the patient, entry for the corresponding demographic features found in the patient’s visit record are set to 1 in the intermediate feature vector . corresponds to the features {old, adult, neonate, middle, healthy , overweight, underweight, male, female}. Then DSC(m, ), where , between intermediate feature vectors and are computed to fill the corresponding entry in the feature vector .
lab: To get feature vector for each medical concept with respect to the view lab, lab item, value pair tuple features are considered. Similar to view dem, it is checked if the concept m occurred in any visit of the patient and entry for the corresponding laboratory results features found in the patient’s visit are set to 1 in the intermediate feature vector , where g corresponds to the total number of lab item, value tuple pairs. Then DSC(m, ), where , between intermediate feature vectors and are computed to fill the corresponding entry in the feature vector .
notes: To get feature vector for each medical concept with respect to notes, UMLS^{1}^{1}1https://www.nlm.nih.gov/research/umls/ Concept Unique identifiers (CUIs) of contextual words of m within a window, w, in the notes are considered as the features. Notes of the patient with occurrence of m in any visit are first extracted and entries for the CUIs of contextual words of m appearing within w in the notes are set to 1 in the intermediate feature vector , where h corresponds to the vocabulary of the CUIs for the words in the clinical notes. Then DSC(m, ), where , between intermediate feature vectors and are computed to fill the corresponding entry in the feature vector .
2.2 Architecture
To fuse the innerview embeddings, each medical concept in the patient sequence is first embedded to a dimensional vector , where is the medical concept feature matrix learned by the graph autoencoder for view and is the th row of . Unlike [20, 22], however, we don’t add positional embeddings to the input patient embedding (i.e., , where = ) as the medical concepts within a visit form an unordered set.
MPVAA has an encoderdecoder framework and captures the crossmodal features by exploiting three types of attention: Encoder MultiHead SelfAttention, EncoderDecoder MultiView Attention and Decoder MultiHead SelfAttention. With the Encoder/Decoder MultiHead SelfAttention, the internal structure of the patient representation with respect to view is captured by learning the dependencies of the medical concepts within its visits. As the view includes the general features of a patient and is relatively more static than the other two views, we feed as the inputs into the encoder and decoder. While EncoderDecoder MultiView Attention facilitates the interactions among the , and views to generate patient representation from a comprehensive view.
2.2.1 Preliminary: MultiHead SelfAttention
The attention mechanism intends to map a query and a set of keyvalue pairs to an output [20]. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed based on the query and the corresponding key. For selfattention, we use the Scaled DotProduct Attention [20]:
(4) 
With MultiHead Attention, this attention mechanism is run multiple times in parallel. It is defined as:
(5) 
where,
(6) 
Here , , and are parameter matrices to be learned.
2.2.2 Mixed Pooling Attention Encoder
The encoder, shown in Figure 1, converts the patient sequence embedding into a hidden representation of vectors, with two sublayers. The first layer utilizes MultiHead SelfAttention mechanism to attend to jointly from different positions. The multiple hops of selfattention enables it to learn multiple vector representations of the patient focusing on different parts of the patient’s visit sequence . For example, each part can be a component capturing the related medical concepts within the visits, which reflects a semantic aspect of the patient’s hospital profile. Thus the overall semantics can be represented by the multiple vector representations computed by the multihead selfattention. It is defined as
(7) 
where , and are parameter matrices;
The second layer, on the other hand, is a fully connected feedforward network that applies a nonlinear activation (i.e., ReLU) on the linearly transformed outputs from the first layer, followed by another linear transformation. We apply residual connections after each layer. So the hidden representation
is produced with the following equations:(8) 
(9) 
(10) 
where and are parameter matrices; and
are bias vectors; LayerNorm denotes layer normalization and ReLU is activation function.
To aggregate the hidden representations to a fixed dimensional vector, a mixed pooling strategy is performed that produces the mixed representation . It does so by a randomly weighting technique that measures the importance of the mean representation and the max representation. This generalized pooling approach results in enriching the expressiveness of the attention mechanism.
In max pooling, max operation is applied on each vector of the hidden representation to extract the most salient feature pertaining to that time step. While mean pooling takes into consideration all the features and summarizes them into a global representation:
(11) 
(12) 
The mixed pooling strategy is defined as
(13) 
where is a random value between 0 and 1 indicating the weighted contribution of the mean pooling and max pooling methods in the final representation.
From patient perspective, the mixed pooling operation allows to simultaneously encode the most activated dimension in the embedding space of each medical concept occurring in their visit sequence, and also provide a comprehensive capture of semantics across all the medical concept vectors. This can greatly ease interpretability of the learned embeddings as well.
2.2.3 MultiView Attention Decoder for Patient Representation
As we intend to learn generalpurpose patient representations with our model that should exhibit comparable performance across all downstream tasks in healthcare, it is important to integrate a holistic view of heterogeneous data associated with each patient. Henceforth, given the mixed patient representation, for view , the purpose of the decoder is to reconstruct the input sequence so as to incorporate the relevant attributes from and views as well. By connecting the encoder and decoder with a multiview attention module, it is possible to fuse together the embedddings from different views into the hidden representation of the patient sequence.
As shown in Figure 1, first the input patient embedding, , is shifted right as to get the decoder input. Equations (7) and (8) are then employed in the decoder MultiHead SelfAttention layer to get .
Following the selfattention layer, multiview attention is computed using the crossview representation as,
(14) 
(15) 
(16) 
(17) 
where and are parameter matrices; and are bias vectors.
The crossview representation incorporates the interactions among all the views. To calculate it, first the weighted mixed representations of the other two views, and , are applied to with a nonlinear activation. Mathematically it is denoted as,
(18) 
where , and are parameter matrices; is the nonlinear activation function (i.e., tanh).
is then fed into a softmax layer and finally the softmax output is combined with the hidden representation
via elementwise multiplication :(19) 
Essentially, the generation of through this weighted mean, where the weight indicates the relevance of the view with regards to the patient, leads to capturing the contributions of the heterogeneous data types into the final patient representation.
The probability of generating the whole patient sequence is then calculated as:
(20) 
The objective is the sum of the logprobabilities for the input sequence itself:
(21) 
MPVAA learns to reconstruct the input sequence by optimizing the objective in equation 21.
3 Experiments
3.1 Settings
We evaluate our model on the publicly available MIMICIII dataset [10]. This database contains deidentified clinical records for 40K patients admitted to critical care units over 11 years. It contains a wide range of heterogeneous healthcare information such as demographics, laboratory test results, procedures, medications, diagnosis codes, nurse and physician notes, among others. ICD9 codes for diagnosis/procedures and NDC codes for medications were extracted from patients with at least two visits to construct each graph. Tables 1 and 2 outline other statistics about the data.
MIMICIII  VALUE 
# of patients  7,499 
# of visits  19,911 
avg. # of visits per patient  2.66 
# of unique ICD9 codes  4,893 
avg. # of codes per visit  13.1 
max # of codes per visit  39 
MIMICIII  VALUE 

# unique clinical codes  11135 
# of unique ICD9 diagnoses codes  4894 
# of unique ICD9 procedure codes  2032 
# of unique NDC medication codes  4209 
3.2 Implementation and Training Details
For each medical concept, the feature matrix for and views in graph autoencoder is constructed by considering general information about the patient such as age, weight, gender from ADMISSIONS table and laboratory measurements results from LABEVENTS table respectively in MIMICIII, that occurred with the concept. For view, Unified Medical Language System (UMLS) concepts, obtained with MetaMap 2, for context within a window of the medical concept mention are considered as the features.
For the views and , there were no instances with missing features and each medical concept node in the graph had all feature value pairs. While in the case of notes view, if the name of the medical concept was not found in the notes, then context of other cooccurring concepts were considered as its features. We used only “discharge summary" notes and the value of window, , was set to 2.
We used 5 parallel attention heads and set the final patient representation dimension to 500 for fair comparison against the baselines. The proposed model is trained using Adam [11] in minibatches of 64 with a learning rate of 0.001. is shared among all the views.
3.3 Evaluation Tasks and Metrics
The learned patient representations are evaluated on the following two extrinsic medical tasks,
Outcome Prediction: This is a binary prediction task that tries to predict whether the patient is at risk of developing a disease in the future visit, , trained on visit embedding sequence up to . We focus on patients with heart failure (HF) disease. Thereby, we examine only patients with at least two visits and check if they contain an occurrence of heart failure in their visit. These are considered as the instances belonging to the positive class (HF). As average number of visits per patient is 2 in MIMICIII,
in our case is 1. A binary logistic regression classifier is trained/tested to perform this prediction task. The train/test/validation split for positive instances is (1113/186/186) and the same split is applied to equal number of total negative instances, where negative instances are formed with patients who do not have HF code in their record up to the
visit.Sequential Disease Prediction: The goal is to predict all the diagnosis codes of the next visit at every time step, having trained on visit sequence up to the th visit. It can be considered as a multilabel classification task.
We report performance on Heart Failure Prediction task assessed based on AUCROC, AUCPR and Accuracy, while Sequential Disease Prediction task is evaluated in terms of Normalized Cumulative Discounted Gain (NCDG). To evaluate the quality of the predicted diagnosis codes, they are first sorted according to their prediction values and NCDG is measured at different cutoff values, k = 5, 15 and 25, against the true diagnosis codes.
NCDG is computed as the ratio of DCG and Ideal DCG (IDCG) defined by the following equations,
(22) 
(23) 
where is the rank of th element in x relative to the sorted predictions (y) and . True and pred refer to the list of true and predicted diagnosis codes respectively.
3.4 Baseline Models
We compare MPVAA against three categories of baseline models (i.e., inside parentheses): 1) the aggregation method used to fuse the embeddings from the different views (CONCAT, AVG, SVD, DAEME, CAEME, AAEME) 2) stateoftheart embedding models (M2V, GRAM, GV, W2V, LIN, NLOne, NLMLP) 3) use of rnn units (LSTMAE).
For the baselines, each visit of the patient sequence, , is represented with a list of medical concepts, , where they differ in how the embedding of the medical concept is learned. We consider as the total number of unique medical codes (i.e., ICD9, NDC) present across all the patients records. For the baseline models under category 1), the learned innerview embeddings for the medical concepts are considered as we intend to focus on the effect on performance from the embedding aggregation method.
CONCAT: The embeddings for the three views are simply concatenated to get the final embedding of each medical concept. We normalize each embedding before concatenation to ensure that each embedding contributes equally.
AVG: The embeddings for the three views are averaged to represent the final embedding of each medical concept. normalization is similarly performed.
SVD: Consider a matrix of dimension , where k is dimension of the resulting embedding from the concatenation of the
normalized embeddings for each medical concept. Singular Value Decomposition (SVD)
[9] is applied on to get the decomposition C = . For each concept, the corresponding vector in U is considered as the SVD embedding.DAEME: Decoupled Autoencoded MetaEmbedding [1] uses separate set of encoderdecoder for each view. The encoded representations are concatenated and then the individual components are reconstructed from corresponding decoders. Each autoencoder is implemented as a single layer neural network.
CAEME: Concatenated Autoencoded MetaEmbedding [1] is similar to DAEME, except that the reconstruction is done from the concatenation of the encoded representations.
AAEME: Averaged Autoencoded MetaEmbedding [1] is similar to CAEME, except that the reconstruction is done from the average of the encoded representations.
M2V: Med2Vec [3] is a twolayer neural network for learning lower dimensional representations for medical concepts.
GRAM: [4]
is a graphbased attention model augmented with knowledge from medical ontology.
GV: GloVe [16]
is an unsupervised learning approach of word embeddings based on word cooccurrence matrix. In our case, the embedding matrix is of dimension
, where is the dimension of Glove embeddings.W2V: The Skipgram model [14] learns word representations based on the cooccurrence information of words within a context window of a predefined size. In our case, the embedding matrix is of dimension , where is the dimension of Word2Vec embeddings.
LIN: A visit is represented with binary, multihot encoding as . That is, only the dimension corresponding to the code is set to 1. This vector is linearly transformed with embedding matrix with embedding dimension d so that . Embedding in W in initialized with Glove [16].
NLOne: It adds nonlinearity to visit by passing it through a nonlinear activation . We used ReLU activation function, .
NLMLP: It is the same as NonLinear(one) but has one more layer to increase expressivity.
LSTMAE: The patient sequence representation is learned through an autoencoder for each view. The encoder and decoder are implemented with RNN units (i.e., LSTM). The final patient representation is considered as the aggregation of the learned representations from all the views.
We further compare the performance of the proposed MPVAA against ablated versions of the model to exhibit contributions from each component.
MMAA: The MeanMax Attention Autoencoder [22] using meanmax attention during decoding.
MMVAA: The proposed model doing meanmax pooling operation on the encoded representation, .
VAA: The proposed model without doing mixed pooling operation on the encoded representation, .
MPVAASin: The proposed model with positional encodings added to input embeddings, .
3.5 Evaluation Results
The experimental results for the heart failure prediction and sequential disease prediction tasks are shown in Figures 3 and 3 respectively. We do comparative analysis on the performance between the proposed and the three different categories of the baseline representation learning models (i.e., 1), 2) and 3)). A key observation is that the proposed MPVAA consistently outperforms all baselines in all three categories for both tasks across all the metrics. We first evaluate how effective the different fusion approaches are in aggregating the embeddings from the different views. Among the general ensemble methods (i.e., CONCAT, AVG and SVD), performing a global projection on the concatenated embeddings through SVD is shown to give better results across most metrics than the other two for both tasks. Now comparing the baselines implementing autoencoder (i.e., DAEME, CAEME and AAEME), it can be seen that they do better than their simple ensemble counterparts, CONCAT and AVG respectively. We can attribute this to the dimensionality reduction they enforce through the reconstruction of the hidden representation, that results in embedding the key features of the input into its learned representation and establishes autoencoders as a comparable base model for representation learning. This justifies our choice of basing MPVAA on the autoencoder framework.
Comparing against the second category of baselines, in particular, MPVAA shows a 13% and 3.5% performance gain against M2V and GRAM in HF prediction and Sequential Disease Prediction tasks respectively. The good results of M2V and GRAM compared to the other baselines in this category can be attributed to the fact that they are specifically designed for learning representations from EHR. This means that training on domainspecific data (e.g., EHR) strengthens generalizability of the representations as opposed to on general data.
MPVAA completely relies on attention mechanism to model patient representations. Its better performance on both tasks than LSTMAE verifies this contribution compared to the use of recurrent units (i.e., LSTM) in LSTMAE. One reason for this could be that RNNs put emphasis on information towards the end of the sequence and hence is not able to connect information from past visits.
3.6 Ablation Study
To quantitatively evaluate the effect of various components used in MPVAA on the model performance, we compare the performance of MPVAA against its variants (i.e., MMAA, MMVAA, VAA and MPVAASin) as presented in Figures 5 and 5 respectively for Heart Failure Prediction and Sequential Disease Prediction tasks. It can be seen that all four variants gave worse performance than MPVAA for both the tasks. VAA excludes mixed pooling operation of the encoded representation and causes the most decline in performance compared to the full model, MPVAA. While MMVAA uses deterministic meanmax pooling instead of the stochastic approach used in MPVAA and is similarly outperformed. This asserts the stochastic mixed pooling component of our proposed MPVAA as an integral part of it. We believe that since MMAA does not incorporate the interactions among the patient information from different views while decoding as opposed to the use of multiview attention in MPVAA, it is not able to embed the dependencies among the different views in the encoded embeddings. As mentioned earlier, the medical codes within each visit are unordered, so adding positional embeddings does not improve performance with MPVAASin.
4 Related Works
As healthcare data is growing owing to the increase in hospital use of EHR, research using EHR has become an active area to mine useful associations between different clinical variables, that unravel information critical to fulfilling each patient’s medical needs alongside facilitating different realworld clinical predictive tasks. The cornerstone of effective implementation of EHR lies in robust representation learning that should holistically capture semantic relations among heterogeneous medical entities. Use of raw EHR data in terms of handengineered feature statistics (e.g., count) or binary representation with hot vector as inputs to predictive models is labor intensive for the former and fail to capture the hierarchical latent relationships at all levels, paving the way for learning distributed vectors through deep learning recently. A regularized nonnegative Restricted Boltzmann Machine is formulated in
[19]that embeds the medical events in EHR to lowdimensional space, while a stack of denoising autoencoders (AE) and multilayer perceptron (MLP) are used in
[15] and [3]respectively to get patient vectors. More common has been the use of Recurrent Neural Networks (RNN) to capture the sequential nature of EHR records for different predictive tasks
[2, 7]. [5] adds a twolevel attention mechanism to RNN to get more interpretable representation. A graphbased attention model combined with RNN is used in [4] to address data insufficiency issue by augmenting knowledge from medical ontology. However, unlike all the aforementioned works, MPVAA first learns graphbased innerview embeddings from the different types of data in EHR, and then captures their crossmodal interactions with multiview attention into a holistic representation, which has shown superior performance.5 Conclusion
In this paper, we present an unsupervised framework, Mixed Pooling MultiView Attention Autoencoder (MPVAA), for learning patient representations. To enforce personalization in the learning that is specific to each patient, innerview graphs are constructed. It facilitates better modeling by exploiting the complementary information from multiple data modalities in EHR. A combination of powerful attention mechanisms (i.e., selfattention, multiview attention) is employed to capture the interactions among the crossmodal features. Comprehensive experiments performed demonstrate that the proposed MPVAA model outperforms the stateoftheart baselines on Heart Failure Prediction and Sequential Disease Prediction tasks.
References
 [1] D. Bollegala and C. Bao. Learning word metaembeddings by autoencoding. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1650–1661, 2018.
 [2] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun. Doctor ai: Predicting clinical events via recurrent neural networks. In Machine Learning for Healthcare Conference, pages 301–318, 2016.
 [3] E. Choi, M. T. Bahadori, E. Searles, C. Coffey, M. Thompson, J. Bost, J. TejedorSojo, and J. Sun. Multilayer representation learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1495–1504. ACM, 2016.
 [4] E. Choi, M. T. Bahadori, L. Song, W. F. Stewart, and J. Sun. Gram: graphbased attention model for healthcare representation learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 787–795. ACM, 2017.
 [5] E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and W. Stewart. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, pages 3504–3512, 2016.
 [6] E. Choi, A. Schuetz, W. F. Stewart, and J. Sun. Medical concept representation learning from electronic health records and its application on heart failure prediction. arXiv preprint arXiv:1602.03686, 2016.
 [7] C. Esteban, O. Staeck, S. Baier, Y. Yang, and V. Tresp. Predicting clinical events by combining static and dynamic information using recurrent neural networks. In Healthcare Informatics (ICHI), 2016 IEEE International Conference on, pages 93–101. Ieee, 2016.
 [8] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visualsemantic embedding model. In Advances in neural information processing systems, pages 2121–2129, 2013.
 [9] G. H. Golub and C. Reinsch. Singular value decomposition and least squares solutions. In Linear Algebra, pages 134–151. Springer, 1971.
 [10] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Liwei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. Mimiciii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
 [11] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [12] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 [13] T. N. Kipf and M. Welling. Variational graph autoencoders. arXiv preprint arXiv:1611.07308, 2016.
 [14] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [15] R. Miotto, L. Li, B. A. Kidd, and J. T. Dudley. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Scientific reports, 6:26094, 2016.
 [16] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
 [17] B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi. Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis. IEEE journal of biomedical and health informatics, 22(5):1589–1604, 2017.
 [18] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 [19] T. Tran, T. D. Nguyen, D. Phung, and S. Venkatesh. Learning vector representation of medical objects via emrdriven nonnegative restricted boltzmann machines (enrbm). Journal of biomedical informatics, 54:96–105, 2015.
 [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.

[21]
D. Yu, H. Wang, P. Chen, and Z. Wei.
Mixed pooling for convolutional neural networks.
In International Conference on Rough Sets and Knowledge Technology, pages 364–375. Springer, 2014.  [22] M. Zhang, Y. Wu, W. Li, and W. Li. Learning universal sentence representations with meanmax attention autoencoder. arXiv preprint arXiv:1809.06590, 2018.
 [23] S. Zhang, H. Tong, J. Xu, and R. Maciejewski. Graph convolutional networks: Algorithms, applications and open challenges. In International Conference on Computational Social Networks, pages 79–91. Springer, 2018.