Mixed Pooling Multi-View Attention Autoencoder for Representation Learning in Healthcare

by   Shaika Chowdhury, et al.

Distributed representations have been used to support downstream tasks in healthcare recently. Healthcare data (e.g., electronic health records) contain multiple modalities of data from heterogeneous sources that can provide complementary information, alongside an added dimension to learning personalized patient representations. To this end, in this paper we propose a novel unsupervised encoder-decoder model, namely Mixed Pooling Multi-View Attention Autoencoder (MPVAA), that generates patient representations encapsulating a holistic view of their medical profile. Specifically, by first learning personalized graph embeddings pertaining to each patient's heterogeneous healthcare data, it then integrates the non-linear relationships among them into a unified representation through multi-view attention mechanism. Additionally, a mixed pooling strategy is incorporated in the encoding step to learn diverse information specific to each data modality. Experiments conducted for multiple tasks demonstrate the effectiveness of the proposed model over the state-of-the-art representation learning methods in healthcare.


Self-attention Multi-view Representation Learning with Diversity-promoting Complementarity

Multi-view learning attempts to generate a model with a better performan...

BiteNet: Bidirectional Temporal Encoder Network to Predict Medical Outcomes

Electronic health records (EHRs) are longitudinal records of a patient's...

Statistical Latent Space Approach for Mixed Data Modelling and Applications

The analysis of mixed data has been raising challenges in statistics and...

mulEEG: A Multi-View Representation Learning on EEG Signals

Modeling effective representations using multiple views that positively ...

MIMO: Mutual Integration of Patient Journey and Medical Ontology for Healthcare Representation Learning

Healthcare representation learning on the Electronic Health Record (EHR)...

Multi-view Factorization AutoEncoder with Network Constraints for Multi-omic Integrative Analysis

Multi-omic data provides multiple views of the same patients. Integrativ...

Broad Learning for Healthcare

A broad spectrum of data from different modalities are generated in the ...

1 Introduction

Distributed representations, also known as embeddings, have brought immense success in numerous Natural Language Processing (NLP) and Computer Vision


tasks. Recent works in Deep Learning applications for healthcare data (e.g., EHR)

[6, 19]

emulated the concept of embedding as learning vector representations of medical concepts, that ensure that similar concepts will form natural clusters and relationships in vector space

[17]. Some other works generated visit representations from the learned medical concept embeddings [3, 4]

. Considering each patient as a sequence of these visits, patient representation is then learned through the optimization of a supervised learning task. However, large amount of labeled healthcare data for predictive modeling is not available in practice, not to mention that manual labeling is laborious and expensive. Aiming towards alleviating annotation efforts, we propose an unsupervised model that relies only on the large unannotated EHR data to generate patient representations. We call this model the Mixed Pooling Multi-View Attention Autoencoder (MPVAA).

In healthcare data (e.g., EHR), patient records may be available as heterogeneous data (e.g., demographics, laboratory results, clinical notes) that can provide an added dimension to learning personalized patient representations. For example, a patient with “diabetes” will have different attributes from a patient with “condition of heart failure”, and can also further vary among patients with different types of “diabetes” (e.g., type I, type II). Prior works focused on learning representations from predominantly one data type (e.g., unstructured notes or structured clinical events) that excludes relevant information available in other modalities. For example, a patient’s symptoms for a disease can be mentioned in physician notes, but could be missing from their structured clinical event data specified as medical codes. Therefore, to further improve the information usage with heterogeneous data, in this work we treat the different data modalities in EHR as separate views (i.e., inner view) to first learn patient-specific medical concept embeddings with graph autoencoder. Information from the embedding spaces of the different views are then fused together through attention mechanism to learn an unified patient representation.

Our attention autoencoder model, MPVAA, follows the encoder-decoder architecture. On the encoder side, a multi-layer Transformer encoder [20] is first applied on the input patient vector, followed by a mixed pooling

strategy that combines mean-max pooling in a stochastic manner. As mean pooling and max pooling methods have their own advantages and drawbacks

[21], randomly weighting their importance results in a latent representation that encapsulates the most salient feature of the patient vector alongside capturing its general features. The decoder then reconstructs the input vector from the encoded representation through multi-view attention mechanism to reconcile the interactions from the heterogeneous features.

Our contributions in this paper can be summarized as follows:

  • We propose a new architecture MPVAA for learning patient representations, in which the various locally and globally relevant features present in the heterogeneous information associated with the patient are seamlessly integrated by facilitating interactions of cross-modal features. Multi-view graph is constructed for each patient to promote personalization of the learned representation.

  • MPVAA is unsupervised and can be easily generalized to other domains with heterogeneous data.

  • We evaluate MPVAA on a publicly available EHR dataset on two different tasks. The results demonstrate the effectiveness of the MPVAA model compared to the other state-of-the-art models.

2 Mixed Pooling Multi-View Attention Autoencoder

Our proposed model MPVAA, shown in Figure 1, is an instance of a sequence autoencoder based on sequence to sequence learning [18]

. The sequence autoencoder works by first using an encoder to read the input sequence step-by-step to a hidden representation, which is then passed through a decoder to recreate the input sequence. However, unlike the traditional RNN autoencoders, MPVAA relies completely on self-attention to model input and output sequences without using RNNs or Convolution. In particular, MPVAA employs a multi-head self-attention mechanism that allows extracting different aspects of the patient sequence into multiple vector representations. By augmenting this multi-head self-attention mechanism with a mixed pooling multi-view strategy, it further helps the model to associate heterogeneous medical information with each patient to generate a comprehensive representation specific to that patient.

Each input patient sequence, , is represented as a sequence of visits , such that each visit is in turn a sequence of medical concepts occurring during that visit. Here each medical concept, , where is the total number of unique medical concepts. In order to encapsulate the patient’s heterogeneous information into its learned representation, we consider three different views – , and . In the following sections, we first discuss how the inner view embeddings are obtained, and then elaborate on the MPVAA architecture to generate patient representations.

Figure 1: Proposed MPVAA Model

2.1 Inner-View Embeddings

For each patient, we use graph autoencoder to get the relevant embeddings of the medical concepts in each view.

2.1.1 Preliminary: Graph Autoencoder

Use of graph-based neural network models like graph convolutional network (GCN)

[12] and graph auto-encoder (GAE) [13] is becoming prevalent to learn robust representations for various applications ranging from social network analysis, bioinformatics to computer vision [23]. A GCN learns the representations of the nodes of a graph with total N nodes to generate output matrix , where F is the number of output features or latent dimensions of each node. A feature matrix, , with M input features per node and adjacency matrix, , are fed as inputs into the GCN to include both characteristics-based and structure-based node information. Each layer of a GCN can be summarized as:


where f

is a non-linear activation function and

and , for a total number of L layers. The function, f, and weight matrices, , aggregate locality information to learn representation of each node. Graph auto-encoder (GAE) is an unsupervised extension of graph convolutional network which uses a GCN encoder and an inner product decoder.

2.1.2 Inner-View Graph Embeddings Using GAE

A graph is constructed for each type of data in EHR from among — demographic information (dem), laboratory results (lab) and clinical notes (notes) — with each considered as a separate view. Each medical concept is a node in this inner-view network, with the similarity between two nodes forming an edge. We consider medical concepts from three different categories (i.e., disease, medication and procedure), which are extracted from the structured clinical codes.

Formally, for view , the respective graph for that view is with total nodes, adjacency matrix and feature matrix . is the total number of unique medical concepts and we set = , so that each node feature with respect to a view is defined as the similarity relationship between two medical concepts, and , in that view. The similarity is computed with Dice Coefficient as,


where and are the raw feature vectors for the respective medical concepts in that view. The more similar two medical concepts are in terms of common features in that view, the closer to 1 the dice coefficient between them will be.

The graph auto-encoder for each view i employs a two-layer GCN, where propagation rule is applied to get the output feature matrix, . It is defined as (to avoid clutter, we describe for one view as ):


Here = is used for normalization, where is the adjacency matrix with added self-connections,

is the identity matrix and

is the diagonal degree matrix of . and are trainable weight matrices of first and second layers respectively. in our case is a binary co-occurrence matrix of dimensions , such that entries with a “1” indicate that two medical concepts appeared within the same visit of the patient.

2.1.3 Graph Construction Illustration for each View

To embed the heterogeneous hospital encounter information specific to a patient, inner-view graphs are constructed for each patient. N unique medical concepts, comprised of diagnosis, procedure and drug codes, are first extracted from the structured clinical events in the patient’s EHR records. These medical concepts are modeled as the nodes in the inner-view specific graphs, , , and , where V = N.

We describe the graph construction process for each view as:

dem: To get feature vector for each medical concept with respect to the view , age, weight and gender features of the patient are considered. The values for each feature are discretized into the following categorical bins, age , gender and weight . For the occurrence of m in any visit of the patient, entry for the corresponding demographic features found in the patient’s visit record are set to 1 in the intermediate feature vector . corresponds to the features {old, adult, neonate, middle, healthy , overweight, underweight, male, female}. Then DSC(m, ), where , between intermediate feature vectors and are computed to fill the corresponding entry in the feature vector .

lab: To get feature vector for each medical concept with respect to the view lab, lab item, value pair tuple features are considered. Similar to view dem, it is checked if the concept m occurred in any visit of the patient and entry for the corresponding laboratory results features found in the patient’s visit are set to 1 in the intermediate feature vector , where g corresponds to the total number of lab item, value tuple pairs. Then DSC(m, ), where , between intermediate feature vectors and are computed to fill the corresponding entry in the feature vector .

notes: To get feature vector for each medical concept with respect to notes, UMLS111https://www.nlm.nih.gov/research/umls/ Concept Unique identifiers (CUIs) of contextual words of m within a window, w, in the notes are considered as the features. Notes of the patient with occurrence of m in any visit are first extracted and entries for the CUIs of contextual words of m appearing within w in the notes are set to 1 in the intermediate feature vector , where h corresponds to the vocabulary of the CUIs for the words in the clinical notes. Then DSC(m, ), where , between intermediate feature vectors and are computed to fill the corresponding entry in the feature vector .

2.2 Architecture

To fuse the inner-view embeddings, each medical concept in the patient sequence is first embedded to a dimensional vector , where is the medical concept feature matrix learned by the graph autoencoder for view and is the -th row of . Unlike [20, 22], however, we don’t add positional embeddings to the input patient embedding (i.e., , where = ) as the medical concepts within a visit form an unordered set.

MPVAA has an encoder-decoder framework and captures the cross-modal features by exploiting three types of attention: Encoder MultiHead Self-Attention, Encoder-Decoder Multi-View Attention and Decoder MultiHead Self-Attention. With the Encoder/Decoder MultiHead Self-Attention, the internal structure of the patient representation with respect to view is captured by learning the dependencies of the medical concepts within its visits. As the view includes the general features of a patient and is relatively more static than the other two views, we feed as the inputs into the encoder and decoder. While Encoder-Decoder Multi-View Attention facilitates the interactions among the , and views to generate patient representation from a comprehensive view.

2.2.1 Preliminary: Multi-Head Self-Attention

The attention mechanism intends to map a query and a set of key-value pairs to an output [20]. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed based on the query and the corresponding key. For self-attention, we use the Scaled Dot-Product Attention [20]:


With MultiHead Attention, this attention mechanism is run multiple times in parallel. It is defined as:




Here , , and are parameter matrices to be learned.

2.2.2 Mixed Pooling Attention Encoder

The encoder, shown in Figure 1, converts the patient sequence embedding into a hidden representation of vectors, with two sub-layers. The first layer utilizes MultiHead Self-Attention mechanism to attend to jointly from different positions. The multiple hops of self-attention enables it to learn multiple vector representations of the patient focusing on different parts of the patient’s visit sequence . For example, each part can be a component capturing the related medical concepts within the visits, which reflects a semantic aspect of the patient’s hospital profile. Thus the overall semantics can be represented by the multiple vector representations computed by the multi-head self-attention. It is defined as


where , and are parameter matrices;

The second layer, on the other hand, is a fully connected feed-forward network that applies a non-linear activation (i.e., ReLU) on the linearly transformed outputs from the first layer, followed by another linear transformation. We apply residual connections after each layer. So the hidden representation

is produced with the following equations:


where and are parameter matrices; and

are bias vectors; LayerNorm denotes layer normalization and ReLU is activation function.

To aggregate the hidden representations to a fixed dimensional vector, a mixed pooling strategy is performed that produces the mixed representation . It does so by a randomly weighting technique that measures the importance of the mean representation and the max representation. This generalized pooling approach results in enriching the expressiveness of the attention mechanism.

In max pooling, max operation is applied on each vector of the hidden representation to extract the most salient feature pertaining to that time step. While mean pooling takes into consideration all the features and summarizes them into a global representation:


The mixed pooling strategy is defined as


where is a random value between 0 and 1 indicating the weighted contribution of the mean pooling and max pooling methods in the final representation.

From patient perspective, the mixed pooling operation allows to simultaneously encode the most activated dimension in the embedding space of each medical concept occurring in their visit sequence, and also provide a comprehensive capture of semantics across all the medical concept vectors. This can greatly ease interpretability of the learned embeddings as well.

2.2.3 Multi-View Attention Decoder for Patient Representation

As we intend to learn general-purpose patient representations with our model that should exhibit comparable performance across all downstream tasks in healthcare, it is important to integrate a holistic view of heterogeneous data associated with each patient. Henceforth, given the mixed patient representation, for view , the purpose of the decoder is to reconstruct the input sequence so as to incorporate the relevant attributes from and views as well. By connecting the encoder and decoder with a multi-view attention module, it is possible to fuse together the embedddings from different views into the hidden representation of the patient sequence.

As shown in Figure 1, first the input patient embedding, , is shifted right as to get the decoder input. Equations (7) and (8) are then employed in the decoder MultiHead Self-Attention layer to get .

Following the self-attention layer, multi-view attention is computed using the cross-view representation as,


where and are parameter matrices; and are bias vectors.

The cross-view representation incorporates the interactions among all the views. To calculate it, first the weighted mixed representations of the other two views, and , are applied to with a non-linear activation. Mathematically it is denoted as,


where , and are parameter matrices; is the non-linear activation function (i.e., tanh).

is then fed into a softmax layer and finally the softmax output is combined with the hidden representation

via element-wise multiplication :


Essentially, the generation of through this weighted mean, where the weight indicates the relevance of the view with regards to the patient, leads to capturing the contributions of the heterogeneous data types into the final patient representation.

The probability of generating the whole patient sequence is then calculated as:


The objective is the sum of the log-probabilities for the input sequence itself:


MPVAA learns to reconstruct the input sequence by optimizing the objective in equation 21.

3 Experiments

3.1 Settings

We evaluate our model on the publicly available MIMIC-III dataset [10]. This database contains de-identified clinical records for 40K patients admitted to critical care units over 11 years. It contains a wide range of heterogeneous healthcare information such as demographics, laboratory test results, procedures, medications, diagnosis codes, nurse and physician notes, among others. ICD-9 codes for diagnosis/procedures and NDC codes for medications were extracted from patients with at least two visits to construct each graph. Tables 1 and 2 outline other statistics about the data.

# of patients 7,499
# of visits 19,911
avg. # of visits per patient 2.66
# of unique ICD9 codes 4,893
avg. # of codes per visit 13.1
max # of codes per visit 39
Table 1: Data Statistics Summary
# unique clinical codes 11135
# of unique ICD-9 diagnoses codes 4894
# of unique ICD-9 procedure codes 2032
# of unique NDC medication codes 4209
Table 2: Summary of Clinical Codes used

3.2 Implementation and Training Details

For each medical concept, the feature matrix for and views in graph auto-encoder is constructed by considering general information about the patient such as age, weight, gender from ADMISSIONS table and laboratory measurements results from LABEVENTS table respectively in MIMIC-III, that occurred with the concept. For view, Unified Medical Language System (UMLS) concepts, obtained with MetaMap 2, for context within a window of the medical concept mention are considered as the features.

For the views and , there were no instances with missing features and each medical concept node in the graph had all feature value pairs. While in the case of notes view, if the name of the medical concept was not found in the notes, then context of other co-occurring concepts were considered as its features. We used only “discharge summary" notes and the value of window, , was set to 2.

We used 5 parallel attention heads and set the final patient representation dimension to 500 for fair comparison against the baselines. The proposed model is trained using Adam [11] in minibatches of 64 with a learning rate of 0.001. is shared among all the views.

3.3 Evaluation Tasks and Metrics

The learned patient representations are evaluated on the following two extrinsic medical tasks,

Outcome Prediction: This is a binary prediction task that tries to predict whether the patient is at risk of developing a disease in the future visit, , trained on visit embedding sequence up to . We focus on patients with heart failure (HF) disease. Thereby, we examine only patients with at least two visits and check if they contain an occurrence of heart failure in their visit. These are considered as the instances belonging to the positive class (HF). As average number of visits per patient is 2 in MIMIC-III,

in our case is 1. A binary logistic regression classifier is trained/tested to perform this prediction task. The train/test/validation split for positive instances is (1113/186/186) and the same split is applied to equal number of total negative instances, where negative instances are formed with patients who do not have HF code in their record up to the


Sequential Disease Prediction: The goal is to predict all the diagnosis codes of the next visit at every time step, having trained on visit sequence up to the -th visit. It can be considered as a multi-label classification task.

We report performance on Heart Failure Prediction task assessed based on AUC-ROC, AUC-PR and Accuracy, while Sequential Disease Prediction task is evaluated in terms of Normalized Cumulative Discounted Gain (NCDG). To evaluate the quality of the predicted diagnosis codes, they are first sorted according to their prediction values and NCDG is measured at different cutoff values, k = 5, 15 and 25, against the true diagnosis codes.

NCDG is computed as the ratio of DCG and Ideal DCG (IDCG) defined by the following equations,


where is the rank of -th element in x relative to the sorted predictions (y) and . True and pred refer to the list of true and predicted diagnosis codes respectively.

Figure 2: Performance on Heart Failure (HF) Prediction task in AUC-ROC, Accuracy and AUC-PR
Figure 3: Performance on Sequential Disease Prediction task in NDCG@k=25,15,5

3.4 Baseline Models

We compare MPVAA against three categories of baseline models (i.e., inside parentheses): 1) the aggregation method used to fuse the embeddings from the different views (CONCAT, AVG, SVD, DAEME, CAEME, AAEME) 2) state-of-the-art embedding models (M2V, GRAM, GV, W2V, LIN, NL-One, NL-MLP) 3) use of rnn units (LSTM-AE).

For the baselines, each visit of the patient sequence, , is represented with a list of medical concepts, , where they differ in how the embedding of the medical concept is learned. We consider as the total number of unique medical codes (i.e., ICD-9, NDC) present across all the patients records. For the baseline models under category 1), the learned inner-view embeddings for the medical concepts are considered as we intend to focus on the effect on performance from the embedding aggregation method.

CONCAT: The embeddings for the three views are simply concatenated to get the final embedding of each medical concept. We normalize each embedding before concatenation to ensure that each embedding contributes equally.

AVG: The embeddings for the three views are averaged to represent the final embedding of each medical concept. normalization is similarly performed.

SVD: Consider a matrix of dimension , where k is dimension of the resulting embedding from the concatenation of the

normalized embeddings for each medical concept. Singular Value Decomposition (SVD)

[9] is applied on to get the decomposition C = . For each concept, the corresponding vector in U is considered as the SVD embedding.

DAEME: Decoupled Autoencoded Meta-Embedding [1] uses separate set of encoder-decoder for each view. The encoded representations are concatenated and then the individual components are reconstructed from corresponding decoders. Each autoencoder is implemented as a single layer neural network.

CAEME: Concatenated Autoencoded Meta-Embedding [1] is similar to DAEME, except that the reconstruction is done from the concatenation of the encoded representations.

AAEME: Averaged Autoencoded Meta-Embedding [1] is similar to CAEME, except that the reconstruction is done from the average of the encoded representations.

M2V: Med2Vec [3] is a two-layer neural network for learning lower dimensional representations for medical concepts.

GRAM: [4]

is a graph-based attention model augmented with knowledge from medical ontology.

GV: GloVe [16]

is an unsupervised learning approach of word embeddings based on word co-occurrence matrix. In our case, the embedding matrix is of dimension

, where is the dimension of Glove embeddings.

W2V: The Skip-gram model [14] learns word representations based on the co-occurrence information of words within a context window of a predefined size. In our case, the embedding matrix is of dimension , where is the dimension of Word2Vec embeddings.

LIN: A visit is represented with binary, multi-hot encoding as . That is, only the dimension corresponding to the code is set to 1. This vector is linearly transformed with embedding matrix with embedding dimension d so that . Embedding in W in initialized with Glove [16].

NL-One: It adds non-linearity to visit by passing it through a non-linear activation . We used ReLU activation function, .

NL-MLP: It is the same as Non-Linear(one) but has one more layer to increase expressivity.

LSTM-AE: The patient sequence representation is learned through an autoencoder for each view. The encoder and decoder are implemented with RNN units (i.e., LSTM). The final patient representation is considered as the aggregation of the learned representations from all the views.

We further compare the performance of the proposed MPVAA against ablated versions of the model to exhibit contributions from each component.

MMAA: The Mean-Max Attention Autoencoder [22] using mean-max attention during decoding.

MMVAA: The proposed model doing mean-max pooling operation on the encoded representation, .

VAA: The proposed model without doing mixed pooling operation on the encoded representation, .

MPVAA-Sin: The proposed model with positional encodings added to input embeddings, .

3.5 Evaluation Results

The experimental results for the heart failure prediction and sequential disease prediction tasks are shown in Figures 3 and 3 respectively. We do comparative analysis on the performance between the proposed and the three different categories of the baseline representation learning models (i.e., 1), 2) and 3)). A key observation is that the proposed MPVAA consistently outperforms all baselines in all three categories for both tasks across all the metrics. We first evaluate how effective the different fusion approaches are in aggregating the embeddings from the different views. Among the general ensemble methods (i.e., CONCAT, AVG and SVD), performing a global projection on the concatenated embeddings through SVD is shown to give better results across most metrics than the other two for both tasks. Now comparing the baselines implementing autoencoder (i.e., DAEME, CAEME and AAEME), it can be seen that they do better than their simple ensemble counterparts, CONCAT and AVG respectively. We can attribute this to the dimensionality reduction they enforce through the reconstruction of the hidden representation, that results in embedding the key features of the input into its learned representation and establishes autoencoders as a comparable base model for representation learning. This justifies our choice of basing MPVAA on the autoencoder framework.

Comparing against the second category of baselines, in particular, MPVAA shows a 13% and 3.5% performance gain against M2V and GRAM in HF prediction and Sequential Disease Prediction tasks respectively. The good results of M2V and GRAM compared to the other baselines in this category can be attributed to the fact that they are specifically designed for learning representations from EHR. This means that training on domain-specific data (e.g., EHR) strengthens generalizability of the representations as opposed to on general data.

MPVAA completely relies on attention mechanism to model patient representations. Its better performance on both tasks than LSTM-AE verifies this contribution compared to the use of recurrent units (i.e., LSTM) in LSTM-AE. One reason for this could be that RNNs put emphasis on information towards the end of the sequence and hence is not able to connect information from past visits.

Figure 4: Ablation Performance of MPVAA on Heart Failure (HF) Prediction in AUC-ROC, Accuracy and AUC-PR
Figure 5: Ablation Performance of MPVAA on Sequential Disease Prediction in NDCG@k=25,15,5

3.6 Ablation Study

To quantitatively evaluate the effect of various components used in MPVAA on the model performance, we compare the performance of MPVAA against its variants (i.e., MMAA, MMVAA, VAA and MPVAA-Sin) as presented in Figures 5 and 5 respectively for Heart Failure Prediction and Sequential Disease Prediction tasks. It can be seen that all four variants gave worse performance than MPVAA for both the tasks. VAA excludes mixed pooling operation of the encoded representation and causes the most decline in performance compared to the full model, MPVAA. While MMVAA uses deterministic mean-max pooling instead of the stochastic approach used in MPVAA and is similarly outperformed. This asserts the stochastic mixed pooling component of our proposed MPVAA as an integral part of it. We believe that since MMAA does not incorporate the interactions among the patient information from different views while decoding as opposed to the use of multi-view attention in MPVAA, it is not able to embed the dependencies among the different views in the encoded embeddings. As mentioned earlier, the medical codes within each visit are unordered, so adding positional embeddings does not improve performance with MPVAA-Sin.

4 Related Works

As healthcare data is growing owing to the increase in hospital use of EHR, research using EHR has become an active area to mine useful associations between different clinical variables, that unravel information critical to fulfilling each patient’s medical needs alongside facilitating different real-world clinical predictive tasks. The cornerstone of effective implementation of EHR lies in robust representation learning that should holistically capture semantic relations among heterogeneous medical entities. Use of raw EHR data in terms of hand-engineered feature statistics (e.g., count) or binary representation with hot vector as inputs to predictive models is labor intensive for the former and fail to capture the hierarchical latent relationships at all levels, paving the way for learning distributed vectors through deep learning recently. A regularized nonnegative Restricted Boltzmann Machine is formulated in


that embeds the medical events in EHR to low-dimensional space, while a stack of denoising autoencoders (AE) and multi-layer perceptron (MLP) are used in

[15] and [3]

respectively to get patient vectors. More common has been the use of Recurrent Neural Networks (RNN) to capture the sequential nature of EHR records for different predictive tasks

[2, 7]. [5] adds a two-level attention mechanism to RNN to get more interpretable representation. A graph-based attention model combined with RNN is used in [4] to address data insufficiency issue by augmenting knowledge from medical ontology. However, unlike all the aforementioned works, MPVAA first learns graph-based inner-view embeddings from the different types of data in EHR, and then captures their cross-modal interactions with multi-view attention into a holistic representation, which has shown superior performance.

5 Conclusion

In this paper, we present an unsupervised framework, Mixed Pooling Multi-View Attention Autoencoder (MPVAA), for learning patient representations. To enforce personalization in the learning that is specific to each patient, inner-view graphs are constructed. It facilitates better modeling by exploiting the complementary information from multiple data modalities in EHR. A combination of powerful attention mechanisms (i.e., self-attention, multi-view attention) is employed to capture the interactions among the cross-modal features. Comprehensive experiments performed demonstrate that the proposed MPVAA model outperforms the state-of-the-art baselines on Heart Failure Prediction and Sequential Disease Prediction tasks.


  • [1] D. Bollegala and C. Bao. Learning word meta-embeddings by autoencoding. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1650–1661, 2018.
  • [2] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun. Doctor ai: Predicting clinical events via recurrent neural networks. In Machine Learning for Healthcare Conference, pages 301–318, 2016.
  • [3] E. Choi, M. T. Bahadori, E. Searles, C. Coffey, M. Thompson, J. Bost, J. Tejedor-Sojo, and J. Sun. Multi-layer representation learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1495–1504. ACM, 2016.
  • [4] E. Choi, M. T. Bahadori, L. Song, W. F. Stewart, and J. Sun. Gram: graph-based attention model for healthcare representation learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 787–795. ACM, 2017.
  • [5] E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and W. Stewart. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, pages 3504–3512, 2016.
  • [6] E. Choi, A. Schuetz, W. F. Stewart, and J. Sun. Medical concept representation learning from electronic health records and its application on heart failure prediction. arXiv preprint arXiv:1602.03686, 2016.
  • [7] C. Esteban, O. Staeck, S. Baier, Y. Yang, and V. Tresp. Predicting clinical events by combining static and dynamic information using recurrent neural networks. In Healthcare Informatics (ICHI), 2016 IEEE International Conference on, pages 93–101. Ieee, 2016.
  • [8] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121–2129, 2013.
  • [9] G. H. Golub and C. Reinsch. Singular value decomposition and least squares solutions. In Linear Algebra, pages 134–151. Springer, 1971.
  • [10] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
  • [11] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [12] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • [13] T. N. Kipf and M. Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
  • [14] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • [15] R. Miotto, L. Li, B. A. Kidd, and J. T. Dudley. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Scientific reports, 6:26094, 2016.
  • [16] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  • [17] B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi. Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis. IEEE journal of biomedical and health informatics, 22(5):1589–1604, 2017.
  • [18] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
  • [19] T. Tran, T. D. Nguyen, D. Phung, and S. Venkatesh. Learning vector representation of medical objects via emr-driven nonnegative restricted boltzmann machines (enrbm). Journal of biomedical informatics, 54:96–105, 2015.
  • [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • [21] D. Yu, H. Wang, P. Chen, and Z. Wei.

    Mixed pooling for convolutional neural networks.

    In International Conference on Rough Sets and Knowledge Technology, pages 364–375. Springer, 2014.
  • [22] M. Zhang, Y. Wu, W. Li, and W. Li. Learning universal sentence representations with mean-max attention autoencoder. arXiv preprint arXiv:1809.06590, 2018.
  • [23] S. Zhang, H. Tong, J. Xu, and R. Maciejewski. Graph convolutional networks: Algorithms, applications and open challenges. In International Conference on Computational Social Networks, pages 79–91. Springer, 2018.