1 Related Work
In this section, we survey the relevant works in (1) visual analytics of healthcare data and similar data records and (2) methods to measure the similarity of medical records.
1.1 Visual Analytics of Healthcare and Similar Data Records
Recognizing the growing availability of healthcare data from EMRs and Electronic Health Records (EHRs), researchers have developed various visual analytics methods. These methods have focused on a wide range of topics, such as clinical decision support systems (CDSS) [20, 3], interpretable machine learning for medical decisions [27, 19, 34, 29], and exploration of disease internal progressions [12, 26].
Among the works above, temporal event-sequence visualizations of healthcare data closely relate to our work. These works focus on revealing the frequent patterns of disease progression by summarizing EMR/EHR data into flow-based representations [37, 35]. Furthermore, to clearly display a sequence pattern, Guo et al. [13, 12] segmented sequences into latent stages to infer the disease progression. Also, both methods by Monroe et al.  and Gotz et al.  highlight the key events to reduce visual complexity by filtering out unimportant events. These works can help clinicians understand the transition patterns of a certain disease. Instead of analyzing the temporal event sequence as it is, other works have tried to extract meaningful information from the sequence. For example, OutFlow [48, 49] bridges a connection between the sequence and its subsequent outcome (e.g., death or survival; a better or worse condition) of the corresponding patient. CarePre  predicts the next medical events (e.g., potential diseases a patient may have) from the sequence and provides interactive visualizations to help understand the prediction with the patient’s historical medical records. Also, similar to ours, to fulfill the clinicians’ practical demands listed up through the interview, CarePre supports a comparison of medical records between one selected patient and similar patients.
When including applications that focus on a more general domain, there are several works visualizing similar historical records of people to help them make decisions in their lives (e.g., which actions Ph.D. students should take to be a professor ). For example, from multiple sets of records, EventAction  finds similar sub-sequences by using a fixed size of the sliding window. Then, EventAction provides a functionality of comparing the current user’s records with the sub-sequences related to the desired outcome (e.g., becoming a professor) with what-if analysis. In PeerFinder  and LikeMeDonuts , instead of fully automatic selection of similar records, Du et al. [8, 9] provided the selection of similarity criteria which can be interactively changed, and they visualized similar records with tables  or sunburst diagrams .
Among the existing works above, the works in [7, 8, 9, 20] are most closely related to our work in terms of providing visual analytics methods for identifying similar records. However, the works [7, 8, 9] did not specifically target medical records and their similarity calculations do not consider the data characteristics of EMRs (i.e., irregularity and sparsity). On the other hand, while CarePre  is designed for medical records, their similarity calculation uses dynamic-time warping (DTW). As described in Sec. 3, using DTW has a limited capacity to handle the irregularity. Another major difference is that our visual analytics system provides effective functionalities focusing on analyzing similar medical records (refer to Sec. 4).
1.2 Similarity Calculation for Medical Records
Here, we describe methods which can be used for calculating the similarity of temporal sequences with different length, including medical records. These methods can be categorized into two different types: elastic measures using dynamic-programming and methods utilizing machine learning (ML).
The methods in the first category find the best alignment between two different sequences and calculate the similarity between sequences by computing a certain aggregation distance between the aligned event pairs. Dynamic time wrapping (DTW) , longest common subsequence (LCSS) , edit distance with real penalty (ERP) , and edit distance on real sequences (EDR)  are in this category. DTW aligns two temporal sequences with the best matching and uses the sum-of-pairs distance for the aligned series to denote sequence similarity. As DTW is vulnerable to noise events, (i.e., noise may lead to a big distance between two similar sequences), the LCSS and EDR are adopted to address this issue by skipping noise events. While DTW, LCSS, and EDR are non-metric distance functions, ERP aims to provide a metric distance function by introducing a constant reference event for similarity calculation. However, as discussed about DTW in , the methods above do not show strong advantages in handling sequences of different lengths.
To address the above problem, ML-based methods have been developed. Most ML-based methods utilize seq2seq  as their basis . seq2seq consists of two major parts, namely an encoder and a decoder. The encoder embeds an input sequence to a fixed-sized latent vector
using one recurrent neural network (RNN) and then the decoder, using another RNN, generates the target time series based on the vector . seq2seq obtains the best representation (i.e., the latent vector ) by minimizing the error between the inputs and outputs . Then, with the latent vector for each input sequence, the similarity of each pair of sequences can be calculated with a certain distance metric (e.g., Euclidean or cosine distance).
However, both types of methods above only consider the order of events within a sequence. For example, these methods do not consider the time interval between events. Medical records of different patients often have different time intervals. Also, records of one patient often have varied time intervals. Another example is that these methods do not account for the correlations between events within the same patient’s records. Two different events that happened at different times may have a high correlation (e.g., a patient may have recurring symptoms after a certain period of time). Based on these observations, we extend the LSTM autoencoder framework by integrating the self-attention mechanism  to better capture the medical records’ temporal patterns.
2 Background and Analysis Targets
This section describes the characteristics of EMR data and analytic targets, which dictate the requirements for our model and visualization system.
2.1 General Characteristics of EMR Data
As described in , the general characteristics of EMR data make data analysis challenging. We describe three critical aspects we address in this work.
EMR data typically consists of a large number of medical features, such as multiple medical tests (e.g., serum calcium concentration), medications, diagnoses, and procedures.
- Irregularity in time
The irregularity of EMR data is caused by the fact that each patient’s medical features are recorded only when they visit the hospital. As a result, each patient’s records, which can be represented as a temporal sequence, have different intervals between each pair of events and also often have different lengths.
- High portion of missing data & data sparsity
EMR data often suffers from a high proportion of missing data. This can be caused by either data collection problems (i.e., patients are only checked for certain medical considerations) or documentation problems (i.e., machine breakdowns or human-entering errors ). Apart from a high proportion of missing data, data sparsity is another general characteristic of EMR. Sparsity is unavoidable since most patients visit the hospital only a few times and they usually take only a small subset of medical examinations and treatments.
2.1.1 Description of Data Used in Our Study
The dataset used in our study contains medical test records for 854 neonates in the Neonatal Intensive Care Unit (NICU) from an academic regional medical center in the United States. Their records were collected during the years 2015 and 2016. Each neonate’s records consist of laboratory testing results performed during their multiple hospital visits and NICU stays. While 239 different medical tests are recorded, each neonate took a subset of these medical tests per hospital visit (45 medical tests on average). Therefore, this data also shares the same general characteristics of EMR data (i.e., 239 dimensions, different time spans and lengths, and high sparsity). We use this dataset as an example of EMR data through our work. However, we should note that our model and visual analytics system are designed to be applicable to other EMR datasets as well.
2.2 Target Analysis Tasks
Our work’s general target is to support the analysis of similar patients’ medical records. As mentioned, analyzing similar records is particularly important when the clinicians make decisions for their patients . Under this general target, we set several detailed analysis targets below.
T1: Find similar medical records. The most fundamental task we need to support is finding similar medical records. Based on the user-selected focal patient’s records, the system should provide the most similar ones (e.g., top-3 similar records) from other patients’ records.
T2: Find (dis)similar time points among similar records. Even though the overall similarities are high between the focal patient’s and similar patients’ records, the similarities might vary along with their clinical progressions. Especially, finding dissimilar time points is important for medical decisions. For example, the focal patient may have had a sudden increase in blood pressure and could not be treated with specific medicines used to treat the other similar patients.
T3: Compare medical feature values at a certain time point. This task is to supplement T2. As mentioned in the example in T2, the user would want to know the reason why the patients are similar or different at some specific time points.
T4: Understand general tendencies in similar records’ medical feature values. Another important task is, between the similar records, which and how medical features are generally (dis)similar across time points. For example, through this task, the user could find all of these patients relatively high lymphocytes while only the focal patient has low red cell counts in his/her blood.
T5: Compare specific medical feature’s values across time. Once the user finds a medical feature of which the focal patient has a different tendency (e.g., low red cell counts) through T3 or T4, the user would want to know more details about the corresponding feature (e.g., whether the patient’s red cell counts are constantly low or rapidly decreased at some time point).
3 Similarity Calculation with Sequence Embedding
We present our methodology for similarity calculation of medical records with different lengths in this section. An overview of the flow of our similarity calculation is shown in Figure 1. We use two-step embeddings to convert medical records to fixed-length vectors. The first step is (1) event embedding, where each event consisting of multiple medical feature values is converted into a vector. Then, we apply (2) sequence embedding, where each sequence consisting of multiple embedded vectors obtained in (1) is converted into a vector. In order to deal with the high-dimensionality and sparsity of the data (refer to Sec. 2), we use the first step to learn lower-dimensional latent representations of the original features. The second step is used to handle the irregularity of the data. Lastly, after obtaining fixed-length vectors with these two steps, we compute the similarity of each pair of these vectors with a certain distance metric, such as the Euclidean distance.
While the existing work for analyzing medical records [12, 20] also embedded each event to a latent vector, there are two major differences with our model. First, they handled only medical events (e.g., diagnosing insulin) without additional information (e.g., how much insulin is diagnosed). This information is important to analyze many of the medical records. For example, in our dataset, while the neonates took a subset of tests at each recorded time, we should consider not only the information of which tests they took (e.g., measuring the amount of lymphocytes in blood) but also the information of the tests results (e.g., 1,000 lymphocytes per ).
Furthermore, Jin et al.  calculated the similarity of two medical records with DTW after the event embedding instead of going through our second embedding step. As discussed in Sec. 1.2, DP-based distance measures, including DTW, has a limited capacity to deal with the irregularity of medical records.
3.1 Data Prepossessing
EMR data often has a high proportion of missing values and/or high sparsity. Therefore, before applying embeddings, we need to fulfill the empty values. We adopt mean imputation, which is the most widely-used method. By learning low-dimensional latent representations from the imputed results with the embedding described in Sec. 3.2, the influence from the mean imputation on the calculation of the similarity of medical records can be moderated.
However, we should note that mean imputation still may bring a problem of underestimated variability. In our case, this problem causes the possibility that our method judges two neonates who have taken a quite different set of tests to have high similarity. Rather than designing a complex algorithm for this problem, we provide an interface for interactive filtering based on the similarity of tests taken, as described in Sec. 4.2.
3.2 Event Embedding
For the first embedding, we use a basic autoencoder , a neural network consisting of three layers (i.e., input, hidden, and output layers). We use this embedding to compress each event’s high-dimensional medical features to a lower-dimensional representation. Let be the number of medical features stored in medical records. The event embedding produces
medical features for each event (i.e., each time step). This is essentially a similar approach to applying principal component analysis (PCA)
for reducing the total number of features to avoid the ‘curse of dimensionality’. As the original purpose of the autoencoder , it performs nonlinear PCA on a dataset. This is more suitable for EMR data due to its high dimensionality and complexity.
In EMR data, for example, each event corresponds to a unique hospital visit by one patient and consists of multiple medical test results. Let be the -th event of patient , be the latent vector representation of after the embedding (the length of is ), and be the time when event occurred. Now, the original sequence of patient is converted to sequence consisting of multiple event vectors
where is the length of sequence .
For our dataset of the neonates’ medical test records, in total, 239 different types of medical tests are recorded (i.e., ). Then, by embedding these events, we obtained latent vectors of size 32 (i.e., ) for each time point.
3.3 Sequence Embedding
For the second embedding, we use the seq2seq framework . As explained in Sec. 1.2, the seq2seq framework has been developed to generate an output sequence from an input sequence. For example, an English sentence and a translated French sentence could be input and output sequences, respectively. One major strength of the seq2seq framework is that the input (and/or output) sequences could have any different lengths because it uses RNNs as both encoder and decoder. Specifically, we use RNNs with LSTM units to handle long-range temporal dependencies. We call the seq2seq with LSTM units as the LSTM autoencoder in the ensuing description. In our case, with the LSTM autoencoder, we obtain the fixed-length latent vector converted from the different length inputs (i.e., medical records).
Additionally, we incorporate the self-attention mechanism [2, 32] into the LSTM autoencoder. The self-attention mechanism is known for improving the learning performance of seq2seq . In addition to this reason, more importantly, we use the self-attention mechanism to model the time interval between events and correlations among events. To achieve this, we enhance the self-attention mechanism by providing control over the attention weight based on the time interval. We name this enhanced self-attention mechanism as the temporal self-attention.
3.3.1 LSTM Autoencoder with the Self-Attention Mechanism
Before our temporal self-attention, we first provide a brief introduction of the original self-attention mechanism used in the LSTM autoencoder  shown in a. In a, , , and represent input sequences, output sequences, and embedded sequences respectively. Specifically, the input sequence is the medical time sequences after the first embedding without time information ; the output sequence is the reconstruction of ; and the embedded sequence is the fixed-length vector representation of the input sequence that we want to obtain.
To predict the -th vector of output , the model refers to , the LSTM autoencoder’s -th hidden state , and the fixed-length latent vector , called the context vector. With the self-attention mechanism, is computed with a sequence of “annotations”  . Each annotation is a hidden state computed from an input sequence and contains information about with a strong focus on the parts surrounding . can be computed as a weighted sum of a sequence of annotations :
where is the weight of annotation . Specifically, is computed by
Here, is a neural network generating from hidden state and annotation . Essentially, shows how much attention the model should pay to the information around when it predicts the -th output vector . Refer to  for more details about their model.
3.3.2 Temporal Self-Attention
b shows the LSTM autoencoder with the temporal self-attention. A fundamental difference from the model in  is using instead of using in Eq. 3. Our model computes the weight with consideration of the time interval between the -th input and -th output events. That is,
where is a floor function to convert the real value to an integer. To control the precision of the difference of time intervals, the unit of (e.g.., a second, hour, or day) should be decided, depending on the type of medical records or analysis target. Now, with Eq. 5, 6, and 7, the model can consider the time interval and/or the temporal correlation among events during the sequence embedding.
Lastly, we use the output of the last encoder’s hidden state as the embedded fixed-length vector of the original input sequence, which is also an input vector to the first decoder cell. The training objective function of the model is to minimize the error between the input and output sequences as .
Our model has another difference with the original model in . While the model in  used a bidirectional RNN , we use a one-directional LSTM to keep the model simple and to compute the result faster. However, this is a minor difference and we can replace it with a bidirectional RNN if necessary.
For our dataset, with the sequence embedding step in this section, we have converted each of the 854 neonates’ records into a fixed-length vector with a size of 128.
3.4 Similarity Calculation with Embedded Vectors
From the above two embedding steps, we have a fixed-length embedded vector for each patient’s records. Now, we can compute the sequence similarities based on these vectors with a certain distance metric, such as the Euclidean distance or cosine distance. For out dataset, we tested both the Euclidean and cosine distances and the resultant performance was similar; thus, we simply selected the Euclidean distance in our system.
3.5 Comparison with Dynamic Time Warping
To demonstrate the effectiveness of our method for calculating the similarity of medical records with different lengths, we compare our method with a baseline similarity measurement: the Dynamic Time Warping (DTW) . We perform the task of identifying similar records in the NICU dataset described in Sec. 2.1.1. Specifically, we use both methods to find the top-3 similar sequences to the given focal neonate. Then, we visually compare the focal and identified similar records to see which method provides more reasonable results.
Figure 3 shows the results for two different focal neonates (ID 80 and ID 21)111We only present the results for two different focal neonates here. The same task is performed on all the neonates’ records except for the records that have less than two events. All results similar to Figure 3 are available at https://drive.google.com/drive/folders/1ypDt00VJKZqvSgW8GFcHtZUjDfZcAagM?usp=sharing.. Because the NICU dataset is multivariate data consisting of 239 different medical tests, instead of displaying temporal sequences for each of all medical tests, we randomly choose 6 tests (Calcium, Carbon dioxide, Chloride, Creatinine, Hemoglobin, and Potassium) to make it possible to visually compare and describe the concrete differences from the results. In Figure 3, we have not applied any sequence alignment to avoid including the effect from the alignment. In Figure 3a and b, the first and second rows show the results using our model and DTW, respectively.
From these results, we can see that our method generally selects temporal sequences that take similar values and patterns with the focal neonates. For example, as for Calcium in a, all sequences have “valleys” (i.e. lower values than other close time steps) around the beginning and “peaks” (i.e., higher values than other close time steps) around the middle time steps. Another clear example can be seen in Hemoglobin of a. The values in all sequences tend to decrease as time progresses but have some peaks around the middle. On the other hand, we can see that the similar sequences selected by DTW tend to be short and do not show the same patterns with the focal neonate. For example, in Carbon dioxide of b, while the focal sequence has a slight increase in its values, DTW selects two short decreasing lines as the top-2 and -3 similar sequences. This could happen because, during the similarity calculation, DTW tries to find the best alignment between two sequences based on simple dynamic programming. Then, if many time steps have less dissimilarity to the corresponding aligned time step, the total dissimilarity would be small.
In summary, the results above show the strength of our method when compared with DTW. The strength is derived from using deep neural networks to capture the overall transition and temporal patterns of sequences, instead of merely focusing on comparing individual event pairs.
4 Visual Analytics System
We present a visual analytics system that is designed to support the analysis tasks described in Sec. 2.2. We first describe an overview of the system with an analysis workflow and then details of each view of the system. We use our dataset of the neonates to make the system description more concrete. However, the system can be applied to other datasets consisting of multivariate values for each event.
4.1 System Overview and Analysis Workflow
We describe an overview of our visual analytics system shown in Figure 4 by going through the explanation of the analysis workflow in Figure 5. First, the user (e.g., a clinician) can see the overviews of similarity information of neonates’ medical test results (Figure 5A). Our system provides two overviews of all neonates: one is for neonates’ similarities based on the combination of tests taken during the recorded period (Figure 4a) and another is neonates’ similarities based on the test values (Figure 4b). Then, by referring to these views, the user selects one focal neonate. For example, in Figure 4a, from the orange cluster, the neonate highlighted with the black outer-ring is selected.
Based on the focal neonate, the system automatically selects the top- similar neonates (specifically, as a default) and then visualizes overviews of the focal and top- neonates, as shown in Figure 4c and Figure 4d (corresponding to Figure 5B and D). Because medical records can be represented as temporal multivariate event sequences, to help the user decide a time step or a variable to be reviewed, in Figure 4c and Figure 4d, we provide overviews from temporal and variable aspects, respectively. More specifically, while Figure 4c shows temporal changes of dissimilarities of the top- neonates’ test results to the focal neonate, Figure 4d visualizes the statistical summary of each test’s values across the recorded time period.
When the user wants to review the test values in detail based on the temporal changes, the user can select one time step from the view at Figure 4c. For example, in Figure 4c, the last time step is selected because the dissimilarity goes higher. Then, to review why they are (dis)similar, the system shows each neonate’s all test values at the selected time, as shown in Figure 4e (corresponding to Figure 5C). Furthermore, from the result shown in Figure 4e, the user can select a specific test item in which he/she wants to see temporal changes in detail. For example, in Figure 4e, we select the test item ‘12: Lymphs’ because its values have the largest difference between the focal and other neonates. The result is visualized in Figure 4f (corresponding to Figure 5E). From Figure 4f, for example, the user can understand the focal neonate’s high ‘lymphs’ at the last time step have increased the dissimilarity observed in Figure 4c.
On the other hand, when the user starts to review the details from the overview of variables visualized in Figure 4d (corresponding to Figure 5D), the user can select a test item of interest from the result in Figure 4d and the details will be displayed in Figure 4f.
While we have described the two major analysis flows above, Figure 5A B C E (focusing on temporal changes) and A D E (focusing on differences across test items), each of these analysis steps is tightly connected and, in practice, we often go back and forth across these different steps. Also, we want to note that the arrangement of the views in our system is designed to match the order of the two major analysis flows.
4.2 Similarity of Tests Taken (Figure 4a)
The views in Figure 4a and Figure 4b are used to perform T1 in Sec. 2.2 with our similarity calculation described in Sec. 3. As discussed, EMR data, including our neonate dataset, is often sparse. We have addressed this problem to some extent with the similarity calculation using the two-step embedding. However, due to the mean imputation in the data preprocessing step, as described in Sec. 3.1, our method still may judge two neonates who have taken a quite different set of tests to have high similarity. Figure 4a can be used to deal with this problem.
In this view, we visualize similarities of the neonates’ records based on test items they took during the collected time period. First, we employ the Jaccard index to compute the similarity. Then, to visualize all the neonates’ similarity relationships in a single 2D plot, we apply a dimensionality reduction (DR) method, specifically t-SNE . Afterwards, to extract clusters from the DR result, we use HDBSCAN 
, one of the density-based clustering methods. Lastly, to inform the clustering information, we color each point (i.e., each neonate’s record) based on the assigned cluster-ID. We use categorical colors with enough differences in their hues to distinguish each cluster. Also, because the density-based clustering, including HDBSCAN, would not assign some points (e.g., outliers or noises) to any specific cluster, we color such points with gray and label them as ‘uncategorized’.
The user can select a focal neonate from the input dialog placed on the top left (e.g., ID 175 is selected in Figure 4a) or by clicking a point in the view. We indicate the focal neonate with a black outer-ring, as shown in Figure 4a. Then, the system automatically searches for the top-3 similar neonates’ records based on the pre-computed similarities with the method described in Sec. 3 (note that these similarities are based on the test values). The selected top-3 neonates are indicated with outer-rings of blue colors with different saturation. The darker blue represents a neonate with a higher similarity. We have chosen this single hue encoding to ensure these colors do not share the same hue with the cluster colors to avoid misleading the user. From an example of Figure 4a, we can see that while the focal and first- and second-top neonates are selected from the orange cluster (i.e., Cluster A), the third-top neonate is selected from the purple cluster (i.e., Cluster D). This is an example of the problem stemming from mean imputation. In this case, the user might want to select similar neonates only within the orange cluster.
To support such selection, we provide an interaction to restrict the identification of similar neonates only within the selected cluster. An example of the interaction result is shown in Figure 6. First, the user can specify a certain cluster from the legend of clusters placed at the bottom of the view. In Figure 6a1, the mouse-hovered cluster is indicated with a dot at the center of the circle (analogous to the design of the commonly-used radio button). Then, the user can select a cluster by mouse-clicking. The system immediately selects and updates the similar neonates only within the selected cluster if the focal patient belongs to the selected cluster, as shown in Figure 6b1. Additionally, the system highlights the selected cluster by reducing the opacity for others. Also, we allow the user to update a focal neonate after the cluster selection. In this case, the system selects similar neonates within the selected cluster. The user can cancel the selection of the cluster by clicking on the selected cluster again.
4.3 Similarity of Overall Test Records (Figure 4b)
This view shows an overview of the similarities of all the neonate test records. We compute the similarity of each pair of the test records with the method described in Sec. 3. Then, similar to Figure 4a, we apply t-SNE to visualize these similarity relationships. By providing this overview, the user can find specific groups in which all the neonates have similar test values (i.e., identifying cohorts of neonates) or outliers from other neonates. For example, in Figure 6a2, we can see three neonates are placed far from the others, as indicated with teal arrows.
We color each point based on the corresponding cluster assigned in Figure 4a to provide better linking between the views. As shown in Figure 7, with these colors, we can see neonates tend to cluster together based on the test they took except for Cluster D (purple) and E (brown).
This view supports the same interactions implemented for Figure 4a, such as the neonate and clustering selections. Also, the views in Figure 4a and Figure 4b are fully linked with each other. For example, as shown in Figure 6, when the selection is updated in the view of similarity of tests taken, the view of similarity of overall test records will be updated accordingly, and vice versa.
4.4 Transition of Test Records’ Dissimilarities (Figure 4c)
The transition of test records’ dissimilarities in Figure 4c shows an overall (dis)similarity progression of the focal neonate and its top-3 similar records. The view helps users understand how the top-3 similar records are similar to the focal record, and when they are most or less similar (i.e., T2 in Sec. 2.2).
To show the information above, we need to compute the similarity of these neonates’ test results for each time step. One possible option is to use the vector representation after the sequence embedding in Sec. 3.3. However, in this case, time steps with different lengths have been embedded into the same length of vectors. As a result, it is difficult to select a specific time step from the result, which is necessary to perform the ensuing tasks (i.e., T3 and T5). Instead, we use the vector representation after the event embedding (i.e., the first-step embedding) in Sec. 3.2. However, because the selected neonates’ records could have a different length of time steps, we cannot calculate the similarity of each time step directly. Thus, as similar to the work by Jin et al. , which applies DTW  to the vectors after the event embedding, we first align the focal and top-3 similar records to the record with the longest time steps. Then, based on the alignment, we compute the similarity of the top-3 similar records’ event vectors to the focal record for each time step. Note that while Jin et al.  used DTW for computing similarities of records across all time steps, we use DTW only for the alignment of two vectors of different lengths for visual comparison.
To keep the visualizations simple, we use a line chart to show the computed transition of test records’ dissimilarities, as shown in Figure 4c. - and -coordinates represents a time step and the computed dissimilarity, respectively. Each polyline corresponding to one of the top-3 neonates is colored with the same scheme used for the outer-ring in Figure 4a and Figure 4b (e.g., the darkest blue shows the most similar neonate). A visualized example is shown in Figure 4c. In Figure 4c, we can see that all the top-3 neonates keep having relatively small dissimilarities across time steps (about 0.2 or less). This shows that our similarity calculation described in Sec. 3 seems to properly identify similar neonate records. On the other hand, we can also see a clear increase of dissimilarities after time step 7.
Also, while we show y-coordinates with a range of 0–1 as a default, the system allows the user to toggle zooming into a range from 0 to the maximum dissimilarity by clicking the ‘zoom’ button placed at the top of the -axis. Lastly, the user can select a time step of interest by mouse-clicking. The selected time step is indicated with a black-dashed line (e.g., time step 9 in Figure 4c). As we describe in Sec. 4.6, the view in Figure 4e will be updated based on the selected time.
4.5 Overview of Test Records (Figure 4d)
This view shows the statistical overview of test results of the focal and top-3 similar neonates. More specifically, for each neonate, we visualize the average value of each key test items across time steps. 32 key test items are selected out of 239 medical tests by filtering out the tests which were less frequently taken. Specifically, we have picked the 8 most frequently-taken tests for each of the 7 clusters shown in Figure 4
a. Some of the test items are overlapped across clusters and, as a result, we obtain 42 test items in total. Then, we remove 10 test items, because they have near to or exactly zero standard deviations across neonates and time steps. This view is helpful for the user to find medical tests which tend to have (dis)similar results across the focal and top-3 similar neonate at a glance (T4 inSec. 2.2
). To make it possible to compare across different medical tests, we have applied the standardization to each test item. With the standardization, all test items are converted to have zero mean and unit variance.-coordinates of this view represent the standardized value. Similar to parallel-coordinates , we use each vertical axis colored with light gray to show the values of each test item. Because of the space limitation, we only show the index of the test item at the bottom of each vertical axis. The corresponding test names are listed in the test items shown in Figure 4g. Each polyline corresponds to one of the top-3 neonates with the same color scheme with Figure 4c. Also, a polyline for the focal neonate is colored with black as similar to the outer-rings in Figure 4a and Figure 4b. As an interaction, the user can select a test item of interest by mouse-clicking. The selected test item will be indicated with the teal dashed-bar as shown in Figure 4d (‘12: Lymphs’). Also, the related views (i.e., Figure 4e, f, and g) are also updated based on the selected item.
In a visualized example in Figure 4d, we can see that, generally, the focal and top-3 similar neonates’ test results are similar to each other. However, three test items ‘10: Hematocrit’, ‘11: Hemoglobin’, and ‘25: Red cell count’ have clear differences between the focal and others. The user may want to see more details of these two test items in the transition of these three items. The view of the transition of selected test’s records described in Sec. 4.7 can be used for this purpose.
4.6 Test Records at Selected Time (Figure 4e)
This view can be used to perform T4 in Sec. 2.2 and shows all medical test results for the focal and top-3 similar neonates at a selected time step in the view in Figure 4c or Figure 4f. The visualizations and interactions for this view are the same as Figure 4d. The only difference is that this view shows the actual test results instead of the average values across time steps.
From the result in Figure 4e, we can understand the cause of the dissimilarity between the focal and other neonates at time step 9 in Figure 4c. The values for most tests of the top-3 similar records are similar to the focal neonate, whereas tests ‘10: Hematocrit’, ‘11: Hemoglobin’, ‘12: Lymphs’, etc. have a large difference, especially ‘12: Lymphs’. The user may want to further investigate the overall transition patterns of these tests across time. Figure 4f in the next subsection can be used for this analysis.
4.7 Transition of Selected Test’s Records (Figure 4f)
This view supports T5 in Sec. 2.2 by providing a result to compare the neonates’ detailed values of a certain test item across time. We employ visualizations and interactions similar to Figure 4c except for two differences. On the -axis, this view shows the standardized results, as explained in Sec. 4.5, of the selected test item. A test name of the selected item is also shown as a -axis label. Additionally, because we want to compare the top-3 neonates’ records with the focal neonate, this view includes the black polyline corresponding to the focal neonate record.
From the result in Figure 4f, we can see that while the focal neonate’s ‘Lymphs’ shows a decrease at time step 8, right immediately after that, it radically increases to about 6. Because ‘Lymphs’ is closely related to the immune system of the human body, the focal neonate seems to have some drastic changes in his/her immune system.
4.8 Linked Interactions across Multiple Views
As described, we have carefully designed visualizations and interactions to show linked information across views. Here we summarize all the linked interactions across multiple views.
- Selection of a focal neonate.
The user can select a focal neonate by entering a neonate ID into the input dialog placed at the top left of Figure 4a or mouse-clicking a point displayed either in Figure 4a or Figure 4b. After the selection, the system automatically selects the top-3 similar neonates’ records and updates all views except for Figure 4g, accordingly.
- Selection of a cluster.
The user can select/unselect a cluster by mouse-clicking from the cluster legend placed in Figure 4a or Figure 4b. This updates the top-3 similar neonates’ records to be selected within the selected cluster. Similar to the selection of a focal neonate, all views except for Figure 4g will be updated.
- Selection of a time step.
- Selection of a test item.
A test item of interest can be selected in Figure 4d or Figure 4e by clicking one of the vertical axes. Also, this can be done with Figure 4g by clicking a test item. Afterward, the teal-dashed vertical lines in Figure 4d and Figure 4e and the rectangle with teal-stroke in Figure 4g will be updated. Additionally, the system updates the visualized result in Figure 4f.
5 Case Studies
We demonstrate the effectiveness of our visual analytics system to identify and analyze similar neonates’ records222An introduction video to the system regarding the case study can be found at https://youtu.be/Ng89PDPcQpc..
5.1 Comparison of Similar Neonates Suffered from the Same Symptom
We have already shown several findings by analyzing the records of ID 175 and similar patients while showing an example of usage of the system in Sec. 4. Here, we demonstrate a more comprehensive analysis of this neonate.
Based on the documentation by a clinician in charge, ID 175 suffered from thrombocytopenia (i.e., the low platelet count in the blood). As shown in the overview of test records in a, looking at values of ‘21: Platelet count’ for these four neonates, we notice that the test values for all of them are lower than zero. Because we have applied standardization to each test item, this result indicates that these neonates have lower platelet counts than the average. First, we select ‘21: Platelet count’ from this view and analyze the detailed temporal changes of platelet counts in the view of the transition of selected records. The result is shown in b. We can see that all four neonates’ platelet counts gradually increased and became the same amount as the average. Thus, we can expect these neonates no longer suffer from thrombocytopenia.
However, as described in Sec. 4, we remind that ID 175 has an increase of dissimilarity from the others at time step 9 and high ‘12: Lymphs’ at that time step. We review the same plot, as shown in Figure 9. We can see that, including ‘12: Lymphs’, ID 175 has high values in several tests, such as ‘10: Hematocrit’, ‘11: Hemoglobin’, ‘14: MCH’ (mean corpuscular hemoglobin), ‘16: MCV’ (mean corpuscular volume), ‘18: MPV’ (mean platelet volume), and ‘25: Red cell count’. On the other hand, ‘17: Monocytes’, ‘19: Neutrophil’, and ‘30: White blood cell’ show low values. Note that both ‘17: Monocytes’ and ‘19: Neutrophil’ are specific types of white blood cells.
We review temporal changes in all of these test items with the view of the transition of selected test’s records and Figure 10 shows some of these results. From the values of ‘30 : White blood cell’ shown in a, we can see that the total number of white blood cells of ID 175 keeps going far from the average value. Also, from b and c, we can see that ID 175 keeps high values for ‘11: Hemoglobin’ and ‘25: Red cell count’ while other neonates has about the average (i.e., zero value) at time step 9. This indicates that ID 175 still suffered from polycythemia at his last hospital visit. Also, more importantly, in d, we can see an increase of MPV in ID 175, which indicates the average size of the platelet becomes large. High MPV implies a potential cause of thrombocytopenia . Therefore, unlike the other similar neonates, even though we observed improvement in ‘21: Platelet count’, as shown in b, ID 175 should be kept checking his/her blood.
5.2 Analysis of Neonate Groups
We perform an analysis of similar neonates selected within a cluster. As we have shown in Figure 6, several distinct clusters of neonates can be observed in the view of similarity of overall test records (Sec. 4.3). In this case study, we analyze two distinct clusters, Cluster C (green) and F (red).
First, with the interaction of selection of a cluster (refer to Sec. 4.8), we select the focal and top-3 similar patients within Cluster F, as shown in a. Then, the related views are immediately updated. The results of the transition of test records’ dissimilarities and overview of values of all test items are visualized in b and c. From b, we can see these four neonates keep high similarity with each other across time steps. Also, from c, we can see the neonates tend to have high ‘2: Bilirubin total’; low ‘10: Hematocrit’ and ‘11: Hemoglobin’. This implies that the patients in Cluster F seem to be suffering from the destruction of hemoglobin and resultant high bilirubin. When we review the transition of ‘2: Bilirubin total’, as shown in d, most of the selected neonates tend to have high bilirubin in the early or middle stage and then show a decrease to a lower value as signs of recovery.
Next, similarly, we select the neonates from Cluster C, as shown in a. From b, which depicts the transition of dissimilarities, we can see that these neonates also keep taking high similarities across time. In the overview of values of all test items shown in c, we can see that the neonates have salient values in several test items. For example, ‘0: Alkaline phosphatase’ and ‘17: Monocytes’ have much lower values than the average of all neonates (i.e., zero value). d shows the transition of values of ‘17: Monocytes’. From this, we can see that the neonates begin to take a close value to the average of all neonates while they used to suffer from low monocytes.
This case study demonstrates the functionality of our system to analyze and compare groups of patients clustered by their symptoms by using the cluster selection.
6 Discussion and Limitations
Extensive Usage of Our Embedding Method We have developed our two-step embedding described in Sec. 3 to identify and compare similar medical records. The fundamental contribution of our embedding is enabling the similarity calculation of event sequences with different lengths. Similarities obtained from event sequences can be used not only for the identification of similar event sequences but also for many other data mining tasks, such as clustering and classification. Therefore, we plan to extend the usage of our embedding for these types of analyses.
Generality of our methods. We use our embedding method and visual analytics system to analyze neonates’ medical test results. However, our methods are generic enough to apply to other medical records that can be represented as multivariate time sequences without any major modifications. Moreover, we can use our methods for other applications, such as career decision support, financial analysis, and cybersecurity systems. For instance, as similar to the existing works [7, 8, 9], in a career decision support system, we can suggest similar students or workers to help the user select the next career.
Limitations of Our Embedding Method. Our embedding method is designed for event sequence data in which each event contains multivariate values. Thus, our method is not suitable for directly applying to event sequences that do not have any value at each event (e.g., each event only has an event name, such as a ‘hospital visit’). For such data, we need to modify the first step of embedding (i.e., the event embedding). For example, we can use the skip-gram  based embedding as used in the work by Guo et al. , or the CBOW  based embedding. A more challenging situation is that we need to handle both types of event sequences above. That is, some events contain multivariate values while some have only an event name. We would like to address this challenge as our future work. For example, we can assign a unique event name for an event with multivariate values based on the range of each value (e.g., high blood pressure and low heartbeat) and then apply the skip-gram or CBOW  based embedding.
Limitations of Our Dataset and Analysis Through the paper, we use the neonate medical records consisting of only values for each medical test item. Because this dataset does not include any medical judgements as events and we did not fully incorporate medical documentations (e.g., clinicians’ comments), the insights obtained through the performed analyses would be limited and also need additional evaluations. For example, by coupling with the clinician’s judgement for each time step, we can correlate the transitions of test values and the diagnoses for the neonates to gain more detailed medical insights. Also, we would like to further understand clinical rationale in the cluster formation of similar neonate records, which is observed in Figure 7, with the expert knowledge.
Limitations of Our Visual Analytics System. Our visual analytics system focuses on effectively supporting exploration of similar medical records from both temporal and multivariate perspectives. However, the system does not provide an interface to investigate raw events, including clinicians’ comments and the detailed date and time of tests taken. We plan to integrate such functionalities into the sytem in the future.
We have chosen to use simple charts (scatterplots and line charts) to provide easy-to-understand visualization in our visual analytics system. However, both scatterplots and line charts would have a scalability problem when we need to visualize more information. For example, scatterplots are used to show the similarities of all 854 neonates records. However, some medical records contain a much larger amount of patients, for example, over 40,000 patients in the MIMIC-III dataset , an open-access clinical database. To handle such a large dataset, we should couple with data reduction methods. For example, we can use the data sampling method developed for scatteplots . As for line charts, we use them for showing the focal and top-3 similar neonates. When the user wants to visualize more lines, for example, showing the top-10 similar neonates, visualized results could suffer from cluttering of lines and/or difficulty in distinguishing the line colors. For example, we have used the colors with the same hue but different saturation to avoid using the same hue with the cluster colors. However, we can expect that this color scheme would work only when we visualize less than or equal to 5 similar neonates. When visualizing more lines is needed, we can consider using aggregation. For example, based on the similarity of each line, we can aggregate multiple similar lines in one line.
Identifying and analyzing similar patients’ medical records are fundamental needs in clinical decision making. Our work provides an unsupervised learning-based method for measuring the similarity of medical records, which can deal with high-dimensionality, irregularity, and sparsity in medical records. The visual analytics system built on top of our methods enables effective analysis of medical records from both temporal and multivariate perspectives. In the future, we would like to conduct user studies with clinical researchers to understand more preferable visualization designs for them. With the contributions above, we believe that our work better guides potential directions of future research on the comparative analysis of event sequence data in various domains.
The authors wish to thank Dr. Mark A. Underwood at UC Davis Children’s Hospital. This research is sponsored in part by the U.S. National Science Foundation through grant IIS-1741536 and a 2019 Seed Fund Award from CITRIS and the Banatao Institute at the University of California.
-  I. D. Arad, G. Alpan, S. D. Sznajderman, and A. Eldor. The mean platelet volume (MPV) in the neonatal period. American J. Perinatology, 3(01):1–3, 1986.
-  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.
-  E. S. Berner and T. J. La Lande. Overview of Clinical Decision Support Systems, chap. 1, pp. 1–17. Springer Int. Publishing, 2016.
R. J. G. B. Campello, D. Moulavi, and J. Sander.
Density-based clustering based on hierarchical density estimates.In J. Pei, V. S. Tseng, L. Cao, H. Motoda, and G. Xu, eds., Proc. Advances in Knowledge Discovery and Data Mining, pp. 160–172. Springer, 2013.
-  L. Chen and R. Ng. On the marriage of LP-norms and edit distance. In Proc. Int. Conf. on Very Large Data Bases-Volume 30, pp. 792–803. VLDB Endowment, 2004.
-  L. Chen, M. T. Özsu, and V. Oria. Robust and fast similarity search for moving object trajectories. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 491–502, 2005.
-  F. Du, C. Plaisant, N. Spring, and B. Shneiderman. EventAction: Visual analytics for temporal event sequence recommendation. In Proc. IEEE Conf. on Visual Analytics Science and Technology, pp. 61–70, 2016.
-  F. Du, C. Plaisant, N. Spring, and B. Shneiderman. Finding similar people to guide life choices: Challenge, design, and evaluation. In Proc. CHI Conf. on Human Factors in Computing Systems, pp. 5498–5544. ACM, 2017.
-  F. Du, C. Plaisant, N. Spring, and B. Shneiderman. Visual interfaces for recommendation systems: Finding similar and dissimilar peers. ACM Trans. on Intelligent Systems and Technology, 10(1):9, 2019.
-  D. Gotz and H. Stavropoulos. DecisionFlow: Visual analytics for high-dimensional temporal event sequence data. IEEE Trans. on Visualization and Computer Graphics, 20(12):1783–1792, 2014.
-  J. C. Gower and M. J. Warrens. Similarity, Dissimilarity, and Distance, Measures of, chap. 1, pp. 1–11. American Cancer Society, 2017.
-  S. Guo, Z. Jin, D. Gotz, F. Du, H. Zha, and N. Cao. Visual progression analysis of event sequence data. IEEE Trans. on Visualization and Computer Graphics, 25(1):417–426, 2018.
-  S. Guo, K. Xu, R. Zhao, D. Gotz, H. Zha, and N. Cao. EventThread: Visual summarization and stage analysis of event sequence data. IEEE Trans. on Visualization and Computer Graphics, PP:1–1, 08 2017.
-  G. Harerimana, J. W. Kim, H. Yoo, and B. Jang. Deep learning for electronic health records analytics. IEEE Access, 7:101245–101259, 2019.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
-  R. Hu, T. Sha, O. Van Kaick, O. Deussen, and H. Huang. Data sampling in multi-view and multi-class scatterplots via set cover optimization. IEEE Trans. on Visualization and Computer Graphics, 2019.
-  L. Huang, A. L. Shea, H. Qian, A. Masurkar, H. Deng, and D. Liu. Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. J. Biomedical Informatics, 99:103291, 2019.
-  A. Inselberg and B. Dimsdale. Parallel coordinates: a tool for visualizing multi-dimensional geometry. In Proc. IEEE Conf. on Visualization, pp. 361–378, 1990.
X. Ji, H. Shen, A. Ritter, R. Machiraju, and P. Yen.
Visual exploration of neural document embedding in information retrieval: Semantics and feature selection.IEEE Trans. on Visualization and Computer Graphics, 25(6):2181–2192, 2019.
-  Z. Jin, S. Cui, S. Guo, D. Gotz, J. Sun, and N. Cao. CarePre: An intelligent clinical decision assistance system. ACM Trans. on Computing for Healthcare, 2019.
-  A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. MIMIC-III, a freely accessible critical care database. Scientific Data, 3:160035, 2016.
-  I. T. Jolliffe. Principal component analysis and factor analysis. In Principal Component Analysis, pp. 115–128. Springer, 1986.
-  E. Kawaler, A. Cobian, P. Peissig, D. Cross, S. Yale, and M. Craven. Learning to predict post-hospitalization vte risk from ehr data. In AMIA Annual Symp. Proc., vol. 2012, p. 436. American Medical Informatics Association, 2012.
-  J. L. Koyner, K. A. Carey, D. P. Edelson, and M. M. Churpek. The development of a machine learning inpatient acute kidney injury prediction model. Critical Care Medicine, 46(7):1070—1077, July 2018.
-  M. A. Kramer. Nonlinear principal component analysis using autoassociative neural networks. AIChE J., 37(2):233–243, 1991.
-  B. C. Kwon, V. Anand, K. A. Severson, S. Ghosh, Z. Sun, B. I. Frohnert, M. Lundgren, and K. Ng. DPVis: Visual exploration of disease progression pathways. arXiv:1904.11652, 2019.
-  B. C. Kwon, M.-J. Choi, J. T. Kim, E. Choi, Y. B. Kim, S. Kwon, J. Sun, and J. Choo. RetainVis: Visual analytics with interpretable and interactive recurrent neural networks on electronic medical records. IEEE Trans. on Visualization and Computer Graphics, 25(1):299–309, 2018.
-  C. Lee, Z. Luo, K. Y. Ngiam, M. Zhang, K. Zheng, G. Chen, B. C. Ooi, and W. L. J. Yip. Big healthcare data analytics: Challenges and applications. In Handbook of Large-Scale Distributed Computing in Smart Healthcare, pp. 11–41. Springer, 2017.
-  O. Li, H. Liu, C. Chen, and C. Rudin. Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. arXiv:1710.04806, 2017.
X. Li, H. Wang, H. He, J. Du, J. Chen, and J. Wu.
Intelligent diagnosis with chinese electronic medical records based on convolutional neural networks.BMC Bioinformatics, 20(1):62, 2019.
-  X. Li, K. Zhao, G. Cong, C. S. Jensen, and W. Wei. Deep representation learning for trajectory similarity computation. In Proc. Int. Conf. on Data Engineering, pp. 617–628. IEEE, 2018.
-  Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. A structured self-attentive sentence embedding. arXiv:1703.03130, 2017.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013.
-  Y. Ming, P. Xu, H. Qu, and L. Ren. Interpretable and steerable sequence learning via prototypes. Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining, 2019.
-  M. Monroe, R. Lan, H. Lee, C. Plaisant, and B. Shneiderman. Temporal event sequence simplification. IEEE Trans. on Visualization and Computer Graphics, 19(12):2227–2236, 2013.
-  P. Nguyen, T. Tran, N. Wickramasinghe, and S. Venkatesh. Deepr: A convolutional net for medical records. IEEE J. Biomedical and Health Informatics, 21(1):22–30, 2016.
-  A. Perer, F. Wang, and J. Hu. Mining and exploring care pathways from electronic medical records with visual analytics. J. Biomedical Informatics, 56:369–378, 2015.
-  T. Pham, T. Tran, D. Phung, and S. Venkatesh. Predicting healthcare trajectories from medical records: A deep learning approach. J. Biomedical Informatics, 69:218–229, 2017.
-  C. A. Ratanamahatana and E. Keogh. Three myths about dynamic time warping data mining. In Proc. Int. Conf. on Data Mining, pp. 506–510. SIAM, 2005.
-  D. B. Rubin. Multiple Imputation for Nonresponse in Surveys. 1987 John Wiley & Sons, Inc., 6 1987.
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
-  M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Press, 1997.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. Proc. Int. Conf. on Neural Information Processing Systems, pp. 3104–3112, 2014.
-  L. van der Maaten and G. Hinton. Visualizing data using t-SNE. J. Machine Learning Research, 9(Nov):2579–2605, 2008.
-  M. Verleysen and D. François. The curse of dimensionality in data mining and time series prediction. In Proc. Int. Work-Conf. on Artificial Neural Networks, pp. 758–770. Springer, 2005.
-  M. Vlachos, G. Kollios, and D. Gunopulos. Discovering similar multidimensional trajectories. In Proc. Int. Conf. on Data Engineering, pp. 673–684. IEEE, 2002.
-  B. Wells, A. Nowacki, K. Chagin, and M. Kattan. Strategies for handling missing data in electronic health record derived data. Generating Evidence and Methods to Improve Patient Outcomes, 1:Article 7, 12 2013.
-  K. Wongsuphasawat and D. Gotz. Outflow: Visualizing patient flow by symptoms and outcome. In Proc. IEEE VisWeek Workshop on Visual Analytics in Healthcare, pp. 25–28. American Medical Informatics Association, 2011.
-  K. Wongsuphasawat and D. Gotz. Exploring flow, factors, and outcomes of temporal event sequences with the outflow visualization. IEEE Trans. on Visualization and Computer Graphics, 18(12):2659–2668, 2012.
S. Yadav, A. Ekbal, S. Saha, and P. Bhattacharyya.
Deep learning architecture for patient data de-identification in
Proc. Clinical Natural Language Processing Workshop, pp. 32–41, 2016.
-  K. Yazhini and D. Loganathan. A state of art approaches on deep learning models in healthcare: An application perspective. In Proc. Int. Conf. on Trends in Electronics and Informatics, pp. 195–200. IEEE, 2019.
-  B.-K. Yi, H. Jagadish, and C. Faloutsos. Efficient retrieval of similar time sequences under time warping. In Proc. Int. Conf. on Data Engineering, pp. 201–208. IEEE, 1998.
-  Z. Zhu, C. Yin, B. Qian, Y. Cheng, J. Wei, and F. Wang. Measuring patient similarities via a deep architecture with medical concept embedding. In Proc. Int. Conf. on Data Mining, pp. 749–758. IEEE, 2016.