The Analysis from Nonlinear Distance Metric to Kernel-based Drug Prescription Prediction System

02/04/2021
by   Der-Chen Chang, et al.
Georgetown University
0

Distance metrics and their nonlinear variant play a crucial role in machine learning based real-world problem solving. We demonstrated how Euclidean and cosine distance measures differ not only theoretically but also in real-world medical application, namely, outcome prediction of drug prescription. Euclidean distance exhibits favorable properties in the local geometry problem. To this regard, Euclidean distance can be applied under short-term disease with low-variation outcome observation. Moreover, when presenting to highly variant chronic disease, it is preferable to use cosine distance. These different geometric properties lead to different submanifolds in the original embedded space, and hence, to different optimizing nonlinear kernel embedding frameworks. We first established the geometric properties that we needed in these frameworks. From these properties interpreted their differences in certain perspectives. Our evaluation on real-world, large-scale electronic health records and embedding space visualization empirically validated our approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/05/2014

Comparing apples to apples in the evaluation of binary coding methods

We discuss methodological issues related to the evaluation of unsupervis...
07/16/2020

When deep learning meets causal inference: a computational framework for drug repurposing from real-world data

Drug repurposing is an effective strategy to identify new uses for exist...
10/18/2018

Nonlinear Mapping and Distance Geometry

Distance Geometry Problem (DGP) and Nonlinear Mapping (NLM) are two well...
08/04/2020

Cross-Global Attention Graph Kernel Network Prediction of Drug Prescription

We present an end-to-end, interpretable, deep-learning architecture to l...
08/31/2021

GeodesicEmbedding (GE): A High-Dimensional Embedding Approach for Fast Geodesic Distance Queries

In this paper, we develop a novel method for fast geodesic distance quer...
02/24/2020

Intensity Estimation on Geometric Networks with Penalized Splines

In the past decades, the growing amount of network data has lead to many...
09/26/2013

The Bregman Variational Dual-Tree Framework

Graph-based methods provide a powerful tool set for many non-parametric ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Distance metrics and their nonlinear variants play a fundamental role in machine learning tasks. They measure the degree of linear or nonlinear similarity between objects, grouping similar objects into different groups (e.g., -means clustering) or assigning class labels based on their nearest neighbors (e.g, -nearest neighbor classification)  [11]. There are multiple choices of distance functions to solve specific problems. Generally, Euclidean distance is used for the majority of classification problems, whereas, cosine distance is more suitable for document classification  [23, 22]. Distance metrics can also be part of kernel function design  [6]. In  [30], the authors use Euclidean distance as part of a graph kernel to capture time difference similarity between patients. In  [12], the authors demonstrate the positive definiteness for a distance substitution kernel. Usually, the kernel function for two data objects and and user defined parameter is defined as:

where is the Euclidean distance between and , and is a constant depending on time .

Distance metric learning, on the other hand, tries to find the optimal distance metric or embedding space given a task specific objective for a set of data objects  [17]

. Under this scope, semantically similar objects are encouraged to be closer to each other and further apart for non-similar objects. The prevalence of deep learning has largely populated the development of deep metric learning, where nonlinear observation can be captured, specifically in unsupervised representation learning for images (e.g., computer vision  

[13, 21]). It also motivated the deep kernel learning  [15, 28]. In  [31], a deep metric learning based graph kernel is proposed to solve the outcome prediction problem in complex chronic disease treatment planning, where cosine distance was shown superior to the Euclidean distance measure.

In practice, both distance metric and kernel function are used to solve complex real-world problems, especially in medicine. A problem of interest is drug prescription efficacy prediction  [1, 16]. Accurate predictive models for drug prescription improve healthcare. They can further reduce medication errors and identify possible drug prescription pathways to pursue for clinical personnel. In  [29], a framework is proposed to predict success and failure outcomes of a given drug prescription for antibiotic treatment-based disease. The approach is further extended to overcome biased data distribution  [30] and chronic disease treatment plan  [31]. Moreover, Euclidean and cosine distance measures exhibit different performance behaviors. In  [31], cosine distance is found superior to the Euclidean measure under highly data imbalanced chronic disease, however, Euclidean distance is used for short-term disease in  [30]. To further investigate such differences, we establish a unified framework, which integrates two model structures proposed in the aforementioned works, to conduct rigorous empirical evaluation on all diseases investigated in previous studies  [30, 31], in addition to a theoretical discussion. The aforementioned prescription efficacy prediction approaches are now under commercial licensing.

Our contributions are as follows:

  • We propose a scalable unified framework for prescription efficacy prediction.

  • We evaluate performance using 10-fold cross validation on large-scale, real-world, electronic health record dataset that includes common and rare, short and chronic illnesses.

  • We investigate the difference between Euclidean and cosine distance on learned embedding space.

  • We provide a theoretical explanation from geometry perspective, generalizing Euclidean and cosine distance.

2. Preliminaries

Our previous efforts  [29, 30, 31] present a graph kernel-based system for outcome prediction of drug prescription, particularly the success or failure treatment, on short-term and chronic diseases. In  [30], a Multiple Graph Kernel Fusion (MGKF) is proposed to overcome noise effect on short-term disease. A deep graph kernel learning approach, e.g., Cross-Global Attention Graph Kernel Network (Cross-Global), is proposed in  [31]

to handle long-term chronic disease. In short, we initially determine success and failure patients for the target disease treatment as training data within a user-defined time quantum, where a set of medical events are extracted between this time period. Then, we construct a patient graph, a graphical representation of the patient EHRs, given the extracted medical events. Finally, we perform binary graph classification as prediction through a graph kernel and a kernel-based classifier. We detail each part of the prediction framework in this section.

2.1. Outcome Selection

We define a failure drug prescription or treatment plan for a disease diagnosis if certain events occur within a predefined time period, otherwise a success. In short-duration diseases, we observe the similar or identical type of disease diagnosis, while in chronic disease, the observed target will be severe complication defined via medication guideline. We name a disease as short-term if it only considers a single medication with immediate outcome observation and recent medical history (e.g., 2 months prior to the diagnostic). For a chronic disease, we consider a multiple medication treatment plan with long-term outcome observation and medical history (e.g., 10 years prior to the diagnostic). We refer readers to MGKF  [30] and Cross-Global  [31] for greater detail.

2.2. Patient Graph

A subset of patient’s EHRs is formulated as directly acyclic graph where a node represents each medical event and an edge with time difference (e.g., days) used as a weight connects two consecutive medical events. Patient demographic information, such as gender and age, is included by connecting to the first medical event with age as an edge weightTo simplify model assumption, we only use gender and age as demographic information.. We define a patient graph here as in  [30] and  [31]:

Definition 2.1 (Patient Graph).

Given medical events, set represents a patient’s EHR with denoting a medical event such as diagnosis, and denoting the time for . The patient graph of events is a weighted directed acyclic graph with its vertices containing all events and edges containing all pairs of consecutive events . The edge weight from node i to node j is defined as which defines the time interval between .

2.3. Graph Kernel

For kernel-based binary graph classification, building a pairwise kernel matrix between patient graphs is the first step. The graph kernel computes the similarity between pairs of graph. It is also a positive definite or semidefinite kernel defined for graphs, which performs an inner product implicitly by mapping data point from an input space to the Hilbert space. It can also be treated as a similarity measurement between two data objects (e.g, graphs). We point readers to these articles  [18] for a more in-depth graph kernel discussion and  [6] for a better understanding and design principle of graph kernel and its associated feature maps. In  [29, 30], several graph kernels are proposed to solve the drug prescription outcome prediction problem as patient graph classification. Please refer  [29, 30] for more in-depth descriptions on kernel definitions.

2.4. Prediction Framework

We then formulate a binary graph classification problem on the resulting patient graph by using a kernelized Support Vector Machine (K-SVM)  

[29], Multiple Graph Kernel Fusion (MGKF)  [30], and Cross-Global Attention Graph Kernel Network (Cross-Global)  [31].

As mentioned in  1, cosine distance is superior to its Euclidean counterpart under highly imbalanced chronic disease (Cross-Global), while both the Euclidean and cosine distance measures achieve high prediction performance under short-term diseases (MGKF). We now question how they relate to each other. It is not suitable to directly compare Cross-Global and MGKF since the following:

  • Different datasets Different database provider. are used in MGKF and Cross-Global.

  • Different model structures and optimization perspectives. In MGKF, optimization aims at generating optimal kernel fusion, while it turns out to find optimal graph embedding under Cross-Global.

  • Different data balance and imbalance ratio between short-term and chronic disease.

To fairly compare MGKF and Cross-Global with Euclidean and cosine distance under short-term and chronic disease, a unified framework is required. Here, we extend and generalize previous efforts to differentiate the behavior of Euclidean and cosine distance, in addition to the theoretical discussion. A unified framework for graph-kernel based drug prescription outcome prediction is presented to conduct a rigorous empirical evaluation on all diseases in previous works on a very large-scale real-world EHRs.

3. Discussion from the Geometric Point of View

3.1. Riemannian and SubRiemannian geometries

To discuss differences between Euclidean and cosine distance, we first establish some mathematical properties with those distance under geometry point of view. In fact, we may consider this problem in a more general setting.

Let be linearly independent vector fields on an -dimensional real manifold with of the tangent bundle . To find a good kernel function to describe the diffusion (energy flow) between two points in , we need to solve the heat equation associate to the the sum of square of ’s:

When , the operator is elliptic. Assume that

In this case, we have a natural volume element which yields the adjoint vector fields

for . More precisely, for ,

whence is the classical Laplace-Beltrami operator whose second order part agrees with the operator . This suggests us that we may use the given differential operator to introduce a geometry on which may help us to solve and hence the heat equation. Hence, for , the solving kernel for the heat operator takes the form

The ’s are functions of and . Here represents the induced Riemannian distance between the points and in . Moreover, stands for negligible error. Furthermore,

i.e., is a solution of the Hamilton-Jacobi equation. Here is the Hamiltonian function associated with .

The simplest example will be the Euclidean distance and the kernel is the Gaussian (see [3]). In this paper, we are going to use another non-trivial example of Riemannian metric. Given a large sample space, we first embed those sample s in an -dimensional sphere . Given two points and in , we define the “distance” . Here is the Euclidean distance between the point and the origin. This is so-called cosine metric. In other words, and are on the same sphere. Hence, and are located on a ”big circle” which is determined by the center of the sphere and these two points. Instead of measure the arc-length (which maybe huge), we consider the angle between and

. This metric provides better estimates for the kernel in applications of drug prescription prediction system for long term disease.

When , the operator is non-elliptic. In this case, the subspace is called the “horizontal subspace” of , and the vectors are called horizontal vectors at . Sometimes, we call the distribution the horizontal distribution. The sections in the horizontal bundle are called horizontal vector fields. They are smooth assignments . The set of the horizontal vector fields on will be denoted by . If is an open subset of , the set of horizontal vector fields on will be denoted by . We call the complement of the “missing direction” at .

Now we encounter new problems since , and we cannot find arc-length in general. We overcome this difficulty by assuming bracket generating property: “the horizontal vector fields and their brackets span ”, then Chow’s theorem [10] to conclude that given any two points , there is a piecewise horizontal curve such that

and

This yields a distance, and therefore a geometry, which we shall call subRiemannian geometry

. SubRiemannian geometry was first discussed in the field of thermodynamics around 1800s. Carnot discovered the principle of an engine in 1824 involving two isotherms and two adiabatic processes, Jule studied adiabatic processes, and Clausius formulated the existence of the entropy in the second law of thermodynamics in 1854. In 1909 Carathéodory made the point regarding the relationship between the connectivity of two states by adiabatic processes and nonintegrability of a distribution, which is defined by the one form of work. Chow proved the general global connectivity in 1934 which was used in studying of partial differential equations. There are significant differences between Riemannian and subRiemannian geometries. However, this geometry can be applied in many situations in our daily life. For more details, readers can read the book by Calin and Chang 

[2].

A subRiemannian structure over a manifold is a pair , where is a bracket generating distribution and a fibre inner product defined on . The length of the horizontal curve is

The shortest length is called a Carnot-Carathéodory distance between which is given by

where the infimum is taken over all absolutely continuous horizontal curves joining and  [5].

Here we mention two examples of subRiemannian geometry. Let and be the the Grushin vector fields [4][8] in which satisfy bracket generating property, i.e.,

These vector fields can be used to described parallel parking and even self-driven cars [7].

3.2. Horizontal Connectivity

In outcome prediction task of drug prescription, one of the main difficulties is to overcome the distinguishing features under short-term and long-term disease progression. Moreover, for long term diseases, we need to avoid some low-effective (or even useless or dangerous) drugs. The answer to this question not only helps us better characterize the embedding space inferred by Euclidean and subRiemannian distances, but also leads to different optimization formulations. Mathematically, the first task is to address the following question: Given any two points on a topologically connected subRiemannian manifold, under what conditions can we join them by a horizontal curve? In the outcome prediction task of drug prescription, we must distinguish features under short-term and long-term disease progression. The answer to this question not only helps us to better characterize the embedding space inferred by Euclidean and cosine distance, but also lead to useful optimization formulation under their embedding properties.

To answer this question, we need to prove the following two results. Readers can find more detailed discussions in the book by Calin and Chang [2].

Proposition 3.1. Let be an open set and be a differentiable distribution on . Then for any point there is a manifold such that

. ;

. ;

. any two points of can be joined by a piecewise horizontal curve.

Proof.

Let be the vector fields in local coordinates. Consider the ODE system

(3.1)

where with , is a system with parameters.

The solutions of (3.1) are horizontal curves with controls . Let be the initial conditions of system (3.1). Standard theorems of ODE system provide the existence and local uniqueness of the solutions, which can be expressed by

for , with . Since the vector components are differentiable, a general theorem states that the functions are twice differentiable with respect to and locally continuous differentiable with respect to .

Since system (3.1

) is autonomous, a simple application of he chain rule shows that the functions

verify the relations

where and .

Applying the theorem on differentiability with respect to a parameter to system (3.1) yields that are continuous differentiable with respect to if with sufficiently small .

If we let , for , then the formulas

for define a -dimensional manifold passing through the point . To finish the proof we will need to show that the rank of Jacobian is maximum, i.e., equal to . This is equivalent with the fact that the vector fields

are linearly independent. Since , it suffices to show that

are linearly independent. Since

it follows that

which are linearly independent vector fields for . It follows that

The proof of this proposition is therefore complete. ∎

Proposition 3.2. Let be a nonintegrable distribution. Assume that through each point of the domains passes a -connected -dimensional manifold defined by the equations

(3.2)

where are continuous differentiable functions on a domain , such that

Then there is a domain such that

. for all , there is a -connected -dimensional manifold passing through ;

. the functions that define the manifolds on have the same properties as the functions ’s in (3.2).

Proof.

Let be the horizontal distribution and

be the extrinsic ideal associated with . Since the distribution is not integrable, the Pfaff system is not integrable; i.e., it cannot have integral manifolds of dimension .

Proof of statement . For any , there is a horizontal vector such that , i.e., not tangent to the manifold .

The proof of is by contradiction. Let be a fixed point. Assume that any horizontal vector field about is tangent to the manifold . Then

Therefore , and since , it follows that the inclusion is in fact identity; i.e., for all . Since the one-forms vanish on , it follows that is an integral -plane for the Pfaff system and hence is an integral manifold for is an integral manifold , which is a contradiction, because is not integrable. Hence we prove the assertion ,

Let be a point with coordinates and be the vector given by ; i.e., and . Let be such that

The numbers will be kept constant for the rest of the proof.

Proof of statement . The matrix

has rank at the point .

The first rows of the matrix are the components of the coordinate vector fields on the manifold , which are tangent to , linearly independent, and span the tangent space . The last row of has the component of the vector , which is transversal to , so all vectors are linearly independent at and hence .

Since all the elements of the matrix are continuous functions of the coordinates of the point , while are still kept constant, there is a subdomain such that and on .

From the non-vanishing Jacobian condition on it follows that the following vector fields

(3.3)

are linearly independent on .

From the preceding discussion, the following vector fields

(3.4)

are linearly independent on . We can complete system (3.4) with elements of set (3.3), say

(3.5)

are linearly independent on .

In the following we shall deal with the construction of a -dimensional manifold passing through , which depends on parameters. In equation (3.2) consider the parameters frozen. Let be the coordinates on this new manifold . Then

(3.6)

where is continuous differentiable with respect to and and

The equation of the integral curves of the vector field on are given by

(3.7)

We shall construct a -dimensional manifold by pushing the manifold in the direction of the integral curves of . This can be done by substituting the variables given by (3.6) into the expressions provided by (3.7). Let be the variable. We obtain

where are kept constant. are continuous differentiable functions of .

To show that the equations

(3.8)

defines a manifold of dimension , we need to show that

(3.9)

on some neighborhood of included in .

Applying the chain rule yields

Since on a neighborhood of , using that vector fields (3.4) are linearly independent yields

are linearly independent, which means that (3.9) holds.

Using that (3.5) are linearly independent on , it follows that the vector fields

are linearly independent on a subdomain , which contains . Therefore

and hence the functions have the same properties as the functions in (3.2).

In conclusion, through each point of passes a -connected manifold defined by equations (3.8), and each manifold depends on parameters. We finish the proof of this proposition.

Now we are in a position to prove the local connectivity property. This result was proved by Teleman [26] for the Pfaff systems that do not contain integrable combinations in 1957. Here we shall prove it from the point of view of distributions.

Theorem 3.3. Let be a nonintegrable differentiable distribution of rank on the open set . Then any domain contains a subdomain such that for any , , there is a piecewise horizontal curve that joins the points and .

Proof.

From Proposition 3.1, for any , there is a -dimensional -connected manifold passing through . Applying Proposition 3.2 times yields a subdomain such that for all , thee is an -dimensional -connected manifold passing through .

Let be two arbitrary points. Let be a path joining and contained in ( not necessarily supposed to be a horizontal curve.) Since covers the compact set , there is a finite subcovering; i.e., we can choose points on

such that

We can choose the points such that any two consecutive points and belong to the same manifold . Since the manifolds are -connected , the points and can be joined by a horizontal curve. This way, the points and can be joined by a piecewise horizontal curve. ∎

3.3. Subelliptic Heat Kernel

Now we need to use Hamilton or Lagrange formalisms to construct the fundamental solution of the subelliptic heat operator. In other words, we are interested in finding the solving kernels for the operators Inspired by the Gaussian, it is reasonable to expect the kernel has the form:

See e.g.,  [4],  [8] and  [9].

The modified complex action function can be written as which plays the role of and satisfies the Hamilton-Jacobi equation

In general, when we deal with a subelliptic heat operator, the heat kernel will depends on parameters (or Lagrange multipliers) . Furthermore, after calculation, one may see that the action function can be written as

We look for a heat kernel in the form . The heat kernel should not depend on . So we use an age old technique to get rid of by summing over it. Since are continuous parameters. Thus we shall look for a heat kernel in the following form:

Here and is so-called volume element which is an appropriate measure that makes the integral is convergent. Now we may apply properties of the heat kernel and reduce the problem to solving the transport equation to find . Once we obtain , then the index can be determined which depends on the Hausdroff dimension of the subRiemannian manifold . When , the is an -dimensional Riemannian manifold and where is the topological dimension of the manifold. In the case, the volume element is just the zero section that will recover results in elliptic cases. For more details, readers the books [3], [4] and a forthcoming research article.

4. Unified Framework

To compare how distance metrics affect prediction performance, we present a unified framework for a graph kernel-based drug prescription prediction system in support of a rigorous empirical evaluation. We consider all disease and distance metric configurations for both data balance and imbalance ratios. Motivated by MGKF and Cross-Global, a hybrid model is formulated to leverage advantages from these two models. Following the same MGKF three-kernel architecture, namely a Weisfeiler-Lehman subtree kernel  [24], Temporal topological kernel  [30], and Vertex histogram kernel  [25], we generate a fused kernel embedding with distance metric loss from Cross-Global as regularization .

Specifically, three pairwise kernel matrices via the aforementioned graph kernels are constructed. A single representation for a fused kernel embedding is generated through a deep neural network for successive classification. The distance regularization, achieved by contrastive loss  

[14], is integrated to combine the power of deep metric learning, and we force a kernel embedding to preserve an optimal distance property. Semantically similar embeddings are encouraged to be closer to each other, and dissimilar further apart in the kernel space. With this setting, kernel embedding is optimized jointly with classification loss and contrastive loss, deriving a single representation with multi-views and selected distance property. We discuss how embedding and prediction performance differs under different distance metrics in the next section.

Given a set of patients with their patient graphs where and associated class labels where , we compute their pairwise kernel gram matrices , , and by , , and respectively. Let be a deep neural network parameterized by weight and

as a single layer sigmoid function parameterized by

, we defined an unified framework as the following optimization problem:

(4.1)
(4.2)
(4.3)

where is a constant margin threshold, if else , and is a pairwise distance metric between kernel embedding calculated by  4.4 or  4.5.

Let and , we also have:

(4.4)
(4.5)

where is a cosine distance and is the Euclidean distance. As usual, is a standard inner product. in  4.2 can be either or .

The optimization problem in  4.1

can be solved by mini-batch Stochastic Gradient Descent (SGD). Once we find the optimal

and , we can perform the prediction. Considering a new incoming patient with patient graph , we compute pairwise kernel matrices , , and between and all patient graphs in . Then, we have the following decision function:

(4.6)

where is the predicted class label (e.g., success or failure) of .

The problem reduces to find good Riemanniaan or subRiemannian structures to handle a huge and complicated data set under certain constraints. In other words, we need to handle the related optimization problem 4.1 by finding horizontal vector fields and then construct solving kernel of the heat operator associate to the subelliptic operator . In this paper, we consider Reproducing Kernel Hilbert Space (RKHS) with certain geometric properties derived from Euclidean or cosine distances.

Disease Number of cases Number of failure Number of success Failure-success ratio
Urinary tract infection 1,501,310 703,646 797,664 47%:53%
Acute otitis media 151,522 72,264 79,258 48%:52%
Pneumonia 95,796 37,724 58,072 39%:61%
Acute cystitis 733,119 301,902 431,217 41%:59%
Hypertension 235,695 104,936 130,759 45%:55%
Hyperlipidemia 123,380 26,043 97,337 21%:79%
Diabetes 131,997 34,414 97,583 26%:74%
Table 1. Dataset / Disease statistics

5. Dataset and Evaluation Protocol

To investigate how distance metric relates to kernel embedding and prediction, we conduct a rigorous empirical evaluation with our proposed unified framework under different data balance-imbalance ratio on a very large-scale real-world EHRs, a subset of the Taiwanese National Health Insurance Research Database (NHIRD)https://nhird.nhri.org.tw/en/.

Our sample of the NHIRD contains a 20-plus year, complete, medical history for over one-million randomly sampled patients. The database is provided by Taiwan’s National Health Insurance Administration and the Ministry of Health and Welfare. Data are composed of registration files and original claim data for reimbursement to hospitals that participate in the National Health Insurance (NHI) program. The International Classification of Diseases, 9th Revision, Clinical Modification (ICD9-CM) code indicates the disease diagnosed. A unique identifier is used per drug and can be further linked to the Anatomical Therapeutic Chemical (ATC) code. For privacy purposes, the NHIRD contains no patient personal information such as name, contact information, and exact birth day (e.g., only with birth year and month). Also all identification numbers for patients and hospitals are de-identified in an attempt to prevent possible information leak. Institutional Review Board (IRB) approvals for our research were granted by all associated institutions.

We select four short-term diseases and three most prevalent chronic diseases in Taiwan. We follow our previous efforts setting an observation window for each type of disease. Refer Table  1 for complete disease list, data statistics, and outcome observation setup. To validate the claim, cosine distance is superior under data imbalance in chronic disease, in Cross-Global  [31], and examine such a case on short-term disease in MGKF  [30]

, we prepare balance and imbalance data. For balanced one, we downsample the size of majority cases to minority cases designating rare diseases, while keeping 70 percent of majority cases and 30 percents of minority cases for imbalanced one. The pairwise t-test with a p-value set to 0.01 is used to reject the null-hypothesis to measure the statistical significance for comparisons.

We compare Euclidean (Euclidean) and cosine (Cosine) distance on our unified framework. In

, we set 5000 dimensions for the first embedding layer of each kernel and 50 dimensions for kernel fusion layer, as a two-layer Multi-layer perceptron (MLP). We set 50 dimensions for sigmoid classifier

. During training, we use the Adamax optimizer  [19]

with a fixed learning rate 0.0001 and setup 64 batch size for 1000 epochs with early stopping criteria on batch loss. Two machine learning models are included as baselines for comparison purposes, e.g., Support Vector Machine (SVM) and Logistic Regression (LR) with all regularization constant set up to 1 (e.g.,

). All patients are represented as documents containing all medical codes from all visits, and transfer to low-dimensional embedding via Paragraph to Vector  [20]

with embedding size 512. Accuracy (ACC), Macro F1-score (F1), and the area under the receiver operating characteristic curve (AUC) are used as our evaluation metrics. All models are developed by Tensorflow and scikit-learn packages using Python. The experiments are executed on an Intel Core i9 CPU with 64GB memory and one Nvidia Titan-RTX GPU with 24GB memory

 We do not perform hyper-parameters tuning for all models.. Accuracy comparisons with other learning models, including a diversity of the latest neural configurations, are found in  [30, 31]

6. Main Result

Tables  Shaded regions indicate statistical equivalence (light) and significant difference (dark) of the Euclidean and cosine measures.  A designation indicates statistical significance over all baselines (SVM and LR) with p-value at 0.01. 234, and  5 show evaluation results under short-term diseases with data balance-imbalance, under different models. Chronic diseases are reported in Tables  67, and  8. Under a balanced setting, both Euclidean and cosine distance are relatively similar in their short-term and chronic disease evaluations. Euclidean distance even outperforms cosine distance in 5 out of 7 diseases, implying that Euclidean distance can achieve favorable results when data variation is small, no matter for short/long term disease progression. When it comes to an imbalance setting, the Euclidean distance measure is superior to the cosine measure for all short-term diseases. This confirms our premise that Euclidean distance is applicable to local problems, namely short disease progression. On the other hand, cosine distance is preferable in imbalance long-term chronic disease, which outperforms Euclidean distance especially in F1 score (e.g., an indicator to model performance under imbalance data set). It is worth noting that the evaluation margin between Euclidean and cosine distance is pretty large in all chronic diseases. The degree of outcome variation (e.g., co-morbidity) of long-term chronic disease patient group is larger than patient group in short-term disease, which reflects that Euclidean is more applicable under a low-variation data set. The comparison to the baseline models validates our unified framework. Note that, we did not perform any hyper-parameters tuning nor customize to any specific disease group. The purpose of this evaluation was strictly to investigate possible conclusions on model behavior under different distance metrics.

Urinary tract infection
Balanced
Model ACC F1 AUC
Euclidean 0.6220 0.0212 0.6186 0.0229 0.6220 0.0212
Cosine 0.6243 0.0284 0.6216 0.0283 0.6243 0.0284
SVM 0.5047 0.0257 0.5046 0.0258 0.5047 0.0257
LR 0.5210 0.0240 0.4874 0.0300 0.4988 0.0296
Imbalanced
Euclidean 0.6280 0.0469 0.5465 0.0290 0.5895 0.0249
Cosine 0.6165 0.0532 0.5632 0.0356 0.6194 0.0182
SVM 0.5023 0.0266 0.5051 0.0266 0.5053 0.0265
LR 0.5208 0.0240 0.4872 0.0260 0.4896 0.0297
Table 2. Evaluation Results for Urinary tract infection.
Acute otitis media
Balanced
Model ACC F1 AUC
Euclidean 0.6245 0.0200 0.6224 0.0218 0.6245 0.0200
Cosine 0.6138 0.0183 0.6097 0.0185 0.6137 0.0185
SVM 0.5023 0.0204 0.5021 0.0203 0.5023 0.0203
LR 0.5011 0.0211 0.5010 0.0211 0.5011 0.0212
Imbalanced
Euclidean 0.6570 0.0203 0.5453 0.0342 0.6037 0.0324
Cosine 0.6238 0.0306 0.5554 0.0258 0.6042 0.0346
SVM 0.5165 0.0196 0.4803 0.0212 0.4899 0.0246
LR 0.5170 0.0177 0.4804 0.0201 0.4898 0.0237
Table 3. Evaluation Results for Acute otitis media.
Pneumonia
Balanced
Model ACC F1 AUC
Euclidean 0.6013 0.0279 0.5922 0.3112 0.6013 0.0279
Cosine 0.6023 0.0211 0.5918 0.0263 0.6023 0.0211
SVM 0.4976 0.0127 0.4975 0.0127 0.4976 0.0126
LR 0.4979 0.0130 0.4978 0.0130 0.4979 0.0129
Imbalanced
Euclidean 0.6398 0.0688 0.5626 0.0423 0.6028 0.0270
Cosine 0.6255 0.0470 0.5712 0.0209 0.6220 0.0250
SVM 0.5430 0.0243 0.5070 0.0246 0.5179 0.0268
LR 0.5430 0.0243 0.5074 0.0242 0.5186 0.0261
Table 4. Evaluation Results for Pneumonia.
Acute cystitis
Balanced
Model ACC F1 AUC
Euclidean 0.6143 0.0189 0.6087 0.0245 0.6143 0.0189
Cosine 0.6095 0.0182 0.6068 0.0199 0.6095 0.0182
SVM 0.5049 0.0231 0.5048 0.0231 0.5049 0.0231
LR 0.5037 0.0228 0.5037 0.0228 0.5037 0.0228
Imbalanced
Euclidean 0.6353 0.0346 0.5607 0.0231 0.5763 0.0325
Cosine 0.6280 0.0405 0.5632 0.0201 0.5839 0.0267
SVM 0.5235 0.0227 0.4871 0.0219 0.4965 0.0233
LR 0.5230 0.0248 0.4860 0.0232 0.4957 0.0242
Table 5. Evaluation Results for Acute cystitis.
Hypertension
Balanced
Model ACC F1 AUC
Euclidean 0.7315 0.0126 0.7305 0.0131 0.7315 0.0126
Cosine 0.7290 0.0131 0.7282