On Sharing Models Instead of Data using Mimic learning for Smart Health Applications

12/24/2019 ∙ by Mohamed Baza, et al. ∙ 0

Electronic health records (EHR) systems contain vast amounts of medical information about patients. These data can be used to train machine learning models that can predict health status, as well as to help prevent future diseases or disabilities. However, getting patients' medical data to obtain well-trained machine learning models is a challenging task. This is because sharing the patients' medical records is prohibited by law in most countries due to patients privacy concerns. In this paper, we tackle this problem by sharing the models instead of the original sensitive data by using the mimic learning approach. The idea is first to train a model on the original sensitive data, called the teacher model. Then, using this model, we can transfer its knowledge to another model, called the student model, without the need to learn the original data used in training the teacher model. The student model is then shared to the public and can be used to make accurate predictions. To assess the mimic learning approach, we have evaluated our scheme using different medical datasets. The results indicate that the student model mimics the teacher model performance in terms of prediction accuracy without the need to access to the patients' original data records.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the past few years, the adoption of electronic health records (EHRs) by health care systems have increased significantly [birkhead2015uses]. According to [shickel2017deep], nearly 84% of hospitals have adopted at least a basic EHR system [shickel2017deep]. The main goal of EHRs systems is to store detailed patients data such as demographic information, diagnoses, laboratory tests and results, prescriptions, radiological images, clinical notes, and more [arndt2017tethered].

Recently, the research community increasingly incorporates machine learning algorithms in the EHRs domain [rajkomar2018scalable]. The primary goal of these algorithms is to develop models that can be used by physicians to predict health status and help prevent future diseases or disabilities. As an example, Google has developed a machine learning algorithm to help identify cancerous tumours on mammograms [ref:google]. Also, Google’s DeepMind Health project [ref:google2] aims to create the best treatment plans for cancer patients by rapidly analyzing their medical test results and instantly referring them to the right specialist.

It is well known that data drives machine learning [obermeyer2016predicting]. As more data is available, as it is more likely for machine learning algorithms to give accurate predictions that doctors can use. However, patient’s records maintained by hospitals can not be shared due to patient privacy concerns. In most countries, privacy protection laws have been passed to protect patients data from being shared or leaked. For instance, In 1996, the Health Insurance Portability and Accountability Act (HIPAA) title II was enacted in the U.S.A [2]. One of the primary objectives of this act is to increase the protection of patients’ medical records against unauthorized usage and disclosure. These laws prevent hospitals from sharing the medical data records since with detailed person-specific records, sensitive information about patients may be easily revealed by analyzing the shared data. Even if the data is anonymized before sharing it, research has shown that patients could be easily identified by using specific combined information (such as age, address, sex, etc.). For example, [samarati2001protecting] shows that linking medication records with voter lists can uniquely identify a person’s name and his/her medical information. Therefore, privacy concerns hinder sharing the patients’ records, which makes creating well-trained machine learning models a challenging task.

(a)

SVM classifier

(b) KNN classifier with
(c) RF classifier
Fig. 4: Machine learning classifiers used in our scheme.

In this paper, we tackle the aforementioned problem by sharing the machine learning models instead of the data using mimic learning approach. The main idea of mimic learning is to enable the transfer of knowledge from sensitive private EHR records to a shareable model while protecting the original patients data from being shared. In a nutshell, a teacher model trained on the original sensitive patients data, is then used to annotate a large set of unlabeled public data. Then, these labeled data are used to train a new model called the student model. The student model can be shared to make accurate predictions without the need to share original data or even the teacher model. To give empirical evaluations of the mimic learning approach in EHR domain, we have used three different datasets of patients data. The results indicate that the student model follows the teacher model in making accurate predictions. Moreover, as the teacher accuracy increases, the student model performance follows the teacher on the unseen data.

The remainder of this paper is organized as follows. The background is presented in II. Our proposed scheme is discussed in details in Section III. The experimental results are discussed in Section IV. Section V discusses the previous related work. Our conclusions are presented in Section VI.

Ii Background

Our scheme includes the following four machine learning algorithms: Support Vector Machine (SVM), k-Nearest Neighbor (KNN), Random Forests (RF), and Naïve Bayes (NB). In this section, we provide an overview for these algorithms.

  • Support Vector Machine (SVM). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible[Cortes:1995:SN:218919.218929]

    . In the SVM model, decision hyperplanes are formed based on identified support vectors to create a separation gap to divide two class instances with the maximal margin, as shown in Fig. 

    (a)a. This is done by finding the hyperplane that has the most significant fraction of points of the same class on the same plane.

  • K-Nearest Neighbor (KNN).

    The KNN classifier is a semi-supervised learning algorithm that identifies the classes of

    nearest neighbors of a given instance based on the majority class obtained [shruthi2019review]. The algorithm takes as an input the training examples , where is the attribute-value representation of the examples, and is the class label (e.g., benign or malignant), and a testing point that we want to classify. The KNN first computes the distance to every training example . Then, it selects the closest instances {} and their labels {}. Finally, it output the class which is the most frequent of {}. As illustrated in Fig. (b)b, the point to be classified is (5, 2.45), which is shown with . When applying KNN algorithm with using Euclidean distance computation, the result is shown with a dotted circle. There are two possible cases: Malignant class with two instances and Benign class with five instances. The algorithm classifies the mark to the malignant class since it represents the majority of data within the radius.

  • Random Forests (RF)

    : A random forest is a classifier that consists of multiple decision trees, each of which provides a

    vote for a specific class [Breiman2001]. Combining a large number of trees in a random forest leads to more reliable predictions, while a single decision tree may overfit the data. As illustrated in Fig. (c)c, an instance is labeled to the class that is selected by the majority of the trees votes in the forest. We adopted RF for its high capability of avoiding overfitting problem.

    Fig. 5: Illustration of the system model under consideration
    Fig. 6: Teacher model generation process
    Fig. 7: Student model generation process
    Fig. 8: Illustration of training process in the teacher/student models generation.
  • Naive Bayes (NB).

    Is a statistical classifier which assumes no dependency between attributes. It attempts to maximize the posterior probability in determining the class by assuming that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even though these features depend on the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.

Iii Proposed Scheme

In this section, we describe our proposed scheme. We first describe the system architecture, followed by the generation processes of teacher and student models.

Iii-a System Architecture

As illustrated in Fig. 8, the system architecture includes two main entities namely, data owners and end users. Data owners can be large hospitals that own the patients medical data records. They also do not want to share these data with others due to patients privacy concerns. They also responsible for generating the teacher and student models. The end users can be the physicians or even other hospitals that want to make use of machine learning models early prediction of diseases and creating better treatment plans.

Iii-B Teacher Model Generation

The first step is to generate the teacher model. As illustrated in Fig. 8, data owners use their original sensitive data to train a set of machine learning classifiers to obtain the corresponding models. Then, theses models are then compared according to their performance, and the most accurate model is selected to be used as the teacher model.

The training process is illustrated in Fig. 8. We first begin with splitting the labeled data into training and testing data. The training data is then used for training the teacher models, evaluating the results, and changing the weights and biases to more accurately predict the data. Once the training process is done, the test data are classified, and the best classifier is selected to be the teacher model.

Notice that the teacher model is kept private and can not be shared to the public. This is because if any adversary has access to the teacher model, this can lead to learning sensitive genomic information about individuals using some kind of model inversion attack [fredrikson2014privacy].

Iii-C Student Model Generation

The generation of the student model starts after generating the teacher model. The process of computing the student model is illustrated in Fig. 8. The first step is the labelling/annotating process. In this step, the teacher model is used to label (or annotate) an unlabeled public dataset to generate a new labeled training data.

After the labelling/annotating process is performed, the generated labeled data is used in training and selection of the student model as illustrated in Fig. 8. The training process is similar to Fig. 8. The major difference between training the student and teacher models is that the training process of the student model is basically a knowledge transfer process. In other words, the knowledge that the teacher model has gained from the sensitive patients data is transferred to the student model by using publicly available data produced in the annotation process.

Finally, once the student model is obtained, its performance needs to be evaluated with the teacher model to ensure its accuracy before sharing it.

Iv Experiments

In this section, we first explain the data and the process used to evaluate our scheme. We also present and discuss the results of this evaluation.

Iv-a Datasets Description

We used three different datasets in our experiments. Table I, gives description overview for datasets.

  • Breast-cancer dataset is a property of UCI’s machine learning repository [UCI] which contains breast cancer diagnostic of more than 699 patients. In this dataset, features are computed from digitized images of a fine needle aspirate of a breast mass. The images describe the characteristics of the cell nuclei present in the image

  • Cardiovascular dataset

    is obtained from kaggle data science 

    [Kaggle]

    . The dataset contains detailed medical examination of about 70000 patients having problems with cardiovascular diseases. The data includes factual information (age, gender, height, weight), examination results (systolic blood pressure, diastolic blood pressure, cholesterol, glucose levels) and subjective information that the patient provides (smoker, alcohol intake, activity level).

  • Heart disease dataset is a property of UCI’s machine learning repository [UCI2]. The data includes information such as age and sex, test results such as resting blood pressure, serum cholesterol, fasting blood sugar, and resting electrocardiographic results, and subjective information such as chest pain type.

 

Disease No. of samples No. of features
Breast cancer 699 18
Cardiovascular 70000 11
Heart disease 303 14

 

TABLE I: Datasets used in the evaluations.

Iv-B Performance Metrics

Teacher model Student model
Precision Recall F1-Score Accuracy Precision Recall F1-Score Accuracy
SVM 0.849 0.862 0.822 0.8546 0.945 0.941 0.935 0.9382
KNN 0.969 0.967 0.964 0.9673 0.962 0.958 0.956 0.9636
RF 0.966 0.967 0.965 0.9697 0.969 0.966 0.964 0.9673
NB 0.963 0.961 0.961 0.9618 0.945 0.94 0.942 0.94
TABLE II: The experiment results of the teacher and student models for the breast-cancer data.
Teacher model Student model
Precision Recall F1-Score Accuracy Precision Recall F1-Score Accuracy
SVM 0.8633 0.84 0.836667 0.8387 0.826 0.813 0.816 0.8172
KNN 0.86 0.85 0.85 0.8495 0.807 0.783 0.78 0.7850
RF 0.903 0.893 0.893 0.8925 0.843 0.84 0.84 0.8387
NB 0.8 0.793 0.793 0.8172 0.8433 0.84 0.8433 0.8387
TABLE III: The experiment results of the teacher and student models for the heart-disease data.
Teacher model Student model
Precision Recall F1-Score Accuracy Precision Recall F1-Score Accuracy
SVM 0.6333 0.6833 0.66 0.6420 0.63 0.61 0.59667 0.6107
KNN 0.71667 0.70667 0.70667 0.7083 0.61 0.6 0.5933 0.6004
RF 0.73 0.7267 0.7267 0.7264 0.6867 0.68 0.6767 0.6798
NB 0.6633 0.59 0.5367 0.5883 0.6867 0.6833 0.6833 0.6842
TABLE IV: The experiment results of the teacher and student models for the cardiovascular-disease data.

For the performance evaluation in the experiment. First, we denote TP, FP, TN, and FN as true positive (the number of instances correctly predicted true on individuals who are diagnosed with the disease), false positive (the number of instances incorrectly predicting true individuals who are not diagnosed with a disease), true negative (the number of instances correctly predicted false on individuals who are diagnosed with the disease), and false negative (the number of instances incorrectly predicted as false on individuals diagnosed with the disease), respectively. Then, we define the following key performance metrics used in our evaluation process:

  • Accuracy is the ratio of correctly predicted instances (true negatives and true positives) to the overall number of patients.

    (1)
  • Precision is the proportion of correct positive classifications (true positives) from the cases that are predicted as positive.

    (2)
  • Recall is the proportion of correct positive classifications (true positives) from the cases that are actually positive for a specific disease.

    (3)
    Breast cancer Heart disease Cardiovascular disease
    Precision Recall F1-Score Accuracy Precision Recall F1-Score Accuracy Precision Recall F1-Score Accuracy
    Teacher 0.97 0.97 0.97 0.96 0.90 0.89 0.89 0.89 0.73 0.72 0.72 0.72
    Student 0.97 0.97 0.964 0.96 0.843 0.84 0.84 0.83 0.68 0.68 0.67 0.67
    TABLE V: Comparison of the results of the the teacher and student models using the RF classifier for the breast cancer, heart, and cardiovascular diseases data.
    (a) Breast cancer.
    (b) Heart disease.
    (c) Cardiovascular disease.
    Fig. 12: RoC curves comparison using the RF classifier for both teacher and the student models for all diseases used in our experiment.
  • F1-Score

    is the weighted harmonic mean of the precision and recall. We used F1-score to represent the overall performance of the classifier.

    (4)

Beside the aforementioned evaluation metrics, we use receiver operating characteristic (ROC) curve to evaluate the pros and cons of both the teacher and student models for different diseases. The ROC curve shows the trade-off between the true positive rate (

TPR) and the false positive rate (FPR), where the TPR and FPR are defined as follows:

As the ROC curve is closer to the upper left corner of the graph, as the model performs better. The AUC is the area under the curve. When the area is closer to one, the model is better.

Iv-C Results and Discussion

To evaluate the effectiveness of the mimic learning approach, we used different classifiers mentioned in Sec. II namely, SVM, KNN, RF and NB. We also used k-fold cross-validation (cv) to reduce the chances of getting biased testing datasets. The idea of the -fold CV is to divide the training data into equal portions. Then, the model is trained on the remaining folds and the remaining folds are used for evaluation. In all classifiers, we selected the parameter to be ten [Han:2011:DMC:1972541].

Each of the four classifiers is used to generate the teacher and student models in each disease. Tables. IIIII, and IV give the results of evaluating the four classifiers for breast cancer, heart, and cardiovascular diseases, respectively. It is clearly seen that the RF classifier outperforms other classifiers when it is evaluated on the original sensitive data in all diseases. We also observe that SVM gives the lowest accuracy in case of the breast cancer dataset, and NB shows the lowest performance in case of heart and Cardiovascular datasets. Therefore, we selected the RF classifier as the best teacher model. Similarly, in case of student models, the results clearly indicate that the RF classifier performs the best among this set of classifiers. In addition, the SVM shows the lowest accuracy in the breast cancer dataset, while the KNN gives the lowest accuracy in heart and cardiovascular diseases. Therefore, the RF model is chosen as the best student model.

Then, we compare the performance of the teacher and student models using the RF classifier on test data. The results of the comparison of the different diseases are given in Table V. The results indicate that the performance of the teacher and student is nearly similar to each other. These results confirm our assertion that unlabeled data trained by a teacher model can be used to transfer knowledge to a student model without revealing data that is considered sensitive.

In another part of our evaluations, we used the RoC curves to visualize the performance of both the teacher and student models, as shown in Fig. 12. The student model has almost identical performance to the teacher model in breast cancer and heart diseases. However, in cardiovascular disease, the gap between student model and teacher model grows more apparently. This is because the teacher model’s accuracy is lowest in all three EHRs datasets; thus the student model also follows the teacher model, causing the gap between the student and the teacher to increase comparing to other two diseases as shown in Fig. (c)c.

V Related Work

Sevearal works have been proposed in literature to study how machine learning can be used in medical diagnosis in the EHR domain. However, little works have addressed the problem of sharing the patients sensitive data.

In [choi2017generating], Cho et al.

have proposed a approach, called medical Generative Adversarial Network (medGAN), to generate realistic synthetic patient records. Based on input real patient records, medGAN can generate high-dimensional discrete variables (e.g., binary and count features) via a combination of an autoencoder and generative adversarial networks. However, the scheme shows privacy risks in both identity and attribute disclosure.

The use of mimic learning approach has been used in the information retrieval (IR) domain, which a user can query an element of collection of huge number of documents. In IR domain, getting an access to large-scale datasets is crucial for designing effective IR systems. However, due to privacy issues, having access to such large-scale datasets is a real challenge. In [DBLP2], a mimic learning scheme have been proposed to train a shareable model using two different techniques namely weak- and full-supervisions [dehghani2017neural]. Then, using the sharable model, it is easy to create large-scale datasets. Unfortunately, current research work has not yet studied the use of mimic learning in electronic heath records which comprises a domain specific environment including the patients medical records and it is not allowed by law to share it with others. 111Many works have studied data security and privacy [baza2019b, baza2018blockchain, parkccnc, omar1, omar2, parksmarnet, yilmaz2019expansion, shafee2019mimic, baza2019detecting, baza2019blockchain, pazos2019privacy, firmware2, Lightride] 222Creating communication protocols for secure content delivery for networks of drones [Delay-Analysis, 8647532, 8885505, 8886101, 8412262, amer2018optimizing, amer2019performance, amer2020caching, 8756296, amer2019mobility, chaccour2019Reliability, 7925123, 7605066].

Vi Conclusions

In this paper, we tackled the problem of sharing patients medical records using the mimic learning approach. A knowledge transfer methodology has been proposed for enabling hospitals to share models model without the need to share the sensitive patients records. We have evaluated the mimic learning approach using extensive experiments with the help of three different datasets of patients diseases. The evaluation results indicate the teacher and student models have identical performance. Moreover, as the teacher model accuracy increases, the student model also follow the teacher. These results indicate that the mimic learning is successful in transferring the knowledge of the teacher model to a shareable model without the need to share the patients sensitive data. We believe that this work can incentive large hospitals to share models so others can make use of it to further improving people health.

References