A Disease Diagnosis and Treatment Recommendation System Based on Big Data Mining and Cloud Computing

10/17/2018 ∙ by Jianguo Chen, et al. ∙ Qatar University New Paltz 0

It is crucial to provide compatible treatment schemes for a disease according to various symptoms at different stages. However, most classification methods might be ineffective in accurately classifying a disease that holds the characteristics of multiple treatment stages, various symptoms, and multi-pathogenesis. Moreover, there are limited exchanges and cooperative actions in disease diagnoses and treatments between different departments and hospitals. Thus, when new diseases occur with atypical symptoms, inexperienced doctors might have difficulty in identifying them promptly and accurately. Therefore, to maximize the utilization of the advanced medical technology of developed hospitals and the rich medical knowledge of experienced doctors, a Disease Diagnosis and Treatment Recommendation System (DDTRS) is proposed in this paper. First, to effectively identify disease symptoms more accurately, a Density-Peaked Clustering Analysis (DPCA) algorithm is introduced for disease-symptom clustering. In addition, association analyses on Disease-Diagnosis (D-D) rules and Disease-Treatment (D-T) rules are conducted by the Apriori algorithm separately. The appropriate diagnosis and treatment schemes are recommended for patients and inexperienced doctors, even if they are in a limited therapeutic environment. Moreover, to reach the goals of high performance and low latency response, we implement a parallel solution for DDTRS using the Apache Spark cloud platform. Extensive experimental results demonstrate that the proposed DDTRS realizes disease-symptom clustering effectively and derives disease treatment recommendations intelligently and accurately.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation

Technological advancements and cost reduction in medical equipment and disease diagnosis have greatly accelerated the adoption of state-of-the-art technologies in various hospitals [29, 30]. The benefits of obtaining interactive and intelligent medical service based on knowledge discovery are rapidly growing. The accurate classification of different disease symptoms is essential in helping doctors carry out compatible treatment schemes for the disease. In contrast, traditional disease classification methods usually follow naive practices based on limited disease information, which might fail to further classify a disease according to symptoms at different treatment stages. In particular, for diseases with the characteristics of multiple similar treatment stages, various symptoms, and multi-pathogenesis, the accuracy and effectiveness of traditional classification algorithms are significantly lower. Therefore, it is crucial to find suitable approaches to accurately classify disease symptoms based on inspection reports.

In general, medical doctors diagnose diseases and select treatment schemes based mostly on their personal experience and knowledge. Inadequate communication, experience exchange, and cooperation between young and senior doctors results in the failure of young doctors to learn and take guidance from the experience, diagnoses, and treatment plans of experienced senior doctors. For example, the determinants of patients’ and doctors’ delays in the diagnosis and treatment of colorectal cancer were discussed in [35]. Despite the generation and availability of abundant medical data related to patients, diseases, treatment plans and their results, these data are not appropriately analyzed to extract useful knowledge and not efficiently shared among doctors and hospitals. Due to the lack of diagnosis experience, fledgling medical doctors might have difficulty in correctly diagnosing a disease in a patient that has an atypical symptom; thus, they are clueless in prescribing effective treatment plans. Hence, the sharing and recommendation of medical knowledge can help fledgling doctors elevate their disease diagnosis and treatment experience. A recommendation system of disease diagnosis and treatment is developed to find a balance between the medical resources of developed and underdeveloped hospitals and between the medical knowledge of experienced doctors and inexperienced doctors.

Because of the massive volume, variety, and continuous updating of medical data, the efficient processing of medical data and the real-time response of the treatment recommendation has become an important issue. Fortunately, parallel computing and cloud computing provide powerful capabilities to cope with large-scale data. Various hospitals have developed cloud-computing-based solutions for treatment guidance and have implemented various improvement measures for medical services. For example, Abbas et al. discussed a cloud-based health insurance plan recommendation system that implements a user-centered approach [1]. Apache Hadoop [2] is a famous cloud platform that is widely utilized in big data mining. Li et al. proposed an efficient tool for de novo peptide sequencing that utilizes the Hadoop cloud computing environment [19]. Apache Spark [3] is an excellent cloud platform that is suitable for data mining with iterative computation. Parallel programming models of Resilient Distributed Datasets (RDDs) and Directed Acyclic Graphs (DAGs) are supported by Spark, which is built in a memory computing framework. Benefitting from the RDD and DAG models, data caches are saved in memory, and iterations are performed on the same dataset directly from memory. Hence, the Spark platform is more suitable for data mining with iterative computation by saving huge amounts of disk I/O operation time.

1.2 Our Contributions

In this paper, we focus on medical resource sharing and treatment intelligence in big data and cloud computing environments and propose a Disease Diagnosis and Treatment Recommendation System (DDTRS). Large-scale historical inspection datasets are analyzed to derive disease-symptom clusters. Association relationships between diseases, diagnoses, and treatments are discovered in historical treatment records. Based on these relationships, valuable diagnosis and treatment plans of diseases are recommended to medical doctors and patients according to the current disease stages. Extensive experimental analysis indicates that the system recommends solutions effectively and accurately. Our contributions in this paper are summarized as follows.

  • To effectively identify more accurate disease symptoms, especially for diseases with multiple treatment stages and multi-pathogenesis, a Density-Peak-based Clustering Analysis (DPCA) algorithm is introduced for patients’ disease-symptom clustering, depending on the symptoms extracted from large-scale historical inspection data.

  • To profusely utilize and share the valuable disease diagnosis and treatment knowledge of experienced doctors and developed hospitals, Disease-Diagnosis (D-D) and Disease-Treatment (D-T) association rules are defined and analyzed by the Apriori algorithm.

  • Interactive recommendation interfaces of DDTRS are designed and implemented for medical doctors and patients. Medical doctors and patients can access the inspection reports and the corresponding treatment recommendations at different treatment stages. They can update the inspection results in DDTRS to obtain recommendations.

  • To achieve the goals of high performance and low latency response, we parallelize DDTRS on the Apache Spark cloud computing platform. Massive volumes of medical data are stored in the Hadoop Distributed File System (HDFS), and a parallel solution is employed based on the RDD programming model.

The remainder of the paper is organized as follows. Section 2 reviews the related work. Section 3 introduces a disease diagnosis and treatment recommendation system, which consists of three core modules: a DPCA-based disease-symptom clustering process, a disease-treatment association analysis process, and recommendation interfaces. To reach the goals of efficiency and timeliness, the proposed system is parallelized in Section 4 using the Apache Spark cloud computing platform. Experimental and application results are presented in Section 5 with respect to recommendation effectiveness and performance. Finally, Section 6 concludes the paper with a discussion of future work and research directions.

2 Related Work

Benefitting from the development of medical technology and information technology, numerous studies focus on the fields of disease prevention, disease treatment, hospital informatization, and drug discovery. Applications of big data analytics in hospitals were adopted in [9, 5]. Patients’ healthcare datasets were stored digitally in the form of Electronic Health Records (EHRs), and realistic and valuable information was obtained from these records by using felicitous analysis techniques and software tools. Diverse patterns, trends, associations, visualization, querying, information privacy, and predictive analytics of the healthcare datasets were analyzed. Evidence-Based Medicine (EBM) is a medical method that establishes best-practice recommendations based on graded treatment schemes for diagnostic and therapeutic issues in health care [18]. In EBM, decisions about the care of individual patients are made based on the current best available clinical evidence, combined with the doctor’s clinical experience and taking into account the patient’s values and aspirations. As a valuable and applicable medical method, EBM has been utilized in various specialties, such as neurology [27], pediatric urology [34], and burn care [9]. In addition, numerous efforts have been made to use big data and data mining technology for EBM [41, 30]. For example, in [41], Yesha et al. introduced a personalized decision support system to enhance EBM using big data analytics. Reports about medication accidents and treatment failures due to diagnosis and treatment delays or inappropriate diagnoses were reviewed in [29]. Makwakwa et al. considered health system delays in the cases of diagnosis, new treatment and retreatment of pulmonary tuberculosis in Malawi [22]. The median patient delay was 14.0 days for both new treatment and retreatment of Tuberculosis cases, and the median health system delay was 59.0 days for new treatment and 40.5 days for retreatment cases. The authors concluded that effective management and new diagnostic techniques were needed for both new treatment and retreatment cases. In our previous work, we explored a parallel patient-treatment-time prediction algorithm and its applications in hospital queuing recommendations in big data environments [5]. There are limited research and applications in disease diagnosis and treatment recommendation based on large-scale historical medical data.

With a focused on clustering analysis, abundant notable achievements were contributed in existing studies [12, 15]. Multifarious traditional clustering algorithms were reviewed in [42], which consist of partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. Distinct from these traditional clustering approaches, groups of novel density-peak-based clustering algorithms were explored [37, 38, 39]

. A density-peak-based hierarchical clustering method (DenPEHC) was proposed, in which different clusters could be generated directly at each possible clustering layer. Alex

et al. addressed the DPCA algorithm based on the idea that a cluster center is surrounded by neighbors that have low local density and large delta distance from any point that has a higher density [31]. A fast density-based data stream clustering algorithm for mixed data was presented in [6] with self-determined cluster centers. Other valuable clustering algorithms were introduced in [14, 32]. Compared with iterative clustering algorithms, the computational complexity of DPCA is very low. Benefitting from the robustness on rough data and the non-iterative training of DPCA, we introduce the DPCA algorithm to cluster disease symptoms in this work.

Efforts are made in the fields of association analysis and information recommendation [43, 45, 16]. In [43], Yuan et al. discussed a personalized recommendation algorithm based on user models and a user-project matrix. A random-walk-based recommendation algorithm was suggested in [45] that considers item categories. Jia et al. improved the Apriori algorithm based on association analysis [16]. In [33], Sing et al. reviewed automatic tag recommendation algorithms for social recommender systems. Corbellini et al. [8] explored an architecture and platform for developing distributed recommendation algorithms, in which the recommendation time is reduced by modifying job distribution and non-invasive tuning strategies. In the medical field, valuable patient-oriented recommendations were introduced in [46, 7, 17]. Zheng et al. introduced a context neighbor recommender platform, which integrates contexts via neighbors for recommendations [46]. Chen et al. proposed an autocratic decision-making system using group recommendation methods [7]. In [17], Kefalas et al. explored a graph-based taxonomy approach for recommendation algorithms and systems. A wearable assistant for gait training for Parkinson’s disease was introduced in [24], which works with the freezing of gait in out-of-the-laboratory environments. In addition, large-scale medical data were analyzed in [25] to identify frequent diseases using the Apriori algorithm. The frequencies of diseases were identified for patients living in various geographical locations during different periods. Taking advantage of the high accuracy of the Apriori algorithm, we exploit it to extract associated rules of disease diagnosis and treatment schemes.

The rapid development of big data and cloud computing technologies provides powerful computing ability for mining large-scale medical data. Valuable knowledge can be extracted from large-scale data in each application field. Yao et al. designed and developed a big medical data processing system [40]. Shunmei et al. reviewed recommendation methods for big data applications based on cloud computing [26]

. Moreover, cloud computing technology provides powerful computing power for machine learning and big data mining. Numerous machine-learning and data-mining algorithms were implemented on various cloud platforms

[10, 47, 36], such as Apache Hadoop and Apache Spark. Li et al. introduced a computing resource management framework of cloud computing platforms, which can be applied in the field of medical data mining [20]. In [10], Sara et al

. used the MapReduce model of Apache Hadoop to enhance the performance of the Random Forest (RF) algorithm for imbalanced big data. In

[36], Wang et al

. developed a new Distributed Trajectory R-Tree (DTR-Tree) index algorithm on the Apache Spark platform and achieved high efficiency. Heuristically, we implement the parallel solution of DDTRS on the Apache Spark cloud computing platform.

3 Disease Diagnosis and Treatment Recommendation System

To effectively identify more accurate disease symptoms and profusely utilize and share the rich medical knowledge of experienced doctors, we propose a disease diagnosis and treatment recommendation system. DDTRS is comprised of two core modules: (a) the disease-symptom clustering analysis module, in which a DPCA-based clustering algorithm is introduced to classify disease symptoms based on the given inspection reports, and (b) the disease diagnosis and treatment recommendation module. Effective association rules of disease diagnosis and treatment are defined and analyzed by the Apriori algorithm. Valuable diagnosis and treatment plans are recommended for medical doctors and patients via the interactive interfaces of DDTRS.

In the early stages of a patient’s treatment, the patient usually undergoes a series of inspection tests, as advised by the attending doctor. Once the inspection reports are available, they are submitted to DDTRS to obtain a disease-symptom cluster, and then further obtain recommendations of disease diagnosis and treatment plans. The workflow of the disease diagnosis and treatment recommendation is presented in Figure 1.

Figure 1: Workflow of disease diagnosis and treatment recommendation

3.1 Standardization Process of Medical Data

Currently, for some inspection tasks, different standards are used in different hospitals, and the standards of one hospital might not be acceptable to another. Few hospitals use unified standards for inspection reports under the management of superior health administrative departments. Therefore, a standardization process is required for the medical datasets gathered from different hospitals. Large-scale historical medical datasets are gathered from our cooperating hospitals. Private patient information, such as ID code, patient name, telephone, and address, is filtered before the data collection process. In the standardization process, a series of operations are performed for the inspection datasets before performing the clustering and association analysis, such as data integration and data cleaning.

3.1.1 Preprocessing of Patient Inspection Data

Profiting from precise instruments and quantifiable data, inspection datasets are utilized as the data source of disease-symptom clustering. In the preprocessing phase, inspection datasets from different departments and different hospitals are collected. Due to the diverse contents of the inspection data, the inspection data are in multiple data formats, e.g., numeric, text, and image formats.

(1) Inspection data in numeric format.

Some of the inspection-task-generated report datasets are stored in numeric format, such as those for routine blood inspections, routine urine inspections, and bone marrow examinations. There might be multiple inspection items, which should be tested in each inspection task. An example report of a routine blood inspection task is shown in Table 1.

No.         Inspection items Result Unit Reference value range Remark
1 White blood cell (WBC) 5.86 /L 4-10
2 Red blood cell (RBC) 5.6 /L 3.5-5.5
3 Platelet (PLT) 341 /L 100-300
4 Hematocrit (HCT) 47.3 % 37-52
5 Hemoglobin (HGB) 148.0 g/L 110-170
6 Mean corpuscular hemoglobin (MCH) 28.3 pg 27-35
7 Mean corpuscular hemoglobin concentration (MCHC) 314 g/L 320-362
8 Mean corpuscular volume (MCV) 85.9 fL 82.6-99.1
9 Mean platelet volume (MPV) 11.5 fL 7.6-13.2
10 Platelet larger cell ratio (P-LCR) 28.5 % 13-43
Table 1: Example of a blood inspection report

As shown in Table 1, more than 10 inspection items are required in a blood inspection task. The patient information and the corresponding inspection items of each inspection task are selected as the feature variables of the patient’s disease data. Due to differences in inspection items, the numbers of feature variables of the disease data are different among inspection tasks. The feature variables of the inspection datasets selected for disease-symptom clustering analysis are presented in Table 2. Detailed descriptions of the inspection item names and abbreviations are given in Table 3.

No. Inspection tasks                        Feature variables
1 Routine blood
specimen, patient gender, age, WBC, RBC, PLT, HCT, HGB, MCH, MCHC, MCV,
MPV, P-LCR, …
2 Routine urine
specimen, patient gender, age, BLD, BIL, URO, KET, PRO, NIT, GLU, PH, SG,
WBC, RBC, …
3 Tumor marker
specimen, patient gender, age, AFP, AFU, LN, CEA, FER, CA50, CA199, CA125,
CA153, TNF-, NSE, …
4 Thyroid function
specimen, patient gender, age, ALT, AST, TP, ALB, GLO, TBIL, DBIL, IBIL,
ALP, CHE, ADA, MAO, LDH, …
5 Hepatitis B surface antigen
specimen, patient gender, age, ALT, ALP, GGT, TP, ALB, TBA, HBsAb,
MAO, HBeAb, HBcAb, AFU, …
Table 2: Feature variables of the inspection data (partial lists)
No. Abbreviation Full name of inspection items No. Abbreviation Full name of inspection items
1 ADA Adenosine deaminase 27 HCT Hematocrit
2 AFP Alphafetoprotein 28 HGB Hemoglobin
3 AFU -fucosidase 29 IBIL Indirect bilirubin
4 ALB Albumen 30 KET Urine acetone bodies (Ketone)
5 ALP Alkaline phosphatase 31 LDH Lactate dehydrogenase
6 ALT Alanine aminotransferase 32 LN Laminin
7 AST Aspartate aminotransferase 33 MAO Monamine oxidase
8 BIL Bilirubin 34 MCH Mean corpuscular hemoglobin
9 BLD Urine occult blood 35 MCHC Mean corpuscular hemoglobin concentration
10 CA50 Carbohydrate antigen 50 36 MCV Mean corpuscular volume
11 CA125 Carbohydrate antigen 125 37 MPV Mean platelet volume
12 CA153 Carbohydrate antigen 153 38 NIT Nitrite
13 CA199 Carbohydrate antigen 199 39 NSE Neuro-specific enolase
14 CEA Carcinoembryonic antigen 40 RBC Red blood cell
15 CHE Choline esterase 41 PH pH value
16 DBIL Direct bilirubin 42 P-LCR Platelet larger cell ratio
17 EC Epithelial cells 43 PLT Platelet
18 FER Ferritin 44 PRO Urine protein
19 GGT Gamma glutamyl transpeptidase 45 SG Specific gravity of urine
20 GLO Globulin 46 TBA Total bile acid
21 GLU Urine glucose 47 TBIL Total bilirubin
22 GOL Urine color 48 TNF- Tumour necrosis factor-
23 HBcAb Hepatitis B core antibody 49 TP Total protein
24 HBeAb Hepatitis B e antibody 50 URO Urobilinogen
25 HBsAb Hepatitis B surface antibody 51 WBC White blood cell
26 HBsAg Hepatitis B surface antigen 52 -HCG -human chorionic gonadotophin
Table 3: Inspection item names and abbreviations

(2) Inspection data in text format.

Distinct from the inspection tasks addressed above, there exists another type of inspection tasks, namely, imaging tasks, e.g., Computer Tomography (CT) scanning, Magnetic Resonance Imaging (MRI) scanning, X-ray, type-B sonography, and color doppler ultrasound. The symptom descriptions of these inspection tasks are expressed in the form of images or texts. The patient information and the corresponding symptom results of each inspection task are chosen as the feature variables of patient inspection data.

3.1.2 Preprocessing of Disease Diagnosis and Treatment Data

(1) Disease diagnosis-scheme data.

In this work, the diagnosis-scheme data refer to the detailed diagnosis descriptions of diseases recorded by doctors rather than the simple definitions of the diseases. Diverse diagnosis schemes are provided at different treatment stages of the same disease. These schemes are recorded by medical doctors or assistants in text format.

is assumed to be a dataset of disease-symptom clusters. The diagnosis schemes of the same disease-symptom cluster are collected for association analysis. Let be a dataset of diagnosis schemes. The dataset of the association items between the disease-symptom clusters and the corresponding diagnosis schemes is denoted as . An association item is defined when there exists a diagnosis scheme for disease-symptom cluster , and .

(2) Disease treatment-scheme data.

Available treatment schemes in hospitals include injections, intravenous infusions, surgical treatments, needle therapy, cupping therapy, and physical therapy. In the treatment process of a disease, the medical doctor might adjust the treatment plans for the patient depending on the current inspection results. Taking a disease as an example, treatment scheme is suggested for a serious condition, for a moderate condition, and for a mild condition. Namely, a specific treatment scheme or a treatment-scheme combination is performed for the current treatment stage of a disease.

Let be a dataset of treatment schemes. The dataset of the associated items between the disease-symptom clusters and the corresponding treatment schemes is denoted as . An associated item is defined when there exists a treatment scheme for disease-symptom cluster , and .

3.2 DPCA-based Disease-Symptom Clustering

Although the traditional classification measures are widely used in disease classification and have achieved notable results, they might have difficulty in further classifying a disease accurately according to the symptoms at different treatment stages. To identify disease symptoms more accurately and effectively, the DPCA algorithm is introduced for patients’ disease-symptom clustering. The DPCA algorithm is appropriate for diseases that have the characteristics of multiple similar treatment stages, various symptoms, and multi-pathogenesis. Taking advantage of the non-iterative training and the robustness of DPCA on rough data, we introduce DPCA to cluster the disease symptoms in this section. DPCA is an innovative density-peak-based clustering algorithm, which is based on the idea that a cluster center is surrounded by neighbors that have low local densities and large delta distances. After determining the cluster centers, each of the remaining data points is classified to the nearest neighboring cluster that has a higher density.

3.2.1 Clustering Analysis on Inspection Data in Numeric Format

Because of the different dimensions of the datasets in separate inspection tasks, the pre-processed inspection datasets are grouped by task name. The dataset of each inspection task is clustered by DPCA to obtain disease-symptom clusters separately. The workflow of the clustering analysis on the inspection data in numeric format is illustrated in Figure 2.

Figure 2: Workflow of clustering analysis on the inspection data

Let be a dataset of the pre-processed inspection data in numeric format, defined as follows:

,

where is the number of inspection tasks and is the data subset of the -th inspection task. Due to the different dimensions of these data subsets, each data subset is clustered by the DPCA algorithm separately. The steps of the disease-symptom clustering are described as follows.

(1) Calculate the local density for each data point.

Assume that inspection data subset contains feature variables, namely, the dimension of is . There are inspection records in , namely, . We calculate the distance matrix of , where denotes the distance between and , as defined in Eq. (1):

(1)

where and are the -th feature variables of and , respectively. In the original DPCA algorithm, for each data point in , the local density is calculated based on between different data points. The local density of is defined in Eq. (2):

(2)

where is a cutoff distance and if ; otherwise, .

Numerous shortcomings can be observed in Eq. (2), such as low robustness, discrete distribution, and inaccurate density-peak points. is equal to the number of data points for which the distance from is smaller than . Therefore, the cutoff distance greatly influences the local density , leading to low robustness of the algorithm. In addition, depending on the calculation method of in Eq. (2), the local density is an integer value, and the local density values of all data points in

obey a discrete distribution. Based on the discrete values of local density, it is difficult to detect local density-peak points and cluster centers accurately. Moreover, in cases of multi-density and non-uniform distribution, there might exist multiple inaccurate density-peak points in each cluster.

To address the limitations of Eq. (2

) and effectively cluster the datasets with multi-density and non-uniform distribution, we improve the DPCA algorithm by proposing an optimization solution for the calculation of the local density. A Radial Basis Function (RBF) kernel function is introduced to measure the local density. The RBF kernel function

is a monotonic function of the Euclidean distance between any data point and a data center in a space. The Gauss Kernel Function (GKF), defined in Eq. (3), is a widely used RBF kernel function:

(3)

where is the center of the kernel function and is a width parameter, which controls the radial range of the function. is set as in this work. The calculation method of the local density is optimized based on GKF, namely, Eq. (2) can be modified to Eq. (4):

(4)

In Eq. (2), the density of a data point is obtained by counting its neighbors. In contrast, in Eq. (4), the density of a data point is obtained by calculating a monotonic function of the distances of all the other data points to the current data point, which can accurately reflect the density distribution of the data points in the entire dataset. GKF maps the data sample to a higher-dimensional space, and it can cope with a nonlinear relationship between the class labels and data features. Owing to the continuous distribution of density values, it can accurately determine the densities of different data points and thereby identify the density peaks. Compared with Eq. (2), the local density based on GKF in Eq. (4) is minimally impacted by the cutoff distance, and the improved DPCA algorithm achieves higher robustness. Benefitting from the advantages of GKF, even for a dataset under a uniform density distribution, the local density peak points can be obtained accurately.

Based on the local density , the delta distance of each point is calculated. is obtained by computing the minimum distance between and the data points that have higher density. The delta distance of data point is defined in Eq. (5):

(5)

If the data point has the highest density, then .

(2) Identify cluster centers based on and .

On the basis of and , disease-symptom cluster centers are identified. Data points that have relatively high values of and are designated cluster centers. Data points that have small densities and high delta distances

are considered outliers. A decision graph of

and is drawn for the disease-symptom clusters detection. An example of a decision graph for disease-symptom clustering is shown in Figure 3.

Figure 3: Decision graph for the DPCA-based disease-symptom clustering

In Figure 3, there are three decision points with higher values of both local density and delta distance , which are shown as red squares in the figure. These decision points are identified as candidate disease-symptom cluster centers of the current inspection data subsets. In contrast, the decision points indicated by black triangles have low values of local density and high values of delta distance. These points are identified as outliers and removed from the clustering results. The decision points shown as blue circles in the figure are the remaining data points, which are neither the density peaks nor the noise data. These remaining data points will be assigned to the related disease-symptom clusters in the next step.

(3) Assign the remaining data points to the nearest disease-symptom clusters.

After the disease-symptom cluster centers are detected, each of the remaining data points is assigned to the cluster to which the nearest and higher-density neighbors belong. For each remaining data point , the set of neighbors with higher density is denoted as . The data point with the minimum distance is discovered from distance matrix . If has been assigned to a cluster , then is also assigned to . Otherwise, the cluster of is further determined iteratively. This step is repeated until all the remaining data points are assigned to the clusters. The DPCA-based clustering analysis process of the inspection data in numeric format is completed and disease-symptom clusters are obtained. An example of the assignment of the remaining data points is shown in Figure 4. The process of the DPCA-based disease-symptom clustering analysis of the inspection data in numeric format is presented in Algorithm 1.

Figure 4: Example of the assignment of the remaining data points
0:   : the inspection data in numeric format;: the percentage of distance for cutoff distance; : threshold value of the local density for cluster-center decision;: threshold value of the delta distance for cluster-center decision.
0:   : the disease-symptom clusters.
1:  for each inspection task in  do
2:      calculate distance from ;
3:      arrange in ascending order;
4:      distance of the position in ;
5:     for each in  do
6:         calculate local density ;
7:         calculate delta distance ;
8:        if ( bigger than and bigger than then
9:            identify cluster centers ();
10:        end if
11:     end for
12:     for each in  do
13:        append to the nearest cluster ;
14:     end for
15:  end for
16:  return  .
Algorithm 1 DPCA-based disease-symptom clustering analysis on the inspection data in numeric format

In DPCA, the cluster assignment process is executed in one step. The time complexity of Algorithm 1 is , where is the number of inspection tasks and is the average number of records in dataset of each inspection task. Thus, the space complexity of the DPCA algorithm is . Compared with iterative clustering algorithms, the computational complexity of DPCA is very low.

3.2.2 Clustering Analysis on Inspection Data in Text Format

In the disease diagnosis process, in addition to quantitative inspection tasks, admitting doctors record the conditions of patients’ diseases by observing and consulting the patients’ situations. These records of disease diagnostic information are saved in text format and constitute another important basis for judging the characteristics of patients’ diseases. Considering the specific terminology and rigor of the medical domain, we introduce a text clustering analysis algorithm for inspection data in text format based on a medical ontology model in this section. First, inspection data in text format are collected and preprocessed for medical word segmentation, and a medical domain ontology library is constructed. Second, medical ontologies of the pre-processed inspection data are extracted based on the medical domain ontology library. In addition, the ontological similarity of these medical ontologies and the text similarity of these textual samples are calculated. Finally, on these bases, the DPCA algorithm is applied on these text data to obtain the corresponding disease symptom clusters, in which the local density peak is calculated based on the ontological and textual similarities. The workflow of the clustering analysis on the inspection data in text format is shown in Figure 5.

Figure 5: Workflow of clustering analysis on the inspection data in text format

(1) Construct medical domain ontology library.

Before the ontology extraction, the inspection dataset in text format is preprocessed, in which the private information of patients and doctors is filtered. Then, specific words of the inspection text are segmented using the Natural Language Processing and Information Retrieval (NLPIR) segmentation system

[44]. In NLPIR, the text contents are segmented by using the -shortest-path rough cut method based on the dictionary with word frequency and property statistical information to obtain the best rough coverage results. In addition, properties of segmented words are marked automatically after obtaining the overall optimal word segmentation results.

An ontology is a formal expression of the relationship between a set of concepts and their relationships in a specific field. As a method of expressing knowledge, ontologies have been widely used in various application fields, among which biomedicine is one of the most active areas of ontology applications [21, 23]. PROTEGE [4] is a free and open-source ontology editor and a knowledge-based framework developed by the Stanford BioMedical Informatics Research Center. Rim et al. introduced a medical domain ontology construction approach for a medical decision-support system [11]. In [21], Lu et al. proposed a medical ontology-enhanced text processing method for infectious disease informatics. In this work, we construct a medical domain ontology library for the text clustering analysis of the inspection data in text format. The medical domain ontology library is built based on the data from PubMed [28], which is a database that provides free searching of biomedical papers and abstracts. We extracted numerous useful phrases and words from the related medical literature in the PubMed biomedical database. The noun phrases are extracted and marked as the concepts of biomedical applications, and the verb phrases are transformed into the corresponding relationships among these concepts. An example of the architecture of the medical domain ontology library in this work is shown in Figure 6.

Figure 6: Architecture of the medical domain ontology library (partial)

(2) Extract the medical ontology of the text inspection dataset.

Based on the medical domain ontology library, we extract various medical ontology objects from the preprocessed text inspection datasets. Because the ontology consists of concepts and relationships, the medical concepts and relationships are extracted separately. Let be the medical ontology set in this work, where is the set of medical concepts and is the set of relationships. The medical concepts of the text inspection dataset are composed of semantic elements, as defined in Eq. (6):

(6)

where refers the -th semantic element in the medical domain. Medical ontology relationships consist of medical concepts and their relationships, namely, .

(3) Calculate the similarity of medical ontologies and text samples.

We introduce an ontology-based similarity measure method to calculate the similarity of medical ontologies and text samples. As described above, the medical ontology is composed of medical concepts and relationships. Hence, the ontological similarity metrics include concept similarity and relationship similarity measures. For two ontology objects and , the concept similarity between them is defined in Eq. (7):

(7)

where and are the numbers of medical concepts of and , respectively. If , then ; otherwise, . The relationship similarity of ontology objects and is defined in Eq. (8):

(8)

where and are the numbers of medical ontology relationships of and , respectively. Based on the concept and relationship similarity measure methods, the similarity of two ontology objects is defined in Eq. (9):

(9)

Each inspection record in text format is composed of a series of ontologies. Assume that there are two inspection text sample and . The similarity of and is defined in Eq. (10):

(10)

where and are the numbers of medical ontologies of text samples and , respectively.

(4) Perform text clustering analysis of text inspection data based on PDCA algorithm.

Based on the medical ontological similarity, we analyze the disease symptoms from text inspection data by using the PDCA algorithm. The main steps in clustering analysis of the inspection data in text format are same as those used for the inspection data in numeric format. We calculate the local density and delta distance for each data sample. Unlike for the inspection data in numeric format, the distance between a pair of data points for the inspection data in text format is measured by the similarity of the documents instead of the Euclidean metric. Therefore, the local density of data sample based on GKF can be calculated by Eq. (11):

(11)

where and is a cutoff similarity. Afterwards, the delta distance of each data sample is calculated subsequently with Eq. (12):

(12)

If data sample has the highest density, then . We identify cluster centers based on and . Data points that have relatively higher values of and are designated cluster centers, while data points that have a small value of and a high value of are considered outliers. Afterwards, the remaining data samples are assigned to the disease-symptom cluster to which the nearest higher-density neighbors belong. The clustering analysis process of the inspection data in text format based on the medical domain ontology and the DPCA algorithm is completed and disease-symptom clusters are obtained.

3.3 Disease Diagnosis and Treatment Recommendation

Based on the disease-symptom clustering, the multifarious human intelligence of medical doctors is accumulated in the form of valuable diagnosis and treatment knowledge. We perform association analysis of the D-D and D-T rules. Moreover, appropriate disease treatment plans are recommended to medical doctors and patients via DDTRS.

3.3.1 Association Analysis of Disease Diagnosis and Treatment Schemes

After preprocessing the disease diagnosis- and treatment-scheme data, the Apriori algorithm is introduced for the association analysis of the D-D rules and the D-T rules, respectively. The Apriori algorithm is used to detect the association relationships between the disease-symptom clusters and the disease diagnosis and treatment schemes that have been utilized successfully. The workflow of the disease diagnosis- and treatment-scheme association analysis is illustrated in Figure 7. Owing to the similar principles of three rules, we take the D-T rules as an example to elaborate the association analysis process. The detailed steps of the Apriori-based disease treatment-scheme association analysis are described as follows.

Figure 7: Workflow of disease diagnosis- and treatment-scheme association analysis

In general, for a patient, multiple visits are required in the treatment process of a disease. The disease diagnosis and treatment data collected during each patient visit are considered as an association record in this work. Each association record contains the inspection reports, diagnosis schemes, and corresponding treatment schemes. These datasets are collected and grouped by disease into the same clusters, which are utilized as the data source of association analysis.

(1) Set minimum and minimum .

To obtain the association rules between each disease and its treatment schemes, the treatment schemes of the same disease-symptom cluster are analyzed separately. The D-T rules of disease-symptom cluster are defined as , where , , and .

Definition 1: (Support). The support of an association rule in the association rule set refers to the ratio of the number of treatment records that contain both and to the total number of treatment records for disease-symptom cluster .

The of association rule is denoted as , as defined in Eq. (13):

(13)

Definition 2: (Confidence). The confidence of an association rule in the association relationship set refers to the ratio of the number of treatment records that contain both and to the number of association rules that contain .

The of association rule is denoted as , as defined in Eq. (14):

(14)

Definition 3: (Frequent Itemsets). If the support and confidence of an association rule in are greater than the minimum support degree and the minimum confidence, respectively, the rule is defined as a frequent item .

For a dataset of treatment schemes, the aim of association rule mining is to extract the D-T rules that satisfy the minimum and minimum at the same time. According to the results of disease-symptom clustering analysis and treatment schemes, the minimum support is set as and the minimum is defined as .

(2) Generate Frequent Itemsets.

The association rule itemset is generated as the frequent itemsets of D-T rules, where and .

(3) Extract strong association rules from frequent itemsets.

After obtaining the maximum frequent itemsets of the treatment-scheme data, candidate D-T rules are extracted from these frequent itemsets. These D-T rules are utilized in the disease treatment recommendation. The Apriori-based association analysis of disease diagnosis and treatment scheme is presented in Algorithm 2.

0:   : the dataset of disease-symptom clusters;: the dataset of disease treatment-scheme records;: the predefined value for the minimum ;: the predefined value for the minimum .
0:   : the frequent itemsets of the association rules.
1:  create frequent itemset ;
2:  for each in  do
3:     for each in  do
4:         get treatment association ;
5:         get treatment association ;
6:         ;
7:        if  then
8:           
9:           if  then
10:               append frequent itemset ;
11:           end if
12:        end if
13:     end for
14:  end for
15:  return  .
Algorithm 2 Apriori-based association analysis of diseases’ diagnosis and treatment scheme

Similar to the case of the D-T rules, the association analysis of the D-D rules is carried out using the Apriori algorithm. Thus, strong association rules among the disease-symptom clusters, diagnosis schemes, and treatment schemes are extracted.

3.3.2 Recommendation Interfaces

On the basis of the disease diagnosis- and treatment-scheme association analysis, we implement a Client / Service (C/S) application for medical doctors and a mobile application for patients to provide treatment recommendations. Using the applications’ interactive interfaces, medical doctors and patients can access the inspection reports and the corresponding treatment recommendations at different treatment stages. They can update the inspection results in the system to obtain recommendations. DDTRS can adjust the output according to the input changes.

(1) Interactive recommendation interfaces for doctors.

We provide interactive recommendation interfaces of DDTRS to medical doctors in hospitals. An example of one such interface is shown in Figure 8. To facilitate used by local patients, the application is designed using the Chinese language. However, the application can be easily customized to different languages. For the convenience of the readers, here we illustrate the English version of the application. From the interactive interface, medical doctors can review the inspection reports of their patients and submit the inspection data to the disease-symptom clustering module. Based on the results of disease-symptom clustering, the diagnosis scheme and treatment plans for each disease are recommended to the medical doctors.

Figure 8: Recommendation interface of DDTRS for medical doctors

In Figure 8, the inspection report shows that abnormal results were obtained for multiple inspection items, such as CEA, AFP, and HCG-. Among them, CEA reaches 5.56 , which is slightly higher than the normal reference range of 0.00 - 5.00; AFP reaches 121000 , which is significantly higher than the normal reference range of 0.00 - 25.00; and HCG- reaches 4.17 , which is higher than the normal reference range of 0.00 - 3.00. After clustering the submitted inspection report, a disease-symptom cluster is obtained from the similar historical inspection data. The corresponding items of the cluster center are listed as follows: CEA = 6.62 , AFP = 113052 , and HCG- = 5.28 . The cluster is termed Liver Cancer, which is detected in the form of mean symptom values. Depending on the disease-symptom cluster Liver Cancer, related diagnosis schemes and treatment plans are recommended, as shown in the right region of Figure 8. Further description of the case analysis of liver cancer is presented in Section 5.3.1.

From a medical point of view, inappropriate treatment recommendations might lead to misdiagnosis and endanger the patient’s health. Therefore, the correctness and effectiveness of the recommendations are the key issues for this kind of application. The quality of a treatment scheme for a patient can be evaluated by tracking the changes of the items’ values in his inspection reports. However, due to different pathologies and diverse patient health conditions, it is difficult to implement a common evaluation standard for the quality of each treatment scheme. Doctors might provide feedback if some incorrect recommendations are made. A feedback collection and analysis module should be utilized to improve the effectiveness of the application. Evaluation is performed based on the quality indicators of the treatment recommendations, which are fed back from medical doctors.

When a doctor receives a recommended treatment scheme, he can adopt and apply it completely or in part to his patient. After tracking and comparing the changes of the inspection reports and patient’s health condition, the doctor determines whether the recommended treatment scheme is effective for the patient’s disease. Afterwards, via the application interface, he submits the quality indicators of the treatment scheme, such as effectiveness, chronergy, non-harmful side-effects, economy, and patient satisfaction. The value of each indicator is in the range of (1 5). The detailed description of the quality-evaluation method of treatment schemes is presented in Section 5.2.3.

(2) Interactive recommendation interfaces for patients.

A mobile application of DDTRS is provided to patients with interactive recommendation interfaces. After each visit, when a patient completes the inspection tasks under his medical doctor’s preliminary advice, he can obtain the inspection reports through the mobile interfaces. Examples of interactive mobile recommendation interfaces for patients are shown in Figure 9.

(a) Inspection report
(b) Disease-symptom clustering
(c) Treatment recommendation
(d) Interactive consulting
Figure 9: Mobile recommendation interfaces of DDTRS for patients

From the recommendation interface, patients can understand and obtain the details of their health conditions via the disease-symptom clustering function. Moreover, they can further access the corresponding diagnosis and treatment schemes through the interface. However, there are two cases in which the recommendations might have negative influences on patients. First, some patients might feel sensitive and anxious when they receive their disease treatment recommendations. Namely, the recommendations could have bad effects on the patients. To avoid this problem, before sending the recommendation messages to a patient, an option is provided on the doctor’s interface that allows the doctor to determine whether the message should be accessible to the patient. Second, some patients might have difficulty in understanding the meanings of recommendations and their health conditions because they do not have the medical knowledge that the doctors have. In this case, they can further consult their doctors or access references from websites. In consideration of the lack of patient medical knowledge and incomplete feedback, evaluations from patients are not considered in this work.

4 Parallel Solution of DDTRS

The performance of disease diagnosis and treatment recommendation is the focus of this section. The efficiency of the disease-symptom clustering and the latency response of the treatment recommendations are the critical issues of the proposed DDTRS. After submitting the symptom data of a patient’s disease, one expects to receive an appropriate and timely treatment recommendation. It would be convenient, useful, and preferable if the patients could receive appropriate diagnosis and treatment plans through an interactive mobile application in real time. Therefore, to reach the goals of high performance and low latency response, we parallelize DDTRS on the Apache Spark cloud computing platform. The clustering process of disease symptoms and the association analysis process of the disease treatment scheme are parallelized separately.

4.1 Parallel Clustering Process of Disease Symptoms

In the parallel solution of the disease-symptom clustering module of DDTRS, large-scale historical inspection datasets are gathered from the cooperating hospitals of this work in a fixed time interval. DDTRS is deployed on the Spark cloud platform at the National Supercomputing Center in Changsha (NSCC) of China. Afterwards, these datasets are stored in the Hadoop Distributed File System (HDFS) on the Spark cloud platform. The parallel clustering process of disease symptoms is implemented on the Spark computing cluster with the RDD programming model. Apache Spark is a popular parallel data processing platform, that is especially suitable for big data mining and machine learning. In contrast to Hadoop, parallel programming models of RDD and DAG are supported, which are built on a memory computing framework. These methods reduce the volume of data transmission operations in the distributed environment without reducing the algorithm’s accuracy.

4.1.1 RDD Dependence for Large-scale Inspection Data

Before the parallel clustering process, massive volumes of inspection data are loaded into the Tachyon system of the Spark platform with a type of RDD object. The RDD model is the core programming model of Spark and represents a collection of distributed items. RDD objects are manipulated in parallel across many computing nodes. As discussed above, due to different numbers of the inspection items, the dimensions of the datasets for different inspection tasks are different. The clustering process of each inspection task is executed in parallel separately. An RDD object is created for the inspection data subset of each inspection task. Resulting from the unconstrained clustering process, RDD independencies also occur between different inspection tasks. Afterwards, each of the RDD objects is allocated to one or multiple adjacent computing nodes of the Spark platform.

In Spark, each RDD object offers two types of operations: (a) and (b) . include some operations on RDD objects that return a new RDD object, such as the and functions. include some operations that compute a result based on an RDD object and return it to the driver program or save it to an external storage system (e.g., HDFS). At the stage, RDD objects are created to store data subsets of inspection tasks, which are split from the original RDD object grouping by the name of the inspection task. In Spark, these objects are computed in a lazy fashion way; namely, they are created in an action at the stage. At the stage, defined RDD objects are created in the Tachyon system.

In the subsequent process of disease-symptom clustering analysis, each is calculated, and multiple new RDD objects are generated from . Data dependencies occur among the RDD objects generated for the same inspection task in the clustering process. RDD dependencies of disease-symptom clustering analysis are presented in Figure 10.

Figure 10: RDD dependencies of disease-symptom clustering analysis

In Spark, there are two types of RDD dependencies: (a) narrow dependencies and (b) wide dependencies. In a narrow dependency, each partition of the parent RDD is used by at most one partition of the child RDD. In contrast, in a wide dependency, multiple child partitions might depend on one partition of the parent RDD. As is evident in Figure 10, there are various RDD dependency relationships in the clustering process. Obviously, a narrow dependency exists between and each inspection data subset . Afterwards, for each , an object is created to save the distance matrix. Similarly, a narrow dependency occurs between and . In the subsequent process of the disease-symptom clustering, objects and are created to save the values of the local density and the delta distance separately. Owing to the cross-partition calculation, wide dependencies exist between and , and , respectively. Finally, an object is created to save the disease-symptom clusters, which is calculated based on the , , and . Therefore, wide dependencies occur between and , , and .

4.1.2 Parallel Clustering Process Based on the DAG Model

According to the RDD dependency, the master computing node of the Spark cluster constructs a task-scheduling DAG for the disease-symptom clustering process. The parallel clustering process based on the Apache Spark platform is presented in Algorithm 3. The detailed steps of the process are described as follows.

0:   : the path of the inspection datasets stored on HDFS;: the percentage of distance for cutoff distance; : threshold value of the local density for cluster-center decision;: threshold value of the delta distance for cluster-center decision.
0:   : RDD objects of the disease-symptom clusters.
1:   new SparkConf(“DPCA”, “SparkMaster”);
2:   new SparkContext();
3:  inspection datasets .textFile();
4:   .map
5:    .split(“::”);
6:   return inspection record ();
7:  end map;
8:  .groupByKey().foreach
9:   calculate distance matrix from ;
10:    arrange in ascending order;
11:    distance of the position in ;
12:   () .map
13:     calculate local density ;
14:     calculate delta distance ;
15:    return ();
16:   end map.reduce().collect();
17:    identify cluster centers().map
18:    if ( bigger than and bigger than ) then
19:      identify cluster centers ();
20:     return ;
21:    end if
22:   end map
23:    .flatMap
24:    append to the nearest cluster ;
25:   end flatMap.reduceByKey();
26:  end foreach.collect();
27:  return  .
Algorithm 3 Parallel process of the DPCA-based disease-symptom clustering

(1) The execution environment of the Spark platform is configured, such as the name of the program and the address of the Spark platform. Then, massive volumes of historical inspection datasets are loaded from HDFS to the Spark Tachyon memory system as an RDD object (termed ). In the function, records of are divided into a series of RDD objects (each one is termed ) by inspection task name.

(2) In the parallel function, the disease-symptom clustering process of each is executed simultaneously. An object is created to save the value of a distance matrix from the current . The local density and delta distance of each record are measured in parallel in the function. Then, the corresponding objects and of are obtained in the and functions.

(3) Records with high values of both and are identified as the centers of disease-symptom clusters (termed ). Afterwards, each of the remaining inspection records in is assigned to the nearest cluster in the parallel function. Owing to the independence of during the remaining-record assignment step, this step is executed in parallel. Moreover, there is no data transmission consumption among the computing nodes that the dataset located. Finally, an RDD object is obtained for the detected disease-symptom clusters.

4.2 Parallel Association Analysis of D-T Rules

To increase the speed of disease treatment-scheme association analysis and achieve a low latency response of the treatment recommendation, we parallelize the association analysis process on the Apache Spark cloud computing platform.

4.2.1 RDD Dependence for Large-scale Treatment Data

Analogous to the inspection data, large-scale historical treatment-scheme datasets are gathered from the cooperating hospitals and stored on HDFS in a fixed time interval. Before the association analysis process, the treatment-scheme datasets are loaded into the Spark Tachyon memory system with a type of RDD object (termed ). At the same time, the results of disease-symptom clustering obtained from Algorithm 3 are loaded into the Tachyon system as an RDD object . In the subsequent process of disease treatment association analysis, records of and objects are calculated. In addition, multiple new RDD objects are generated according to the data dependency between and . RDD dependencies of the treatment-scheme association analysis are presented in Figure 11.

Figure 11: RDD dependencies of the treatment-scheme association analysis

In Figure 11, there are various RDD dependency relationships in the association analysis process. The RDD object of the support value is created based on treatment schemes and disease-symptom clusters . Because each partition of is used by at most one partition of , a narrow dependency exists between and . In contrast, since multiple partitions of the child object are dependent on one partition of , a wide dependency exists between and . In the subsequent process of the association analysis, to obtain the confidence value, an object is created from , with a narrow dependency between them. Finally, an object is created to save the frequent items based on the wide dependency of .

4.2.2 Parallel Association Analysis Process of Disease-Treatment Rules

Similar to the parallel clustering process, the execution environment is configured for the parallel association analysis process. According to the RDD dependencies in Figure 11, for each disease-symptom cluster , the association analysis process of the corresponding treatment schemes is executed in parallel in the function. The association rules between and the related treatment schemes are extracted. In the parallel function, the support value of each association rule is measured by the Apriori algorithm. If of an association rule is larger than the minimum support , the confidence value of the rule is derived. If is larger than the minimum confidence , the current association rule is marked as a strong rule. In the and functions, strong rules of disease-symptom clusters are appended to the frequent itemset simultaneously. Hence, strong association rules of disease treatment schemes are produced. The parallel process of the Apriori-based disease treatment-scheme association analysis is presented in Algorithm 4.

0:   : the disease-symptom clusters obtained from Algorithm 3;: the path of the treatment-scheme records stored on HDFS;: the preset value of the minimum support;: the preset value of the minimum confidence.
0:   : the frequent itemsets of the association rules.
1:   new SparkConf(“Apriori”, “SparkMaster”);
2:   new SparkContext();
3:   .textFile();
4:  create frequent itemset ;
5:  sc.parallelize().foreach
6:   get disease-symptom cluster ;
7:    .flatMap
8:    get treatment scheme ;
9:     get treatment association ;
10:     get treatment association ;
11:    calculate the support value ;
12:    if then
13:     calculate confidence ;
14:     if then
15:       append frequent itemset ;
16:     end if
17:    end if
18:   end flatMap.reduce();
19:  end foreach.collect().groupBy();
20:  return  .
Algorithm 4 Parallel process of the Apriori-based disease treatment-scheme association analysis

The computational complexity of Algorithm 4 is , where is the number of disease-symptom clusters, is the average number of treatment records for each disease-symptom cluster, and is the number of computing nodes. Benefiting from the distributed data allocation and the parallel computing mechanism, the time complexity of the Apriori-based disease treatment-scheme association analysis is obviously reduced.

5 Experiments and Applications

We evaluate our proposed model in terms of clustering accuracy, recommendation quality, and performance of the proposed DDTRS. Section 5.1 presents the experimental settings. Clustering accuracy and recommendation quality evaluations of DDTRS are presented in Section 5.2. Section 5.3 analyzes two cases of the disease treatment recommendations. A performance assessment of DDTRS is provided in Section 5.4.

5.1 Experimental Setup

DDTRS is developed in a client/server software model, including a data-collection terminal (client) and a data-analysis terminal (server). The data-collection terminal is deployed in our cooperating hospitals. Using the data-collection terminal, massive volumes of historical and current medical datasets are gathered, and the results of clustering analysis and treatment recommendations are fed back to the medical doctors in these hospitals. The data-analysis terminal is deployed at the National Supercomputing Center in Changsha (NSCC) for the disease-symptom clustering, association analysis, and treatment recommendation.

The experimental setup of DDTRS is installed on the Apache Spark cloud platform at the NSCC, which is comprised of 30 computing nodes. Each node has eight Intel Xeon Nehalem EX CPU (8 cores, 2.27GHz) and 32GB memory. The nodes are connected by a high-speed Gigabit network. Each node is configured with Ubuntu 15.10 and a cloud computing environment using Apache Spark 1.6.0.

5.2 Accuracy Evaluation

5.2.1 Accuracy of Disease-Symptom Clustering

The accuracy of the disease-symptom clustering is a crucial issue in medical sciences. Inaccurate results of the disease-symptom clustering might lead to incorrect diagnoses and inappropriate treatment recommendations, endangering patients’ health. Considering that predicting disease for a given inspection result is a standard classification issue, in this experiment, two typical classification algorithms (namely, C4.5 and Random Forest (RF)) and a typical clustering algorithm (namely, K-Means) are introduced as the comparison algorithms. The accuracy of the proposed DPCA-based disease-symptom clustering algorithm is assessed by comparing the results with those of the C4.5, RF, and K-Means algorithms.

Although the clustering algorithm is a kind of unsupervised learning algorithm, for which it is not necessary to label the samples in advance, we pre-defined the class labels for all the samples in this experiment for comparison with the classification algorithms. Cluster Accuracy (CA) is introduced to evaluate the clustering algorithms and the classification algorithms

[13]. CA measures the ratio of the number of correctly classified / clustered instances to that of pre-defined class labels. Let be the inspection dataset in this experiment, let be the set of classes / clusters detected by the corresponding algorithm, and let be the set of pre-defined class labels. CA is defined in Eq. (15):

(15)

where is the set of data points in the -th class/cluster, is the set of pre-defined class labels of the data points in , and is the size of . is the number of data points that have the majority label in . The greater value of , the higher the accuracy of the classification / clustering algorithm and the greater the purity that each cluster achieves. The comparison results of disease-symptom clustering based on different algorithms are illustrated in Figure 12.

Figure 12: Accuracy evaluation of different algorithms for disease-symptom clustering

As is evident from Figure 12, for diseases that have few treatment stages or pathogeneses, such as influenza and Diabetes Mellitus (DM), the traditional classification algorithms obtain higher accuracy than the clustering algorithms. For example, in the case of influenza, the accuracy of C4.5 is 85.32% and that of RF is 92.82%, and those of the K-Means and DPCA algorithms are 70.92% and 84.73%, respectively. In the case of DM, the accuracies of C4.5 and RF are 86.48% and 90.23%, while those of K-Means and DPCA are 74.38% and 86.53%, respectively. This is because there are fewer treatment stages of diseases and fewer changes in the disease symptoms. In contrast, for diseases with the characteristics of multiple treatment stages, various symptoms, or multi-pathogenesis, the clustering algorithms achieve higher degrees of accuracy than the traditional classification algorithms. For example, in the case of anemia, the accuracy of DPCA is higher than those of the comparison algorithms, peaking at 88.35%, while those of the K-Means, RF, and C4.5 algorithms are 72.36%, 66.21%, and 57.92%, respectively. The experimental results demonstrate that our DPCA-based disease-symptom clustering algorithm achieves stability and high accuracy.

5.2.2 Robustness of the DPCA Algorithm

To intuitively describe the DPCA clustering process of disease symptoms, we take a dataset of the tumor-marker inspection task as an example. The disease-symptom clustering analysis of this dataset is expected to identify whether the patients corresponding to these data have a liver cancer. Liver cancer is a malignant tumor that occurs in the liver, which includes a primary liver cancer and metastatic liver cancer. In the tumor-marker inspection task, there are more than 10 inspection items, such as alpha-fetoprotein (AFP), carcinoembryonic antigen (CEA), carbohydrate antigen (CA724), CA125, CA153, CA199, neuro-specific enolase (NSE), globulin (GLO), human chorionic gonadotophin- (HCG-), and gamma glutamyl transpeptidase (GGT). Among them, AFP and CEA are the most crucial inspection items for identifying liver cancer. AFP measurement is one of the most specific methods for the diagnosis of hepatocellular carcinoma. CEA is a tumor index of digestive tract cancer, which particularly deteriorates and transforms into liver cancer. When AFP is more than 25 ng/ml and CEA is more than 5 ng/ml, a tumor is often suggested, which is common in liver cancer, colorectal cancer, and breast cancer.

To facilitate expression and understanding, a dataset of the tumor-marker inspection task containing 40 samples is collected, and two inspection items (namely, AFP and CEA) of the inspection task are selected to construct a two-dimensional data view. Afterwards, we analyze the calculation processes, such as distance calculation and density peak calculation. The dataset of the tumor-marker inspection task is shown in Table 4.

No. AFP CEA No. AFP CEA No. AFP CEA No. AFP CEA
1 1.06 10.70 11 6.24 19.42 21 4.75 26.85 31 6.34 27.11
2 2.32 13.50 12 7.38 10.35 22 4.91 27.98 32 6.34 25.32
3 4.55 20.45 13 8.00 8.10 23 4.82 25.48 33 6.03 23.37
4 4.68 18.23 14 3.57 27.73 24 4.24 23.32 34 6.65 27.37
5 5.01 19.94 15 3.78 25.06 25 4.70 22.65 35 6.60 25.83
6 5.11 18.14 16 4.08 25.93 26 5.30 27.62 36 6.60 24.29
7 5.32 20.96 17 4.50 27.26 27 5.11 26.09 37 6.65 21.73
8 5.37 19.42 18 4.54 26.03 28 5.21 24.04 38 6.96 25.06
9 5.62 17.89 19 4.70 22.65 29 5.73 28.65 39 7.11 26.60
10 5.73 20.19 20 4.65 28.14 30 5.88 26.09 40 7.47 22.75
Table 4: Dataset of the tumor-marker inspection task (partial)

To compare the effects of different calculation methods of local density on the robustness of the algorithm, we set different values of the cutoff distance in the experiment. For the 40 data points of the dataset in Table 4, there are 780 distance values between data points, which are calculated by Eq. (1). These distances are arranged in ascending order, and the cutoff distance is set to the distance values of the 1.0%, 2.0%, 3.0%, and 4.0% positions of the ordered distances. Namely, the cutoff distance is set as 0.57, 0.76, 0.85, and 0.88, respectively. The influence of the cutoff distance on the local density is evaluated by analyzing the variation of the data density under different cutoff distances. We calculate the local density for the dataset by Eq. (2) and Eq. (4) depending on different cutoff distances. Moreover, corresponding delta distances are subsequently calculated based on the related local density . The results of comparing the effects of different calculation methods of local density are presented in Figure 13.

(a) Eq. (2) ()
(b) Eq. (2) ()
(c) Eq. (2) ()
(d) Eq. (2) ()
(e) Eq. (4) ()
(f) Eq. (4) ()
(g) Eq. (4) ()
(h) Eq. (4) ()
Figure 13: Comparison of decision graphs obtained by Eq. (2) and Eq. (4)

It can be observed from Figure 13 (a) to (d) that, using the calculation method of Eq. (2), the value of cutoff distance has a great influence on the local density of the data points. When is equal to 0.57 (1.0%), the densities of most data points are 1.0 while delta distances are in the range of 0.0 to 10.0. There is only one point with high density (2.0) and high delta distance (20.1), which is the candidate cluster center. With the change of the value of , the distributions of decision points in Figure 13 (b), (c), and (d) are changed evidently. In contrast, it can be found from Figure 13 (e) to (h) that, using GKF of Eq. (4), the value of cutoff distance has minimal effect on the local density of the data points. When the value of cutoff distance increases from 0.57 (1.0%) to 0.88 (4.0%), the density distribution of the data points is basically stable. There are three decision points with high density and high delta distance (shown as the red squares) in the four cases of (e) to (h), which are the candidate cluster centers. At the same time, there are four decision points with low density and high delta distance (shown as the black triangles) in the four cases, which refer the outliers. Moreover, the three candidate cluster centers and four outliers refer to the same data points in the four cases. Among them, four outliers are the 1-st (AFP = 1.06 and CEA = 10.70), 2-nd (AFP = 2.32 and CEA = 13.50), 12-th (AFP = 7.38 and CEA = 10.35), and 13-th (AFP = 8.00 and CEA = 8.10) of data points in Table 4. Three candidate cluster centers are the 18-th (AFP = 4.54 and CEA = 26.03), 22-th (AFP = 4.91 and CEA = 27.98), and 26-th (AFP = 5.30 and CEA = 27.62) of data points, which refer different degrees of disease symptom of liver cancer. Therefore, we can carefully draw the conclusion that, the DPCA algorithm achieves high robustness by using GKF and obtain accurate disease-symptom clusters.

5.2.3 Quality Evaluation of Treatment Recommendation

To evaluate the quality of the treatment recommendation that DDTRS provided, an evaluation is performed via feedback from medical doctors. There are five indicators available on the application interface to evaluate the quality of the treatment scheme: (a) effectiveness, (b) chronergy, (c) non-harmful side-effects, (d) economy, and (e) patient satisfaction. The quality of a treatment scheme is defined in Eq. (16):

(16)

where is the effectiveness of the treatment scheme, is the chronergy, is the non-harmful side-effects, is the economy, and is the patient satisfaction. The scoring range of each indicator is (1 5). The quality of the treatment scheme is drawn as a radar graph, as shown in Figure 14.

Figure 14: Radar graph of the quality of treatment schemes

The radar graph of is a pentagon, in which the distance from the center to each vertex is equal to 5. It is easy to obtain that each side length is approximately 5.88 and the area of the pentagon is . Hence, the value of is within the range of (). According to the radar graph, the value of is the area defined by the five indicators (Eq. (17)):

(17)

In Eq. (17), the value of each indicator is computed with the scores submitted by doctors and their weights. Taking the effectiveness indicator of a treatment plan as an example, assume that scores are fed back from doctors. Then, is defined in Eq. (18):

(18)

where is the score evaluated by the -th doctor and is his weight, with a value in the range of (0 1).

The weight of a doctor is defined based on the professional authority of the doctor in his field of expertise. Namely, it’s value is calculated by combining his professional title and the average quality of his historical treatment schemes. Assuming that treatment schemes presented by the -th doctor are utilized in the recommendation, let be the average quality of all treatment schemes of the doctor. The weight value of the -th doctor is defined in Eq. (19):