Landscape of Big Medical Data: A Pragmatic Survey on Prioritized Tasks

01/03/2019
by   Zhifei Zhang, et al.
0

Big medical data poses great challenges to life scientists, clinicians, computer scientists, and engineers. In this paper, a group of life scientists, clinicians, computer scientists and engineers sit together to discuss several fundamental issues. First, what are the unique characteristics of big medical data different from those of the other domains? Second, what are the prioritized tasks in clinician research and practices utilizing big medical data? And do we have enough publicly available data sets for performing those tasks? Third, do the state-of-the-practice and state-of-the-art algorithms perform good jobs? Fourth, are there any benchmarks for measuring algorithms and systems for big medical data? Fifth, what are the performance gaps of state-of-the-practice and state-of-the-art systems handling big medical data currently or in future? Finally but not least, are we, life scientists, clinicians, computer scientists and engineers, ready for working together? We believe answering the above issues will help define and shape the landscape of big medical data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 10

page 13

page 14

page 18

page 19

page 20

04/01/2020

Leveraging Data Preparation, HBase NoSQL Storage, and HiveQL Querying for COVID-19 Big Data Analytics Projects

Epidemiologist, Scientists, Statisticians, Historians, Data engineers an...
08/20/2014

Making FPGAs Accessible to Scientists and Engineers as Domain Expert Software Programmers with LabVIEW

In this paper we present a graphical programming framework, LabVIEW, and...
12/31/2012

On Automation and Medical Image Interpretation, With Applications for Laryngeal Imaging

Indeed, these are exciting times. We are in the heart of a digital renai...
12/05/2018

A distributed data warehouse system for astroparticle physics

A distributed data warehouse system is one of the actual issues in the f...
09/28/2020

The Grey Hoodie Project: Big Tobacco, Big Tech, and the threat on academic integrity

As governmental bodies rely on academics' expert advice to shape policy ...
02/28/2019

Thinging for Computational Thinking

This paper examines conceptual models and their application to computati...
06/10/2021

Defending IEEE Software Standards in Federal Criminal Court

IEEE's 1012 Standard for independent software and hardware verification ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Unlike physics or chemistry, which the natural laws governing molecules are successful in describing [1], medical science is not founded on first principals from which a healthy or unhealthy human being or animal can be derived. Thus in nature, one of the important features of medical science is its data-driven mode: massive medical data stems from a wide range of experiments or clinical practices that spit out many types of information [2], and they provide the basis for our clinician research and practice.

Big medical data poses great challenges to life scientists, clinicians, computer scientists, and engineers. Even only considering computing requirements without delving into medical details, Stephens et al. [3] compared genomics data—one portion of big medical data, with three other major data sources: astronomy, YouTube, and Twitter, and concluded big medical data is either on par with or the most demanding of the domains in terms of data acquisition, storage, distribution, and analysis [3]. Unfortunately, big medical data has many other dimensions of complexity other than data volume. For example, medical data is much more heterogeneous than those in the other domains [2]. Taking Alzheimer’ s disease (AD)—the most common age-related neurodegenerative disease—as an example, clinicians and researchers [4] need collect several types of data: clinical, genetic, imaging, and biospecimen data for AD diagnosis. The heterogeneity of multi-source data not only raises cognition difficulty (for both clinician and computer scientist practitioners) , but also poses the challenges of managing and analyzing those data (for computer scientists and engineers). The worst of all, the knowledge and skills in both fields are very professional, which seriously challenges multi-disciplinary collaboration.

Fig. 1: Relationships of Prioritized Tasks with Different Medical Data.

The purpose of this survey is to bridge the gap among life scientists, clinicians, computer scientists, and engineers. To define and shape the landscape of big medical data, we—a group of life scientists, clinicians, computer scientists, and engineers—sit together. We know it is impossible to perform an exhaustive survey on all fields because of our knowledge and time budget limits. Instead, we take a pragmatic approach and focus on the prioritized tasks in clinician researches and practices: Quantified Self—a specific movement to collect and analyze different aspects of a personal daily life; Disease Classification; Disease Diagnosis; and Drug Discovery. Figure 1 summarize the relationships among different medical data with those prioritized tasks.

Also, our pragmatic approach lies in drafting this survey, and we keep the readers in our mind: for each prioritized task, we will help the readers—both life scientists, clinicians, computer scientists and engineers answer the following questions. What data sets are publicly available? What are the state-of-the-practice and state-of-the-art algorithms and systems? Do they perform a good job? If not, how about the performance gap? Are there any comprehensive benchmark suite to evaluate the algorithms and systems?

As big medical data is a fast-evolving field, the another purpose of this survey is acting as a framework of defining and shaping the landscape of big medical data. For example, understanding the root causes of disease is an important task in utilizing medical data. Currently, we do not include them because of its complexity and immaturity. Meanwhile, for one prioritized task—disease diagnosis, we only include three representative diseases to demonstrate how to utilize medical data: Alzheimer’s disease (AD)—an age-related neurodegenerative disease, Acute lymphoblastic leukemia (ALL)—the most common cancer in children, and breast cancer—the most common diseases among women. In one word, we will keep expanding and updating state-of-the-art and state-of-the-practice of the full spectrum of big medical data.

Table I performs a comprehensive comparison of our survey with the previous ones. To the best of our knowledge, existing big medical surveys [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] target specific problems and thus fail to cover the whole spectrum of the above issues. Significantly different from the previous ones, our survey paper covers the four prioritized tasks in clinician research and practice: quantified self, disease classification, disease diagnosis, and drug discovery from perspectives of data sets, algorithms, systems, and benchmarks.

Reference Publication year Prioritized tasks Data sets Algorithms Systems Benchmarks
Our 2018 Quantified self
Disease classification
Disease diagnosis
Drug discovery

[5]
2018 Quantified self -
Disease diagnosis - -
Drug discovery - - -

[6]
2018 Disease diagnosis -
Disease surgery -

[7]
2018 Disease diagnosis

[8]
2018 Disease diagnosis

[9]
2018 Quantified self

[10]
2017 Disease diagnosis -

[11]
2016 Quantified self -
Disease diagnosis
Drug discovery - -

[12]
2016 Quantified self - -
Disease diagnosis -

[13]
2016 Disease diagnosis

[14]
2016 Big data index

[15]
2015 Quantified self - - -
Drug discovery - - - -

TABLE I: Summary of medical data surveys

After thoughtful discussion and comprehensive survey within our multi-disciplinary group, we gain several consensuses and insights as follows:

  1. Big medical data is heterogeneous, high-dimensional, embodying a large mixture of signals and errors [16]

    . The widely used data mining or machine learning techniques heavily depend upon identifying weak associations instead of strong causation. The noisy nature of experimental data may amplify the side effect of our current ability to identify weak associations at the cost of tolerating larger error thresholds 

    [16]. From this perspective, it may indicate that we need develop new computing models and approaches in handling noisy big medical data.

  2. The publicly available big medical data sets are limited in terms of not only its scale, but also its single data source. For example, massive previous work utilizes deep learning algorithms to analyze imaging data to automatically diagnose disease. Unfortunately, all clinician practices and researches, like disease classification, disease diagnosis, and drug discovery need utilize comprehensive data sources. So, we need to build multi data-source knowledge base to advance state-of-the-art and state-of-the-practice for disease classification, disease diagnosis, and drug discovery.

  3. The previous work demonstrates the potential of incorporating machine learning techniques into clinician practices. However, its high accuracy is achieved on the static data. In reality, the clinician practitioners work in an open environment and handle open problems, so we need set up the realistic benchmarks that can mimic the way that the clinician practitioners handle the dynamic data for different clinician purposes. Or else, the achieved accuracy on static data does not make sense in the clinician practices.

  4. The sources and types of medical data are usually multifarious and integrated. These dimensions are not processed and learned individually, and conversely, they are combined to detect and diagnose diseases cooperatively. Under this circumstance, the storage and processing systems are required to integrate different data sources and types. To the best of our knowledge, there exists no such a system that can support multi-source and heterogeneous data storage and processing in the big medical domain or even the other domains.

  5. We discuss several prerequisites for the purpose of an effective and efficient multi-disciplinary cooperation.

In the following sections, the related terminologies and unique characteristics of big medical data are described in Section  II and Section III, respectively. In Section  IV, we will discuss four prioritized jobs in clinician research and practises, and the related publicly available data sets. In Section V, the state-of-the-art or state-of-practise algorithms are discussed. The benchmark for measuring algorithms or systems will be shown in Section VI. The performance gap of the state-of-the-art or state-of-practise systems handling big medical data will be explained in Section VII. Also, we discuss multidisciplinary collaboration in Section VIII, and conclusions will be drawn in Section IX.

Ii Terminologies

For the readers with different background, this section explains several important terminologies in both medical sciences and computer sciences.

Genetics. Genetics [17] is a term that refers to the study of genes and their roles in inheritance. As genes (units of heredity) carry the instructions for making proteins, directing the activities of cells and functions of the body, genetics involves scientific studies of genes and their effects [17].

Omics. Omics [18] aims at the collective characterization and quantification of pools of biological molecules that translate into the structure, function, and dynamics of an organism or organisms 111The English-language neologism omics [18] informally refers to a field of study in biology ending in omics, such as genomics, proteomics or metabolomics..

High throughput technologies. The generic term refers to the technologies that allow exact and simultaneous examinations of thousands of genes, proteins and metabolites [19]. For example, sequencing technologies [20] include a number of methods that are grouped broadly as template preparation, sequencing and imaging, and data analysis. The unique combination of specific protocols distinguishes one technology from another and determines the type of data produced from each platform [20]. In general, the automated Sanger method is considered as a first-generation technology [20]. Second-generation sequencing (SGS) refers to sequencing of an ensemble of DNA molecules with wash-and-scan techniques [21]. Third-generation sequencing (TGS) refers to sequencing single DNA molecules without the need to halt between read steps [21].

Systems biology. The basic purpose of systems biology [22] is the system-level understanding of a cell or an organism in the context of molecular networks. The four purposes of systems biology are as follows [22]

: (1) understand the structure of all the components of a cell/organism up to a molecular level. (2) predict the future state of the cell/organism under a normal environment, (3) predict the output responses for a given input stimulus, and (4) estimate the changes in system behavior upon perturbation of the components or the environment.

Comparative Medicine. Comparative medicine [23] is a distinct discipline of experimental medicine that uses animal models of human and animal disease in translational and biomedical research. Basically, it relates and leverages biological similarities and differences among species to better understand the mechanism of human and animal disease [24].

Precision Medicine. Precision medicine [25] uses clinicopathological indexes and molecular profiling to create diagnostic, prognostic, and therapeutic strategies individually tailored to a patient.

Personalized medicine. Personalized medicine [26] refers to the prescription of specific therapeutics that is the best suitable for an individual on the basis of pharmacogenetic and pharmacogenomic information.

Data integration. The term data integration [27] refers to the situation where, for a given system, multiple data sources are available and studied integratively for knowledge discovery.

Benchmarks. A benchmark [28] is the act of running a set of computer programs or other operations, so as to assess the relative performance of an object, i.e., a system or an algorithm.

Iii What are the characteristics of Big Medical Data?

There are four significant characteristics of big medical data.

First, big medical data is either on par with or the most demanding of the domains 222Stephens et al. [3] compared genomics data—one portion of big medical data, with three other major generators of Big Data: astronomy, YouTube, and Twitter. in terms of data acquisition, storage, distribution, and analysis [3]. There are several reasons underlying the boom of big medical data. On one hand, a significant fraction of the world’ s human population will have their genomes sequenced because of the promise of precision medicine or personalized medicine [3]. On the other hand, for medicine, just having the genome will not be sufficient, and other relevant omics data sets will be definitely collected, i.e, transcriptome, epigenome, proteome, metabolome, and microbiome sequencing, from different tissues to compare healthy and diseased states [29] [30] [3].

Second, medical data is much more heterogeneous than those in the other domains [2]. On one hand, they stem from a wide range of experiments that spit out many types of information [2]. In summary, extensively-used medical data consist of four types: sequence data, 3D-structure data, multivariate data, and network data  [22]. The challenge for integrating heterogeneous data lies in deriving meaningful interpretable correlations and causation [16]. For example, direct correlation analyses between transcriptomics and proteomics profiles are not valid in eukaryotic organisms [16]. On the other hand, there are diversity of existing data types and formats, each one compliant to a different standard, which results in data heterogeneity [27] [31].

Third, medical data is high-dimensional [16]

. It is widely recognized that multiple dimensions must be considered simultaneously so as to understand biological systems from a perspective of systems biology, and performing analytics on high-dimensional data often results with poor interpretability 

[32]

. The reliability of models probably decreases with each added dimension (i.e. increased model complexity) for a fixed sample size (i.e. bias-variance dilemma) 

[33]

. All estimate instability, model overfitting, local convergence, and large standard errors compromise the prediction advantage provided by multiple measures 

[16].

Fourth, medical data is coupled with the noisy nature of experimental data, and especially omics data embody a large mixture of signals and errors [16]. The widely-used data mining or machine learning techniques heavily depend upon identifying weak associations instead of strong causation. The noisy nature of experimental data, which is often found in Comparative Medicine, may amplify the side effect of our current ability to identify weak associations at the cost of tolerating larger error thresholds [16]. In other word, the popular data analytics techniques may fail in handling mixture of signals and errors. Biological systems include non-linear interactions and joint effects of multiple factors that make it difficult to distinguish signals from random errors [16], which has two implications: first, it is important to minimize sources of error with omics data [16]. Second, we need to develop new computing models and approaches in handling noisy big medical data.

Iv What are the prioritized tasks in clinician research and practices?

In this section, we summarize the prioritized tasks in clinician research and practices utilizing big medical data sets. We mainly focus on four prioritized tasks: quantified self, disease classification, disease diagnosis, and drug discovery. For each task, we answer the same question: do we have enough publicly available data sets?

Iv-a Quantified Self

So far, there is no uniform definition for quantified self (in short, QS). Wolf et al. [34] first propose QS in 2007, which uses technologies and equipments to record and analyze the body [35, 36, 37]. In this paper, we adopt the definition of QS in Wikipedia [38]—QS is a specific movement that integrates technology into data acquisition in a personal daily life [38]. We collect data in terms of biological, physical, behavioural, or environmental information, like food consumed, the quality of air, mood, and performance, whether mental or physical [38, 39]. After analysing the data, QS equipments or apps will provide suggestions or warnings for users.

Zhu et. al [39] summarize four applications in QS, such as Daily Health Management [40, 41, 42], Disease Prevention [43, 44, 45], Chronic Disease Management [46, 47, 48], and Out-of-hospital rehabilitation [49, 50, 51]. Personal data are continuously tracked and recorded through smart wearable devices [40]. In daily life, tracking records are analyzed to provide suggestions and warning, which can also be shared among friends and families [41, 42]. Data collected by QS tools helps pinpoint the root cause of disease early and achieve early treatment to prevent disease [43]. QS apps transfer the data to the hospital’s medical record system and monitoring center, and provide early warning and corresponding diagnosis and treatment opinions in chronic disease [48]. And with portable quantitative instruments, patients can monitor self at home and use computer software to ask doctors for advice remotely for regular review after surgery [51].

Iv-A1 Publicly available data sets

With the improvement of economic and the development of technology, QS data is becoming more and more important in the field of health management and personalized medicine [52, 26]. However, most of QS data sets are limited to human behaviour. Moreover, the size of data set is small, and only collected from several people or a dozen people. There is no comprehensively public access database for user-contributed data of QS [52]. The communities should establish a public repository where individuals could upload any types of QS data [53]. Several public database of QS data are as follows.

HAR (Human Activity Recognition) dataset is an important data category of QS, which is of great significance in improving walking stability, recognizing motor disorders, evaluating surgical outcomes, and reducing joint loading [54]. At present, several HAR datasets have been released. Daphnet freezing of gait dataset [55] is collected from patients with the Parkinson’s disease. The dataset includes 10 users, including 1,140,835 samples [55]. The samples are marked as ”frozen” or ”non-frozen” [55]. WISDM Actitracker dataset [56] contains 1,098,213 samples belong to 28 users and 6 distinctive activities of sitting, jogging, walking, climbing, and standing stairs. They collected the data samples from Android phones [56]. Actitracker [57] contains several daily activities, including ”jogging”, ”walking”, ”ascending stairs”, and ”descending stairs”,and etc., which are collected using a cell phone in their pocket with 20 Hz sampling rate. Hand Gesture dataset [58] contains data on different types of human hand movements. Table II shows seven public QS datasets.

The National Health and Nutrition Examination Survey (NHANES) [59], which can access the health and nutritional status, is a program designed for the United States (www.cdc.gov/nchs/nhanes/). The NHANES data set is available on Internet through an extensive series of publications and articles in scientific and technical journals [59]. Open Humans [60] is a platform that allows you to upload, connect, and privately store your personal data – such as genetic, activity, or social media data (www.openhumans.org/).

Name Task Instance Types Year
Open Humans [60] Data Analysis - genetic, activity, or social media data 2015-present
Hand Gesture [58] HAR sampling rate is 32 samples per second time series data 2014-present
WISDM [56] HAR 1, 098, 213 samples time series data 2011-present
Actitracker [57] HAR 29, 000 frames time series data 2011-present
Daphnet [55] HAR 1, 140, 835 samples time series data 2010-present
NHANES [59] Food Consumption - text 1960-present
TABLE II: Overview of QS datasets. HAR means Human Activity Recognition.

Iv-B Disease Classification

Disease classification is a task that groups entities together according to their similarities, which is an important step in precision medicine and of great significance to the quantitative study of the medicine phenomenon [61, 62]. Nowadays, it has been widely used in academic medicine [63]. However, existing disease taxonomy is often focused on physiological characterizations and clinical appearance of diseases, with little reference to the diseases mechanism [64, 65]. Constructing a more accurate classification of the diseases that can reflect the diseases mechanism is essential towards fully understanding the entities.

With the development of the biotechnology, computer technology and medical technology, the molecular data is growing rapidly. This data provides new knowledge of the diseases on a molecular level, and deepens our understanding of the disease. Thus, we can reclassify the diseases according to the new input—the molecular biology information (such as the genomic data of the tumour).

Iv-B1 Publicly available data sets

Encouraged by the concept of precision medicine, most of the researches on reclassifying the diseases focus on the molecular biology. Besides, according to [66], for the purpose of building a new effective system of disease taxonomy, other comprehensive data concerning clinical medicine, environment and the state of individual health, which is expected to be accumulated, is also indispensable. As shown in Table III, there are many publicly available data sets which are qualified for disease classification. So far the publicly available data mainly includes the genomic information of diseases like different kinds of cancers and clinical reports. However, the datasets relating to the environment and the state of individual health are rarely publicly accessible since they involve the patient privacy. Several representative public datasets containing genome, clinical medicine, circumstance and the health status of patients are described as below.

Name Task Instances Types Year
APGI [67]
Expediting the transformation from research
detection to the improvement of the
treatment for pancreatic cancer patients.
More than 4,000 cases
genomic and clinical data
2009 - present
EGA [68]
Offering free data and services of biological
information.
Over 6.25 million cases
Data of phenotype and
gene
2007 - present
ICGC [69]
Describing cancer genomes.
1.52 PB
genomic and clinical data
2007 - present
TCGA [70]
Enhancing the avoidance, diagnosis, and
therapy of cancer.
33096 cases genomic data
2000 - present
GEO [71]
Offering a robust, multi-functional database
and effective tools to the query and
analysis of the database.
2,680,676 samples
genomic data
2000 - present
CCLE [72]
A transformation from genomics of cancer
cell into cancer division.
Over 1100 cell lines contain-
ing genetic information.
genomic and cell lines
data
2000 - present
Sanger[73]
Genomic detection and comprehension
on the Earth.
More than 1000 human
Genomes
genomic data
1992 - present
Truven
Marketscan [74]
Providing health improvement resolutions set
up from complete data, progressive
analysis and related specialists.
Roughly 55 million claim
every year.
genetic, environmental
and other medical
insurance data
1970s - present

TABLE III: Data set for disease classification

The Cancer Genome Atlas data set

The Cancer Genome Atlas (TCGA) data set is the most popular genetic dataset, which focuses on collecting the data of the cancer patients. So far, TCGA has collected approximately 7000 human tumors [70]. And, it contains many types of data such as measurements of somatic mutations, copy number variation, mRNA expression, et al [70]. One of the most important purpose of TCGA is to obtain the valuable insights of the heterogeneity of different cancer subtypes [75, 76, 77, 78].

European Genome-phenome Archive(EGA)

EGA provides over 4 thousand datasets, consisting of individually distinguishable data in phenotype and gene. Those data are collected from the research of biomedicine, and can merely be used for legally genuine research purpose with permission by creating accounts [68]. Specifically, EGA contains approximately 710 DNA samples without plasma cell and 428 white blood cell samples collected from over 4 hundred patients with metastatic prostate cancer.

Australian Pancreatic Cancer Genome Initiative (APGI) data set

APGI is a part of the International Cancer Genome Consortium—a global research enterprise of over 100 scientists, clinicians and allied health professionals involved in pancreatic cancer research and care [67]. APGI contains over 4,000 pancreatic cancer patients in the database, with a range of prospectively collected and archived biospecimens [67]. Every biospecimen is coupled with detailed clinico-pathological data, including past medical history, treatment data and detailed disease outcomes [67].

Cancer Cell Line Encyclopedia (CCLE) data set

CCLE is a cooperation among Novartis Institutes for Biomedical Research, Genomics Institute of the Novartis Research Foundation, and the Broad Institute[72]. The project has three purposes: portray the features of gene and pharmacology of a large amount of cancer, promote synthetic analysis of connecting special susceptibility of pharmacology concerning the patterns of genome, and transform comprehensive genomics of cancer cell into division of human cancer [72]. The web site [72] provides publicly available genetic dataset that can be displayed in visualized methods. The data set, involving more than 1 thousand related information of cell lines, can be exploited to quantitative analysis of cancer and research of cancer reclassification according to gene and cell lines.

Truven Marketscan data sets

Truven Marketscan is a synthetic platform of the IBM Watson Health business providing health improvement resolutions [74]. The clinical information of patients and the accumulation of over 50 million medical insurance claims every year are included, some of which, according to the description of the platform, are accessible for the aim of relative research with a fee. The medical insurance information can be applied to various aspects of medical research, especially disease classification concerning precise medicine due to the genetic and environmental information of plenty of patients.

Iv-C Disease Diagnosis

Disease diagnosis is the process of determining which disease or condition explains a person’s symptoms and signs [79]. Early diagnosis wins time and money for patients [80]

. Nowadays, with the development of techniques, the diagnosis method has improved greatly. First, the development of genomics techniques makes genetic data play an important role in diagnosis. For example, gene mutations are used to classify acute myelocytic leukemia (AML) 

[81, 82], and two gene mutations have been included in the classification of myeloid neoplasms and AML by WHO [83]. Second, electronic health records (EHRs) have increased dramatically recently. For example, 75.5% of US hospitals had a basic EHR system [84] by 2014. Based on EHRs, Rajkomar et al. [85] propose a fast healthcare interoperability resources (FHIR) format, making it possible to make the most of EHRs including free-text notes. In the following, three common diseases that Alzheimer’s disease (AD)—age-related neurodegenerative disease [86], Acute lymphoblastic leukemia (ALL)—the most common cancer in children [87] and breast cancer—the most common diseases among women are discussed in detail.

Iv-C1 Representative disease diagnosis

Alzheimer’s disease (AD) diagnosis

AD is the most common age-related neurodegenerative disease resulting in an irreversible loss of memory and other cognitive functions in elderly people worldwide [86]. In 2006, the number of individuals with AD is 26.6 million [88]. By 2050, this number will quadruple, by which time, 1 in 85 persons worldwide will be living with AD [88]. According to the order of severity, patients are classified as normal, mild cognitive impairment (MCI) or AD [4]. Medical imaging of brains are usually used to diagnose AD, which is time consuming if the work is done manually. To tackle the problem, many automatic diagnostic systems have been developed. Among them, convolutional network (CNN) is the most prevalent method and has good performance [10].

Acute lymphoblastic leukemia (ALL) diagnosis

ALL is the most common cancer in children [87] with a peak incidence at 2-5 years old [80]. Without timely treatment, children with this serious blood pathology will die in a few weeks [80]. Early diagnosis helps provide timely and proper treatment for patients [80]. Microscopic examination of blood or bone marrow smears is the only effective way to leukemia diagnosis [89]

. Generally, the method can be tackled by a classic sequence of steps: (1) enhancing image, (2) identification of white cells, (3) feature extraction, (4) classification 

[80]. Besides image analysis, genomics studies have been introduced to inform disease classification in recent years  [82].

Breast cancer diagnosis Breast cancer has become one of the most common diseases among women that leads to death. Breast cancer can be diagnosed by classifying tumors. There are two different types of tumors, such as malignant and benign tumors. Doctors need a reliable diagnostic procedure to distinguish between these tumors [10]. Generally, it is very difficult to distinguish tumors even by the experts. Therefore, an automatic diagnostic system is needed to diagnose the tumors. The detection of breast cancer consists of three subtasks [90]: (1) detection and classification of mass-like lesions, (2) detection and classification of micro-calcifications, (3) breast cancer risk scoring of images. By using and extending the results from the fields of machine learning, statistics, image processing and optimization, highly accurate diagnosis of breast is expected to be done even by untrained users.

Iv-C2 Publicly available data sets

Alzheimer’s disease (AD) diagnosis

Different materials such as clinical, cognitive, imaging, genetic, and biochemical biomarkers can all be used to define the progression of AD, and several researches try to determine the relationships between those data [4]

. ADNI and BioFINDER are two longitudinal studies for AD, which provide comprehensive data set of AD and have been used widely by researchers. However, several researches indicate that the data is still inadequate for solving the real problem. For example, the AD DREAM Challenge 

[91] aims to benchmark state-of-the-art algorithms in predicting AD based on publicly genetic and imaging data. However, the result is not so perfect, and one possible reason is that the data used to train model is inadequate.

As a longitudinal multi-center study designed to develop clinical, imaging, genetic, and biochemical biomarkers for the early detection and tracking of AD, the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [4] provides comprehensive data for AD. Since its foundation in 2003, it has made major contributions to AD research. Now it contains 483 subjects diagnosed with elderly control, 1001 subjects with MCI, and 437 subjects with AD. ADNI researchers collect several types of data: clinical data, genetic data, imaging data, and biospecimen data. Clinical dataset comprises recruitment, demographics, physical examinations, and cognitive assessment data saved as comma separated values (CSV) files. Genetic data contains genotyping and sequencing data. Images such as magnetic resonance imaging (MRI) and positron emission tomography (PET) are available. Biospecimens includes blood, urine, and cerebrospinal fluid (CSF).

Swedish Biomarkers For Identifying Neurodegenerative Disorders Early and Reliably (BioFINDER) [92] is another longitudinal study, aiming to develop methods for early and accurate diagnosis of AD and Parkinson’s disease (PD). It comprises more than 1600 subjects which undergo examinations of advanced MRI, CSF and plasma analysis, amyloid and tau PET, clinical assessments and neuropsychological examinations [92].

Name Task Instances Types Modality Year
ADNI Detect AD at early stage 483 EC subjects, 1001 MCI subjects, 437 AD subjects Clinical, genetic, imaging, biospecimen MRI, PET, TXT 2003-2018
BioFINDER Detect AD and PD at early stage more than 1600 subjects Clinical, genetic, imaging, biospecimen MRI, PET, TXT 2013-2018
  • Abbreviation: AD = Alzheimer disease; EC = elderly controls; MCI = mild cognitive impairment; PD = Parkinson’s disease.

TABLE IV: Data set for AD diagnosis

Acute lymphoblastic leukemia (ALL) diagnosis

There are many comprehensive dataset for caners (e.g. TCGA). But as a branch of cancer, the data for ALL is relatively decentralized. Blood or bone marrow smears are key materials to diagnose ALL [89]. ALL-IDB is a public image database for ALL that has been studied widely. However, several researchers [93, 94] claim that hundreds of images are not enough to build a robust CNN, so more public data is still needed for ALL diagnosis. Besides, genomic information is also provided for ALL such as TARGET [87] and BioGPS [95].

Acute lymphoblastic leukemia image database (ALL-IDB) [80] is a public dataset of microscopic images of blood samples, based on which researchers can evaluate their segmentation and classification algorithms. The format of the images is JPG with 24 bit color depth and resolution 2592 x 1944. ALL-IDB1 contains 108 images of 39000 blood elements, where lymphocytes are manually labeled by experts. Cropped areas of interest of cells belonging to ALL-IDB1 are collected as ALL-IDB2 dataset.

Therapeutically applicable research to generate effective treatments (TARGET) [87], a project of national institutes of health (NIH), determines molecular changes that drive childhood cancers by genomic approaches. The ALL pilot phase (Phase I) has produced genomic profiles of nearly 200 B-cell ALL patient cases for molecular alterations. Nucleic acid samples data, extracted from peripheral blood and bone marrow tissues, is included in each fully-characterized case. The dataset consists of clinical information, tissue pathology data, chromosome-specific copy number alterations, sequence data of single amplicons, and mutations. BioGPS [95] is a gene annotation portal, which supplies serval ALL genetic dataset, the biggest of which contains 207 samples provided by children’s oncology group (COG) study P9906 for high-risk pediatric ALL [96]. For each subject the dataset provides genome structure information such as BCR-ABL, E2A-PBX1, TEL-AML and clinical information such as central nervous system (CNS) status, white blood cell (WBC), age, gender and etc.

Name Task Instances Types Modality Year
TARGET Phase I Determine molecular changes that drive childhood cancers 200 subjects Clinical, genetic TXT 2009-2018
BioGPS COG P9906 Evaluate a regimen in patients with high risk B-precursor ALL 267 subjects Clinical, genetic TXT 2011
ALL-IDB Evaluate the algorithms for image segmentation and classification 108 images Imaging JPG 2010
TABLE V: Data set for ALL diagnosis

Breast Cancer Diagnosis

Digital imaging databases are needed for mammographic image analysis research. For the sake of accurate labeling of images, free-text report databases are necessary as well, which can be leveraged to turn the reports into accurate annotations automatically for network training [10]. Besides, it is recommendable to use gene expression databases to be able to connecting cancer phenotypes to genotypes.

So far, all three types of breast databases mentioned above have been developed. Unfortunately, most of the large databases are not publicly available and many stale databases are still in use.

INbreast, mini-MIAS, DDSM, BCDR-FMR, Breast Cancer Wisconsin Dataset and TCGA-BRCA are the most frequently-used mammographic mass classification datasets. These databases do represent a constructive and practical contribution to computer vision research in mammography (in short, MG) and it is expected that they will encourage the production of more extensive collections of data. Table

VI shows the most commonly-used datasets for diagnosing breast cancer.

Name Task Instance Types Modality
INbreast develop breast cancer CAD systems 115 cases , 410 images imaging MG
MIAS mammographic image analysis research 322 images imaging MG
DDSM mammographic image analysis research 2620 cases, 43 volumes imaging, expert ground-truth, metadata MG
BCDR-FMR lesion classification 1010 cases, 3703 images imaging, metadata MG
BCW lesion classification 569 instances, 32 attributes imaging, metadata MG
TCGA-BRCA connecting cancer phenotypes to genotypes 139 cases, 230167 images imaging, clinical and genomic data MR, MG
TABLE VI: Overview of datasets for breast cancer diagnosis. MG stands for mammography.

The INbreast database contains 115 examples, which includes 90 examples from breast-affected women and 25 examples from mammectomy women [97]. The INbreast database includes many types of lesions [97]. The comparative advantages of the INbreast database are its huge amount of examples together with accurate labels [97]. It is glad to see that this database can strongly support the future work on breast cancer diagnosis [97].

The Mammographic Image Analysis Society (MIAS) is a digital mammography (in short, MG) dataset. It includes 322 digital images, and contains both abnormal examples and normal examples[98] . The entire database, when compressed, occupies less than 2 GBytes fitting onto a single 8 mm magnetic tape. Copies are available for research purposes. The mini-MIAS database is available for scientific research at no cost, provided that they must abide by the licence agreement when using the imagery.

The Digital Database for Screening Mammography (DDSM) contains digitized mammograms together with related labels and other detailed information [99]. The DDSM database is freely available through the website [99].

The Breast Cancer Digital Repository (BCDR-FMR) is a comprehensive labeled dataset, which provides digital content (digitized film mammography images) and associated metadata (clinical history, segmented lesions BI-RADS classified, image-based descriptors, biopsy proven, etc.). The BCDR-FMR establishes a novel reference to develop breast cancer diagnosis methods [100].

The Breast Cancer Wisconsin (Diagnostic) Dataset is comparatively abundant in examples, including 569 patients, and for each instances, there are 32 attributes to describe it and ten features to measure it. There are both qualitative and quantitative features in the dataset. All feature values are recorded with four significant digits. This dataset consists of 212 malignant cases and 357 benign cases [101].

Nowadays, tumor gene expression analytical techniques based on DNA microarray have been applied to diagnose breast cancer [102]. TCGA-BRCA project has explored the most comprehensive gene expression database. However, analytical algorithms, which can solve gene expression-based diagnosis problems, have yet to be established [102].

Traditional PACS (Picture archiving and communication) systems preserve structured reports described by radiologists [97]. To optimally leverage free-text reports to train network, we can automatically turn these reports into precise labels or structured annotations [97]. However, most PACS databases are unavailable.

Iv-D Drug discovery

Drug discovery is the process of finding drug candidates that can be used as new drugs [103]. The motivation for drug discovery is because there are no suitable medical products for certain diseases [104]. Drug discovery is generally divided into the following steps: target identification and validation, screening and lead discovery, lead optimization and retrosynthetic analysis [105, 103]. Despite advances in biotechnology, drug discovery remains an expensive, difficult and inefficient process [106].

In target identification and validation phase, we need to determine the pathogenic factors of the disease, from DNA to RNA to protein characterization [107].

Lead discovery and drug screening refer to the process of assessing the biological activity, pharmacological effects, and medicinal value of a substance that may be used as a drug using an appropriate method [108]. Drug screening mainly includes high-throughput screening (HTS) and virtual drug screening [104]. HTS methods can be performed by robots at the same time for millions of tests, so the cost is very high. Because real drug screening requires the construction of large-scale compound libraries, extraction or cultivation of a large number of target enzymes or target cells necessary for experiments, and the need for complex device support, Drug screening requires a huge investment. The virtual drug screening method simulates the process of drug screening on a computer, predicts the possible activity of the compound, and then performs targeted physical screening of compounds that are likely to become drugs.

The final stage of drug discovery is lead optimization and retrosynthetic analysis. The purpose is to maintain the advantageous properties of the lead while improving the defects of the lead structure. Retrosynthetic analysis is to generate a synthetic route for a given target molecule [109]. Retrosynthetic analysis is to give a desired target molecule, using molecular compounds that can be directly synthesized, to give several possible synthetic routes of the target molecule [110]. The difficulty is that if the molecular compound to be synthesized is very complicated, the chemist may have to consult a lot of relevant literature, and also carry out repeated practice analysis in order to finally obtain several possible synthetic routes [111]. However, chemists may not be able to find a reasonable synthetic route because of knowledge or time constraints, or only find a few routes.

Many types of data are needed for drug discovery. At the stages of target identification and validation as well as lead optimization and retrosynthetic analysis, gene expression data and molecular-level data are needed, including compound structure, properties, and related chemical reactions. In drug screening, we need a drug sensitivity database and a toxicity database [107]. Several databases related to drug discovery are listed in Table VII.

Name Task Instance Types
Reaxys discover chemical structures, properties and reactions more than 28 million responses, more than 18 million substances, and more than 4 million documents relevant literature, precise compound properties and reaction data
SIOC for chemical research - compound structure and identification, natural products and pharmaceutical chemistry, chemical literature, chemical reactions and comprehensive information
PubChem BioAssay deliver free and easy access to all deposited data, and to provide intuitive data analysis tools [112] 500,000 descriptions of assay protocols, covering 5000 protein targets, 30,000 gene targets and providing over 130 million bioactivity outcomes [112] chemical structure and biological properties of small molecules and RNAi reagents [112]
TCM construct the first traditional Chinese medicine database for molecular docking simulation [113] 20,000 pure compounds isolated from 453 TCM components [113] molecular attributes, substructures, TCM components, and TCM classifications
ChEMBL address a wide range of drug discovery problems 2,275,906 compound records, 12,091 targets compound structure, biological or physicochemical measurements of these compounds and information on the goals of these assays arerecorded in a structured form [114]
TABLE VII: Overview of datasets for drug discovery

Iv-D1 Gene expression database and molecular-level database

The Reaxys database is produced by Elsevier, and is a rich database of chemical values and facts. Reaxys integrates the contents of Beilstein, Patent and Gmelin into an unified resource that includes more than 28 million responses, more than 18 million substances, and more than 4 million documents. It helps users identify promising new projects, terminate ineffective lead compounds, and design economical and high-yield synthetic routes that maximize time and cost savings.

The Shanghai Institute of Organic Chemistry’s (SIOC) database (http://www.organchem.csdb.cn) group is a comprehensive information system for chemical research and development. It provides compound structure and identification, natural products and pharmaceutical chemistry, chemical literature, chemical reactions and comprehensive information. Chemical reaction condition retrieval is to search for matching chemical reactions in the database by chemical reaction conditions such as reactants, products, catalysts, solvents, reagents, and the like. The user can search for the relevant reaction by the reactant, the English name of the product, the reaction conditions, the catalyst, etc.

The Taiwan traditional Chinese medicine (TCM) database [113] is currently the largest non-commercial TCM database in the world. This web-based database contains more than 20,000 pure compounds isolated from 453 TCM components [113]. All data are easily accessible to all researchers. In the past eight years, many volunteers have spent time in analyzing Chinese medicine ingredients in the Chinese medical literature and building structural files for each of the isolated compounds.

Iv-D2 Lead discovery and drug screening database

PubChem BioAssay database [112] is a public resource for archiving the chemical structure and biological properties of small molecules and RNAi reagents. The PubChem BioAssay database currently includes bioactivity from high-throughput screening and medicinal chemistry studies [112]. In addition, the PubChem BioAssay database contains dozens of high-throughput RNAi screens for complete genomes. These data, combined with other NCBI resources, make PubChem a public information system widely used in chemical biology and drug discovery research [112].

ChEMBL [114] is an open, large-scale bioactivity database containing information manually extracted from the medicinal chemistry literature (https://www.ebi.ac.uk/chembl). The ChEMBL database currently contains information extracted from more than 51,000 publications, as well as bioactive data sets from 18 other databases [114]. The data mainly includes screening results and bioactivity data.

V Do state-of-the-art and state-of-the-practise algorithms perform a good job?

V-a Quantified Self

With wearable devices deployed in recent years, more and more physiological and functional data is captured continuously for healthcare applications [52]. In the previous work, traditional statistical methods are widely used. There are two challenges and opportunities for handling QS data. On one hand, although deep learning has been applied in high performance platforms successfully, it does not perform well on low-power wearable devices due to resource limits [115]. On the other hand, sensor data of QS is mostly time-series [52].

In the previous work, the data are usually analysed by traditional technologies like linear regression or other statistical methods. For example, Angeles et al. 

[116] use statistical algorithms to distinguish between non-mimicked and mimicked tests for all the Parkinson’s primary symptoms, with very convincing differences. To evaluate the patients in rehabilitation recovery progress [117], Chen et al. [118] use some validation techniques, such as 10-fold cross-validation. This technique can classify the types of exercise and determine if their postures are appropriate. The overall accuracy of posture recognition is 88.26% and that of type classification is 97.29% [118]. It is believed to be beneficial for patients to effectively rehabilitate.

We summarize state-of-the-practise and state-of-the-art work of applying machine learning for QS in Table VIII. To identify common daily living activities for chronic disease management, Atallah et al. [119] use a two-stage Bayesian classifier to evaluate the condition of patients with chronic obstructive pulmonary disease (COPD). The development of Bayesian classification framework can explain the errors in sensor data, and classification accuracy of different activities [119]

. Methods like SVM (support vector machines) and decision trees are trained to classify the data 

[120, 121, 122]. For human physical activity recognition, Ignatov et al. [121]

propose a method using k-nearest neighbor and DNN as an alternative to process time-series data. Their method has high accuracy. When using a set of segmentation and KNN, it achieves nearly 96% recognition accuracy 

[121]. To recognize activity, Catal et al. [123]

propose a method that combines multiple classification methods such as J48 decision tree, Multi-Layer Perceptrons (MLP) and Logistic Regression techniques. Using deep learning methods, including CNN (convolutional neural networks), RBM (Restricted Boltzmann machines), and DBN (deep belief networks), from the input data, the machine can directly learn a set of features which are discriminative 

[124, 125, 115]. Alsheikh et al. [126] use a method based on DBNs and RBMs. The method uses multiple hidden layers to recognize activity. This work proves that these models have better recognition accuracy of human activities, using a large number of unlabeled acceleration samples to extract unsupervised features and avoid expensive manual features design in existing systems [126].

Nowadays, the information of the data requires efficient ways of classification and analysis if deep learning is a good choice in some cases [117]. Deep learning is a promising technique that could extract information and infer from big data by using multiple processing layers [127].

Reference Method Modality Application Database Year
 [120]

GMM, HC, K-means, K-medoids, SC

time series data HAR none 2017
 [116] statistical time series data Quantify Parkinsonian symptoms none 2016
 [121] LR, NN, SVM, J48, KNN time series data HAR none 2016
 [122] adaboost time series data HAR WISDM 2016
 [126] DBN and RBM time series data HAR WISDM, Daphnet, Skoda 2016
 [118] cross-validation time series data Rehabilitation Exercise Assessment for Knee Osteoarthritis none 2016
 [123] ensemble of classifiers time series data HAR WISDM 2015
 [125] DCNN time series data HAR Opp, Hand Gesture 2015
 [115] CNN time series data HAR none 2015
 [124] CNN time series data HAR Opp, Skoda, Actitracker 2014
 [119] Bayesian classifier time series data chronic obstructive pulmonary disease none 2009
TABLE VIII: The summary of algorithms used in QS.

V-B Disease classification

Disease classification, which groups diseases together based on their similarities, is expected to promote understanding and curing human diseases [62]. The conventional methods classify human diseases through 4 dimensions containing pathogen, the original component causing diseases, pathology and clinical patient behavior. However, with the improvement of human gene analysis and molecular biology, researchers obtain a large amount of relative information. And with the molecular level information, our knowledge network of diseases has been refreshed and our cognition of diseases has also been improved. Consequently, these two significant changes provide an appropriate opportunity for researchers to develop more precise methods to classify human diseases from the molecular level.

Table IX shows a variety of mature algorithms relating to classification and clustering. Though, at present many algorithms of disease taxonomy has been developed, researchers need to develop more effective and robust algorithms to meet the needs of disease classification with the increase of different type data of molecular biology.

Reference Method Modality Application Data Year
[128]
iCluster [129]
genomic
Clustering cancer from chromosome,
DNA, mRNA and protein level.
TCGA 2018
[130]
random forest, DIANA clustering,
two-dimensional hierarchical
clustering
genomic
Classifying colorectal cancer with gene
information of cancer cells.
GEO 2017
[131]
non-negative matrix factorization
genomic
Clustering pancreatic cancer information of
recurring altering genes.
APGI 2016
[132]
Markov cluster (MCL) algorithm
genomic
Reclassifying 6 classification system of
Colorectal cancer into 4 consensus
molecular subtypes.
CCLE,
GSK,
Sanger
2015
[133]
cluster-of -cluster
assignments(COCA) algorithm
genomic
Clustering 12 cancer types from gene and
protein level to find new subtypes of
the cancer.
TCGA 2014
[134] iCluster [129] genomic
Cancer subtype classification and discovery.
TCGA 2012

TABLE IX: Summary of algorithms for disease classification

Hoadley et al. [133]

reclassified 12 kinds of cancer from molecular level, which originate from different organs. In the research, the cluster-of-cluster assignments (COCA) algorithm, which is a kind of agglomerative hierarchical clustering method using Pearson correlation as the distance measurement, is adopted to cluster the genomic and proteomic data from different platforms. According to the research result, the 12 cancers conventionally classified according to their original organs are uniformly reclassified into 11 major subtypes with their causes of gene and protein, which provides a new idea to the cancer clinical treatment strategy.

Hoadley at al. [135] conduct a molecular reclassification of 33 cancer types from the TCGA platform. During the process of the reclassification, researchers utilize iCluster algorithm, a joint latent variable model-based clustering algorithm to deal with different type data, to cluster molecular data involving chromosome, DNA, mRNA, miRNA and protein from TCGA. Based on the consequence of the study, researchers re-cluster 33 cancer types into 28 cancer subtypes. Besides, they discover and verify the dominated position of cell-of-origin pattern in cancer molecular classification. The molecular similarities of cancer subtypes contribute to the improvement of future therapy of cancer [135].

Bailey at al. [131] identify various subtypes of pancreatic ductal adenocarcinomas through genetic analysis. The non-negative matrix factorization clustering, which uses the matrix factorization technology to cluster samples, is exploited to cluster recurring altering genes of 456 pancreatic cancer patients. This study discovers 4 subtypes of pancreatic cancer with respectively specific pathological characteristics, which provides beneficial information for the inference of the development of pancreatic cancer and a new idea of clinically therapeutic strategy of pancreatic cancer [131].

Guinney et al. [132] construct a consensus classification of colorectal cancer (CRC), since the translational and clinical utility of gene expression-based subtyping is hampered by discrepant results from different researches. Firstly, the Markov cluster (MCL) algorithm is applied to obtain the consensus molecular subtypes (CMS) from 6 CRC classification systems. And then a random forest model is utilized to classify new samples into CMS. The research classifies CRC into 4 consensus molecular subtypes microsatellite instability immune, canonical, metabolic and mesenchymal [132]. Each subtype has the clear biological interpretability, which is very important in the treatment of patients.

V-C Disease diagnosis

V-C1 Alzheimer’s disease (AD) diagnosis

Depending on severity, patients are usually diagnosed as normal controls (NC), MCI or AD. Different materials such as neuroimaging, biospecimen and genetic data are used to diagnose AD. In the previous work, deep learning techniques are extensively used to analyze medical imaging by many researchers and have achieved good performance. Although some studies  [136, 137] show that biospecimens and genetic data provide alternatives to neuroimaging in AD diagnosis, those data is scarcely used to diagnose AD in practice. So researchers should pay more attention to develop comprehensive and integrative diagnosis method based on different materials, which is likely to achieve much higher precision.

Analyzing neuroimagings such as MRI and PET images is a prevalent method for AD diagnosis. Jie et al. [138] propose a manifold regularized multitask feature learning method and then classify patients to three categories: AD, MCI and NC. The algorithm reaches 95.03% accuracy tested on MRI and PET images from ADNI, which contains 202 subjects: 51 AD patients, 99 MCI patients, and 52 NC. Suk et al. [139] utilize sparse regression models as target-level representation learner and build a deep convolutional neural network for AD identification. The algorithm reaches 90.28% accuracy on a baseline MRI dataset of 805 subjects, including 186 AD, 393 MCI, and 226 NC, from the ADNI database. Shi et al.  [140] use thin-plate spline (TPS) based nonlinear feature transformation and stack denoising sparse auto-encoder (DSAE) deep fusion for AD staging analysis, and the approach reaches 91.95% accuracy. The experiment is performed on a sub dataset of MRIs together with their whole brain masks selected from ADNI, which contains 338 subjects: 94 patients with AD, 121 with MCI and 123 NCs. Shi et al. [141] develop a multi-modal stacked deep polynomial networks (MM-SDPN) algorithm to fuse and learn feature representation from multi-modal neuroimaging data for AD diagnosis, and the approaches reach 97.13% accuracy. Data from ANDI are used here, consisting of MRI and PET images from 202 subjects: 51 AD patients, 99 MCI patients, and 52 NC. The performances of different classifiers are summarized in Table X.

Several studies indicate that CSF biomarkers provide alternatives to MRI and PET images in AD diagnosis. Hansson et al. [136] indicate that tau/ ratios are as accurate as semiquantitative PET image assessment. Mattsson et al. [142] provide evidence that when identifying early AD, CSF tau and

F-AV-1451 PET have similar performance but MRI measures have lower area under the receiver operating characteristic curve (AUROC). Besides, they find when identifying mild to moderate AD,

F-AV-1451 PET is superior to CSF tau.

Although genetic information is provided in ADNI, the studies are limited mainly because of the nontrivial validation of correlations between genetic variants and phenotype. However several studies show that certain genes are related to AD. For instance, Lorenzi et al. [137] identify a link between tribbles pseudokinase 3 (TRIB3) and the stereotypical pattern of gray matter loss in AD.

Reference AD vs. NC MCI vs. NC
ACC SEN SPE ACC SEN SPE
[141] 97.13 95.93 98.53 87.24 97.91 67.04
[140] 91.95 89.49 93.82 83.72 84.74 82.72
[139] 90.28 92.65 89.05 74.20 78.74 66.30
[138] 95.03 94.90 95.00 79.27 85.86 66.54


  • Abbreviation: AD = Alzheimer disease; MCI = mild cognitive impairment; NC = normal controls; ACC = Accuracy; SEN = Sensitivity; SPE = specificity.

TABLE X: Classification result of different algorithms for AD diagnosis
Reference Method Modality Application Data Year



[141]
MM-SDPN MRI, PET AD diagnosis ADNI 2018
[136] Comparation CSF, PET Clinical progression prediction ADNI, BioFINDER 2018
[142] Comparation CSF, PET AD diagnosis BioFINDER 2018
[137] Statistic Genetic, PET Genetic underpinnings of AD ADNI 2018
[140] DSAE MRI AD diagnosis ADNI 2017
[139] CNN MRI AD diagnosis ADNI 2016
[138] Laplacian regularizer MRI, PET, CSF AD diagnosis ADNI 2015

  • Abbreviation: CNN = convolutional neural network; DSAE = stacked denoising sparse auto-encoder; MM-SDPN = multimodal stacked deep polynomial networks.

TABLE XI: Overview of papers for AD diagnosis

V-C2 Acute lymphoblastic leukemia (ALL) diagnosis

Microscopic examination of blood or bone marrow smears is the only effective way to leukemia diagnosis  [89]. Several methods  [143, 144, 89, 145] have been provided to classify cells to be cancerous or noncancerous, and the most achieve the accuracy more than 95%. For instance, Abdeldaim et al. [145] present a computer-aided ALL diagnosis system, which first segments each cell in the microscopic images, and then classifies each segmented cell to be normal or affected. The experiment based on ALL-IDB2 achieves the accuracy of 96.42% with a KNN classifier. Gene data may also be used to diagnose ALL. For instance, by examining gene expression profiles, Willman et al. [96] find that the clusters are associated with either specific clinical features or treatment response characteristics in the children with high risk B-precursor ALL.

Reference Method Modality Application Data Year
[145] K-NN JPG Image segmentation ALL-IDB 2018
[89] Fuzzy C Means JPG Image segmentation Taken form Google 2017
[144] K-Means JPG ALL classification Isfahan Al-Zahra and Omid hospital 2015
[143] Ensemble classifier JPG ALL classification Ispat General Hospital 2014
[96] Cluster TXT

ALL feature selection

TARGET 2013

TABLE XII: Overview of papers for ALL diagnosis

V-C3 Breast Cancer Diagnosis

Machine learning approaches have been extensively used in the diagnosis of breast cancer. Researchers have focused on devising better algorithms to automate the detection of cancerous cells. Table XIII shows the state-of-the-art algorithms diagnosing breast cancer.

Reference Techinique Modality Application DB Year
[146] GAN MG Mabss segmentation INbreast, DDSM-BCRP 2018
[147] MIL-CNN MG Lesion classification INbreast 2017
[148] CNN MG Lesion classification INbreast, DDSM 2017
[149] CNN MG Semi-supervised CNN for classification of masses FFDM 2017
[150] CNN MRI Breast and fibro glandular tissue segmentation Self-produced data 2017
[151] CNN MG Detection of cardiovascular disease based on vessel calcification Self-produced data 2017
[152] M-CNN H&E Mitosis detection AMIDA 2016
[153] SAE US, CT Lesion classification Self-produced data 2016
[154] CAE MG Breast density segmentation, breast cancer risk scoring Self-produced data 2016
[155] RBM US Lesion classification Self-produced data 2016
[156] CNN TS Mass detection DDSM 2016
[157] CNN MG Lesion classification BCDR 2016
[158] CNN MG Tissue classification using regular CNNs Self-produced data 2016
[159] CNN MG Lesion classification Inbreast 2016
[160] CNN MG Mass localization TCGA 2016
[161] CNN MG Mass classification Collected from University of Chicago Medical Center 2016
[162] CNN MG Lesion classification Self-produced data 2016
[163] CNN MG Cancer risk score Self-produced data 2016
[164] CNN TS Micro calcification detection Collected from the University of Michigan 2016
[165] CNN TS Transfer mammographic masses to tomosynthesis Self-produced data 2016
[166] CNN MG Mass classification Inbreast 2016
[167] CNN MG Lesion classification MIAS 2015
[168] ADN MG,US Mass classification Self-produced data 2012

  • Abbreviation: M-CNN= Multi-Stream CNN; MIL-CNN= Multi-instance Learning CNN; CAE= Convolutional Auto-Encoders; SAE= Stacked Auto-Encoders; H&E= Hematoxylin & Eosin Histology Images; MG= Mammography; US= Ultrasound; CT= Computed Tomography.

TABLE XIII: Summary of breast cancer diagnosis.

Huynh et al. [161] learn the features on mammography pictures by the use of CNN algorithm, and an SVM model is used to classify the derivative features into three categories, which include benign, cystic and malignant. They apply their algorithm on a dataset which include 607 breast images and procure an AUC (Area Under the Curve) of 86% [161].

Wang et al. [151] develop a 12-layer CNN for breast arterial calcification (BAC) detection. The micro-calcium detection and diagnosis algorithm yields 96.24% accuracy, and the inferred micro-calcium lesion is close to the ground truth [151].

Sun et al. [149]

present a semi-supervised learning method based on graph by the use of CNN to diagnose breast cancer. They obtain an AUC of 88.18% on a dataset which contains both labeled and unlabeled data 

[149].

Kooi et al. [148] develop a computer-aided diagnosis technique to diagnose benign solitary lesions from malignant masses on digital mammogram. The algorithm obtains an AUC of 87%, which is better than the other algorithms [148].

Zhu et al. [147] propose deep end-to-end networks for lesion classification on digital mammography pictures. They also explore three different modes to develop deep CNN networks for whole mammogram diagnosis [147]. They apply their algorithm on the INbreast dataset, and the experimental results show the robustness of their networks [147].

So far, mammography has been analyzed by much previous work with deep learning algorithms [10]. But there are little previous work that analyzes breast MRI, US, or digital breast tomosynthesis. In the near future, these other modalities will probably receive much attention [10].

Because of most of the large digital datasets can not be used for free, many research efforts apply their algorithm on stale and small databases, which results in precarious AUC [149]. Much previous work has tackled this issue by employing semi-supervised learning [149], weakly supervised learning [160]

, and transfer learning 

[148].

V-D Drug discovery

Target identification and validation For a disease, determining its target is a challenging and time-consuming task. Lamb et al. [169]

create the first set of reference sets of gene expression profiles derived from cultured human cells treated with biologically active small molecules, and pattern matching software to mine these data. Madhukar et al. 

[170] develop a platform that integrates multiple data types into a Bayesian machine learning framework to predict the goals and mechanisms of small molecules. They use publicly available BANDIT data set to achieve approximately 90% accuracy on more than 2,000 different small molecules-significantly better than any other published target recognition platform.

Retrosynthetic analysis Retrosynthetic analysis has a large search space, which is the biggest challenge of drug discovery. Law et al. [171] design a retrosynthetic analysis tool utilizing automated retrosynthetic rule generation. However, the algorithm does not have good efficiency and effectiveness. Inspired by alphgo, Segler et al. [110]

use Monte Carlo tree search (MCTS) and symbolic artificial intelligence to discover retrosynthetic routes. The Waller team 

[110]

integrate concepts such as deep neural networks and reinforcement learning into a common architecture, and propose an algorithmic framework using three different neural networks together with MCTS(3N-MCTS). Deep neural networks are used to predict which molecules will participate in the reaction. Monte Carlo search tree is used to predict the likelihood of a reaction. Compared with traditional rule-based retrosynthesis analysis, this work borrows a lot of ideas from deep neural network and reinforcement learning, which is an important improvement to the traditional methods. The 3N-MCTS method is applied to the Reaxys database. Chemical reactions recorded before 2015 are used as training data, and chemical reactions recorded after 2015 are used as test data. Compared with two traditional approaches: Heuristic BFS and Neural BFS, 3N-MCTS solves 87.12% of the test set’s problems while Neural BFS solves 45.6% and Heuristic BFS solves 84.24% 

[171] [110].

Reference Techinique Application DB Year
[170] Bayesian machine learning framework target identification and validation public data 2017
[171] automated retrosynthetic rule generation retrosynthetic analysis MOS 2009
[110] MCTS, DNN retrosynthetic analysis Reaxys chemistry database 2018
[108] Multitask Networks drug screening 259 datasets 2015
[172] DCNN bioactivity prediction ChEMBL, DUDE 2015
  • Abbreviation: MOS = Accelrys and the Beilstein Crossfire reaction database; MCTS = Monte Carlo tree search; DNN = deep neural networks; DCNN = Deep Convolutional Neural Network;.

TABLE XIV: Summary of state-of-the-art work for drug discovery

Drug screening In drug screening, virtual screening has been a hot topic because of its high time overhead of high-throughput screening and high cost of consumption. The researchers at Google and Stanford [108] are working to develop virtual screening techniques using deep learning to replace or enhance traditional high-throughput screening processes and increase the speed and success rate of screening. By applying deep learning, researchers can share information across numerous experiments across multiple targets. They obtain data from 259 public datasets and divide the data into four groups, using a 5-fold cross-validation in group PCBA to achieve an AUC value of 0.873. The 5-fold cross-validation AUC scores in the grouped MUV and Tox21 reach 0.841 and 0.818, respectively. In drug screening, it is important to predict the ability of binding between molecules. Wallach et al. [172] use structure-based deep-convolutional neural network - AtomNet to predict bioactivity. They evaluate the accuracy of the model on the famous Directory of Useful Decoys Enhanced (DUDE) benchmark platform. AtomNet reaches or exceeds 0.9 AUC on 59 targets. Two previous research efforts [108] [172] use state-of-the-art deep learning techniques to effectively reduce the cost of drug screening and improve accuracy.

In one word, each step in drug discovery has relatively mature algorithms and platforms [173]. We summarize the related algorithms in Table XIV.

Vi Are there any benchmarks for measuring algorithms, systems for big medical data?

Benchmark is the foundation of systems and algorithms design, optimization, and evaluation. Many algorithms and systems have been widely used in medical domains. To explore processing efficiency and tackling potential bottlenecks, domain-specific benchmarks are essential for the developments and optimizations of that domain. However, in terms of big medical data domain, the diversity of disease taxonomies, the complexity of clinician research tasks, and the heterogeneity of medical data pose great challenges in constructing a comprehensive and fair benchmark.

An ideal medical benchmark should cover a broad spectrum of big medical data processing. To cover the data diversity, a benchmark should include electronic health records, laboratorial data, and QS data. To assure the algorithm diversity, a benchmark should consist of not only traditional machine learning but also deep learning algorithms.

At present, there is no comprehensive big medical data benchmark suite. The previous benchmarking efforts only cover limited perspectives of big medical data. Table XV compares different medical benchmarks from the perspectives of application domain, data type, data size, algorithm, system, metric and publishing year.

Reference Application Domain Data Type Data size Algorithm System Metric Year
 [9] Fall detection sensor data 9, 379 files the artificial neutral network, k nearest neighbor, support vector machine, and kernel Fisher discriminant none Accuracy, Specificity, Sensitivity 2018
 [174] abnormality detection in musculoskeletal radiographs image data 40,561 169-layer DenseNet baseline model none AUC, Specificity, Sensitivity 2017
 [175] Gene unstructed text data gene sequence: 20MB-7GB, gene assembly: 100MB-13GB offline analysis Work Queue, MPI system and architecture metrics 2016
 [91] Alzheimer’s Disease (AD) prediction Clinical data; Genotype data; Magnetic resonance image data 767 training samples for question one; 176 CN samples for question two; 628 training samples for question three AD prediction algorithms none Balanced accuracy, AUC 2016
 [176] blood transfusion process none none none none none 2010
 [177] biological image analysis Image (TIFF format) 9 datasets, 4,073 images WND-CHARM multi-purpose image classification none Accuracy 2008
 [178] nucleotide or protein sequences Nucleotide, Peptide 209,775,348 loci, 263,957,884,539 bases, from 209,775,348 reported sequences parallel basic local alignment search MPI execution time 2003
 [179] genome assembly Text 4 datasets, 41,861,131 Reads candidate filtering and alignment Work Queue none -
TABLE XV: Overview of medical data benchmarks.

Liu et al. [9] propose a benchmark database for fall detection. This database [9] collects data from 50 males and females ranging from 21 to 60 years of age, 1.55 to 1.90 m in height, and 40 to 85 kg in weight. They use four baseline algorithms (ANN, KNN, SVM, and kernel Fisher discriminant) to evaluate the reliability of the database compared to the previous ones.

MURA [174] is a benchmark database of musculoskeletal radiographs containing 40,561 multi-view radiographic images collected from 12,173 patients, with a total of 14,863 studies covering seven study types—elbow, finger, forearm, hand, humerus, shoulder, and wrist. Each study is labelled as normal or abnormal manually.

ChestX-ray8 [180]

is a hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common Thorax diseases. It comprises 108,948 frontal-view X-ray images of 32,717 unique patients with the text-mined eight disease image labels, from the associated radiological reports using natural language processing 

[180].

AD DREAM Challenge [91] is a benchmark suite aiming to evaluating the state-of-the-art algorithms in predicting AD, based on high dimensional, publicly available genetic and structural imaging data. The training data consist of individuals participating in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [181].

Christov et al. [176] present a medical benchmark based on a blood transfusion process. The benchmark [176] consists of a blood transfusion process definition, a set of properties or requirements, and a set of bindings between the blood transfusion properties and process definition.

IICBU 2008 [177] is a benchmark suite for biological image analysis. It [177] provides a biological image datasets and a set of practical real-life imaging problems in biology, including the examples of organelles, cells and tissues. They can be used to evaluate different biological image analysis methods.

BigDataBench [182, 175] provides a big data and AI benchmark covering search engine, e-commerce, social network, multimedia, and bioinformatics domains. As for bioinformatics, it provides two workloads—SAND and BLAST. Among them, SAND [179] is a set of modules for genome assembly. It performs distributed computing and supports easily deployments on large-scale clusters, clouds or grids. Totally, it consists of two steps including candidate filtering and alignment. As for datasets, it provides genome sequence data with three data scales—small, medium and large. BLAST [178] is a parallel basic local alignment searching workload. It is used to compare nucleotide or protein sequences with the database and find the similarities.

Vii What is the performance gap of state-of-practise and state-of-the-art systems for handling big medical data currently or in future?

Many state-of-the-art and state-of-practise systems have been widely used in big medical data. However, the characteristics of medical data pose great challenges to both data storage and processing.

Multiple medical data processing systems are proposed to handle large-scale medical data. For medical imaging data analysis, PAIS (Pathology Analytical Imaging Standards) [183] provides a data model to manage image data, using a spatial DBMS based architecture. Hadoop-GIS [184] adopts a MapReduce-based solution to support complex queries for analytical pathology imaging. For genome data analysis, GATK [185] is a MapReduce-based genome analysis toolkit for analyzing next-generation DNA sequencing data. IMG [186] is an integrated microbial genomes database and comparative analysis system. For medical text data, Neamatullah et al. [187] provide a system to process text-based patient medical records.

However, the sources and types of medical data are usually multifarious and integrated. For example, the electronic health record (EHR) dataset, which is used in a recent research effort [85], has thousands of feature dimensions and contains patient demographics, provider orders, diagnoses, procedures, medications, laboratory values, vital signs, and flowsheet data [85], covering images, text, structured, or un-structured data types and sources. These dimensions are not processed and learned individually, and conversely, they are combined to detect and diagnose diseases cooperatively. Under this circumstance, the storage and processing systems are required to integrate different data sources and types. Existing systems targeting a specific data type like medical images can not process other data types. To the best of our knowledge, there exists no such a system that can support multi-source and heterogeneous data storage and processing in the big medical domain. Previous work [85] develops a new data structure—FHIR standard to handle their heterogeneous data. However, considering the diversity of disease taxonomies and the variety of medical data, developing a new data structure or a new system for one or more data types or data sources only provide a case-by-case solution. Hence, there is an urgent need for a new comprehensive system that satisfies the processing requirements of big medical data.

Nowadays, deep learning systems, e.g. TensorFlow 

[188]

, Caffe 

[189], are used to handle medical data. However, since these systems are general-purpose deep learning processing framework without specific optimizations for big medical data, they have inefficiencies considering complex clinician research tasks. As illustrated above, EHR dataset [85] has thousands of correlative feature dimensions covering images, text, structured, or un-structured data types and sources, so the researchers not only need to conduct time-consuming data conversion so as to suit for the data input requirements of the above systems, but also need to correlate those data manually to collect all the data for each patient.

Viii Are we ready for working together?

The tendency to utilize computing technologies in medicine science is indispensable in the future for two reasons: on one hand, a large amount of high quality data has been accumulated in the medicine fields with the fast-paced medical development. On the other hand, machine learning techniques have performed a good job in routine tasks, and the researchers of both fields have already made progress in many aspects of medicine, i.e., automatic disease diagnosis.

However, we notice that the practical and wide application of machine learning techniques to clinical therapy is limited by several significant problems: first, the development of data-driven AI in the directions of clinical medicine has been restricted by the low degree of clinical practice digitization and the difficulty of sharing data. Second, clinical practices cover exceedingly complex scenarios while current machine learning techniques merely have a good performance in the scenarios where the conditions are described explicitly. Thus, we believe that several prerequisites should be met with for the sake of an effective and efficient multidisciplinary cooperation. First, the digitized degree of clinical practice and the sharing of clinical data should be improved because these two issues limit the computer experts to deepen their understanding on clinical medicine. Second, the unified criteria of the case utilizing AI or other machine learning techniques in clinical tasks should be set up to help experts of both fields to validate and share the knowledge among each other. Benchmarks will play an unique role in setting up those unified criteria of the case utilizing AI in clinical tasks. Third, the multidisciplinary education among computing and medicine sciences should be promoted to provide sufficient talent with comprehensive abilities.

Ix conclusion

In this paper, we—a group of life scientists, clinicians, computing scientists and engineers perform a comprehensive survey on big medical data, which is heterogeneous, high-dimensional, embodying a large mixture of signals and errors, and significantly different from other domain data. We investigate the prioritized tasks in clinician practices and research: quantified self (QS), disease classification, disease diagnosis, and drug discovery. We found publicly availably data sets that can be utilized for those tasks are not limited to its scale, but also its single data source. The previous work demonstrates the potential of incorporating machine learning techniques into clinician practices. However, its high accuracy is achieved on the static data. In reality, the clinician practitioners work in an open environments, so we need set up the realistic benchmarks that can mimic the way that the clinician practitioners handle the data for different medical data purposes. The sources and types of medical data are usually multifarious and integrated. These dimensions are not processed and learned individually, and conversely, they are combined to detect and diagnose diseases cooperatively. To the best of our knowledge, there exists no such a system which can support multi-source and heterogeneous data storage and processing in the big medical domain. Also, we discuss several prerequisites for the sake of an effective and efficient cooperation among life scientists, clinicians, computing scientists and engineers.

Acknowledgment

This work is supported by the National Key Research and Development Plan of China (Grant No. 2016YFB1000600, 2016YFB1000605, and 2016YFB1000601).

References

  • [1] N. P. Tatonetti, “Translational medicine in the age of big data,” Briefings in bioinformatics, 2017.
  • [2] V. Marx, “Biology: The big challenges of big data,” 2013.
  • [3] Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, C. Zhai, M. J. Efron, R. Iyer, M. C. Schatz, S. Sinha, and G. E. Robinson, “Big data: astronomical or genomical?” PLoS biology, vol. 13, no. 7, p. e1002195, 2015.
  • [4] http://adni.loni.usc.edu.
  • [5] O. R. Shishvan, D.-S. Zois, and T. Soyata, “Machine intelligence in healthcare and medical cyber physical systems: A survey,” IEEE Access, vol. 6, pp. 46 419–46 494, 2018.
  • [6] L. Lévêque, H. Bosmans, L. Cockmartin, and H. Liu, “State of the art: Eye-tracking studies in medical imaging,” IEEE Access, vol. 6, pp. 37 023–37 034, 2018.
  • [7] M. I. Razzak, S. Naz, and A. Zaib, “Deep learning for medical image processing: Overview, challenges and the future,” in Classification in BioApps.   Springer, 2018, pp. 323–350.
  • [8] J. Ker, L. Wang, J. Rao, and T. Lim, “Deep learning applications in medical image analysis,” IEEE Access, vol. 6, pp. 9375–9389, 2018.
  • [9] Z. Liu, Y. Cao, L. Cui, J. Song, and G. Zhao, “A benchmark database and baseline evaluation for fall detection based on wearable sensors for the internet of medical things platform,” IEEE Access, 2018.
  • [10] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
  • [11] R. Fang, S. Pouyanfar, Y. Yang, S.-C. Chen, and S. Iyengar, “Computational health informatics in the big data age: A survey,” ACM Computing Surveys (CSUR), vol. 49, no. 1, p. 12, 2016.
  • [12] H. Kashyap, H. A. Ahmed, N. Hoque, S. Roy, and D. K. Bhattacharyya, “Big data analytics in bioinformatics: architectures, techniques, tools and issues,” Network Modeling Analysis in Health Informatics and Bioinformatics, vol. 5, no. 1, p. 28, 2016.
  • [13] R. Sharma, S. N. Singh, and S. Khatri, “Medical data mining using different classification and clustering techniques: a critical survey,” in Computational Intelligence & Communication Technology (CICT), 2016 Second International Conference on.   IEEE, 2016, pp. 687–691.
  • [14] A. Gani, A. Siddiqa, S. Shamshirband, and F. Hanum, “A survey on indexing techniques for big data: taxonomy and performance evaluation,” Knowledge and Information Systems, vol. 46, no. 2, pp. 241–284, 2016.
  • [15] S. R. Islam, D. Kwak, M. H. Kabir, M. Hossain, and K.-S. Kwak, “The internet of things for health care: a comprehensive survey,” IEEE Access, vol. 3, pp. 678–708, 2015.
  • [16] A. Alyass, M. Turcotte, and D. Meyre, “From big data analysis to personalized medicine for all: challenges and opportunities,” BMC medical genomics, vol. 8, no. 1, p. 33, 2015.
  • [17] N. H. G. R. Institute, “A brief guide to genomics,” https://www.genome.gov/19016904/, Retrieved 2018-08-19.
  • [18] wikipedia, “Omics,” https://en.wikipedia.org/wiki/Omics, Retrieved 2018-08-19.
  • [19] M. Blankenburg, L. Haberland, H.-D. Elvers, C. Tannert, and B. Jandrig, “High-throughput omics technologies: potential tools for the investigation of influences of emf on biological systems,” Current genomics, vol. 10, no. 2, pp. 86–92, 2009.
  • [20] M. L. Metzker, “Sequencing technologies¡ªthe next generation,” Nature reviews genetics, vol. 11, no. 1, p. 31, 2010.
  • [21] E. E. Schadt, S. Turner, and A. Kasarskis, “A window into third-generation sequencing,” Human molecular genetics, vol. 19, no. R2, pp. R227–R240, 2010.
  • [22] M. Altaf-Ul-Amin, F. M. Afendi, S. K. Kiboi, and S. Kanaya, “Systems biology in the context of big data and networks,” BioMed research international, vol. 2014, 2014.
  • [23] Wikipedia, “Comparative medicine,” https://en.wikipedia.org/wiki/Comparative_medicine, Retrieved 2018-08-19.
  • [24] J. Macy and T. L. Horvath, “Focus: Comparative medicine: Comparative medicine: An inclusive crossover discipline,” The Yale journal of biology and medicine, vol. 90, no. 3, p. 493, 2017.
  • [25] R. Mirnezami, J. Nicholson, and A. Darzi, “Preparing for precision medicine,” New England Journal of Medicine, vol. 366, no. 6, pp. 489–491, 2012.
  • [26] K. Jain, “Personalized medicine.” Current opinion in molecular therapeutics, vol. 4, no. 6, pp. 548–558, 2002.
  • [27] D. Gomez-Cabrero, I. Abugessaisa, D. Maier, A. Teschendorff, M. Merkenschlager, A. Gisel, E. Ballestar, E. Bongcam-Rudloff, A. Conesa, and J. Tegnér, “Data integration in the era of omics: current and future challenges,” 2014.
  • [28] P. J. Fleming and J. J. Wallace, “How not to lie with statistics: the correct way to summarize benchmark results,” Communications of the ACM, vol. 29, no. 3, pp. 218–221, 1986.
  • [29] R. Chen, G. I. Mias, J. Li-Pook-Than, L. Jiang, H. Y. Lam, R. Chen, E. Miriami, K. J. Karczewski, M. Hariharan, F. E. Dewey et al., “Personal omics profiling reveals dynamic molecular and medical phenotypes,” Cell, vol. 148, no. 6, pp. 1293–1307, 2012.
  • [30] W. W. Soon, M. Hariharan, and M. P. Snyder, “High-throughput sequencing for biology and medicine,” Molecular systems biology, vol. 9, no. 1, p. 640, 2013.
  • [31] C. Goble and R. Stevens, “State of the nation in data integration for bioinformatics,” Journal of biomedical informatics, vol. 41, no. 5, pp. 687–693, 2008.
  • [32] R. J. Loos and E. E. Schadt, “This i believe: gaining new insights through integrating ¡°old¡± data,” Frontiers in genetics, vol. 3, p. 137, 2012.
  • [33] R. Clarke, H. W. Ressom, A. Wang, J. Xuan, M. C. Liu, E. A. Gehan, and Y. Wang, “The properties of high-dimensional data spaces: implications for exploring gene and protein expression data,” Nature Reviews Cancer, vol. 8, no. 1, p. 37, 2008.
  • [34] “Wolfqs,” http://antephase.com/quantifiedself.
  • [35] M. Almalki, K. Gray, and F. M. Sanchez, “The use of self-quantification systems for personal health information: big data management activities and prospects,” Health information science and systems, vol. 3, no. S1, p. S1, 2015.
  • [36] R. Meleady, D. Abrams, J. Van de Vyver, T. Hopthrow, L. Mahmood, A. Player, R. Lamont, and A. C. Leite, “Surveillance or self-surveillance? behavioral cues can increase the rate of drivers? pro-environmental behavior at a long wait stop,” Environment and behavior, vol. 49, no. 10, pp. 1156–1172, 2017.
  • [37] S. Wolfram, “The personal analytics of my life,” Stephen Wolfram blog, 2012.
  • [38] “Wikiqs,” https://en.wikipedia.org/wiki/Quantified_self.
  • [39] Q. Zhu, D. Hu, and Y. Zhang, “Current applications of quantified self in the health service,” Library forum, vol. 38, no. 2, pp. 17–21, 2018.
  • [40] J. Van den Bulck, “Sleep apps and the quantified self: blessing or curse?” Journal of sleep research, vol. 24, no. 2, pp. 121–123, 2015.
  • [41] S. S. Coughlin and J. Stewart, “Use of consumer wearable devices to promote physical activity: a review of health intervention studies,” Journal of environment and health sciences, vol. 2, no. 6, 2016.
  • [42] H. Liu, R. Li, S. Liu, S. Tian, and J. Du, “Smartcare: Energy-efficient long-term physical activity tracking using smartphones,” Tsinghua Science and Technology, vol. 20, no. 4, pp. 348–363, 2015.
  • [43] M. A. Barrett, O. Humblet, R. A. Hiatt, and N. E. Adler, “Big data and disease prevention: from quantified self to quantified communities,” Big data, vol. 1, no. 3, pp. 168–175, 2013.
  • [44] J. J. Oresko, Z. Jin, J. Cheng, S. Huang, Y. Sun, H. Duschl, and A. C. Cheng, “A wearable smartphone-based platform for real-time cardiovascular disease detection via electrocardiogram processing,” IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 3, pp. 734–740, 2010.
  • [45] F. Axisa, P. M. Schmitt, C. Gehin, G. Delhomme, E. McAdams, and A. Dittmar, “Flexible technologies and smart clothing for citizen medicine, home healthcare, and disease prevention,” IEEE Transactions on information technology in biomedicine, vol. 9, no. 3, pp. 325–336, 2005.
  • [46] L. Allet, R. H. Knols, K. Shirato, and E. D. d. Bruin, “Wearable systems for monitoring mobility-related activities in chronic disease: a systematic review,” Sensors, vol. 10, no. 10, pp. 9026–9052, 2010.
  • [47] B. G. Celler, N. H. Lovell, J. Basilakis et al., “Using information technology to improve the management of chronic disease,” Medical Journal of Australia, vol. 179, no. 5, pp. 242–246, 2003.
  • [48] S. Yan and D. Qi-rui, “Research on the relationship between wearable devices and the health care industry and analysis on the development trend,” China Digital Medicine, vol. 10, no. 8, pp. 25–28, 2015.
  • [49] S. A. Bernard, T. W. Gray, M. D. Buist, B. M. Jones, W. Silvester, G. Gutteridge, and K. Smith, “Treatment of comatose survivors of out-of-hospital cardiac arrest with induced hypothermia,” New England Journal of Medicine, vol. 346, no. 8, pp. 557–563, 2002.
  • [50] S. Wenhui, W. Yi, W. Bo, Z. Xianbo, and X. Liqun, “Analysis of the application of wearable and portable devices in healthcare sector,” China Internet, vol. 12, no. 08, pp. 26–32, 2015.
  • [51] O. Fey, A. El-Banayosy, L. Arosuglu, H. Posival, and R. Körfer, “Out-of-hospital experience in patients with implantable mechanical circulatory support: present and future trends,” European Journal of Cardio-Thoracic Surgery, vol. 11, no. Supplement, pp. S51–S53, 1997.
  • [52]

    M. Swan, “The quantified self: Fundamental disruption in big data science and biological discovery,”

    Big Data, vol. 1, no. 2, pp. 85–99, 2013.
  • [53] C. Hogg, “Value of a qs data commons (and data standards) for personal health data. 100plus 2012,” 2013.
  • [54] P. B. Shull, W. Jirattigalachote, M. A. Hunt, M. R. Cutkosky, and S. L. Delp, “Quantified self and human movement: a review on the clinical impact of wearable sensing and feedback for gait analysis and intervention,” Gait & posture, vol. 40, no. 1, pp. 11–19, 2014.
  • [55] M. Bachlin, M. Plotnik, D. Roggen, I. Maidan, J. M. Hausdorff, N. Giladi, and G. Troster, “Wearable assistant for parkinson?s disease patients with the freezing of gait symptom,” IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 2, pp. 436–446, 2010.
  • [56] J. R. Kwapisz, G. M. Weiss, and S. A. Moore, “Activity recognition using cell phone accelerometers,” ACM SigKDD Explorations Newsletter, vol. 12, no. 2, pp. 74–82, 2011.
  • [57] J. W. Lockhart, G. M. Weiss, J. C. Xue, S. T. Gallagher, A. B. Grosner, and T. T. Pulickal, “Design considerations for the wisdm smart phone-based sensor mining architecture,” in Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data.   ACM, 2011, pp. 25–33.
  • [58] A. Bulling, U. Blanke, and B. Schiele, “A tutorial on human activity recognition using body-worn inertial sensors,” ACM Computing Surveys (CSUR), vol. 46, no. 3, p. 33, 2014.
  • [59] “Nhanes,” https://www.cdc.gov/nchs/nhanes/.
  • [60] “Openhumans,” https://www.openhumans.org/.
  • [61] W. H. Organization et al.

    , “Manual of the international statistical classification of diseases, injuries, and causes of death: based on the recommendations of the seventh revision conference, 1955, and adopted by the ninth world health assembly under the who nomenclature regulations,” 1957.

  • [62] N. R. Council et al., Toward precision medicine: building a knowledge network for biomedical research and a new taxonomy of disease.   National Academies Press, 2011.
  • [63] K. Wang, H. Gaitsch, H. Poon, N. J. Cox, and A. Rzhetsky, “Classification of common human diseases derived from shared genetic and environmental determinants,” Nature genetics, vol. 49, no. 9, p. 1319, 2017.
  • [64] J. Bell, “The new genetics: The new genetics in clinical practice,” BMJ: British Medical Journal, vol. 316, no. 7131, p. 618, 1998.
  • [65] J. Park, B. J. Hescott, and D. K. Slonim, “Towards a more molecular taxonomy of disease,” Journal of biomedical semantics, vol. 8, no. 1, p. 25, 2017.
  • [66] R. Mirnezami, J. Nicholson, and A. Darzi, “Preparing for precision medicine,” New England Journal of Medicine, vol. 366, no. 6, pp. 489–491, 2012.
  • [67] “Australian pancreatic cancer genome initiative,” http://www.pancreaticcancer.net.au, accessed August 25, 2018.
  • [68] I. Lappalainen, J. Almeida-King, V. Kumanduri, A. Senf, J. D. Spalding, G. Saunders, J. Kandasamy, M. Caccamo, R. Leinonen, B. Vaughan et al., “The european genome-phenome archive of human data consented for biomedical research,” Nature genetics, vol. 47, no. 7, p. 692, 2015.
  • [69] “International genome cancer consortium,” https://icgc.org/, accessed October 11, 2018.
  • [70] “The cancer genome atlas,” https://cancergenome.nih.gov/, accessed August 25, 2018.
  • [71] “Gene expression omnibus,” https://www.ncbi.nlm.nih.gov/geo/, accessed October 11, 2018.
  • [72] “The broad institute of mit & harvard,” http://truvenhealth.com/markets/life-sciences/products/data-tools/marketscan-databases, accessed August 25, 2018.
  • [73] “The sanger institute,” https://www.sanger.ac.uk/, accessed October 11, 2018.
  • [74] “Ibm,” https://cancergenome.nih.gov/, accessed August 25, 2018.
  • [75] R. Akbani, P. K. S. Ng, H. M. Werner, M. Shahmoradgoli, F. Zhang, Z. Ju, W. Liu, J.-Y. Yang, K. Yoshihara, J. Li et al., “A pan-cancer proteomic perspective on the cancer genome atlas,” Nature communications, vol. 5, p. 3887, 2014.
  • [76] J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. M. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, J. M. Stuart, C. G. A. R. Network et al., “The cancer genome atlas pan-cancer analysis project,” Nature genetics, vol. 45, no. 10, p. 1113, 2013.
  • [77] C. G. A. R. Network et al., “Comprehensive genomic characterization defines human glioblastoma genes and core pathways,” Nature, vol. 455, no. 7216, p. 1061, 2008.
  • [78] C. G. A. Network et al., “Comprehensive molecular characterization of human colon and rectal cancer,” Nature, vol. 487, no. 7407, p. 330, 2012.
  • [79] https://en.wikipedia.org/wiki/Medical_diagnosis.
  • [80] https://homes.di.unimi.it/scotti/all/.
  • [81] E. Papaemmanuil, M. Gerstung, L. Bullinger, V. I. Gaidzik, P. Paschka, N. D. Roberts, N. E. Potter, M. Heuser, F. Thol, N. Bolli et al., “Genomic classification and prognosis in acute myeloid leukemia,” New England Journal of Medicine, vol. 374, no. 23, pp. 2209–2221, 2016.
  • [82] L. Bullinger, K. Döhner, and H. Döhner, “Genomics of acute myeloid leukemia diagnosis and pathways,” Journal of Clinical Oncology, vol. 35, no. 9, pp. 934–946, 2017.
  • [83] D. A. Arber, A. Orazi, R. Hasserjian, J. Thiele, M. J. Borowitz, M. M. Le Beau, C. D. Bloomfield, M. Cazzola, and J. W. Vardiman, “The 2016 revision to the world health organization (who) classification of myeloid neoplasms and acute leukemia,” Blood, pp. blood–2016, 2016.
  • [84] D. Charles, M. Gabriel, and M. F. Furukawa, “Adoption of electronic health record systems among us non-federal acute care hospitals: 2008-2012,” ONC data brief, vol. 9, pp. 1–9, 2013.
  • [85] A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Hajaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus, M. Sun et al., “Scalable and accurate deep learning with electronic health records,” npj Digital Medicine, vol. 1, no. 1, p. 18, 2018.
  • [86] G. McKhann, D. Drachman, M. Folstein, R. Katzman, D. Price, and E. M. Stadlan, “Clinical diagnosis of alzheimer’s disease report of the nincds-adrda work group* under the auspices of department of health and human services task force on alzheimer’s disease,” Neurology, vol. 34, no. 7, pp. 939–939, 1984.
  • [87] https://ocg.cancer.gov/programs/target/acute-lymphoblastic-leukemia.
  • [88] R. Brookmeyer, E. Johnson, K. Ziegler-Graham, and H. M. Arrighi, “Forecasting the global burden of alzheimer?s disease,” Alzheimer’s & dementia, vol. 3, no. 3, pp. 186–191, 2007.
  • [89] T. Karthikeyan and N. Poornima, “Micros-copic image segmentation using fuzzy c means for leukemia diagnosis,” Leukemia, vol. 4, no. 1, 2017.
  • [90] B. Gayathri, C. Sumathi, and T. Santhanam, “Breast cancer diagnosis using machine learning algorithms-a survey,” International Journal of Distributed and Parallel Systems, vol. 4, no. 3, p. 105, 2013.
  • [91] G. I. Allen, N. Amoroso, C. Anghel, V. Balagurusamy, C. J. Bare, D. Beaton, R. Bellotti, D. A. Bennett, K. L. Boehme, P. C. Boutros et al., “Crowdsourced estimation of cognitive decline and resilience in alzheimer’s disease,” Alzheimer’s & Dementia, vol. 12, no. 6, pp. 645–653, 2016.
  • [92] http://biofinder.se/.
  • [93] D. Mahapatra, P. K. Roy, S. Sedai, and R. Garnavi, “Retinal image quality classification using saliency maps and cnns,” in International Workshop on Machine Learning in Medical Imaging.   Springer, 2016, pp. 172–179.
  • [94] J. I. Orlando, E. Prokofyeva, M. del Fresno, and M. B. Blaschko, “Convolutional neural network transfer for automated glaucoma identification,” in 12th International Symposium on Medical Information Processing and Analysis, vol. 10160.   International Society for Optics and Photonics, 2017, p. 101600U.
  • [95] http://biogps.org/#goto=welcome.
  • [96] C. L. Willman, R. Harvey, G. S. Davidson, X. Wang, S. R. Atlas, E. J. Bedrick, and I. L. Chen, “Identification of novel subgroups of high-risk pediatric precursor b acute lymphoblastic leukemia, outcome correlations and diagnostic and therapeutic methods related to same,” Oct. 29 2013, uS Patent 8,568,974.
  • [97] I. C. Moreira, I. Amaral, I. Domingues, A. Cardoso, M. J. Cardoso, and J. S. Cardoso, “Inbreast: toward a full-field digital mammographic database,” Academic radiology, vol. 19, no. 2, pp. 236–248, 2012.
  • [98] J. Suckling, J. Parker, D. Dance, S. Astley, I. Hutt, C. Boggis, I. Ricketts, E. Stamatakis, N. Cerneaz, S. Kok et al., “The mammographic image analysis society digital mammogram database,” in Exerpta Medica. International Congress Series, vol. 1069, 1994, pp. 375–378.
  • [99] C. Rose, D. Turi, A. Williams, K. Wolstencroft, and C. Taylor, “Web services for the ddsm and digital mammography research,” in International Workshop on Digital Mammography.   Springer, 2006, pp. 376–383.
  • [100] M. G. Lopez, N. Posada, D. C. Moura, R. R. Pollán, J. M. F. Valiente, C. S. Ortega, M. Solar, G. Diaz-Herrero, I. Ramos, J. Loureiro et al., “Bcdr: a breast cancer digital repository,” in 15th International Conference on Experimental Mechanics, 2012.
  • [101] D. C. Maddix, “Diagnosing malignant versus benign breast tumors via machine learning techniques in high dimensions,” 2014.
  • [102] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.-H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. P. Mesirov et al., “Multiclass cancer diagnosis using tumor gene expression signatures,” Proceedings of the National Academy of Sciences, vol. 98, no. 26, pp. 15 149–15 154, 2001.
  • [103] J. Drews, “Drug discovery: a historical perspective,” Science, vol. 287, no. 5460, pp. 1960–1964, 2000.
  • [104] J. P. Hughes, S. Rees, S. B. Kalindjian, and K. L. Philpott, “Principles of early drug discovery,” British journal of pharmacology, vol. 162, no. 6, pp. 1239–1249, 2011.
  • [105] G. Sliwoski, S. Kothiwale, J. Meiler, and E. W. Lowe, “Computational methods in drug discovery,” Pharmacological reviews, vol. 66, no. 1, pp. 334–395, 2014.
  • [106] H. C. Kolb and K. B. Sharpless, “The growing impact of click chemistry on drug discovery,” Drug discovery today, vol. 8, no. 24, pp. 1128–1137, 2003.
  • [107] B. Chen and A. Butte, “Leveraging big data to transform target selection and drug discovery,” Clinical Pharmacology & Therapeutics, vol. 99, no. 3, pp. 285–297, 2016.
  • [108] B. Ramsundar, S. Kearnes, P. Riley, D. Webster, D. Konerding, and V. Pande, “Massively multitask networks for drug discovery,” arXiv preprint arXiv:1502.02072, 2015.
  • [109] K. Nicolaou, C. N. Boddy, H. Li, A. E. Koumbis, R. Hughes, S. Natarajan, N. F. Jain, J. M. Ramanjulu, S. Bräse, and M. E. Solomon, “Total synthesis of vancomycin—part 2: Retrosynthetic analysis, synthesis of amino acid building blocks and strategy evaluations,” Chemistry–A European Journal, vol. 5, no. 9, pp. 2602–2621, 1999.
  • [110] M. H. Segler, M. Preuss, and M. P. Waller, “Planning chemical syntheses with deep neural networks and symbolic ai,” Nature, vol. 555, no. 7698, p. 604, 2018.
  • [111] K. Nicolaou, B. S. Safina, M. Zak, S. H. Lee, M. Nevalainen, M. Bella, A. A. Estrada, C. Funke, F. J. Zécri, and S. Bulat, “Total synthesis of thiostrepton. retrosynthetic analysis and construction of key building blocks,” Journal of the American Chemical Society, vol. 127, no. 31, pp. 11 159–11 175, 2005.
  • [112] Y. Wang, J. Xiao, T. O. Suzek, J. Zhang, J. Wang, Z. Zhou, L. Han, K. Karapetyan, S. Dracheva, B. A. Shoemaker et al., “Pubchem’s bioassay database,” Nucleic acids research, vol. 40, no. D1, pp. D400–D412, 2011.
  • [113] C. Y.-C. Chen, “Tcm database@ taiwan: the world’s largest traditional chinese medicine database for drug screening in silico,” PloS one, vol. 6, no. 1, p. e15939, 2011.
  • [114] A. P. Bento, A. Gaulton, A. Hersey, L. J. Bellis, J. Chambers, M. Davies, F. A. Krüger, Y. Light, L. Mak, S. McGlinchey et al., “The chembl bioactivity database: an update,” Nucleic acids research, vol. 42, no. D1, pp. D1083–D1090, 2014.
  • [115] Y. Chen and Y. Xue, “A deep learning approach to human activity recognition based on single accelerometer,” in Systems, man, and cybernetics (smc), 2015 ieee international conference on.   IEEE, 2015, pp. 1488–1492.
  • [116] P. Angeles, M. Mace, M. Admiraal, E. Burdet, N. Pavese, and R. Vaidyanathan, “A wearable automated system to quantify parkinsonian symptoms enabling closed loop deep brain stimulation,” in Conference Towards Autonomous Robotic Systems.   Springer, 2016, pp. 8–19.
  • [117] D. Ravi, C. Wong, B. Lo, and G. Yang, “A deep learning approach to on-node sensor data analytics for mobile or wearable devices,” 2016.
  • [118] K.-H. Chen, P.-C. Chen, K.-C. Liu, and C.-T. Chan, “Wearable sensor-based rehabilitation exercise assessment for knee osteoarthritis,” Sensors, vol. 15, no. 2, pp. 4193–4211, 2015.
  • [119] L. Atallah, B. Lo, R. Ali, R. King, and G.-Z. Yang, “Real-time activity classification using ambient and wearable sensors,” IEEE Transactions on Information Technology inBiomedicine, vol. 13, no. 6, p. 1031, 2009.
  • [120] Y. Lu, Y. Wei, L. Liu, J. Zhong, L. Sun, and Y. Liu, “Towards unsupervised physical activity recognition using smartphone accelerometers,” Multimedia Tools and Applications, vol. 76, no. 8, pp. 10 701–10 719, 2017.
  • [121] A. D. Ignatov and V. V. Strijov, “Human activity recognition using quasiperiodic time series collected from a single tri-axial accelerometer,” Multimedia tools and applications, vol. 75, no. 12, pp. 7257–7270, 2016.
  • [122] K. H. Walse, R. V. Dharaskar, and V. M. Thakare, “A study of human activity recognition using adaboost classifiers on wisdm dataset,” The Institute of Integrative Omics and Applied Biotechnology Journal, vol. 7, no. 2, pp. 68–76, 2016.
  • [123] C. Catal, S. Tufekci, E. Pirmit, and G. Kocabag, “On the use of ensemble of classifiers for accelerometer-based activity recognition,” Applied Soft Computing, vol. 37, pp. 1018–1022, 2015.
  • [124] M. Zeng, L. T. Nguyen, B. Yu, O. J. Mengshoel, J. Zhu, P. Wu, and J. Zhang, “Convolutional neural networks for human activity recognition using mobile sensors,” in Mobile Computing, Applications and Services (MobiCASE), 2014 6th International Conference on.   IEEE, 2014, pp. 197–205.
  • [125] J. Yang, M. N. Nguyen, P. P. San, X. Li, and S. Krishnaswamy, “Deep convolutional neural networks on multichannel time series for human activity recognition.” in Ijcai, vol. 15, 2015, pp. 3995–4001.
  • [126] M. A. Alsheikh, A. Selim, D. Niyato, L. Doyle, S. Lin, and H.-P. Tan, “Deep activity recognition models with triaxial accelerometers.” in AAAI Workshop: Artificial Intelligence Applied to Assistive Technologies and Smart Environments, 2016.
  • [127] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
  • [128] K. A. Hoadley, C. Yau, T. Hinoue, D. M. Wolf, A. J. Lazar, E. Drill, R. Shen, A. M. Taylor, A. D. Cherniack, V. Thorsson, R. Akbani, R. Bowlby, C. K. Wong, M. Wiznerowicz, F. Sanchez-Vega, A. G. Robertson, B. G. Schneider, M. S. Lawrence, H. Noushmehr, T. M. Malta, Cancer Genome Atlas Network, J. M. Stuart, C. C. Benz, and P. W. Laird, “Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer,” Cell, vol. 173, no. 2, p. 291â€?04.e6, April 2018. [Online]. Available: https://doi.org/10.1016/j.cell.2018.03.022
  • [129] R. Shen, A. B. Olshen, and M. Ladanyi, “Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis,” Bioinformatics, vol. 25, no. 22, p. 2906, 2009.
  • [130] P. D. Dunne, M. Alderdice, P. G. O’Reilly, A. C. Roddy, A. M. B. Mccorry, S. Richman, T. Maughan, S. S. Mcdade, P. G. Johnston, and D. B. Longley, “Cancer-cell intrinsic gene expression signatures overcome intratumoural heterogeneity bias in colorectal cancer patient classification,” Nature Communications, vol. 8, p. 15657, 2017.
  • [131] P. Bailey, D. K. Chang, K. Nones, A. L. Johns, A.-M. Patch, M.-C. Gingras, D. K. Miller, A. N. Christ, T. J. Bruxner, M. C. Quinn et al., “Genomic analyses identify molecular subtypes of pancreatic cancer,” Nature, vol. 531, no. 7592, p. 47, 2016.
  • [132] J. Guinney, R. Dienstmann, X. Wang, R. A. De, A. Schlicker, C. Soneson, L. Marisa, P. Roepman, G. Nyamundanda, and P. Angelino, “The consensus molecular subtypes of colorectal cancer,” Nature Medicine, vol. 21, no. 11, p. 1350, 2015.
  • [133] K. A. Hoadley, C. Yau, D. M. Wolf, A. D. Cherniack, D. Tamborero, S. Ng, M. D. Leiserson, B. Niu, M. D. McLellan, V. Uzunangelov et al., “Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin,” Cell, vol. 158, no. 4, pp. 929–944, 2014.
  • [134] R. Shen, Q. Mo, N. Schultz, V. E. Seshan, A. B. Olshen, J. Huse, M. Ladanyi, and C. Sander, “Integrative subtype discovery in glioblastoma using icluster,” Plos One, vol. 7, no. 4, p. e35236, 2012.
  • [135] K. A. Hoadley, C. Yau, T. Hinoue, D. M. Wolf, A. J. Lazar, E. Drill, R. Shen, A. M. Taylor, A. D. Cherniack, V. Thorsson et al., “Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer,” Cell, vol. 173, no. 2, pp. 291–304, 2018.
  • [136] O. Hansson, J. Seibyl, E. Stomrud, H. Zetterberg, J. Q. Trojanowski, T. Bittner, V. Lifke, V. Corradini, U. Eichenlaub, R. Batrla et al., “Csf biomarkers of alzheimer’s disease concord with amyloid- pet and predict clinical progression: A study of fully automated immunoassays in biofinder and adni cohorts,” Alzheimer’s & dementia: the journal of the Alzheimer’s Association, 2018.
  • [137] M. Lorenzi, A. Altmann, B. Gutman, S. Wray, C. Arber, D. P. Hibar, N. Jahanshad, J. M. Schott, D. C. Alexander, P. M. Thompson et al., “Susceptibility of brain atrophy to trib3 in alzheimer?s disease, evidence from functional prioritization in imaging genetics,” Proceedings of the National Academy of Sciences, vol. 115, no. 12, pp. 3162–3167, 2018.
  • [138] B. Jie, D. Zhang, B. Cheng, D. Shen, and A. D. N. Initiative, “Manifold regularized multitask feature learning for multimodality disease classification,” Human brain mapping, vol. 36, no. 2, pp. 489–507, 2015.
  • [139] H.-I. Suk and D. Shen, “Deep ensemble sparse regression network for alzheimer?s disease diagnosis,” in International Workshop on Machine Learning in Medical Imaging.   Springer, 2016, pp. 113–121.
  • [140] B. Shi, Y. Chen, P. Zhang, C. D. Smith, J. Liu, A. D. N. Initiative et al., “Nonlinear feature transformation and deep fusion for alzheimer’s disease staging analysis,” Pattern recognition, vol. 63, pp. 487–498, 2017.
  • [141] J. Shi, X. Zheng, Y. Li, Q. Zhang, and S. Ying, “Multimodal neuroimaging feature learning with multimodal stacked deep polynomial networks for diagnosis of alzheimer’s disease,” IEEE journal of biomedical and health informatics, vol. 22, no. 1, pp. 173–183, 2018.
  • [142] N. Mattsson, R. Smith, O. Strandberg, S. Palmqvist, M. Schöll, P. S. Insel, D. Hägerström, T. Ohlsson, H. Zetterberg, K. Blennow et al., “Comparing 18f-av-1451 with csf t-tau and p-tau for diagnosis of alzheimer disease,” Neurology, vol. 90, no. 5, pp. e388–e395, 2018.
  • [143] S. Mohapatra, D. Patra, and S. Satpathy, “An ensemble classifier system for early diagnosis of acute lymphoblastic leukemia in blood microscopic images,” Neural Computing and Applications, vol. 24, no. 7-8, pp. 1887–1904, 2014.
  • [144] M. M. Amin, S. Kermani, A. Talebi, and M. G. Oghli, “Recognition of acute lymphoblastic leukemia cells in microscopic images using k-means clustering and support vector machine classifier,” Journal of medical signals and sensors, vol. 5, no. 1, p. 49, 2015.
  • [145] A. M. Abdeldaim, A. T. Sahlol, M. Elhoseny, and A. E. Hassanien, “Computer-aided acute lymphoblastic leukemia diagnosis system based on image analysis,” in Advances in Soft Computing and Machine Learning in Image Processing.   Springer, 2018, pp. 131–147.
  • [146] W. Zhu, X. Xiang, T. D. Tran, G. D. Hager, and X. Xie, “Adversarial deep structured nets for mass segmentation from mammograms,” in Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on.   IEEE, 2018, pp. 847–850.
  • [147] W. Zhu, Q. Lou, Y. S. Vang, and X. Xie, “Deep multi-instance networks with sparse label assignment for whole mammogram classification,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2017, pp. 603–611.
  • [148] T. Kooi, B. van Ginneken, N. Karssemeijer, and A. den Heeten, “Discriminating solitary cysts from soft tissue lesions in mammography using a pretrained deep convolutional neural network,” Medical physics, vol. 44, no. 3, pp. 1017–1027, 2017.
  • [149] W. Sun, T.-L. B. Tseng, J. Zhang, and W. Qian, “Enhancing deep convolutional neural network scheme for breast cancer diagnosis with unlabeled data,” Computerized Medical Imaging and Graphics, vol. 57, pp. 4–9, 2017.
  • [150] S. V. Fotin, Y. Yin, H. Haldankar, J. W. Hoffmeister, and S. Periaswamy, “Detection of soft tissue densities from digital breast tomosynthesis: comparison of conventional and deep learning approaches,” in Medical Imaging 2016: Computer-Aided Diagnosis, vol. 9785.   International Society for Optics and Photonics, 2016, p. 97850X.
  • [151] J. Wang, H. Ding, F. A. Bidgoli, B. Zhou, C. Iribarren, S. Molloi, and P. Baldi, “Detecting cardiovascular disease from mammograms with deep learning.” IEEE Trans. Med. Imaging, vol. 36, no. 5, pp. 1172–1181, 2017.
  • [152] S. Albarqouni, C. Baur, F. Achilles, V. Belagiannis, S. Demirci, and N. Navab, “Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1313–1321, 2016.
  • [153] J.-Z. Cheng, D. Ni, Y.-H. Chou, J. Qin, C.-M. Tiu, Y.-C. Chang, C.-S. Huang, D. Shen, and C.-M. Chen, “Computer-aided diagnosis with deep learning architecture: applications to breast lesions in us images and pulmonary nodules in ct scans,” Scientific reports, vol. 6, p. 24454, 2016.
  • [154] M. Kallenberg, K. Petersen, M. Nielsen, A. Y. Ng, P. Diao, C. Igel, C. M. Vachon, K. Holland, R. R. Winkel, N. Karssemeijer et al., “Unsupervised deep learning applied to breast density segmentation and mammographic risk scoring,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1322–1331, 2016.
  • [155] Q. Zhang, Y. Xiao, W. Dai, J. Suo, C. Wang, J. Shi, and H. Zheng, “Deep learning based classification of breast tumors with shear-wave elastography,” Ultrasonics, vol. 72, pp. 150–157, 2016.
  • [156] T. Kooi, G. Litjens, B. van Ginneken, A. Gubern-Mérida, C. I. Sánchez, R. Mann, A. den Heeten, and N. Karssemeijer, “Large scale deep learning for computer aided detection of mammographic lesions,” Medical image analysis, vol. 35, pp. 303–312, 2017.
  • [157] J. Arevalo, F. A. González, R. Ramos-Pollán, J. L. Oliveira, and M. A. G. Lopez, “Representation learning for mammography mass lesion classification with convolutional neural networks,” Computer methods and programs in biomedicine, vol. 127, pp. 248–257, 2016.
  • [158] A. Dubrovina, P. Kisilev, B. Ginsburg, S. Hashoul, and R. Kimmel, “Computational mammography using deep neural networks,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, vol. 6, no. 3, pp. 243–247, 2018.
  • [159]

    N. Dhungel, G. Carneiro, and A. P. Bradley, “The automated learning of deep features for breast mass classification from mammograms,” in

    International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2016, pp. 106–114.
  • [160] K. Paeng, S. Hwang, S. Park, and M. Kim, “A unified framework for tumor proliferation score prediction in breast histopathology,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support.   Springer, 2017, pp. 231–239.
  • [161] B. Q. Huynh, H. Li, and M. L. Giger, “Digital mammographic tumor classification using transfer learning from deep convolutional neural networks,” Journal of Medical Imaging, vol. 3, no. 3, p. 034501, 2016.
  • [162] P. Kisilev, E. Sason, E. Barkan, and S. Hashoul, “Medical image description using multi-task-loss cnn,” in Deep Learning and Data Labeling for Medical Applications.   Springer, 2016, pp. 121–129.
  • [163] Y. Qiu, Y. Wang, S. Yan, M. Tan, S. Cheng, H. Liu, and B. Zheng, “An initial investigation on developing a new method to predict short-term breast cancer risk based on deep learning technology,” in Medical Imaging 2016: Computer-Aided Diagnosis, vol. 9785.   International Society for Optics and Photonics, 2016, p. 978521.
  • [164] R. K. Samala, H.-P. Chan, L. M. Hadjiiski, K. Cha, and M. A. Helvie, “Deep-learning convolution neural network for computer-aided detection of microcalcifications in digital breast tomosynthesis,” in Medical Imaging 2016: Computer-Aided Diagnosis, vol. 9785.   International Society for Optics and Photonics, 2016, p. 97850Y.
  • [165] R. K. Samala, H.-P. Chan, L. Hadjiiski, M. A. Helvie, J. Wei, and K. Cha, “Mass detection in digital breast tomosynthesis: Deep convolutional neural network with transfer learning from mammography,” Medical physics, vol. 43, no. 12, pp. 6654–6666, 2016.
  • [166] A. Akselrod-Ballin, L. Karlinsky, S. Alpert, S. Hasoul, R. Ben-Ari, and E. Barkan, “A region based convolutional network for tumor detection and classification in breast mammography,” in Deep Learning and Data Labeling for Medical Applications.   Springer, 2016, pp. 197–205.
  • [167] P. Fonseca, J. Mendoza, J. Wainer, J. Ferrer, J. Pinto, J. Guerrero, and B. Castaneda, “Automatic breast density classification using a convolutional neural network architecture search procedure,” in Medical Imaging 2015: Computer-Aided Diagnosis, vol. 9414.   International Society for Optics and Photonics, 2015, p. 941428.
  • [168] A. R. Jamieson, K. Drukker, and M. L. Giger, “Breast image feature learning with adaptive deconvolutional networks,” in Medical Imaging 2012: Computer-Aided Diagnosis, vol. 8315.   International Society for Optics and Photonics, 2012, p. 831506.
  • [169] J. Lamb, E. D. Crawford, D. Peck, J. W. Modell, I. C. Blat, M. J. Wrobel, J. Lerner, J.-P. Brunet, A. Subramanian, K. N. Ross et al., “The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease,” science, vol. 313, no. 5795, pp. 1929–1935, 2006.
  • [170] N. S. Madhukar, P. Khade, L. Huang, K. Gayvert, G. Galletti, M. Stogniew, J. E. Allen, P. Giannakakou, and O. Elemento, “A new big-data paradigm for target identification and drug discovery,” bioRxiv, p. 134973, 2017.
  • [171] J. Law, Z. Zsoldos, A. Simon, D. Reid, Y. Liu, S. Y. Khew, A. P. Johnson, S. Major, R. A. Wade, and H. Y. Ando, “Route designer: a retrosynthetic analysis tool utilizing automated retrosynthetic rule generation,” Journal of chemical information and modeling, vol. 49, no. 3, pp. 593–602, 2009.
  • [172] I. Wallach, M. Dzamba, and A. Heifets, “Atomnet: A deep convolutional neural network for bioactivity prediction in structure-based drug discovery,” arXiv preprint arXiv:1510.02855, 2015.
  • [173] F. F. Costa, “Big data in biomedicine,” Drug discovery today, vol. 19, no. 4, pp. 433–440, 2014.
  • [174] P. Rajpurkar, J. Irvin, A. Bagul, D. Ding, T. Duan, H. Mehta, B. Yang, K. Zhu, D. Laird, R. L. Ball et al., “Mura dataset: Towards radiologist-level abnormality detection in musculoskeletal radiographs,” arXiv preprint arXiv:1712.06957, 2017.
  • [175] J. Zhan, W. Gao, L. Wang, J. Li, K. Wei, C. Luo, R. Han, X. Tian, C. Jiang et al., “Bigdatabench: An open-source big data benchmark suite,” Chinese Journal of Computers, vol. 39, no. 1, pp. 196–211, 2016.
  • [176] S. C. Christov, G. S. Avrunin, L. A. Clarke, L. J. Osterweil, and E. A. Henneman, “A benchmark for evaluating software engineering techniques for improving medical processes,” in Proceedings of the 2010 ICSE Workshop on Software Engineering in Health Care.   ACM, 2010, pp. 50–56.
  • [177] L. Shamir, N. Orlov, D. M. Eckley, T. J. Macura, and I. G. Goldberg, “Iicbu 2008: a proposed benchmark suite for biological image analysis,” Medical & biological engineering & computing, vol. 46, no. 9, pp. 943–947, 2008.
  • [178] A. E. Darling, L. Carey, and W. C. Feng, “The design, implementation, and evaluation of mpiblast,” Los Alamos National Laboratory, Tech. Rep., 2003.
  • [179] “Sand,” http://ccl.cse.nd.edu/software/sand/.
  • [180] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on.   IEEE, 2017, pp. 3462–3471.
  • [181] “Ad dream challenge training data,” https://www.synapse.org/#!Synapse:syn2290704/wiki/64710.
  • [182] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang et al., “Bigdatabench: A big data benchmark suite from internet services,” in High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on.   IEEE, 2014, pp. 488–499.
  • [183] D. J. Foran, L. Yang, W. Chen, J. Hu, L. A. Goodell, M. Reiss, F. Wang, T. Kurc, T. Pan, A. Sharma et al.

    , “Imageminer: a software system for comparative analysis of tissue microarrays using content-based image retrieval, high-performance computing, and grid technology,”

    Journal of the American Medical Informatics Association, vol. 18, no. 4, pp. 403–415, 2011.
  • [184] F. Wang, A. Aji, Q. Liu, and J. Saltz, “Hadoop-gis: A high performance spatial query system for analytical medical imaging with mapreduce,” Center for Comprehensive Informatics, Technical Report. Available at: http://www3. cs. stonybrook. edu/~ fuswang/papers/CCI-TR-2011–3. pdf (access 21 September 2015)(online), 2011.
  • [185] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly et al., “The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data,” Genome research, 2010.
  • [186] V. M. Markowitz, I.-M. A. Chen, K. Palaniappan, K. Chu, E. Szeto, Y. Grechkin, A. Ratner, B. Jacob, J. Huang, P. Williams et al., “Img: the integrated microbial genomes database and comparative analysis system,” Nucleic acids research, vol. 40, no. D1, pp. D115–D122, 2011.
  • [187] I. Neamatullah, M. M. Douglass, H. L. Li-wei, A. Reisner, M. Villarroel, W. J. Long, P. Szolovits, G. B. Moody, R. G. Mark, and G. D. Clifford, “Automated de-identification of free-text medical records,” BMC medical informatics and decision making, vol. 8, no. 1, p. 32, 2008.
  • [188] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
  • [189] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia.   ACM, 2014, pp. 675–678.