Due to previous successive successes of AI in the clinical research field, AI is considered a promising technology to provide high-quality and low-cost diagnostic services [1, 2, 3, 4, 5, 6, 7]. However, there is little evidence that these researches can be implemented into real-world clinical settings (in short, real-world settings) and improve medical services [8, 9, 10]. Fig. 1, 2, 3 qualitatively and quantitatively reveal the state-of-the-art and state-of-the-practice AI systems only achieve acceptable performance on the stringent conditions. We call those stringent conditions closed clinical settings (in short, closed settings). The closed settings have the following primary assumptions: all categories of subjects are known a priori ; the same diagnostic strategy is applied to all subjects, e.g., every subject requires a nuclear magnetic resonance scan (MRI) ; the state-of-the-art AI systems can only be deployed at medical institutions that are able to execute the pre-prescribed diagnostic strategy [4, 13, 14]. Vice versa, if the medical institution can not meet prerequisite conditions that are able to complete the pre-prescribed diagnostic strategy, the corresponding AI system can not be deployed. In this context, the diagnosis problem is a closed set recognition problem that is artificially simplified [3, 4, 14, 5, 15].
Close settings are too ideal for real-world settings. The real-world setting is open with uncertainty and complexity. The subject in real-world settings is not all pre-known categories but contains many unknown and unfamiliar categories. Every subject is different, and there is no one-size-fits-all diagnosis strategy. Conditions of medical institutions are different and not pre-known, e.g., some hospitals have positron emission tomography (PET). In contrast, most of the other hospitals in underdeveloped areas are not equipped with PET. The diagnosis problem in real-world settings is an open set recognition problem .
Essentially, the diagnosis task in the closed setting is to find the optimal solution to classify different categories of subjects in a limited space (so-called supervised task) with the help of the ground truth of every subject. However, the real-world setting is open and puts the diagnosis task into unlimited space. Compared to the limited space of closed settings, the infinite space of real-world settings infinitely expands the scale of solving. Moreover, supervised learning will lose efficacy since some categories of subjects and their ground truth are unknown during the development of the AI model. Hence, the main problem of the diagnosis task in the real-world setting converts to efficiently locate the known subjects from the uncertain and complex real-world setting. Moreover, as shown in Fig.2a and 3b,c, solving well the diagnosis task in the closed setting is not much help to solve the diagnosis task in the real-world setting. Compared to the diagnosis task in the closed setting, the diagnosis task in the real-world setting is a new and challenging task that we must treat differently.
This paper calls for turning medical AI attention from algorithmic research in closed settings to systematic study in real-world settings. Specifically, we construct a clinical AI benchmark named Clinical AIBench, which contains real-world and closed settings to promote the landing of AI in real-world settings. To tackle uncertainty and complexity in real-world settings, we propose an open, dynamic machine learning framework ( Fig. S1) and a diagnostic system named OpenClinicalAI to embed in the current healthcare systems as shown in Fig. 1b.
The first versions of Clinical AIBench and OpenClinicalAI target Alzheimer’s disease (AD) as AD is an incurable disease that brings a heavy burden to our society (the total payment for individuals with AD or other dementias is estimated atbillion) [17, 18, 19, 20]. Early and accurate AD diagnosis will result in the correct management of AD or other dementias, saving up to trillion in medical and care costs . However, it is estimated that million of the world’s million people with dementia do not receive a diagnosis since the limited medical resources and experts, etc. .
The current version of Clinical AIBench includes two clinical settings, which are curated from a large enriched dataset Alzheimer’s disease neuroimaging initiative (ADNI): a closed setting and a real-world setting . OpenClinicalAI is composed of multiple independent parts, which can cooperate to handle unknown subjects in real-world settings, and dynamically adjust diagnosis strategies according to specific subjects and medical institutions. OpenClinicalAI provides an opportunity to embed the AI-based diagnostic system into the current healthcare systems to cooperate with clinicians to improve healthcare services.
In the real-world setting of Clinical AIBench, we evaluate the performance of OpenClinicalAI against the state-of-the-art AI diagnosis system. Our evaluations show that the performance of OpenClinicalAI exceeds that of the state-of-the-art AI diagnosis system in the real-world setting. Additionally, OpenClinicalAI can develop personalized diagnosis strategies for every subject in the real-world setting, maximizing the patient benefit.
Clinical AIBench contains real-world and closed settings to develop and evaluate the AI system designed for real-world settings. The first version targets Alzheimer’s disease. In this section, we focus on real-world settings.
The diagnosis in a real-world setting requires clinicians to use both individual clinical expertise and the best available external evidence, which is usually obtained by clinical examination, to make a clinical decision for every specific subject . It means that at least two main factors must be considered in the diagnosis task in real-world settings: the subject and the available clinical examination in the medical institution.
As shown in Fig. S2, the real-world setting is open with uncertainty and complexity. The primary characteristics of the real-world setting are as follows:
Real-world settings are open, and clinicians or AI systems often refer to unknown and unfamiliar categories. Thus, the subject’s categories are not all pre-known and familiar. A clinician has different expertise and may be unfamiliar with some diseases. In the real-world setting of Clinical AIBench, an unknown subject category means that it is not familiar to the clinician or AI system. Thus, we mark both unknown categories and unfamiliar categories as unknown. In this work, Clinical AIBench divides all mild cognitive impairment (MCI) and significant memory concern (SMC) subjects into the test set, which are unknown categories during the development of the AI system.
Subjects in real-world settings are under different situations. In this work, subjects with varying conditions are from 67 sites in two countries ( Table S1). For every subject, data of all visits are included in Clinical AIBench ( Table S2). The interval between two contiguous visits of a subject is usually more than six months.
Medical institutions in real-world settings have wildly different executive abilities of the examination. Not all the specific medical institutions and their specific executive abilities of the examination are pre-known. In this work, missing data for subjects are not be filled in the real-world setting of Clinical AIBench. In the real world, most of the subjects do not have all examination data categories. The purpose of the lack of specific category examination data is to keep the varied executive ability of the examination in different medical institutions. That is to say, in the real-world setting of Clinical AIBench, the lack of specific category examination data indicates that a medical institute lacks that examination ability.
Specifically, in this work, the examination data in ADNI is divided into 13 categories: base information (Base), cognition information (Cog), cognition testing (CE), neuropsychiatric information (Neur), function and behavior information (FB), physical neurological examination (PE), blood testing (Blood), urine testing (Urine), nuclear magnetic resonance scan (MRI), positron emission computed tomography scan with 18-FDG (FDG), positron emission computed tomography scan with AV45 (AV45), gene analysis (Gene), and cerebral spinal fluid analysis (CSF).
Details of the dataset in the real-world setting are as follows.
All subjects with labels in ADNI are included.
85% AD and cognitively normal (CN) subjects are divided as the training set. 5% of AD and CN subjects are divided as the validation set. 20% AD and CN subjects, 100% MCI subjects, and 100% SMC are divided as the test set.
For every subject, different diagnosis strategies are combined according to the presence of different examination data, and the data of each diagnosis strategy forms a sample.
The test set is not accessible during the training of the AI system. In addition, since each subject may have multiple visits ( each visit of the subject is treated as an independent subject), we stipulate that each subject’s visit data can only appear in one of the training set, validation set, and test set.
Sine previous AD diagnosis researches are developed in closed settings, the closed setting in Clinical AIBench is similar to the previous research [24, 25, 26, 27, 28, 12, 29, 30, 31]. Only AD and CN subjects are included in the closed setting, and only the nuclear magnetic resonance instrument and historical medical records are available. 80% of subjects are divided as the training set, 5% of subjects are divided as the validation set, and 15% of subjects are divided as the test set.
The performance of OpenClinicalAI on Alzheimer’s disease diagnosis
. Most of these works are based on MRI data and transfer learning obtain the most excellent results. In addition, among the recent AI diagnosis researches, the transfer learning framework of the pre-trained model followed by a classifier achieves the state-of-the-art performance in many diagnosis tasks based on medical images[14, 1, 32, 33, 3]. Thus, based on the state-of-the-art transfer learning framework and MRI data, we utilize a trained model named DenseNet201  and a classifier called XGBoot  to develop an Alzheimer’s disease diagnosis AI system, which we consider as the baseline system to compare against OpenClinicalAI in the rest of this paper.
We validate the effectiveness of OpenClinicalAI in two ways. First, we compare OpenClinicalAI to the baseline system in the closed setting. Second, we compare OpenClinicalAI to the baseline system in the real-world setting. Our comparison metrics are the area under the receiver operating characteristic (ROC) curve (AUC) and sensitivity. The larger the value of AUC and sensitivity are, the better the AI system is.
The performance of OpenClinicalAI against the baseline system in the closed setting.
To the best of our knowledge, all state-of-the-art and state-of-the-practice Alzheimer’s disease diagnosis AI researches are developed and evaluated in closed settings [27, 28, 12, 29, 30, 31]. We firstly assess the baseline AI system in the closed setting, and then evaluate OpenClinicalAI in the same closed setting without the limitation of that only the nuclear magnetic resonance instrument and historical medical records are available.
As shown in Fig. 2 a, the baseline system obtains a high AUC score of
(95% confidence interval (CI) 0.9722-0.9827), and there is not much room for promotion. OpenClinicalAI achieves an AUC score of(95% CI 0.9907-0.9945) and obtains the state-of-the-art performance. However, the essential improvement from the baseline system to OpenClinicalAI is that the latter can dynamically develop personalized diagnosis strategies according to specific subjects and medical institutions. As shown in Fig. 2 b, less than 10% of the subjects require a nuclear magnetic resonance scan, and most of the subjects only require harmless examination such as cognitive examination. We conclude OpenClinicalAI can avoid unnecessary examination for subjects and suit medical institutions with different examination abilities 222Different hospitals have various clinical settings, such as community hospitals without nuclear magnetic resonance machines, big hospitals with multiple facilities..
The performance of OpenClincalAI against the baseline system in the real-world setting.
Our goal is to develop an AI diagnosis system that can be embedded in the current medical system and cooperated with clinicians. In this work, if the predicted probability of the AD or CN is smaller than the probability threshold ( 0.95 ), the subject will be marked as unknown and referral to the clinician. For comparison, we use the same baseline system discussed above. In addition, we also consider OpenClinicalAI without an OpenMax mechanism ( Algorithm S2,3) as the comparison system.
As shown in Fig. 3a, b, and c, compared to the baseline system, OpenClinicalAI demonstrates a significant improvement in the AUC of identification of AD subjects (+0.1102) and the AUC of identification of CN subjects (+0.1148). It is worth noting that OpenClinicalAI has a vast improvement in the sensitivity of AD, CN, and unknown on the operating point.
For the baseline system, the sensitivity of known (AD and CN) subjects is low. The sensitivity of AD is just 0.5483 (95% CI 0.4604-0.6301), and the sensitivity of CN is just 0.3333(95% CI 0.2663-0.3979). It indicates that most known subjects will be marked as unknown and sent to the clinician for diagnosis. Moreover, the sensitivity of unknown subjects is 0.8888(95% CI 0.8753-0.9018), meaning 11.12% of unknown subjects will be misdiagnosed. In addition, the baseline system requires that every subject has a nuclear magnetic resonance scan, and every medical institution that deploys the baseline system must be equipped with a nuclear magnetic resonance apparatus.
For OpenClinicalAI without an OpenMax mechanism, the sensitivity of known (AD and CN) subjects is as good as OpenClinicalAI with an OpenMax mechanism. In contrast, the sensitivity of unknown subjects is much worse than OpenClinicalAI with an OpenMax mechanism. It means most unknown subjects will be misdiagnosed, and it is unendurable in real-world settings.
OpenClinicalAI diagnoses most of the known (AD and CN) subjects correctly, marks most of the rest as unknown, and sends them to the clinician for further diagnosis. Besides, most unknown subjects are correctly identified, and the misdiagnosis of unknown subjects is only . It means that OpenClinicalAI has enormous potential application value to implement in real-world settings. In addition, as shown in Fig. 3d, similar to the behaviors of OpenClinicalAI in the closed setting, OpenClinicalAI can develop and adjust diagnosis strategies for every subject dynamically in the real-world setting. Only a small part of subjects require a nuclear magnetic resonance scan and more costs (economy and harm) examinations.
Development of diagnosis strategies
For every subject, firstly, OpenClinicalAI will acquire the base information of the subject. Secondly, OpenClinicalAI will give a final diagnosis or receive other examination information according to the current data of the subject. Thirdly, repeat the previous step until the diagnosis is finalized or there is no further examination.
As shown in Fig. 4a, diagnosis strategies of subjects are not the same ( Table S3). OpenClinicalAI dynamically develops 35 diagnosis strategies according to different subject situations and all 40 examination abilities in the test set( Table S4). For the known (AD and CN) subjects, as shown in Fig. 4b, and c, most of the subjects require low-cost examinations (such as cognition examination (CE)). A small part of subjects requires high-cost examinations (such as cerebral spinal fluid analysis (CSF) ). For unknown subjects, as shown in Fig. 4d, different from the diagnosis of known (AD and CN) subjects, identifying unknown subjects is more complex and more dependent on high-cost examinations. The reason for the above phenomenon is that according to the mechanism of OpenClinicalAI, it will do its best to distinguish whether the subject belongs to the known categories. When it fails, OpenClinicalAI will mark the subject as unknown. It means that the unknown subject will undergo more examinations than the known subject. The details of the high-cost examinations requirement are as follows.
33.94% of unknown subjects require a nuclear magnetic resonance scan (that of the known subject is 12.43%).
13.95% of unknown subjects require a positron emission computed tomography scan with 18-FDG ( that of the known subject is 4.75%).
8.67% of unknown subjects require a positron emission computed tomography scan with AV45 ( that of the known subject is 5.87%).
9.38% of unknown subjects require a gene analysis ( that of the known subject is 1.96%).
5.13% of unknown subjects require a cerebral spinal fluid analysis (that of the known subject is 0.28%).
Potential clinical applications
OpenClinicalAI enables that the AD diagnosis system can be implemented in uncertain and complex clinical settings to reduce the workload of AD diagnosis and minimize the cost of subjects.
To identify the known (AD and CN) subject with high confidence, the operating point of OpenClinicalAI is running with a high decision threshold (0.95). For the test set, OpenClinicalAI achieved a accuracy value of 92.47% (95% CI 91.36%-93.44%), AD sensitivity value of 84.92% (95% CI 78.91%-90.51%), CN sensitivity value of 81.27% (95% CI 75.51%-86.67%) while retaining an unknown sensitivity value of 93.96% (95% CI 92.90%-94.92%). In addition, OpenClinicalAI can cooperate with the senior clinician to identify the known subject. In this work, 15.08% (95% CI 9.49%-21.09%) of AD subjects and 18.73% (95% CI 13.33%-24.49%) of CN subjects are marked as unknown and sent to senior clinicians to diagnose. The work pattern is significant for the undeveloped area, which is a promising way to connect developed areas and undeveloped areas to reduce the workload, improve the overall medical services, and promote medical equity. To minimize the subject cost and maximize the subject benefit, OpenClinicalAI dynamically develops personalized diagnosis strategies for the subject according to the subject’s situation and existing medical conditions.
For the subject, OpenClinicalAI will judge whether it can finalize the subject’s diagnosis according to the currently obtained information of subjects. If the current data of the subject is not enough to support OpenClinicalAI to make a diagnosis, it will recommend the most suitable further examination for the subject. It will mitigate the over-testing plight, minimize the subject cost, and maximize the subject benefits. For the test set, different diagnosis strategies are applied to the subject by OpenClinicalAI ( Table S3). The details of the high-cost examination are as follows.
31.07% of subjects require a nuclear magnetic resonance scan.
12.72% of subjects require a positron emission computed tomography scan with 18-FDG.
8.29% of subjects require a positron emission computed tomography scan with AV45.
8.39% of subjects require a gene analysis.
4.48% of subjects require a cerebral spinal fluid analysis.
For the medical institution, before the system recommends an examination for a subject, OpenClinicalAI will inquire whether the medical institution can execute the examination. Suppose the medical institution cannot perform the examination. In that case, OpenClinicalAI will recommend other examinations until the current information of the subject is enough to support it to make a diagnosis or until all common examinations have been suggested and the subject is marked as unknown. It enables that OpenClinicalAI is able to deploy in the different medical institutions with various examination abilities. In this work, OpenClinicalAI diagnoses subjects on 40 conditions of medical institutions ( Table S4). In addition, for the subject of the test set, due to lack of the information of recommended examinations (which may be equal to the medical institution not having the ability to execute the recommended examination), OpenClinicalAI adjusts the diagnostic strategies times.
Currently, the media overhype the AI assistance diagnosis system. However, it is far from being mature to be implemented in real-world clinical settings. Many clinicians are gradually losing faith in the medicine AI [36, 37, 9, 38, 39, 40]. Similar to the first trough of AI, the high expectation and unsatisfactory practical implementation of medical AI may severely hinder the development of medical AI. In addition, compared performances of state-of-the-art AI systems on stringent conditions and real-world settings, solving well the diagnosis task on stringent conditions is not much help to solve the diagnosis task in the real-world setting. It is time to draw the attention from the pure algorithm research in closed settings to systematic study in real-world settings, focusing on the challenge of tackling the uncertainty and complexity of real-world settings. In this work, we propose an open, dynamic machine learning framework to make the AI diagnosis system can directly deal with the uncertainty and complexity in the real-world setting. Based on our framework, an AD diagnostic system demonstrates huge potentiality to implement in the real-world setting with different medical environments to reduce the workload of AD diagnosis and minimize the cost of the subject.
Although many AI diagnostic systems have been proposed, how to embed these systems into the current health care system to improve the medical service remains an open issue [2, 41, 42, 43]. OpenClinicalAI provides a reasonable way to embed the AI system into the current health care system. OpenClinicalAI can collaborate with clinicians to improve the clinical service quality, especially the clinical service quality of undeveloped areas. On the one hand, OpenClinicalAI can directly deal with the diagnosis task in the uncertain and complex real-world setting. On the other hand, OpenClinicalAI can diagnose typical patients of known subjects, while sending those challenging or atypical patients of known subjects to the clinicians for diagnosis. Although AI technology is different from traditional statistics, the model of the AI system still learns patterns from training data. For typical patients, the model is easy to understand patterns from patients, while it is challenging to learn patterns for atypical patients. Thus, every atypical and unknown patient is needed to treat by clinicians especially. In this work, most of the known subjects are diagnosed by OpenClinicalAI, and the rest are marked as unknown and sent to the senior clinician.
Over-testing has always been a concern and has been exacerbated in current AI-based diagnostic systems [44, 45]. As samples, the systems proposed by Lu et al., Ding et al., and Liu et al. achieved state-of-the-art performance. At the same time, they required every subject to have a positron emission computed tomography scan, which is unnecessary for most subjects in real-work settings [46, 31, 47]. However, OpenClinicalAI enables AI systems able to develop personalized diagnosis strategies to avoid unnecessary testing. OpenClinicalAI provides a possible way that can effectively reduce over-testing under strict supervision.
Notably, the experiment of this work does not contain a comparison with clinicians. There are two main reasons. First, OpenClinicalAI obtains an AUC value of 0.9927 (95% CI 0.9854-0.9981) in the closed setting. It is very close to the ground truth and unnecessary compared to clinicians. Second, the diagnosis patterns in real-world settings aim to diagnose typical patients of known subjects (which is usually easier to diagnose) and send atypical patients of known subjects ( which are generally difficult to diagnose) and unknown subjects to clinicians. The task of OpenClinicalAI is quite different from that one of clinicians. Unlike current AI-based diagnostic systems, OpenClinicalAI performs as a new part of the whole healthcare system instead of replacing the role of clinicians. Therefore, it is not necessary to compare OpenClinicalAI to clinicians.
Although OpenClinicalAI is promising to impact the future research of the diagnosis system, several limitations remain. First, the prospective clinical studies of diagnosis of Alzheimer’s disease will be required to prove the effectiveness of our system. Second, the data of collection and processing are required to follow the standards of ADNI.
-  A. Esteva, et al., Nature 542, 115 (2017).
-  S. M. McKinney, et al., Nature 577, 89 (2020).
-  D. S. Kermany, et al., Cell 172, 1122 (2018).
-  J. De Fauw, et al., Nature medicine 24, 1342 (2018).
-  K. Ning, et al., Neurobiology of aging 68, 151 (2018).
-  Z. Tang, et al., Nature communications 10, 1 (2019).
-  C. Lian, M. Liu, Y. Pan, D. Shen, IEEE Transactions on Cybernetics (2020).
-  J. He, et al., Nature medicine 25, 30 (2019).
-  P. Brocklehurst, et al., The Lancet 389, 1719 (2017).
-  M. Roberts, et al., Nature Machine Intelligence 3, 199 (2021).
-  A. Bendale, T. Boult, (2015), pp. 1893–1902.
-  S. Qiu, et al., Brain 143, 1920 (2020).
-  J. J. Titano, et al., Nature medicine 24, 1337 (2018).
-  H. Lee, et al., Nature biomedical engineering 3, 173 (2019).
-  X. Mei, et al., Nature medicine 26, 1224 (2020).
-  C. Geng, S.-j. Huang, S. Chen, IEEE transactions on pattern analysis and machine intelligence (2020).
-  L. E. Hebert, L. A. Beckett, P. A. Scherr, D. A. Evans, Alzheimer Disease & Associated Disorders 15, 169 (2001).
-  L. E. Hebert, J. Weuve, P. A. Scherr, D. A. Evans, Neurology 80, 1778 (2013).
-  A. Association, et al., Alzheimer’s & Dementia 14, 367 (2018).
-  C. S. Frigerio, et al., Cell reports 27, 1293 (2019).
-  M. Prince, R. Bryce, C. Ferri (2018).
-  S. G. Mueller, et al., Neuroimaging Clinics 15, 869 (2005).
-  D. L. Sackett, W. M. Rosenberg, J. M. Gray, R. B. Haynes, W. S. Richardson, BMJ 312, 71 (1996).
-  H. Li, et al., Alzheimer’s & Dementia (2019).
-  H. Choi, et al., EBioMedicine 43, 447 (2019).
-  T. Zhou, M. Liu, K.-H. Thung, D. Shen, IEEE transactions on medical imaging (2019).
-  M. A. Ebrahimighahnavieh, S. Luo, R. Chiong, Computer methods and programs in biomedicine 187, 105242 (2020).
-  M. Tanveer, et al., ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 1 (2020).
-  R. Sharma, T. Goel, M. Tanveer, S. Dwivedi, R. Murugan, Applied Soft Computing 106, 107371 (2021).
-  M. Tanveer, et al., IEEE Journal of Biomedical and Health Informatics (2021).
-  Y. Ding, et al., Radiology 290, 456 (2019).
-  P. Tschandl, et al., Nature Medicine 26, 1229 (2020).
-  R. Poplin, et al., Nature Biomedical Engineering 2, 158 (2018).
-  G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Proceedings of the IEEE conference on computer vision and pattern recognition (2017), pp. 4700–4708.
-  T. Chen, C. Guestrin, Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (2016), pp. 785–794.
-  M. van Assen, L. J. Cornelissen, Jacc-Cardiovascular imaging 13, 1172 (2020).
-  C. G. Weaver, F. A. McAlister, Canadian Journal of Cardiology (2021).
-  J. H. Chen, S. M. Asch, The New England journal of medicine 376, 2507 (2017).
-  T. M. Maddox, J. S. Rumsfeld, P. R. Payne, Jama 321, 31 (2019).
-  H. T. Head, Bmj p. 363 (2018).
-  C.-Y. Kuo, H.-M. Chiu, Journal of Gastroenterology and Hepatology 36, 267 (2021).
-  J. Schneider, M. Agus, arXiv preprint arXiv:2103.01149 (2021).
J. Bullock, A. Luccioni, K. H. Pham, C. S. N. Lam, M. Luengo-Oroz,
Journal of Artificial Intelligence Research69, 807 (2020).
-  M. O’Keeffe, et al., JAMA Internal Medicine 181, 865 (2021).
-  J. W. O’Sullivan, et al., BMJ open 8, e018557 (2018).
-  D. Lu, et al., Medical image analysis 46, 26 (2018).
-  M. Liu, D. Cheng, W. Yan, A. D. N. Initiative, et al., Frontiers in neuroinformatics 12, 35 (2018).
We thank Weibo Pan and Fang Li for downloading the raw data sets from Alzheimer’s Disease Neuroimaging Initiative. Funding: This work is supported by the Project of Guangxi Science and Technology (No. GuiKeAD20297004 to Y. H.) and the National Natural Science Foundation of China (No.61967002 to S. T.). Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.;Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.;Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (https://www.fnih.org/). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California. Author contributions: Y.H. conceptualized the study, designed the models, wrote the codes, collected and analyzed the data, and wrote the manuscript. N.W., S.T., L.M, T.H., and Z.J. conceptualized the study and revised the manuscript. F.Z., G.K., X.M, X.G, and R,Z. collected and analyzed the data. Z.Z., and J.Z. directed the project and revised the manuscript. Competing interests: The authors declare no competing financial interest. Data and materials availability: The data from Alzheimer’s Disease Neuroimaging Initiative was used under license for the current study. Applications for access to the dataset can be made at http://adni.loni.usc.edu/data-samples/access-data/. All original code has been deposited at the website BenchCouncil and is publicly available as of the date of publication.
Materials and Methods
Figs. S1 to S3
Tables S1 to S6
Algorithms S1 to S4
Supplementary Materials for
OpenClinicalAI: enabling AI to diagnose diseases in real-world clinical settings
Materials and Methods
Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner,MD. For up-to-date information, see http://www.adni-info.org.
The data is collected from 67 sites in the United States and Canada [48, 49, 50, 51]. The subject in the dataset aged between and at the first visit. The interval of the subject follow-up is usually greater than months. Generally, the longer the follow-up time is, the longer the interval is. The first visit is marked as bl, and the other visit is marked as mxx according to the time (For example, the visit takes place six months after the first visit is marked as m06). Detailed characteristics of the subject are shown in Table S1,2.
The data contains study data, image data, genetic data compiled by ADNI between 2005 and 2019. Considering the commonly used examinations and the concerned examinations in AD diagnosis by the clinician, 13 categories of data are selected.
Base information, usually obtained through consultation, includes demographics, family history, medical history, symptoms.
Cognition information, usually obtained through consultation and testing, includes Alzheimer’s Disease Assessment Scale, Mini-Mental State Exam, Montreal Cognitive Assessment, Clinical Dementia Rating, Cognitive Change Index.
Cognition testing, usually obtained through testing, includes ANART, Boston Naming Test, Category Fluency-Animals, Clock Drawing Test, Logical Memory-Immediate Recall, Logical Memory-Delayed Recall, Rey Auditory Verbal Learning Test, Trail Making Test.
Neuropsychiatric information, usually obtained through consultation, includes Geriatric Depression Scale, Neuropsychiatric Inventory, Neuropsychiatric Inventory Questionnaire.
Function and behavior information, usually obtained through consultation, includes Function Assessment Question, Everyday Cognitive Participant Self Report, Everyday Cognition Study Partner Report.
Physical, neurological examination, usually obtained through testing, includes Physical Characteristics, Vitals, neurological examination.
The rest of the examinations include blood testing, urine testing, nuclear magnetic resonance scan, positron emission computed tomography scan with 18-FDG, positron emission computed tomography scan with AV45, gene analysis, and cerebral spinal fluid analysis. It is worth noting that not all categories of information are obtained for a subject’s visit, and the information on each type is often incomplete.
All subjects with labels containing at least one of the above categories of information are considered in this study. Two thousand one hundred twenty-seven subjects with 9593 visits are included in our work. A subject in a visit may require different categories of examination. Every combination of those examinations represents a diagnosis strategy. Thus, for the subject, strategies are generated. These AD and CN subjects are randomly assigned to the training, validation, and test set. The training set contains 1025 subjects with 3986 visits and generates 180682 strategies. In the training set, 587 subjects with 1781 visits are AD and develop 80022 strategies, 466 subjects with 2205 visits are CN, and generate 100660 strategies. The validation set contains 73 subjects with 254 visits and generates 11898 strategies. In the validation set, 44 subjects with 127 visits are AD and develop 6008 strategies, 31 subjects with 127 visits are CN, and generate 5890 strategies. The test set contains 1460 subjects with 5353 visits. In the test set, 109 subjects with 305 visits are AD, 92 subjects with 411 visits are CN, 1082 subjects with 4357 visits are MCI, 280 subjects with 280 visits are SMC. Notably, the label of a subject may be different in other visits.
Randomization and blinding.
AD and CN subjects as known categories of subjects are randomized into training, validation, and test sets by applying a random function provided by the Python3 tool. The assignment is determined by a float value generated by a random function. We assign subjects whose values are [0,0.8) into the training set, assign subjects whose values are [0.8,0.85) into the validation set, assign subjects whose values are [0.85,1] into the test set. The data of visits belong to the same subject are only allowed to appear in the same set. MCI and SMC subjects as unknown categories of subjects are directly into the test set. During the development of the AI system, the test set is inaccessible.
For each category of study data, if it contains more than one sub-category of data, concatenate all of the sub-category data by RID (The ID of the subject) and VISCODE (The mark of the subject’s visit). For the medical image, we first convert the data from the DICOM format to the NIfNI format by the dcm2nii library. Second, register the image by ant library [52, 53, 54]. Third, convert the 3D image to 2D slices and convert the image from gay to RGB. Finally, a trained model named DenseNet201 is used to extract the features of the 2D slices . For the genetic data, we extract 70 single nucleotide polymorphisms (SNP), which are very relating to the AD ( Table S5), and use one-hot code to represent each SNP [55, 56, 57]. This work proposes a unified data representation framework, since the different dimensions of each category of data, the number of data categories included in each visit is different, and the number of history visits included in each subject is also different. We present an examination category in the subject’s visit by an array with a shape of . The shape of our data is , is the number of categories of data for the subject ( Fig. S3).
The propose model
Our model consists of five parts: , , , , and ( Fig. S1 ). We name the model consisting of , , and as , which can identify the subject from open clinical settings [16, 58, 11]. We name the model consisting of , and as , which can dynamically develop and adjust the diagnosis strategy according to the situation of subjects and existing medical conditions.
The is a multi-task learning model, which simultaneously optimizes the model’s disease diagnosis and data reconstruction ability. The data reconstruction task can improve the diagnosis ability of the model in the open world 
. The loss function of the model is. The is categorical cross-entropy, and the is mean squared logarithmic error. The is also a multi-task learning model, which simultaneously optimizes the 12 examinations whether should be selected as the next examination for the subject. We introduce a loss function that combines the BCE loss function and weighs losses with uncertainty [59, 14]. The modified loss function is given by equations 1:
where is the total number of examinations as the subsequent examination, is the total number of other examinations as the following examination. is an observation noise scalar of the output of examination .
Although researchers have made many efforts on the interpretability and internal logic of deep learning, the current behavior of deep learning is still tricky to understand[60, 61]. We do not know whether the diagnosis strategy of the AI model needs to be consistent with human experts. Thus, it is unnecessary to label the subsequent examination of the current examination strategy by the clinician and train a model to simulate the clinician’s behavior. In this work, the following examination label is labeled by the examination label algorithm ( Algorithm S1 ). The subsequent examination for the subject is determined by whether this examination makes the prediction model () obtain a greater predicted probability for the correct category and smaller predicted probabilities for other categories.
OpenMax is a modified SoftMax layer that adopted the concept of Meta-Recognition[62, 11, 63]
. OpenMax uses the distance between the activation vector (AV) of the sample and the mean activation vector (the mean computed over only the correctly classified training examples) to identify the unknown categories of the subject. The deep learning network can be regarded as a feature extractor, and the output of the AV layer can be regarded as characteristics of the sample. However, the AV layer usually only retains the most relevant features to the classification task, and the features related to the unknown category are not guaranteed to be retained. To alleviate this problem, we replaced the output of the AV layer with the abnormal patterns of 14 selected indicators of known categories according to the Alzheimer’s Diagnosis guidelines to improve the performance of the AI model [64, 65, 66, 67] ( Table S6 ). The modified OpenMax by abnormal patterns is shown in Algorithm S2,3.
The training of our model consists of two stages. The first stage is training the , in which the uses SoftMax layer as the output layer. The dimension of the output of the in this training stage is 2, corresponding to AD and CN. After training the , a modified OpenMax layer, which estimates the probability of an input being an unknown class, is used to replace the SoftMax layer . The dimension of the output of in the prediction stage is 3, corresponding to AD, CN, and unknown. According to prediction probabilities of subjects by the , every examination strategy in the training set and validation set is labeled by the Algorithm S1. The second stage is training the , the input of the contains raw data and the prediction probability, the dimension of the output of the
is 12, which respectively correspond to 12 categories of examination. The model was optimized using mini-batch stochastic gradient descent with Adam and a base learning rate of 0.0005. The experiments are conducted on a Linux server equipped with Tesla P40 and Tesla P100 GPU.
Due to the historical information has a significant influence on the diagnosis of Alzheimer’s disease, there is a vast difference between the diagnosis of Alzheimer’s disease at first visit without historical information and other visits with historical data. Therefore, based on the above model training method, we additionally trained a model for diagnosing Alzheimer’s disease at the first visit based on the subject’s data at the first visit.
Unlike the other state-of-the-art AI models, predictions of our model are dynamic. The prediction algorithm comprehensively considers the situation of the subject, the condition of the medical institution, and the ability of our model to dynamically adjust the diagnosis strategy ( Algorithm S4). Firstly, our model will generate the probability for every category (AD vs. CN vs. Unknown) according to the current input data of the subject. Second, if the probability of categories exceeds thresholds (AD , CN , unknown ), output the corresponding label. Otherwise, adjust the examination strategy by selecting the subsequent examination according to the situation of the subject and the medical institution, and go to the first step. Finally, if all diagnostic strategies are tried, the model still cannot obtain the probability of exceeding the threshold and then outputs unknown.
To evaluate the evaluation index of the AI model, a non-parametric bootstrap method is applied to calculate the confidence intervals (CI) for the evaluation index . In this work, we calculate 95% CI for every evaluation index. We randomly sample cases from the test set and evaluated the AI model by the sampled set for every evaluation index. repeated trials are executed, and values of the evaluation index are generated. The 95% CI is obtained by the 2.5 and 97.5 percentiles of the distribution of the evaluation index values.
|Data set||Training set||Validation set||Test set|
|More than one||25||10||0||18|
|Visit||Data set||Training set||Validation set||Test set|
|Diagnosis strategies||Visit number of subject|
The examination is marked as 1, meaning that the medical institution cannot perform this examination for the subject. The examination is marked as 0, indicating that (1) the medical institution can perform this examination for the subject, or (2) OpenClinicalAI does not request for performing this examination during the diagnosis of the subject though the medical institution may not be able to perform this examination for the subject. It is worth noting that the examination ability in the test set may be different from other AI systems since 0 may mean that OpenClinicalAI does not request for performing this examination during the diagnosis of the subject. However, the medical institution may not be able to perform this examination for the subject.
Nausea, Vomiting, Diarrhea, Constipation, Abdominal discomfort, Sweating, Dizziness, Low energy, Drowsiness, Blurred vision, Headache, Dry mouth, Shortness of breath, Coughing, Palpitations, Chest pain, Urinary discomfort (e.g., burning), Urinary frequency, Ankle swelling, Muscloskeletal pain, Rash, Insomnia, Depressed mood, Crying, Elevated mood, Wandering, Fall, Other.
Nausea to Rash
Nausea to Other
The CCI scale is in https://adni.bitbucket.io/reference/cci.html.
CCI1 to CCI12
CCI1 to CCI20
The CDR scale is in https://adni.bitbucket.io/reference/cdr.html.
The Alzheimer’s Disease Assessment Scale-Cognitive scale is in https://adni.bitbucket.io/reference/adas.html.
Q1 to Q11
Q1 to Q13
The Mini Mental State Exam scale is in https://adni.bitbucket.io/reference/mmse.html.
The Montreal Cognitive Assessment scale is in https://adni.bitbucket.io/reference/moca.html.
The calculation method of Preclinical Alzheimer’s Cognitive Composite is in https://ida.loni.usc.edu/pages/access/studyData.jsp?categoryId=16&subCategoryId=43.
-  R. C. Petersen, et al., Neurology 74, 201 (2010).
-  M. W. Weiner, et al., Alzheimer’s & Dementia 6, 202 (2010).
-  M. W. Weiner, et al., Alzheimer’s & Dementia 11, 865 (2015).
-  M. W. Weiner, et al., Alzheimer’s & Dementia 13, 561 (2017).
-  S. Darkner, Fdg-pet template mni152 1mm (2013).
-  S. M. Smith, et al., Neuroimage 23, S208 (2004).
-  N. Seneca, C. Burger, I. Florea pp. 1–3 (2011).
-  J.-C. Lambert, et al., Nature genetics 45, 1452 (2013).
-  B. W. Kunkle, et al., Nature genetics 51, 414 (2019).
-  R. S. Desikan, et al., PLoS medicine 14, e1002258 (2017).
-  P. Perera, et al., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 11814–11823.
-  A. Kendall, Y. Gal, R. Cipolla, Proceedings of the IEEE conference on computer vision and pattern recognition (2018), pp. 7482–7491.
-  Y. Zhang, Q. V. Liao, R. K. Bellamy, Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (2020), pp. 295–305.
-  P. Linardatos, V. Papastefanopoulos, S. Kotsiantis, Entropy 23, 18 (2021).
-  Z. Ge, S. Demyanov, Z. Chen, R. Garnavi, British Machine Vision Conference 2017 (British Machine Vision Association and Society for Pattern Recognition, 2017).
-  W. J. Scheirer, A. Rocha, R. Michaels, T. E. Boult, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 33, 1689 (2011).
-  C. R. Jack Jr, et al., Alzheimer’s & dementia 7, 257 (2011).
-  R. A. Sperling, et al., Alzheimer’s & dementia 7, 280 (2011).
-  M. S. Albert, et al., Alzheimer’s & dementia 7, 270 (2011).
-  M. C. Donohue, et al., JAMA neurology 71, 961 (2014).
-  D. P. Kingma, J. Ba, ICLR (Poster) (2015).
-  B. Efron, R. J. Tibshirani, An introduction to the bootstrap (CRC press, 1994).
-  D. Sculley, Proceedings of the 19th international conference on World wide web (2010), pp. 1177–1178.