Log In Sign Up

Clinical Recommender System: Predicting Medical Specialty Diagnostic Choices with Neural Network Ensembles

by   Morteza Noshad, et al.

The growing demand for key healthcare resources such as clinical expertise and facilities has motivated the emergence of artificial intelligence (AI) based decision support systems. We address the problem of predicting clinical workups for specialty referrals. As an alternative for manually-created clinical checklists, we propose a data-driven model that recommends the necessary set of diagnostic procedures based on the patients' most recent clinical record extracted from the Electronic Health Record (EHR). This has the potential to enable health systems expand timely access to initial medical specialty diagnostic workups for patients. The proposed approach is based on an ensemble of feed-forward neural networks and achieves significantly higher accuracy compared to the conventional clinical checklists.


page 1

page 2

page 3

page 4


Redesigning Electronic Health Record Systems to Support Developing Countries

Electronic Health Record (EHR) has become an essential tool in the healt...

Clinical Decision Transformer: Intended Treatment Recommendation through Goal Prompting

With recent achievements in tasks requiring context awareness, foundatio...

DC3 – A Diagnostic Case Challenge Collection for Clinical Decision Support

In clinical care, obtaining a correct diagnosis is the first step toward...

Rationale production to support clinical decision-making

The development of neural networks for clinical artificial intelligence ...

Condensed Memory Networks for Clinical Diagnostic Inferencing

Diagnosis of a clinical condition is a challenging task, which often req...

1. Introduction

The growing limitations in the scarcest healthcare resource - clinical expertise - is an issue that has long been at the front line of the healthcare industry. This shortage of clinician experteise is particularly acute in access to medical specialty care. In some locations, patients wait several months for outpatient specialty consultation visits, which contributes to the higher mortality in the US (Prentice and Pizer, 2007). However, potential solutions have been slow to come.

Our goal is to develop a radically different paradigm for specialty consultations by developing a tier of automated guides that proactively enable initial workup that would otherwise be delayed awaiting an in-person visit. We focus on recommending the clinical orders for medications and diagnostic tests from outpatient consultations that any clinician could initiate with adequate support. This system can consolidate specialty consultation needs and open greater access to effective care for more patients. A key scientific barrier to realizing this vision is the lack of clinically acceptable tools powered by robust methods for collating clinical knowledge, with continuous improvement through clinical experience, crowdsourcing, and machine learning. Existing tools include electronic consults that allow clinicians to email specialists for advice, but their scale remains constrained by the availability of human clinical experts. Electronic order checklists (order sets) are in turn limited by the effort to maintain and adapt content to individual patient contexts (Middleton et al., 2016).

Machine learning approaches are revolutionizing various healthcare areas such as medical imaging(Giger, 2018), diagnostic models (Choi et al., 2016; Miotto et al., 2016) and virtual health assistants(Kenny et al., 2008) by introducing more accurate, low cost, fast and scalable solutions. Automated diagnostic workflow recommendation is another emerging application of machine learning which has so far mainly been focused on predicting the need for specific medical imaging (Merdan et al., 2018). However, only a few previous studies have explored the possibility of using machine learning approaches to design a scalable intelligent system that can recommend diagnostic procedures of any type to the patients, as an alternative to the conventional clinical checklists. Authors in (Lakshmanaprabu et al., 2019) and (Chen et al., 2017) apply recommender systems based on probabilistic topic modeling and neural networks to predict inpatient clinical order patterns. Other than predicting workflows, recommender systems have also been used for diagnosis in several previous papers (Komkhao and Halang, 2013; Hao and Blair, 2016).

In this work, we address the problem of predicting outpatient specialty workflows. Specifically, our objective is to predict which procedures would be ordered at the first specialty visit for a patient referred by a primary care physician (PCP), based on their medical records. This procedure could provide automated decision support and recommendations at primary care visits or specialist pre-visit screenings to allow diagnostic procedures to be completed while the patient is awaiting their in-person specialist visit. As opposed to manually-created medical checklists, which are mainly based on diagnosis (e.g., common laboratory and imaging tests a clinician can order to evaluate diabetes), the proposed data-driven algorithm utilizes the patient’s previous lab results, diagnosis codes, and the most recent procedures as input and recommends follow-up lab orders and procedures. The proposed recommender model offers several key features such scalability, to answer unlimited queries on-demand; maintainability, through automated statistical learning; adaptability to respond to evolving clinical practices; and personalizability of individual suggestions with greater accuracy than manually-authored checklists. We categorized the input EHR data into three groups: diagnostic data, including the diagnosis codes and lab results; procedures ordered by the referring PCP; and the specialist being referred to (recognized by their ID). This grouping of the data lets us use appropriate base models for each of the input data categories and process them separately. The first base model is a neural network based multi-label classifier with diagnostic data as input and specialty procedures as labels. The second model is a collaborative filtering AutoEncoder (AE) with the PCP and specialty procedures as input and output, respectively. The designed collaborative filtering AutoEncoder is similar to the the deep learning based collaborative models proposed in

(Zhang et al., 2019; Kuchaiev and Ginsburg, 2017). The predictions from the base models are then fed into an ensemble neural network to improve the predictions from each of the base learners. Despite traditional ensembles methods that use the ratings from base learners to improve predictions (Moghimi et al., 2016), the proposed approach leverages the specialist id number as side information to personalize the recommendations both for the patient and speciality provider. Here, we develop and measure the potential advantages of the proposed method compared to clinical checklists and several other baselines.

2. Cohort and Data Description

In this work, we address the prediction of future clinical diagnostic steps for the outpatients referred to Stanford Health Care Endocrinology Clinic between Jan 2008 and Dec 2018. To have adequate access to the patients’ clinical records, we only consider those referred by a PCP within Stanford Health Care Alliance network, which totally includes patients. We aimed to predict the procedures (primarily lab and imaging tests) the endocrinologist would order at their first in-person visit. Because the procedures ordered could depend on the time window between the referral and the first specialist visit, we restricted the cohort to only those patients with a first specialist visit within 4 months after referral.

For each patient in our cohort we used electronic health record (EHR) data to extract all of the lab results within two months before the referral as well as the procedures ordered by the referring PCP. We further include the receiving specialist’s identify (specialist ID) as side information to allow the model to personalize predictions per specialist as well.

3. Proposed Method

The proposed method is an ensemble model that takes the patient’s clinical information and the specialist ID as input and predicts the future procedures. In order to feed the data into the model and train the base and ensemble models, we need to pre-process the data to the appropriate format.

3.1. Data Pre-Processing

The defined cohort includes patients and, within the defined cohort, there are unique labs, unique procedures, and unique diagnosis codes. Given that it would not be practical to train a model with several thousand data dimensions and output labels using only samples, we restricted each data category to only top most frequent types. Specifically, we only considered the top most common labs and top procedures. We also restricted the diagnosis codes to top most-prevalent codes related to endocrinology: Diabetes mellitus Type I or II, Hypercalcemia, Hyperlipidemia, Hypothyroidism, Hyperthyroidism, Osteopenia, Thyroid cancer, Thyroid nodule, and Obesity

. The raw lab results in the EHR data are mainly continuous data, which we converted into one-hot encoded format using the clinical laboratory defined ”normal range” for each value. Thus, each lab value is embedded into a three dimensional binary vector, where the first dimension represents whether the lab value is available for the patient and the second and third dimensions indicate whether the lab value is low or high (in case of a normal result both are

). Thus, if a patient has any missing clinical information, the one hot encoding approach appropriately considers it the the encoded data format. Finally, the samples are randomly shuffled and split into the train and test sets with and of the entire sample sizes, respectively.

3.2. Ensemble Model

Figure 1. The proposed model consists of two base models which are trained separately and an ensemble model that combines the prediction results from the base models using the trained neural network.

The proposed model consists of two base models which are trained separately and an ensemble model that combines the prediction results from the base models using the trained neural network (Figure 1). The first base model is a neural network based multi-label classifier with diagnostic data as input and specialty procedures as labels. The neural network consists of fully connected layers with the dimensions

and rectified linear unit (ReLU) activations. The network is trained using stochatic gradient descent (SGD) with the learning rate

and mean square error (MSE) loss function. The network is trained for

epochs with the batch size of . After each layer, a dropout regularization with is used to prevent overfitting. We refer to this network as diagnostic model (abbreviated as DM). The second base model is an AutoEncoder (AE) based collaborative filtering architecture with the PCP and specialty procedures as input and output. The AE consists of fully connected layers with dimensions . The predictions from the base models are then fed into an ensemble neural network which includes the specialist ID as side information to get the final predicted specialty procedures. The ensemble neural network consists of fully connected layers with the dimensions

and each output neuron represents the score for a procedure ID. For all of the neural network based methods we performed several hyperparameter optimizations. The scores are normalized within the range

and could be interpreted as an uncalibrated probability that the corresponding procedure is ordered by the specialist. Based on the predicted scores for the procedures we can take two different recommendation approaches. The first method applies a fixed threshold and if the score of a given procedure is above the threshold, that procedure is recommended. Therefore, for different patients different numbers of procedures may be recommended. In the second approach the algorithm always recommends the top

procedures. Thus, in this approach only the order of the scores are important not their values. In all of our experiments we used the recommendation based on a fixed score threshold since it resulted a better performance (reported in the Results section).

4. Experiment Design

The problem of predicting the specialty procedures using the lab results, diagnosis codes, and the PCP procedures is, in general, a multi-label classification problem and recommender system methods cannot be directly applied. However, we can split the clinical data into two major groups such that such that we can separately apply a multi-label classification model to the first group (lab results and diagnosis codes) and a collaborative filtering model to the second group (PCP procedures), which is of the same type as the output labels (Specialty procedures). We compared the results to two of the standard collaborative filtering method, i.e. singular value decomposition (SVD) and probabilistic matrix factorization (PMF). We also compared the performance of the proposed ensemble method to each of the base models, i.e., the diagnostic model (DM) and AutoEncoder (AE), as well as the collaborative filtering methods SVD and PMF, and also a conventional clinical checklist. The clinical checklist was mainly retrieved and reviewed by our clinical author Ivana Jankovic from clinical guideline documents ( for each of the main referral diagnoses to collate a checklist of relevant diagnostic procedure orders that should be considered for each. We also compared the results to an aggregate multi-label classifier based on neural networks (abbreviated as ANN in the figures) with

fully connected layers which utilizes all the lab results, diagnosis codes, PCP procedures and specialist ID as a unified input and predicts the specialist-ordered procedures.

5. Results

By varying the score threshold for each of the prediction methods to convert predicted scores into binary predictions for each procedure order, we can obtain different performance metrics including precision (positive predictive value, the fraction of predicted procedure orders the specialist actually ordered) and recall (sensitivity, the fraction of orders the specialist actually ordered that were predicted). Therefore, the methods are evaluated in terms of precision, recall, and area under the receiver operating curve (AUROC) metrics. Figure 2 represents the precision-recall graph of the proposed ensemble method compared to the base models (diagnostic model and AutoEncoder), collaborative filtering methods (SVD and PMF), the aggregate neural network model (ANN), and a clinical checklist. Precision at different fixed values of recall are represented in Figure 3. The ensemble method achieves a better precision-recall trade-off compared to the other models. The methods are also compared in terms of AUROC. The ensemble method achieves the highest AUROC of compared to the other methods.

Figure 2. Precision-Recall graph of the proposed ensemble method compared to the base models (diagnostic model and AutoEncoder), collaborative filtering methods (SVD and PMF), the aggregate neural network model (ANN), and clinical checklist.
Figure 3. Precision at fixed recall for the proposed ensemble method compared to the base models (diagnostic model and AE), collaborative filtering methods (SVD and PMF), the aggregate neural network model (ANN).
Figure 4. AUROC of the proposed ensemble method (EM) compared to the base models (diagnostic model and AE), collaborative filtering methods (SVD and PMF), the aggregate neural network model (ANN). Error bars show the confidence interval computed using bootstrapped resampling.

Figure 5 Example model inputs and outputs. Example patient’s data up to time of speciality referral, the actual subsequent specialist procedure orders vs. predicted procedure orders from a diagnosis-based clinical checklist or predicted from our proposed ensemble method with a score threshold . Finally we compare the performance of the ensemble method using two selection approaches based on the predicted scores (discussed in Section 3.2). As shown in 6 , the selection method based on a fixed threshold () performs better than the selection method based on the fixed .

Figure 5. A real-world example of a patient with the true specialist orders, the predicted procedures based on clinical checklist and the proposed ensemble method.
Figure 6. Performance comparison of two different recommendation approaches based on the predicted scores for the ensemble method.

6. Discussion

The generalizability of the proposed model to more diverse types of patients with different conditions depends on several key assumptions. As mentioned in 3.1, due to the model’s learning limitations with respect to the number of patients, we had to only include a portion of the labs, diagnosis codes and procedures as features and labels in our data, which degrades the performance of the recommendation model. Further, the recommended items based on the prediction model is learned based on the specialists’ preferences, and they don’t necessarily mean to be correct or incorrect orders.

7. Conclusion

In this work we addressed the problem of predicting outpatient specialty diagnostic workups, specifically the procedure orders for diagnostic orders for adult Endocrinology referrals. We proposed a data-driven model that recommends follow-up procedure orders based on patients’ clinical information. Several evaluations illustrate that the proposed method can outperform conventional clinical checklist and baseline methods.


  • J. H. Chen, M. K. Goldstein, S. M. Asch, L. Mackey, and R. B. Altman (2017) Predicting inpatient clinical order patterns with probabilistic topic models vs conventional order sets. Journal of the American Medical Informatics Association 24 (3), pp. 472–480. Cited by: §1.
  • E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun (2016)

    Doctor ai: predicting clinical events via recurrent neural networks

    In Machine Learning for Healthcare Conference, pp. 301–318. Cited by: §1.
  • M. L. Giger (2018) Machine learning in medical imaging. Journal of the American College of Radiology 15 (3), pp. 512–520. Cited by: §1.
  • F. Hao and R. H. Blair (2016) A comparative study: classification vs. user-based collaborative filtering for clinical prediction. BMC medical research methodology 16 (1), pp. 172. Cited by: §1.
  • P. Kenny, T. Parsons, J. Gratch, and A. Rizzo (2008) Virtual humans for assisted health care. In Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments, pp. 1–4. Cited by: §1.
  • M. Komkhao and W. A. Halang (2013) Recommender systems in telemedicine. IFAC Proceedings Volumes 46 (28), pp. 28–33. Cited by: §1.
  • O. Kuchaiev and B. Ginsburg (2017) Training deep autoencoders for collaborative filtering. arXiv preprint arXiv:1708.01715. Cited by: §1.
  • S. Lakshmanaprabu, S. N. Mohanty, S. Krishnamoorthy, J. Uthayakumar, K. Shankar, et al. (2019) Online clinical decision support system using optimal deep neural networks. Applied Soft Computing 81, pp. 105487. Cited by: §1.
  • S. Merdan, K. Ghani, and B. Denton (2018) Integrating machine learning and optimization methods for imaging of patients with prostate cancer. In Machine Learning for Healthcare Conference, pp. 119–136. Cited by: §1.
  • B. Middleton, D. Sittig, and A. Wright (2016) Clinical decision support: a 25 year retrospective and a 25 year vision. Yearbook of medical informatics 25 (S 01), pp. S103–S116. Cited by: §1.
  • R. Miotto, L. Li, B. A. Kidd, and J. T. Dudley (2016) Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Scientific reports 6 (1), pp. 1–10. Cited by: §1.
  • M. Moghimi, S. J. Belongie, M. J. Saberian, J. Yang, N. Vasconcelos, and L. Li (2016)

    Boosted convolutional neural networks.

    In BMVC, Vol. 5, pp. 6. Cited by: §1.
  • J. C. Prentice and S. D. Pizer (2007) Delayed access to health care and mortality. Health services research 42 (2), pp. 644–662. Cited by: §1.
  • S. Zhang, L. Yao, A. Sun, and Y. Tay (2019) Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys (CSUR) 52 (1), pp. 1–38. Cited by: §1.