RadFusion: Benchmarking Performance and Fairness for Multimodal Pulmonary Embolism Detection from CT and EHR

11/23/2021 ∙ by Yuyin Zhou, et al. ∙ 30

Despite the routine use of electronic health record (EHR) data by radiologists to contextualize clinical history and inform image interpretation, the majority of deep learning architectures for medical imaging are unimodal, i.e., they only learn features from pixel-level information. Recent research revealing how race can be recovered from pixel data alone highlights the potential for serious biases in models which fail to account for demographics and other key patient attributes. Yet the lack of imaging datasets which capture clinical context, inclusive of demographics and longitudinal medical history, has left multimodal medical imaging underexplored. To better assess these challenges, we present RadFusion, a multimodal, benchmark dataset of 1794 patients with corresponding EHR data and high-resolution computed tomography (CT) scans labeled for pulmonary embolism. We evaluate several representative multimodal fusion models and benchmark their fairness properties across protected subgroups, e.g., gender, race/ethnicity, age. Our results suggest that integrating imaging and EHR data can improve classification performance and robustness without introducing large disparities in the true positive rate between population groups.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With recent advances in deep learning, AI systems have received unprecedented attention in processing both medical imaging data,  e.g.

, computed tomography (CT) scans, and clinical data found in the electronic health records (EHR). However, architectures which jointly learn feature representations from images and structured EHR data remain underexplored. Clinical data availability during image interpretation is particularly important in radiology, where accurate medical diagnosis on imaging relies significantly on pre-test probability, prior diagnosis, clinical and laboratory data, and prior imaging 

[24]. For example, a survey showed that more than 85% of radiologists consider clinical context as vital for radiological exam interpretation [33].

While medical imaging benchmarks such as LiTS [10], BTCV [32], TCIA-Pancreas [45] are ubiquitous in the community, the number of multimodal medical datasets remains limited. Existing multimodal datasets largely focus on enriching a single modality, such as including multiple imaging protocols or longitudinal image captures for tracking disease progression  [40], or combining medical images with radiology reports (MIMIC-CXR [28], CheXpert [26], Chest-Xray8 [50]) for chest X-ray diagnosis. Few datasets provide longitudinal health record data and imaging, with the notable exception of the UK Biobank Imaging Study which includes multi-organ 2D magnetic resonance (MRI), x-ray, and ultrasound imaging in addition to health record data for 100,000 individuals [38]. The status quo of not providing demographic and clinical history data as part of imaging datasets is concerning as recent research reveals the degree to which pixel data alone encodes protected attributes like race [4], potentially creating hidden biases in medical imaging models.

To help address these challenges, we present RadFusion, a large-scale multimodal pulmonary embolism database to advance future research into multimodal fusion strategies for integrating 3D medical imaging data and patient EHR data. Our dataset contains high-quality CT images and patient EHR data collected for 1837 pulmonary embolism cases. To ensure that all collected cases are representative and of high-quality, the training, validation and testing sets were selected by stratified random sampling from an original cohort pool of 108,991 studies, followed by a careful removal of studies with wrong protocols, significant artifacts, etc. The ground truth label for each study was curated by two board certified radiologists and verified by a senior radiologist.

To understand the role of each modality, we benchmark the performance of an imaging-only, EHR-only, and a multimodal fusion model using six different evaluation metrics. We further examine the robustness and fairness of these three architectures by stratifying the performance on key protected patient subgroups (

i.e., gender, age or race). Specifically, model fairness has been quantified by analyzing equality of opportunity [19]—the differences in true positive rate (TPR) across different patient groups. Our empirical results suggest that multimodal fusion model not only achieves consistent performance gain compared to imaging-based and EHR-based models, but also demonstrates stronger robustness to different population groups without introducing large TPR disparities. This finding suggests that instead of developing more advanced medical image representation models, more research interests should also be drawn to designing better multimodal fusion models ( e.g., integrating EHR data and CT). To facilitate future research on this direction, we release our RadFusion dataset and benchmark different methods in this work. To summarize, our contributions are three-fold:

  • [leftmargin=*]

  • We release a large-scale multimodal pulmonary embolism detection dataset, RadFusion, consisting of 1837 CT imaging studies (comprising 600,000+ 2D slices) for 1794 patients and their corresponding EHR summary data. To the best of our knowledge, this is the first public dataset combining 3D medical imaging with longitudinal health information extracted from EHRs.

  • We benchmark representative imaging-based, EHR-based and multimodal fusion models on the RadFusion dataset, and provide a rigorous ablation study based on six different evaluation measures for a better understanding of the effect of each modality.

  • We report TPR disparities among different demographic groups to measure model fairness. Our results suggest that the integration of imaging and EHR modalities is a promising direction for improving model performance and robustness without introducing large TPR gaps.

2 Related Works

Pulmonary embolism detection.

Pulmonary Embolism (PE) is known as a life-threatening medical condition, accounting for almost 300,000 hospitalizations and 180,000 deaths in the United States every year [21]. Even though the mortality rate among PE patients is high, studies have shown that prompt recognition and immediate initiation of treatments for PE can significantly decrease morbidity and mortality rates  [35, 34]. Definitive diagnosis of PE is made via computed tomography pulmonary angiography (CTPA), which is interpreted manually by radiologists. Unfortunately, patients with PE often experience more than 6 days of delay in diagnosis and a quarter of patients are misdiagnosed during their first visit  [20, 2]. The long delay and high misdiagnosis rate is in part due to the rapid increase in utilization CTPA (27-fold in emergency settings). Many studies have attempted to automate PE diagnosis and patient triaging to alleviate the burden for radiologists  [52, 47, 22]. However, few studies have directly included patient clinical history and demographic information as inputs to their model, even though patient EHR is crucial for accurate interpretation of medical images  [16].

Multimodal fusion.

Pertinent clinical context is essential for providing accurate diagnostic decisions during medical image interpretations. Studies have repeatedly shown that clinical history, vitals and laboratory data are crucial for accurate interpretation of medical images, and a lack of access to patient EHR significantly hinders a radiologist’s diagnosis ability. Similar to radiologists, medical imaging models that can leverage patient EHR may become more clinically relevant and accurate  [23]. Several studies have attempted to integrate clinical context to overcome the limitations of imaging-only models and observed a boost in performance  [9, 36, 31, 51, 3, 25]. However, most of these studies only rely on a few manually selected clinical features.

Fairness in healthcare.

While machine learning models can be powerful tools to deliver automated clinical decision making, they may also exhibit great performance disparities across protected subgroups, leading to different treatment and care delivery 

[14]. Equitable use of healthcare data for training machine learning models and applications of algorithmic fairness have received considerable research attention  [8, 13, 42, 43, 48]. With the recent advance in machine learning, different types of bias have also been identified, such as gender  [17] and racial bias [15]. In the healthcare domain, many studies have also indicated that health disparities widely exist in different protected attributes such as race, sex and age [30, 41, 14, 49]. In the medical imaging domain, Seyyed-Kalantari et al.

examined the TPR disparities of the state-of-the-art X-ray classifiers among different racial groups, and identified that leveraging multiple sources of data can lead to smaller disparities  

[46]. However, to date, the fairness of multimodal medical models has not been studied yet. Our study provides the first fairness evaluation of multimodal medical model in the use case of pulmonary embolism detection from CT and EHR.

3 Dataset

Data Acquisition.

We retrieved a total of 108,991 CTPA studies from Stanford University Medical Center (SUMC) with the approval of the Stanford Institutional Review Board (IRB). All studies were conducted between 2000 and 2016 and performed under the pulmonary embolism protocol. We applied an NLP model by Banejee  et al.  [5, 6] on the corresponding radiology reports to generate pseudo labels (positive or negative for PE) for all studies. Using the generated labels, we retrieved 2500 1.25 mm axial CT, with approximately equal distribution of positive and negative PE labels, from the local picture archiving and communicating system (PACS) for manual review and labeling. Two radiologists reviewed each study to remove cases with wrong protocol, significant artifacts, poor imaging quality, and non-diagnostic studies. We provide the data characteristic for the 1837 studies that remained in Table. 1. In addition to CT images, we created a view of each patient’s EHR from the SUMC Epic database within an observational window of 12 months prior to their CT examination date, including demographics, vitals, inpatient/outpatient medications, ICD codes and lab test results. We processed these structured EHR (described in Section 4.1) and provide a summary of the patient’s health record as part of our dataset. Note that radiology reports are not included in our dataset due to patient privacy concerns and HIPPA compliance.

Annotation.

Ground truth labels for all studies were generated by manual review. Two board certified radiologists (with 8 and 10 years of experience) separately reviewed each CT scan to diagnose the presence of PE, classify the subtype of PE and annotate the slices that contains PE. The standard descriptions by Remy-Jardin et al. [44] were used to define central positive, segmental positive, subsegmental positive and negative PE, which indicates the location or the arterial branch the PE resides. We made slight modifications to define subsegmental-only PE as the location of the largest defect at the subsegmental level on a spiral CT, allowing a satisfactory visualization of all pulmonary arteries at the segmental level or higher. The two radiologists had high inter-rate reliability, with a Cohen’s Kappa Score of 0.959. For the few cases with conflicting annotations, labels were determined by the more senior out of the two interpreting radiologists. We randomly partitioned the studies into train/validation/test split of 80%/10%/10% and made sure that no patients overlap between each split. We provide the detailed data characteristics and splits in Table 1.

Category Sub-category Overall Train Validation Test
# of studies 1837 1454 193 190
# of patients 1794 1414 190 190
CTPA exams
Median # of slices
(IQR)
386
(134)
385
(136)
388
(132)
388
(139)
# of negative PE 1111(60.48%) 946 (65.06%) 85 (44.04%) 80 (42.10%)
# of positive PE 726 (39.52%) 508 (34.94%) 108 (55.96%) 110 (57.89%)
Central 257(35.40%) 202 (39.76%) 27 (25.00%) 28 (25.45%)
Segmental 387(53.31%) 281 (55.31%) 52 (48.15%) 54 (49.09%)
PE Subsegmental 82 (11.29%) 25 (4.91%) 29 (26.85%) 28 (25.45%)
BMI (mean: std) 28.37 : 9.65 28.36 : 10.03 27.11 : 6.78 29.60 : 9.22
Vitals Pulse (mean: std) 81.62 : 14.99 81.53 : 15.64 83.05 : 11.86 80.50 : 13.06
D-dimer test taken 580 (30.62%) 461 (30.90%) 58 (28.71%) 61 (30.50%)
D-dimer D-dimer positive 496 (26.18%) 389 (26.07%) 51 (25.25%) 56 (28.00%)
Table 1: Data characteristics of our RadFusion database. Statistics of the training, validation and test set are listed.

Data usage.

Our RadFusion dataset can be used for different research purposes such as 1) building better clinical decision models for detecting pulmonary embolism, a life-threatening condition that represents the third most common cause of cardiovascular-related deaths after myocardial infarction and stroke; 2) developing multimodal fusion models using both CT scans and patient EHR, which is relatively under-explored in the field of medical AI, even while utilizing both modalities is crucial for medical image interpretation in real-world clinical settings. Notably, we also release different patient demographics information (e.g, race, gender) which enables researchers to study model fairness of different machine learning models. We split the testing set in to Female and Male (based on gender), White and Others111Others include Black, Asian, Pacific Islander and other groups. (based on race) , Age and Age (based on age). We chose to dichotomize age into categories of older than 80 years and 80 years or younger because prior studies have determined this cutoff as clinically meaningful  [27, 39]. The number of instances in each group can be found in Table 2. We believe that the broad multimodal data in RadFusion can enable a wide range of research related to foundation models [11], leading to potential benefit on different downstream tasks as well.

To the best of our knowledge, our work is the first to provide a fairness evaluation of multimodal fusion models in the medical imaging domain. We release all de-identified cases, annotation, the splits and the patient attribute information of each study in https://stanfordmedicine.box.com/s/q6lm1iwauyspyuicq4rlz35bqsnrwle0.

Patient
attribute
Sub-category
Including
subsegmental-only PE
Excluding
subsegmental-only PE
Female 99 87
Gender Male 91 75
White 119 95
Race Others 71 67
57 47
Age 133 115
Table 2: Demographic statistics on the testing set.

4 Benchmarks

4.1 Approaches

To understand the effect of fusion multiple modalities when training medical imaging models, we build 1) an imaging-only model which only consider the pixel values from medical images, 2) an EHR-only model which only relies on patient medical records; and 3) a multimodal fusion model which takes both medical images and EHR data as input for predicting the final outcome. Figure 1 illustrates different models used in this study.

Figure 1: Different models used for benchmarking: (a) the imaging-only model, (b) the EHR-only model, and (c) the multimodal fusion model.

Imaging-only model.

Each CT scan is preprocessed by extracting the pixel data from the original Digital Imaging and Communications in Medicine (DICOM) format and rescaled each slice to pixels. We then apply a viewing window that is optimized for pulmonary arteries (window center = 400, window width = 1000) and clip the Hounsfield Units to the range of . Lastly, we normalize each CT scan to be zero-centered.

For our imaging-only model, we use PENet222Code is publicly available at https://github.com/marshuang80/penet [22]

, a 77-layer 3D Convolutional Neural Network (CNN) model capable of detecting PE with high accuracy  

[22]. PENet is primarily made up of layers of 3D convolutions with skip connections and squeeze-and-excitation blocks. Instead of using the entire CT scan, PENet takes in sliding window of CT slices as inputs and made predictions based on the sliding window with the highest prediction probability. The detailed model architecture and training procedure can be found in the original manuscript. We pretrain PENet with the Kinetics-600 video dataset [12].

EHR-only model.

We feature engineered and parsed our EHR data based on the processing steps described by Banerjee et al. [7]

. For demographic features, we one-hot encoded gender, race and smoking habits while keeping age as numeric variables. All vital features were computed by taking the derivative of the vital values along the temporal axis to represent their sensitivity to change. We represent all 641 inpatient and outpatient medications as 1) a binary label of whether the drug was prescribed to the patient and 2) the frequency of prescription within the 12-month window. Similarly, ICD codes are also represented as a binary presence/absence label as well as a frequency value. For the 22 lab test values, we represent each test with binary presence/absence as well as the latest value of the test. We remove ICD codes with less than 1% occurrences in the training dataset, resulting in 141 diagnosis groups. To ensure there are no data leakage, we dropped all ICD codes recorded within 24 hours prior to the CTPA study, in addition to ICD codes recorded with the same encounter number as the patent’s CTPA exam. All input features are normalized by subtracting the mean and dividing by the standard deviation as a preprocessing step. Following  

[24], we use an ElasticNet [53] model that takes in a concatenation of all EHR features as our EHR-only model.

Metric Imaging-only Model EHR-only Model Multimodal Fusion Model
Accuracy 0.689 0.837 0.890
AUROC 0.796 0.922 0.946
Specificity 0.863 0.888 0.900
Sensitivity 0.564 0.800 0.882
PPV 0.849 0.907 0.924
NPV 0.590 0.763 0.847
Table 3: Performance comparison of the imaging-only model, the EHR-only model, and the multimodal fusion model on the testing set.

Multimodal fusion model.

Features from different modalities can often provide complementary information or separately contribute to the decision making for a machine learning model. Several different strategies can be leveraged to fuse features from different modalities, including early fusion, late fusion and joint fusion [23]

. Early fusion combines features from separate modalities at the input level. Joint fusion trains a decision making model using extracted features from single modality models, while propagating the loss back to the feature extracting models. Late fusion trains separate models for each modality and aggregates the predicted probability from all single-modality models as the final prediction. In our prior study  

[24], we show that late fusion works best for fusing CT and EHR to diagnose PE. Therefore, for our multimodal fusion design, we take the average of the predicted probabilities from our EHR-only model (ElasticNet) and imaging-only model (PENet) as our final prediction, as shown in Figure 1(c).

4.2 Results and Discussion

To comprehensively evaluate the performance of the imaging-only model, the EHR-only model and the multimodal fusion model, several evaluation metrics were calculated for the performance across the entire test set, including area under the receiver operating characteristic curve (AUROC), sensitivity, specificity, accuracy, positive predictive value (PPV), and negative predictive value (NPV). We picked the operating point based on the widely adopted Youden’s J statistic  

[18] which maximizes the sum of specificity and sensitivity on our validation set. All 6 evaluation metrics are reported in the following experiments. We use 2 GTX 1080 for all experiments.

Imaging-only vs. EHR-only vs. multimodal fusion.

To understand the importance of each modality, we test the 3 models on all 190 testing cases. The results of different models are summarized in Table 3. Out of the two single-modality models, we observe that the EHR-only model consistently achieves better performance on pulmonary embolism detection under all 6 evaluation measures. When incorporating both the imaging and the EHR modalities, the multimodal fusion model further enhances the performance by a large margin. For instance, on the testing set, the imaging-only model and the EHR-only model only yields an accuracy of and respectively, whereas the multimodal fusion model achieves a much higher accuracy of . This suggests that the both modalities are vital for clinical decision making.

Metric Imaging-only Model EHR-only Model Multimodal Fusion Model
Accuracy 0.759 0.877 0.895
AUROC 0.842 0.932 0.962
Specificity 0.863 0.888 0.838
Sensitivity 0.659 0.866 0.951
PPV 0.831 0.888 0.857
NPV 0.711 0.866 0.944
Table 4: Performance comparison of the imaging-only model, the EHR-only model, and the multimodal fusion model on non-subsegmental-only PE.

Diagnosis on the non-subsegmental-only PE set.

Diagnosing subsegmental-only PE is known to have questionable clinical significance and is often left untreated [1]. Therefore, we have also computed the same evaluation metrics for the 162 non-subsegmental-only PE cases out of the testing set to understand the clinical utility of our model. As shown in Table 4, once again, we observe that the EHR-only model outperforms the imaging-only model. But the multimodal fusion model achieves the best results out of the 3 models despite the slightly higher specificity and PPV observed for the EHR-only model.

Imaging-only Model EHR-only Model Multimodal Fusion Model
Metric Female Male Female Male Female Male
Accuracy 0.747 0.626 0.869 0.802 0.899 0.879
AUROC 0.844 0.730 0.918 0.924 0.953 0.936
Specificity 0.953 0.757 0.884 0.892 0.884 0.919
Sensitivity 0.589 0.537 0.857 0.741 0.911 0.852
PPV 0.943 0.763 0.906 0.909 0.911 0.939
NPV 0.641 0.528 0.826 0.702 0.884 0.810
Table 5: Performance comparison of the imaging-only model, the EHR-only model and the multimodal fusion model for different gender groups on the testing set. Bold underline denotes the best results in the Female group, bold denotes the best results in the Male group.
Imaging-only Model EHR-only Model Multimodal Fusion Model
Metric Female Male Female Male Female Male
Accuracy 0.828 0.680 0.885 0.867 0.897 0.893
AUROC 0.897 0.769 0.934 0.930 0.967 0.958
Specificity 0.953 0.757 0.884 0.892 0.837 0.838
Sensitivity 0.705 0.605 0.886 0.842 0.955 0.947
PPV 0.939 0.719 0.886 0.889 0.857 0.857
NPV 0.759 0.651 0.884 0.846 0.947 0.939
Table 6: Performance comparison of the imaging-only model, the EHR-only model and the multimodal fusion model for different gender groups on the non-subsegmental-only PE set. Bold underline denotes the best results in the Female group, bold denotes the best results in the Male group.

5 Fairness Evaluation

Assessing fairness and robustness of machine learning systems is critical prior to deployment [46, 29, 41]. Although the multimodal fusion model achieves better performance compared with the imaging-only model and the EHR-only model, it is not clear whether this benefit can be preserved for different population groups with sensitive attributes. In addition, it remains unknown whether the integration of both modalities will exacerbate existing model biases against different patient attributes in gender, race, etc., which can further raise ethical concerns when deploying such models [29].

To answer these questions, in this section, we provide a rigorous fairness analysis of the 3 different models for various patient groups for both the full test set and the non-subsegmental-only test set. We first report the performance comparison of different models in AUROC, sensitivity, specificity, accuracy, PPV and NPV for each gender group in Table 5 and Table 6. We observe that, under both evaluation settings, the multimodal fusion model achieves better results for different gender groups compared with the imaging-only model and the EHR-only model by a large margin. In particular, on the Male group, the multimodal fusion model achieves accuracies of and on the testing set and non-subsegmental-only PE, outperforming the EHR-only model by and respectively. Similarly, we demonstrate that the multimodal fusion model achieves the best performance for different age groups in Table 7 & Table 8, and for different racial groups in Table 9 & Table 10. This fact indicates that the integration of both the imaging modality and the EHR modality not only improves the standard performance but also consistently yields more robust results against different population groups.

Imaging-only Model EHR-only Model Multimodal Fusion Model
Metric Age Age Age Age Age Age
Accuracy 0.737 0.669 0.825 0.842 0.895 0.887
AUROC 0.886 0.766 0.933 0.917 0.979 0.936
Specificity 1.0 0.823 0.944 0.871 0.944 0.887
Sensitivity 0.615 0.535 0.769 0.817 0.872 0.887
PPV 1.0 0.776 0.968 0.879 0.971 0.900
NPV 0.545 0.607 0.654 0.806 0.773 0.873
Table 7: Performance comparison of the imaging-only model, the EHR-only model and the multimodal fusion model for different age groups on the testing set. Bold underline denotes the best results in the Age group, bold denotes the best results in the Age group.
Imaging-only Model EHR-only Model Multimodal Fusion Model
Metric Age Age Age Age Age Age
Accuracy 0.809 0.739 0.894 0.870 0.957 0.870
AUROC 0.925 0.814 0.935 0.930 0.989 0.953
Specificity 1.0 0.823 0.944 0.871 0.944 0.806
Sensitivity 0.690 0.642 0.862 0.868 0.966 0.943
PPV 1.0 0.756 0.962 0.852 0.966 0.806
NPV 0.667 0.729 0.810 0.885 0.944 0.943
Table 8: Performance comparison of the imaging-only model, the EHR-only model and the multimodal fusion model for different age groups on the non-subsegmental-only PE set. Bold underline denotes the best results in the Age group, bold denotes the best results in the Age group.
Imaging-only Model EHR-only Model Multimodal Fusion Model
Metric White Others White Others White Others
Accuracy 0.622 0.803 0.832 0.845 0.882 0.901
AUROC 0.76 0.852 0.932 0.914 0.944 0.959
Specificity 0.825 0.900 0.900 0.875 0.900 0.900
Sensitivity 0.519 0.677 0.797 0.806 0.873 0.903
PPV 0.854 0.840 0.940 0.833 0.945 0.875
NPV 0.465 0.783 0.692 0.854 0.783 0.923
Table 9: Performance comparison of the imaging-only model, the EHR-only model and the multimodal fusion model for different racial groups on the testing set. Bold underline denotes the best results in the White group, bold denotes the best results in the Others group.
Imaging-only Model EHR-only Model Multimodal Fusion Model
Metric White Others White Others White Others
Accuracy 0.695 0.851 0.895 0.851 0.905 0.881
AUROC 0.809 0.892 0.944 0.91 0.964 0.962
Specificity 0.825 0.900 0.900 0.875 0.850 0.825
Sensitivity 0.600 0.778 0.891 0.815 0.945 0.963
PPV 0.825 0.840 0.925 0.815 0.897 0.788
NPV 0.600 0.857 0.857 0.875 0.919 0.971
Table 10: Performance comparison of the imaging-only model, the EHR-only model and the multimodal fusion model for different racial groups on the non-subsegmental-only PE set. Bold underline denotes the best results in the White group, bold denotes the best results in the Others group.

In addition to overall model performance for different patient groups, we also assess the fairness of different models by reporting the equal opportunity difference (EOD) which measures the difference in TPR (i.e., Sensitivity) for the privileged and under-privileged groups following the evaluation protocol in [19, 46, 37]. We choose to use the TPR gap as our fairness metric based on the needs of the clinical diagnostic setting—high TPR disparity indicates that sick members from one demographic group would not be given correct diagnoses at the same rate as the general population, which can be dangerous for clinical deployment. The evaluation results on the test set and on non-subsegmental-only PE cases are summarized in Table 11 and Table 12 respectively. They suggest that while the imaging-only model and the EHR-only model can yield large TPR gaps for different gender and racial groups, the multimodal fusion model consistently yields TPR gaps . For instance, the TPR gaps of different racial groups from the imaging-only model can be as large as and on the testing set and on non-subsegmental-only cases, and the TPR gap of different gender groups from the EHR-only model reaches on the testing set. As a comparison, the largest TPR gap observed for the multimodal fusion model is only , for different gender groups on the testing set. However, we note that the multimodality fusion model does not consistently improve the fairness compared with single-modality models. How to design fairer multimodality fusion models remain as an open question.

Discussion.

Although our result demonstrates the huge potential for integrating both the imaging and the EHR modalities to improve model performance and robustness without introducing large TPR disparities between different population groups, several gaps are yet to be addressed. Future studies should examine fusion model designs that not only improve overall performance but also reduce model bias when integrating different modalities. Moreover, given the inherent biases in existent large public datasets, it would be important to investigate the performance of multimodal fusion models on diverse multi-source datasets in an effort to improve algorithm fairness for the different gender, racial and age groups.

Limitation.

One limitation of our work is that the different population groups collected in the dataset are not well-balanced, which may have contributed to the evaluation of the TPR gaps. Additionally, future research should also extend the dataset to external institutions with new scanners and protocols.

Patient Attribute Imaging-only Model EHR-only Model Multimodal Fusion Model
Gender 0.052 0.116 0.059
Race 0.158 0.009 0.030
Age 0.080 0.048 0.015
Table 11: EOD under various patient attributes (i.e., gender, race, age) on the testing set. Less EOD indicates less bias.
Patient Attribute Imaging-only Model EHR-only Model Multimodal Fusion Model
Gender 0.100 0.044 0.008
Race 0.178 0.076 0.017
Age 0.047 0.006 0.023
Table 12: EOD under various patient attributes (i.e., gender, race, age) on the non-subsegmental-only PE set. Less EOD indicates less bias.

6 Conclusion

Despite the pervasiveness of medical imaging benchmarks, the number of multimodal medical datasets remains limited. Datasets involving both medical imaging and EHR patient data are even more scarce. To advance research in this field, in this study, we introduce RadFusion, a large-scale multimodal dataset consisting of paired CT images and EHR patient data from over 1800 studies. We identify the importance of both imaging data and EHR patient data for pulmonary embolism detection by benchmarking different representative imaging-based, EHR-based and multimodal fusion models on RadFusion. Through extensive evaluation on both the testing set and non-subsegmental-only PE cases, our initial results suggest that compared with single-modality methods, the multimodal fusion model can significant improve performance and robustness, without introducing large TPR disparities between different population groups.

To the best of our knowledge, RadFusion is the first public database focusing on combining medical imaging data with large-scale patient EHR for advancing clinical diagnosis. To support better clinical utility, RadFusion also provides opportunities to consider fairness and robustness against different patient attributes when designing multimodal fusion models. We hope our study can serve as an important baseline and facilitate future research on this direction.

Acknowledgement

Research reported in this publication was supported by the National Heart, Lung, And Blood Institute of the National Institutes of Health under Award Number R01HL155410 and the National Library Of Medicine of the National Institutes of Health under Award Number R01LM012966. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

  • [1] M. H. Albrecht, M. W. Bickford, J. W. Nance Jr, L. Zhang, C. N. De Cecco, J. L. Wichmann, T. J. Vogl, and U. J. Schoepf (2017) State-of-the-art pulmonary ct angiography for acute pulmonary embolism. American Journal of Roentgenology 208 (3), pp. 495–504. Cited by: §4.2.
  • [2] J. L. Alonso-Martínez, F. A. Sánchez, and M. U. Echezarreta (2010) Delay and misdiagnosis in sub-massive and non-massive acute pulmonary embolism. European journal of internal medicine 21 (4), pp. 278–282. Cited by: §2.
  • [3] G. An, K. Omodaka, S. Tsuda, Y. Shiga, N. Takada, T. Kikawa, T. Nakazawa, H. Yokota, and M. Akiba (2018) Comparison of machine-learning classification models for glaucoma management. Journal of healthcare engineering 2018. Cited by: §2.
  • [4] I. Banerjee, A. R. Bhimireddy, J. L. Burns, L. A. Celi, L. Chen, R. Correa, N. Dullerud, M. Ghassemi, S. Huang, P. Kuo, et al. (2021) Reading race: ai recognises patient’s racial identity in medical images. arXiv preprint arXiv:2107.10356. Cited by: §1.
  • [5] I. Banerjee, M. C. Chen, M. P. Lungren, and D. L. Rubin (2018) Radiology report annotation using intelligent word embeddings: applied to multi-institutional chest ct cohort. Journal of biomedical informatics 77, pp. 11–20. Cited by: §3.
  • [6] I. Banerjee, Y. Ling, M. C. Chen, S. A. Hasan, C. P. Langlotz, N. Moradzadeh, B. Chapman, T. Amrhein, D. Mong, D. L. Rubin, et al. (2019)

    Comparative effectiveness of convolutional neural network (cnn) and recurrent neural network (rnn) architectures for radiology text report classification

    .
    Artificial intelligence in medicine 97, pp. 79–88. Cited by: §3.
  • [7] I. Banerjee, M. Sofela, J. Yang, J. H. Chen, N. H. Shah, R. Ball, A. I. Mushlin, M. Desai, J. Bledsoe, T. Amrhein, et al. (2019) Development and performance of the pulmonary embolism result forecast model (perform) for computed tomography clinical decision support. JAMA network open 2 (8), pp. e198719–e198719. Cited by: §4.1.
  • [8] A. L. Beam and I. S. Kohane (2018) Big data and machine learning in health care. Jama 319 (13), pp. 1317–1318. Cited by: §2.
  • [9] N. Bhagwat, J. D. Viviano, A. N. Voineskos, M. M. Chakravarty, A. D. N. Initiative, et al. (2018) Modeling and prediction of clinical symptom trajectories in alzheimer’s disease using longitudinal data. PLoS computational biology 14 (9), pp. e1006376. Cited by: §2.
  • [10] P. Bilic, P. F. Christ, E. Vorontsov, G. Chlebus, H. Chen, Q. Dou, C. Fu, X. Han, P. Heng, J. Hesser, et al. (2019) The liver tumor segmentation benchmark (lits). arXiv preprint arXiv:1901.04056. Cited by: §1.
  • [11] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. (2021) On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. Cited by: §3.
  • [12] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman (2018) A short note about kinetics-600. arXiv preprint arXiv:1808.01340. Cited by: §4.1.
  • [13] I. Y. Chen, E. Pierson, S. Rose, S. Joshi, K. Ferryman, and M. Ghassemi (2020) Ethical machine learning in healthcare.

    Annual Review of Biomedical Data Science

    4.
    Cited by: §2.
  • [14] I. Y. Chen, P. Szolovits, and M. Ghassemi (2019) Can ai help reduce disparities in general medical and mental health care?. AMA journal of ethics 21 (2), pp. 167–179. Cited by: §2.
  • [15] A. Chouldechova (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big data 5 (2), pp. 153–163. Cited by: §2.
  • [16] M. D. Cohen (2007) Accuracy of information on imaging requisitions: does it matter?. Journal of the American College of Radiology 4 (9), pp. 617–621. Cited by: §2.
  • [17] M. De-Arteaga, A. Romanov, H. Wallach, J. Chayes, C. Borgs, A. Chouldechova, S. Geyik, K. Kenthapadi, and A. T. Kalai (2019) Bias in bios: a case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 120–128. Cited by: §2.
  • [18] R. Fluss, D. Faraggi, and B. Reiser (2005) Estimation of the youden index and its associated cutoff point. Biometrical Journal: Journal of Mathematical Methods in Biosciences 47 (4), pp. 458–472. Cited by: §4.2.
  • [19] M. Hardt, E. Price, and N. Srebro (2016)

    Equality of opportunity in supervised learning

    .
    Advances in neural information processing systems 29, pp. 3315–3323. Cited by: §1, §5.
  • [20] J. M. Hendriksen, M. Koster-van Ree, M. J. Morgenstern, R. Oudega, R. E. Schutgens, K. G. Moons, and G. Geersing (2017) Clinical characteristics associated with diagnostic delay of pulmonary embolism in primary care: a retrospective observational study. BMJ open 7 (3), pp. e012789. Cited by: §2.
  • [21] K. T. Horlander, D. M. Mannino, and K. V. Leeper (2003) Pulmonary embolism mortality in the united states, 1979-1998: an analysis using multiple-cause mortality data. Archives of internal medicine 163 (14), pp. 1711–1717. Cited by: §2.
  • [22] S. Huang, T. Kothari, I. Banerjee, C. Chute, R. L. Ball, N. Borus, A. Huang, B. N. Patel, P. Rajpurkar, J. Irvin, et al. (2020) PENet—a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric ct imaging. NPJ digital medicine 3 (1), pp. 1–9. Cited by: §2, §4.1.
  • [23] S. Huang, A. Pareek, S. Seyyedi, I. Banerjee, and M. P. Lungren (2020) Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ digital medicine 3 (1), pp. 1–9. Cited by: §2, §4.1.
  • [24] S. Huang, A. Pareek, R. Zamanian, I. Banerjee, and M. P. Lungren (2020) Multimodal fusion with deep neural networks for leveraging ct imaging and electronic health record: a case-study in pulmonary embolism detection. Scientific reports 10 (1), pp. 1–9. Cited by: §1, §4.1, §4.1.
  • [25] S. Huang, L. Shen, M. P. Lungren, and S. Yeung (2021) GLoRIA: a multimodal global-local representation learning framework for label-efficient medical image recognition. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    ,
    pp. 3942–3951. Cited by: §2.
  • [26] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. (2019) Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 590–597. Cited by: §1.
  • [27] D. Jiménez, D. Aujesky, L. Moores, V. Gómez, J. L. Lobo, F. Uresandi, R. Otero, M. Monreal, A. Muriel, R. D. Yusen, et al. (2010) Simplification of the pulmonary embolism severity index for prognostication in patients with acute symptomatic pulmonary embolism. Archives of internal medicine 170 (15), pp. 1383–1389. Cited by: §3.
  • [28] A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C. Deng, R. G. Mark, and S. Horng (2019) MIMIC-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data 6 (1), pp. 1–8. Cited by: §1.
  • [29] A. Kadambi (2021) Achieving fairness in medical devices. Science 372 (6537), pp. 30–31. Cited by: §5.
  • [30] I. Kawachi, N. Daniels, and D. E. Robinson (2005) Health disparities by race and class: why both matter. Health Affairs 24 (2), pp. 343–352. Cited by: §2.
  • [31] J. Kawahara, S. Daneshvar, G. Argenziano, and G. Hamarneh (2018) Seven-point checklist and skin lesion classification using multitask multimodal neural nets. IEEE journal of biomedical and health informatics 23 (2), pp. 538–546. Cited by: §2.
  • [32] B. Landman, Z. Xu, J. E. Igelsias, M. Styner, T. Langerak, and A. Klein (2015) MICCAI multi-atlas labeling beyond the cranial vault–workshop and challenge. In Proc. MICCAI: Multi-Atlas Labeling Beyond Cranial Vault-Workshop Challenge, Cited by: §1.
  • [33] A. Leslie, A. Jones, and P. Goddard (2000) The influence of clinical information on the reporting of ct by radiologists.. The British journal of radiology 73 (874), pp. 1052–1055. Cited by: §1.
  • [34] A. N. Leung, T. M. Bull, R. Jaeschke, C. J. Lockwood, P. M. Boiselle, L. M. Hurwitz, A. H. James, L. B. McCullough, Y. Menda, M. J. Paidas, et al. (2011) An official american thoracic society/society of thoracic radiology clinical practice guideline: evaluation of suspected pulmonary embolism in pregnancy. American journal of respiratory and critical care medicine 184 (10), pp. 1200–1208. Cited by: §2.
  • [35] A. N. Leung, T. M. Bull, R. Jaeschke, C. J. Lockwood, P. M. Boiselle, L. M. Hurwitz, A. H. James, L. B. McCullough, Y. Menda, M. J. Paidas, et al. (2012) American thoracic society documents: an official american thoracic society/society of thoracic radiology clinical practice guideline—evaluation of suspected pulmonary embolism in pregnancy. Radiology 262 (2), pp. 635–646. Cited by: §2.
  • [36] H. Li and Y. Fan (2019) Early prediction of alzheimer’s disease dementia based on baseline hippocampal mri and 1-year follow-up cognitive measures using deep recurrent neural networks. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 368–371. Cited by: §2.
  • [37] X. Li, Z. Cui, Y. Wu, L. Gu, and T. Harada (2021) Estimating and improving fairness with adversarial learning. arXiv preprint arXiv:2103.04243. Cited by: §5.
  • [38] T. J. Littlejohns, J. Holliday, L. M. Gibson, S. Garratt, N. Oesingmann, F. Alfaro-Almagro, J. D. Bell, C. Boultwood, R. Collins, M. C. Conroy, et al. (2020) The uk biobank imaging enhancement of 100,000 participants: rationale, data collection, management and future directions. Nature communications 11 (1), pp. 1–12. Cited by: §1.
  • [39] L. López-Jiménez, M. Montero, J. A. González-Fajardo, J. I. Arcelus, C. Suárez, J. L. Lobo, M. Monreal, R. Investigators, et al. (2006) Venous thromboembolism in very elderly patients: findings from a prospective registry (riete). Haematologica 91 (8), pp. 1046–1051. Cited by: §3.
  • [40] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, et al. (2014) The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34 (10), pp. 1993–2024. Cited by: §1.
  • [41] Z. Obermeyer and S. Mullainathan (2019) Dissecting racial bias in an algorithm that guides health decisions for 70 million people. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 89–89. Cited by: §2, §5.
  • [42] S. R. Pfohl, A. Foryciarz, and N. H. Shah (2021) An empirical characterization of fair machine learning for clinical risk prediction. Journal of biomedical informatics 113, pp. 103621. Cited by: §2.
  • [43] A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Hajaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus, M. Sun, et al. (2018) Scalable and accurate deep learning with electronic health records. NPJ Digital Medicine 1 (1), pp. 1–10. Cited by: §2.
  • [44] M. Remy-Jardin, J. Remy, D. Artaud, F. Deschildre, and A. Duhamel (1997) Peripheral pulmonary arteries: optimization of the spiral ct acquisition protocol.. Radiology 204 (1), pp. 157–163. Cited by: §3.
  • [45] H. R. Roth, L. Lu, A. Farag, H. Shin, J. Liu, E. B. Turkbey, and R. M. Summers (2015) Deeporgan: multi-level deep convolutional networks for automated pancreas segmentation. In International conference on medical image computing and computer-assisted intervention, pp. 556–564. Cited by: §1.
  • [46] L. Seyyed-Kalantari, G. Liu, M. McDermott, I. Y. Chen, and M. Ghassemi (2020) CheXclusion: fairness gaps in deep chest x-ray classifiers. In BIOCOMPUTING 2021: Proceedings of the Pacific Symposium, pp. 232–243. Cited by: §2, §5, §5.
  • [47] N. Tajbakhsh, M. B. Gotway, and J. Liang (2015) Computer-aided pulmonary embolism detection using a novel vessel-aligned multi-planar image representation and convolutional neural networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 62–69. Cited by: §2.
  • [48] N. Tomašev, X. Glorot, J. W. Rae, M. Zielinski, H. Askham, A. Saraiva, A. Mottram, C. Meyer, S. Ravuri, I. Protsyuk, et al. (2019) A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572 (7767), pp. 116–119. Cited by: §2.
  • [49] J. Walter, A. Tufman, R. Holle, and L. Schwarzkopf (2019) “Age matters”—german claims data indicate disparities in lung cancer care between elderly and young patients. PloS one 14 (6), pp. e0217434. Cited by: §2.
  • [50] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 2097–2106. Cited by: §1.
  • [51] A. Yala, C. Lehman, T. Schuster, T. Portnoi, and R. Barzilay (2019) A deep learning mammography-based model for improved breast cancer risk prediction. Radiology 292 (1), pp. 60–66. Cited by: §2.
  • [52] X. Yang, Y. Lin, J. Su, X. Wang, X. Li, J. Lin, and K. Cheng (2019) A two-stage convolutional neural network for pulmonary embolism detection from ctpa images. IEEE Access 7, pp. 84849–84857. Cited by: §2.
  • [53] H. Zou and T. Hastie (2005) Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology) 67 (2), pp. 301–320. Cited by: §4.1.