1 Introduction
Machine learning models have shown great promise for precision diagnosis, treatment prediction, and a number of other clinical applications yu2018artificial, myszczynska2020applications, abrol2021deep, zhou2021review, singh2022machine. This has led to increasing interest in building systems where such models can aid human experts for accurate and efficient decision making in clinical settings thompson2020enigma, marek2022reproducible, bethlehem2022brain. However, there are some key challenges to achieving this goal dinsdale2021challenges, rajpurkar2022ai, varoquaux2022machine. Particularly, clinical data is highly heterogeneous. For example, neurological disorders such as Alzheimer’s disease, stem not only from diverse anatomies, overlapping clinical phenotypes, or genomic traits of different subjects, but also from operational, demographic and social aspects such as data acquisition protocols pomponio2020harmonization, wang2021harmonization, perkonigg2021dynamic, and paucity of data for minorities davatzikos2019machine. As a consequence, machine learning models often have poor reproducibility across the population.
This paper focuses on one particular aspect of this issue, namely the fact that machine learning models make biased predictions, i.e., they have different accuracy, for different genders, age and racial groups, and cohorts from different clinical studies. Recently, this issue has received wide attention larrazabal2020gender, gao2020deep, wilkinson2020time, seyyed2021underdiagnosis, finlayson2021clinician, dockes2021preventing, li2022cross. For example, a machine learning model trained on X-ray images consistently made inaccurate predictions on underrepresented genders when the imbalance in training data is beyond a thresholdlarrazabal2020gender. Similarly, for classification of chest X-ray pathologies, minority groups such as female African-Americans and patients with low socioeconomic status are prone to be incorrectly diagnosed as healthy seyyed2021underdiagnosis. Such results have raised concerns on whether machine learning models can provide unbiased predictions, and whether they can be deployed in clinical settings eventually.
This paper provides results from relatively large and diverse datasets pertaining to three neurological disorders, namely Alzheimer’s disease (AD), schizophrenia (SZ), and autism spectrum disorder (ASD), which alleviate these concerns. We build machine learning models using 3D magnetic resonance (MR) images along with demographic, clinical, and genetic factors, as also cognitive scores, from large-scale consortia—iSTAGING habes2021brain for AD, PHENOM satterthwaite2010association, zhang2015heterogeneity, chand2020two for SZ, and ABIDE di2014autism for ASD; see Table 1 . There exists large imbalance in this dataset, e.g., 13% and 37.4% female subjects respectively in ABIDE and PHENOM, and 70.6% European Americans, 8.8% African Americans, and 1.5% Asian Americans in iSTAGING. We show that, when trained with appropriate data pre-processing techniques and hyper-parameter tuning, machine learning models do not have biased predictions across different subgroups. This is not the case for a baseline deep neural network, which provides accurate predictions on-average across the population but suffers from bias when predictions are stratified by different attributes.
2 Results
![]() |
![]() |
![]() |
-values shown in the figure indicate that we cannot reject the null hypothesis that the accuracy for different subgroups has the same mean (at significance level < 0.01). This is not the case for the baseline deep network; see
Section 2.1.2.1 Baseline machine learning models have accurate predictions on-average, but can be biased
The accuracy of a deep network on held-out data is 85.83% 1.76% for AD, 73.29% 2.85% for SZ, and 55.61% 3.61% for ASD. These numbers are comparable to published results in the literature wen2020convolutional, rozycki2018multisite, katuwal2015predictive. As Fig. 1 shows, for all three neurological disorders, we find large discrepancies in the accuracy of this model on different subgroups of the population for all attributes. For example, for AD the model has a higher accuracy on females than males (-value 1.59 ); for SZ the model predicts more accurately for Native Americans than Asian Americans (-value 1.49 ). One may be inclined to hypothesize that this bias in accuracy comes from one subgroup having a larger sample size than the other (Table 1). This hypothesis does not hold when predictions are stratified by age. AD subjects older than 80 years, SZ subjects older than 35 years, and ASD subjects younger than 20 years, have a lower accuracy (-values are 8.76 , 3.54 and 2.80 respectively) even if these subgroups are not the ones with the smallest sample size (Table 1). For all three disorders, predictions of the baseline deep network are biased (-value < 0.01) except in four cases: race and clinical studies in AD, sex in SZ, and clinical studies in ASD.
2.2 Appropriately designed machine learning models are not necessarily biased
The accuracy of the ensemble learned using the methodology discussed in Section 4 on held-out data is 87.35% 1.18% for AD, 74.76% 3.32% for SZ and 60.54% 2.13% for ASD; all three are slightly better than that of the baseline deep network. As Fig. 1 shows visually, the accuracy of the ensemble is more consistent. For all three neurological disorders, for all attributes, we find that we cannot reject the hypothesis that the accuracy for different subgroups does not have the same mean (at significance level < 0.01). In other words, the ensemble does not exhibit a bias, up to statistically indistinguishable levels. It is remarkable that this observation holds even in situations with extreme imbalance in the data, e.g., for ASD there are 87% males as compared to 13% females.
Using the same preprocessing and hyper-parameter tuning methodology as that of the ensemble, the bias of the deep network can be improved slightly. We obtained an accuracy of 86.78% 1.90% on AD, 70.44% 2.78% on SZ and 54.70% 4.03% on ASD; these numbers, except the one for SZ, are about the same as that of the baseline deep network. The -values for there being bias across subgroups now are 0.665 (sex), 0.090 (age), 0.690 (race) and 0.768 (study) for AD; 0.089 (sex), 0.771 (age), 0.381 (race) and 7.86 (study) for SZ; 4.77 (sex), 0.356 (age) and 0.880 (study) for ASD. With a significance level of 0.01, this indicates that the predictions of the deep network trained with better data preprocessing achieve a similar average accuracy as that of the baseline deep network but are not biased (except across clinical studies in SZ, which is likely to be partially due to differences in clinical characteristics across patient cohorts).
2.3 Machine learning models trained on multi-source data are also unbiased and predict more accurately on average, but their subgroup-specific accuracy may not always be better than models trained on single-source data
We next trained the ensemble using demographic features, clinical variables, genetic factors and cognitive scores in addition to MR imaging features. The accuracy of this multi-source ensemble on held-out data is 91.22% 1.76% for AD, 79.18% 3.29% for SZ and 63.83% 3.18% for ASD. All three values are better than the corresponding ones for the ensemble trained only on structural MR imaging features (-value < 6.44 for all three). Therefore, using multi-source data improves the average accuracy of machine learning models for the three neurological disorders. This ensemble also makes unbiased predictions across different subgroups; the -values are 0.106 (sex), 0.732 (age), 0.819 (race), and 0.275 (study) for AD; 0.808 (sex), 0.845 (age), 0.772 (race), and 0.012 (study) for SZ; 0.811 (sex), 0.523 (age), and 0.862 (study) for ASD.
We calculated a two-way ANOVA rutherford2011anova to check whether the improved average accuracy translates to improved accuracy for subgroups pertaining to each attribute. The -values are 7.91 (sex), 1.07 (age), 8.24 (race), and 8.10 (study) for AD; 7.40 (sex), 1.68 (age), 0.566 (race), and 5.24 (study) for SZ; 0.940 (sex), 0.150 (age), and 0.213 (study) for ASD. At a significance level of 0.01, we find that using multi-source data improves accuracy of the ensemble as compared to using features from only structural measures for subgroups pertaining to sex and clinical studies in AD, but this does not hold for other cases.
3 Discussion
Relationship of this work to existing literature on identifying and mitigating bias.
As machine learning models are being applied to diverse problems in the clinical sciences, there is an increasing amount of discussion on bias in particular, and ethical issues in general wiens2019no, wilkinson2020time, finlayson2021clinician, gichoya2022ai, vasey2022reporting. This literature is playing a crucial role in shaping policies for deployment of automated diagnostic models. Consequently, there is also a large amount of recent work on identifying such biases larrazabal2020gender, seyyed2021underdiagnosis, li2022cross, and developing techniques to mitigate these biases gao2020deep, zhao2020training, wang2022embracing. In this study, we have shown that when machine learning models are trained using well-established data preprocessing and hyper-parameter optimization techniques on data from large-scale multi-site studies for three neurological disorders, namely Alzheimer’s disease, schizophrenia, and autism spectrum disorder, the predictions of these models need not be biased. Baseline machine learning models, even performant deep neural networks, that do not use this preprocessing and hyper-parameter optimization do suffer from bias. This observation holds when predictions are stratified across four different attributes (sex, age, race and the clinical study that collected the data). Our results do not diminish the value of the existing work on bias. Instead, they provide evidence that we might be able to develop unbiased machine learning-based diagnostic models with ample, proper, and diverse training. Our results therefore provide the hope that, after appropriate checks and balances are met, we might be able to deploy machine learning models in the future, and obtain both accurate and unbiased predictions.
Investigating existing machine learning techniques thoroughly is as important as discovering new ways to mitigate bias.
The machine learning literature has well-established ways to safeguard against poor generalization. Disregarding these procedures can lead to bias in the predictions. Our experiments indicate that a rigorous preprocessing, training and evaluation methodology can enable ways to build machine learning models that do not suffer from biased predictions. Our work therefore provides a sound “baseline” to benchmark the performance of automated diagnostic systems. The accuracy of predictions obtained herein on three neurological disorders and across large-scale, multi-source cohorts reported in our reports is comparable to the state of the art of large-scale multi-site studies wen2020convolutional, rozycki2018multisite, katuwal2015predictive, wang2022embracing suggests that popular methods used to mitigate bias in the literature, e.g., learning invariant features arjovsky2019invariant, zhao2020training, chyzhyk2022remove, or domain-adaptation techniques gao2020deep, ganin2016domain, tzeng2017adversarial, end up removing useful information from the data, e.g., correlations between gender or age with the pathology, even if they are robust to covariate shift. This tradeoff, between being invariant to the heterogeneity in the data and predictive ability, is seen in the harmonization literature more broadly and it has been argued previously that it is unavoidablemoyer2021harmonization,wang2022embracing.
Large-scale multi-source cohorts of data from diverse populations are desirable for building robust and accurate machine learning models, even if they are not balanced.
Balanced datasets, where each subgroup has an equal number of samples, are desirable if we are to build unbiased modelschawla2002smote. But it is extremely difficult to obtain balanced data in practice. There are long-standing problems in recruiting volunteers across gender, age, and race. For example, female, minority ethnicity groups, and older subjects are less likely to participate in clinical trials as compared to white young menmurthy2004participation, chastain2020racial. Even when we are able to obtain balanced data, a machine learning model may still be biased due to unobserved confounders, e.g., severity of the disease or genetic factors. This thinking has inspired recent studies which argue for training models on extensive multi-source neuroimaging datasets he2022meta, schulz2022performance. Our results corroborate these findings; large-scale cohorts of multi-source data can enable training robust and unbiased machine learning models, and equally importantly also enable thorough evaluation across different attributes.
4 Methods
We summarize our technical approach in this section and Appendix A provides more details. We use 3D magnetic resonance (MR) images along with demographic (gender, age, race, education level, marital status, employment status, handedness, smoker), clinical (diabetes, hypertension, hyperlipidemia, systolic/diastolic blood pressure and body mass index), and genetic factors (apolipoprotein E (APOE) alleles 2, 3 and 4) and cognitive scores from three large consortia—iSTAGING habes2021brain for AD, PHENOM satterthwaite2010association, zhang2015heterogeneity, chand2020two for SZ, and ABIDE di2014autism for ASD. We use a standard processing pipeline tustison2010n4itk,doshi2013multi,doshi2016muse to compute structural features from T1-weighted MR images. All accuracies reported in this paper are calculated using 5 independent held-out test sets.
Pre-processing pipeline
Some features, predominantly clinical and genetic factors, and cognitive scores, are sparsely populated; see Table 2 . Continuous-valued features are normalized to have zero mean and unit variance after median imputation; quantile normalization is used for features with highly skewed distributions. For discrete-valued features, we introduce a “unknown” category for missing values. Corresponding to each feature with missing values, we introduce an additional Boolean feature which indicates whether the value was missing. No harmonization tools
Hyper-parameter optimization methodology
We use a machine learning framework called AutoGluon erickson2020autogluon which gives an easy way to train a large number of different types of models (deep networks lecun2015deep, -nearest neighbor classifiers cover1967nearest, random forests
The baseline deep network used in Section 2.1 has three fully-connected layers and is also built within AutoGluon. This network is trained using data that is normalized to have zero mean and unit standard deviation after dropping missing values.
Acknowledgments
This work was supported by the National Institute on Aging (RF1AG054409 and U01AG068057), the National Institute of Mental Health (R01MH112070), the National Science Foundation (2145164) and cloud computing credits from Amazon Web Services.
[title=Bibliography]
Appendix A Detailed methods
Data
We use 3D magnetic resonance (T1-weighted) images along with demographic (gender, age, race, education level, marital status, employment status, handedness, smoker), clinical (diabetes, hypertension, hyperlipidemia, systolic/diastolic blood pressure and body mass index), and genetic factors (apolipoprotein E (APOE) alleles 2, 3 and 4), as also cognitive scores (mini-mental state exam (MMSE), full-scale intelligence quotient (FIQ), verbal intelligence quotient (VIQ), and performance intelligence quotient (PIQ)) from three large consortia—iSTAGING habes2021brain for AD, PHENOM satterthwaite2010association, zhang2015heterogeneity, chand2020two for SZ, and ABIDE di2014autism for ASD. The subset of iSTAGING data used here consists of multiple clinical studies: AD Neuroimaging Initiative (ADNI) jack2008alzheimer, Penn Memory Center cohort (PENN), and Australian Imaging, Biomarkers and Lifestyle (AIBL) ellis2010addressing. In PHENOM scans are acquired from five different sites namely Penn (United States), China, Munich, Utrecht, and Melbourne. All tasks in this study are a binary classification problem with two labels (healthy controls and patients). We only use the baseline (first time point) scans from each cohort; all follow-up sessions are excluded. This way there is no data leakage for the same participant between training and test sets. For AD, we select stable cognitive normal (CN) and AD patients based on each participant’s longitudinal diagnosis status. We only include subjects who were diagnosed as CN or AD at the baseline and stayed stable during the follow-up sessions.
Methodology for creating features from structural measures
We compute features from T1-weighted MR images using a standard pipeline. Scans are bias-field corrected tustison2010n4itk, skull-stripped with a multi-atlas algorithm doshi2013multi, and then a multi-atlas label fusion segmentation method doshi2016muse is used to obtain anatomical region-of-interest (ROI) masks for 119 grey matter ROIs, 20 white matter ROIs and 6 ventricle ROIs of the brain (total 145). We further segment white matter hyperintensities (WMH) using a deep learning-based algorithm
Evaluation methodology
We report accuracy on held-out test sets as follows. We split data into 5 equal-sized folds (stratified by labels), use four for training and validation (80%) and the fifth for testing (20%). All hyper-parameter tuning is performed using a further 5-fold cross-validation within the 80% data. This way, the 20% data is a completely independent test set which is used only for reporting the final accuracy. We report mean and standard deviation of the accuracy over 5 independent test sets (one for each outer fold). This is a computationally expensive, but rigorous, evaluation methodology. The three neurological disorders consist of data from multiple clinical studies; we create the training, validation and test sets for each study independently and then concatenate them.
Pre-processing pipeline
Our data contains structural features such as ROI and WML volumes in addition to demographic, clinical and genetic factors, and cognitive scores. Some of these features (predominantly last three) are sparsely populated; see Table 2. For continuous-valued features, we first impute missing values with the median of each variable and normalize the feature to have zero-mean and unit-variance. We apply quantile normalization to skewed distributions. For discrete-valued features, we introduce a “unknown” category for missing values. Corresponding to each feature with missing values, we introduce an additional Boolean feature which indicates whether the value was missing. This way we preserve the evidence of absence (rather than the absence of evidence) erickson2020autogluon. We did not use any harmonization tools pomponio2020harmonization, wang2021harmonization.
Hyper-parameter optimization methodology
We compare results from an optimized machine learning model, in which hyper-parameter optimization and ensemble learning were performed, with a basic network. For the former, we use a machine learning framework called AutoGluon erickson2020autogluon which gives an easy way to train a large number of different types of models (deep networks lecun2015deep, -nearest neighbor classifiers cover1967nearest, random forests breiman2001random, CatBoost prokhorenkova2018catboost, and LightGBM ke2017lightgbm) and perform hyper-parameter search. For deep network models we create an input layer that concatenates the embedding of continuous-valued and categorical features; the other models can natively handle both these types of features. Using AutoGluon we can also build ensembles of these models via bagging breiman1996bagging, boosting bartlett1998boosting and stacking wolpert1992stacked. For each neurological disorder, for each of the 5 outer folds, we train the above different types of models using different hyper-parameters in parallel across multiple CPUs and 4 GPUs for 1 hour and build an ensemble that obtains the best classification log-likelihood on the validation data.
The baseline deep network used in Section 2.1 has three fully-connected layers and is also built within the same software framework. This network is trained using data that is normalized to have zero mean and unit standard deviation after dropping missing values. It does not use the pre-processing pipeline described above.
Appendix B Summary of the data
Alzheimer’s Disease | ADNI-1 | ADNI-2/3 | PENN | AIBL | Total | |
(22.81%) | (30.58%) | (33.40%) | (13.22%) | |||
Subjects | ||||||
Control | 173 | 261 | 228 | 119 | 781 | |
Patient | 191 | 227 | 305 | 92 | 815 | |
Sex (%) | ||||||
Female | 10.90 | 15.66 | 21.37 | 8.08 | 56.02 | |
Male | 11.90 | 14.91 | 12.03 | 5.14 | 43.98 | |
Age (%, years) | ||||||
0–65 | 1.50 | 3.51 | 6.33 | 1.82 | 13.16 | |
65–70 | 1.94 | 8.15 | 7.02 | 2.57 | 19.67 | |
70–75 | 7.27 | 6.95 | 6.83 | 3.76 | 24.81 | |
75–80 | 6.52 | 6.89 | 6.58 | 2.57 | 22.56 | |
> 80 | 5.58 | 5.08 | 6.64 | 2.51 | 19.80 | |
Race (%) | ||||||
White | 21.18 | 16.98 | 25.13 | 7.33 | 70.61 | |
Black | 1.19 | 0.88 | 6.70 | - | 8.77 | |
Asian | 0.31 | 0.56 | 0.63 | - | 1.50 |
Schizophrenia | Penn | China | Munich | Utrecht | Melbourne | Total | |
(22.28%) | (13.94%) | (29.64%) | (20.12%) | (14.03%) | |||
Subjects | |||||||
Control | 131 | 76 | 157 | 115 | 84 | 563 | |
Patient | 96 | 66 | 145 | 90 | 59 | 456 | |
Sex (%) | |||||||
Female | 11.87 | 6.77 | 7.75 | 6.97 | 4.02 | 37.39 | |
Male | 10.40 | 7.16 | 21.88 | 13.15 | 10.01 | 62.61 | |
Age (%, years) | |||||||
0–25 | 5.79 | 4.91 | 9.42 | 10.11 | 5.89 | 36.11 | |
25–30 | 6.28 | 2.36 | 7.26 | 4.12 | 2.16 | 22.18 | |
30–35 | 3.53 | 2.16 | 5.99 | 2.85 | 1.37 | 15.90 | |
> 35 | 6.67 | 4.51 | 6.97 | 3.04 | 4.61 | 25.81 | |
Race (%) | |||||||
Native | 10.50 | - | - | - | - | 10.50 | |
Asian | 7.36 | - | - | - | - | 7.36 |
Autism Spectrum | ABIDE-1 | ABIDE-2 | Total | |
Disorder | (62.78%) | (37.22%) | ||
Subjects | ||||
Control | 224 | 138 | 362 | |
Patient | 196 | 111 | 307 | |
Sex (%) | ||||
Female | 6.73 | 6.28 | 13.00 | |
Male | 56.05 | 30.94 | 87.0 | |
Age (%, years) | ||||
0–20 | 21.97 | 11.66 | 33.63 | |
20–25 | 17.49 | 11.06 | 28.55 | |
> 25 | 23.32 | 14.50 | 37.82 |
Variables | iSTAGING | PHENOM | ABIDE | |
MR imaging | ||||
Region-of-interest volumes | ||||
White matter lesion volume | ||||
Demographics | ||||
Gender | ||||
Age | ||||
Race | ||||
Education level | ||||
Marital status | ||||
Employment status | ||||
Handedness | ||||
Smoking status | ||||
Clinical | ||||
Diabetes | ||||
Hypertension | ||||
Hyperlipidemia | ||||
Blood pressure (systolic/diastolic) | ||||
Body mass index | ||||
Genetic factor | ||||
Apolipoprotein E alleles 2, 3 and 4 | ||||
Cognitive scores | ||||
Mini-mental state exam | ||||
Full-scale intelligence quotient | ||||
Verbal intelligence quotient | ||||
Performance intelligence quotient |