Machine Learning Models Are Not Necessarily Biased When Constructed Properly: Evidence from Neuroimaging Studies

05/26/2022
by   Rongguang Wang, et al.
University of Pennsylvania
0

Despite the great promise that machine learning has offered in many fields of medicine, it has also raised concerns about potential biases and poor generalization across genders, age distributions, races and ethnicities, hospitals, and data acquisition equipment and protocols. In the current study, and in the context of three brain diseases, we provide experimental data which support that when properly trained, machine learning models can generalize well across diverse conditions and do not suffer from biases. Specifically, by using multi-study magnetic resonance imaging consortia for diagnosing Alzheimer's disease, schizophrenia, and autism spectrum disorder, we find that, the accuracy of well-trained models is consistent across different subgroups pertaining to attributes such as gender, age, and racial groups, as also different clinical studies. We find that models that incorporate multi-source data from demographic, clinical, genetic factors and cognitive scores are also unbiased. These models have better predictive accuracy across subgroups than those trained only with structural measures in some cases but there are also situations when these additional features do not help.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/08/2022

Disability prediction in multiple sclerosis using performance outcome measures and demographic data

Literature on machine learning for multiple sclerosis has primarily focu...
06/12/2021

Harmonization with Flow-based Causal Inference

Heterogeneity in medical data, e.g., from data collected at different si...
03/27/2019

A novel framework for automatic detection of Autism: A study on Corpus Callosum and Intracranial Brain Volume

Computer vision and machine learning are the linchpin of field of automa...
03/23/2021

Embracing the Disharmony in Heterogeneous Medical Data

Heterogeneity in medical imaging data is often tackled, in the context o...
03/24/2022

Feature visualization for convolutional neural network models trained on neuroimaging data

A major prerequisite for the application of machine learning models in c...
11/18/2021

Assessing Social Determinants-Related Performance Bias of Machine Learning Models: A case of Hyperchloremia Prediction in ICU Population

Machine learning in medicine leverages the wealth of healthcare data to ...

1 Introduction

Machine learning models have shown great promise for precision diagnosis, treatment prediction, and a number of other clinical applications yu2018artificial, myszczynska2020applications, abrol2021deep, zhou2021review, singh2022machine. This has led to increasing interest in building systems where such models can aid human experts for accurate and efficient decision making in clinical settings thompson2020enigma, marek2022reproducible, bethlehem2022brain. However, there are some key challenges to achieving this goal dinsdale2021challenges, rajpurkar2022ai, varoquaux2022machine. Particularly, clinical data is highly heterogeneous. For example, neurological disorders such as Alzheimer’s disease, stem not only from diverse anatomies, overlapping clinical phenotypes, or genomic traits of different subjects, but also from operational, demographic and social aspects such as data acquisition protocols pomponio2020harmonization, wang2021harmonization, perkonigg2021dynamic, and paucity of data for minorities davatzikos2019machine. As a consequence, machine learning models often have poor reproducibility across the population.

This paper focuses on one particular aspect of this issue, namely the fact that machine learning models make biased predictions, i.e., they have different accuracy, for different genders, age and racial groups, and cohorts from different clinical studies. Recently, this issue has received wide attention larrazabal2020gender, gao2020deep, wilkinson2020time, seyyed2021underdiagnosis, finlayson2021clinician, dockes2021preventing, li2022cross. For example, a machine learning model trained on X-ray images consistently made inaccurate predictions on underrepresented genders when the imbalance in training data is beyond a thresholdlarrazabal2020gender. Similarly, for classification of chest X-ray pathologies, minority groups such as female African-Americans and patients with low socioeconomic status are prone to be incorrectly diagnosed as healthy seyyed2021underdiagnosis. Such results have raised concerns on whether machine learning models can provide unbiased predictions, and whether they can be deployed in clinical settings eventually.

This paper provides results from relatively large and diverse datasets pertaining to three neurological disorders, namely Alzheimer’s disease (AD), schizophrenia (SZ), and autism spectrum disorder (ASD), which alleviate these concerns. We build machine learning models using 3D magnetic resonance (MR) images along with demographic, clinical, and genetic factors, as also cognitive scores, from large-scale consortia—iSTAGING habes2021brain for AD, PHENOM satterthwaite2010association, zhang2015heterogeneity, chand2020two for SZ, and ABIDE di2014autism for ASD; see Table 1

. There exists large imbalance in this dataset, e.g., 13% and 37.4% female subjects respectively in ABIDE and PHENOM, and 70.6% European Americans, 8.8% African Americans, and 1.5% Asian Americans in iSTAGING. We show that, when trained with appropriate data pre-processing techniques and hyper-parameter tuning, machine learning models do not have biased predictions across different subgroups. This is not the case for a baseline deep neural network, which provides accurate predictions on-average across the population but suffers from bias when predictions are stratified by different attributes.

2 Results

Figure 1: Evaluating machine learning models on subjects from different gender, age, racial groups, and clinical studies. For each of the three disorders, Alzheimer’s disease, schizophrenia and autism spectrum disorder, we built an ensemble of machine learning models that uses data from multiple sources (structural measures, demographic, clinical variables, genetic factors, and cognitive scores). This ensemble was trained using the pre-processing and hyper-parameter optimization discussed in Section 4. For comparison, we also trained an ensemble that only uses structural measures as features. Data from different studies (e.g., ADNI-1, ADNI-2/3, PENN and AIBL for Alzheimer’s disease) was used to train, without using any harmonization preprocessing to remove the batch/site effects. Bar plots denote the size of the subgroup/study (%); in many cases, there is strong imbalance in the amount of data for different subgroups. Violin plots denote the test accuracy (%) on five different held-out subsets of data. Solid colors indicate that models used all features while translucent colors indicate that models were trained only on structural measures. Translucent gray denotes the accuracy of a baseline deep network (without appropriate preprocessing and hyper-parameter tuning). White dots denote the average accuracy of each subgroup/study. For the ensemble trained on multi-source data, the

-values shown in the figure indicate that we cannot reject the null hypothesis that the accuracy for different subgroups has the same mean (at significance level < 0.01). This is not the case for the baseline deep network; see 

Section 2.1.

2.1 Baseline machine learning models have accurate predictions on-average, but can be biased

The accuracy of a deep network on held-out data is 85.83% 1.76% for AD, 73.29% 2.85% for SZ, and 55.61% 3.61% for ASD. These numbers are comparable to published results in the literature wen2020convolutional, rozycki2018multisite, katuwal2015predictive. As Fig. 1 shows, for all three neurological disorders, we find large discrepancies in the accuracy of this model on different subgroups of the population for all attributes. For example, for AD the model has a higher accuracy on females than males (-value 1.59 ); for SZ the model predicts more accurately for Native Americans than Asian Americans (-value 1.49 ). One may be inclined to hypothesize that this bias in accuracy comes from one subgroup having a larger sample size than the other (Table 1). This hypothesis does not hold when predictions are stratified by age. AD subjects older than 80 years, SZ subjects older than 35 years, and ASD subjects younger than 20 years, have a lower accuracy (-values are 8.76 , 3.54 and 2.80 respectively) even if these subgroups are not the ones with the smallest sample size (Table 1). For all three disorders, predictions of the baseline deep network are biased (-value < 0.01) except in four cases: race and clinical studies in AD, sex in SZ, and clinical studies in ASD.

2.2 Appropriately designed machine learning models are not necessarily biased

The accuracy of the ensemble learned using the methodology discussed in Section 4 on held-out data is 87.35% 1.18% for AD, 74.76% 3.32% for SZ and 60.54% 2.13% for ASD; all three are slightly better than that of the baseline deep network. As Fig. 1 shows visually, the accuracy of the ensemble is more consistent. For all three neurological disorders, for all attributes, we find that we cannot reject the hypothesis that the accuracy for different subgroups does not have the same mean (at significance level < 0.01). In other words, the ensemble does not exhibit a bias, up to statistically indistinguishable levels. It is remarkable that this observation holds even in situations with extreme imbalance in the data, e.g., for ASD there are 87% males as compared to 13% females.

Using the same preprocessing and hyper-parameter tuning methodology as that of the ensemble, the bias of the deep network can be improved slightly. We obtained an accuracy of 86.78% 1.90% on AD, 70.44% 2.78% on SZ and 54.70% 4.03% on ASD; these numbers, except the one for SZ, are about the same as that of the baseline deep network. The -values for there being bias across subgroups now are 0.665 (sex), 0.090 (age), 0.690 (race) and 0.768 (study) for AD; 0.089 (sex), 0.771 (age), 0.381 (race) and 7.86 (study) for SZ; 4.77 (sex), 0.356 (age) and 0.880 (study) for ASD. With a significance level of 0.01, this indicates that the predictions of the deep network trained with better data preprocessing achieve a similar average accuracy as that of the baseline deep network but are not biased (except across clinical studies in SZ, which is likely to be partially due to differences in clinical characteristics across patient cohorts).

2.3 Machine learning models trained on multi-source data are also unbiased and predict more accurately on average, but their subgroup-specific accuracy may not always be better than models trained on single-source data

We next trained the ensemble using demographic features, clinical variables, genetic factors and cognitive scores in addition to MR imaging features. The accuracy of this multi-source ensemble on held-out data is 91.22% 1.76% for AD, 79.18% 3.29% for SZ and 63.83% 3.18% for ASD. All three values are better than the corresponding ones for the ensemble trained only on structural MR imaging features (-value < 6.44 for all three). Therefore, using multi-source data improves the average accuracy of machine learning models for the three neurological disorders. This ensemble also makes unbiased predictions across different subgroups; the -values are 0.106 (sex), 0.732 (age), 0.819 (race), and 0.275 (study) for AD; 0.808 (sex), 0.845 (age), 0.772 (race), and 0.012 (study) for SZ; 0.811 (sex), 0.523 (age), and 0.862 (study) for ASD.

We calculated a two-way ANOVA rutherford2011anova to check whether the improved average accuracy translates to improved accuracy for subgroups pertaining to each attribute. The -values are 7.91 (sex), 1.07 (age), 8.24 (race), and 8.10 (study) for AD; 7.40 (sex), 1.68 (age), 0.566 (race), and 5.24 (study) for SZ; 0.940 (sex), 0.150 (age), and 0.213 (study) for ASD. At a significance level of 0.01, we find that using multi-source data improves accuracy of the ensemble as compared to using features from only structural measures for subgroups pertaining to sex and clinical studies in AD, but this does not hold for other cases.

3 Discussion

Relationship of this work to existing literature on identifying and mitigating bias.

As machine learning models are being applied to diverse problems in the clinical sciences, there is an increasing amount of discussion on bias in particular, and ethical issues in general wiens2019no, wilkinson2020time, finlayson2021clinician, gichoya2022ai, vasey2022reporting. This literature is playing a crucial role in shaping policies for deployment of automated diagnostic models. Consequently, there is also a large amount of recent work on identifying such biases larrazabal2020gender, seyyed2021underdiagnosis, li2022cross, and developing techniques to mitigate these biases gao2020deep, zhao2020training, wang2022embracing. In this study, we have shown that when machine learning models are trained using well-established data preprocessing and hyper-parameter optimization techniques on data from large-scale multi-site studies for three neurological disorders, namely Alzheimer’s disease, schizophrenia, and autism spectrum disorder, the predictions of these models need not be biased. Baseline machine learning models, even performant deep neural networks, that do not use this preprocessing and hyper-parameter optimization do suffer from bias. This observation holds when predictions are stratified across four different attributes (sex, age, race and the clinical study that collected the data). Our results do not diminish the value of the existing work on bias. Instead, they provide evidence that we might be able to develop unbiased machine learning-based diagnostic models with ample, proper, and diverse training. Our results therefore provide the hope that, after appropriate checks and balances are met, we might be able to deploy machine learning models in the future, and obtain both accurate and unbiased predictions.

Investigating existing machine learning techniques thoroughly is as important as discovering new ways to mitigate bias.

The machine learning literature has well-established ways to safeguard against poor generalization. Disregarding these procedures can lead to bias in the predictions. Our experiments indicate that a rigorous preprocessing, training and evaluation methodology can enable ways to build machine learning models that do not suffer from biased predictions. Our work therefore provides a sound “baseline” to benchmark the performance of automated diagnostic systems. The accuracy of predictions obtained herein on three neurological disorders and across large-scale, multi-source cohorts reported in our reports is comparable to the state of the art of large-scale multi-site studies wen2020convolutional, rozycki2018multisite, katuwal2015predictive, wang2022embracing suggests that popular methods used to mitigate bias in the literature, e.g., learning invariant features arjovsky2019invariant, zhao2020training, chyzhyk2022remove, or domain-adaptation techniques gao2020deep, ganin2016domain, tzeng2017adversarial, end up removing useful information from the data, e.g., correlations between gender or age with the pathology, even if they are robust to covariate shift. This tradeoff, between being invariant to the heterogeneity in the data and predictive ability, is seen in the harmonization literature more broadly and it has been argued previously that it is unavoidablemoyer2021harmonization,wang2022embracing.

Large-scale multi-source cohorts of data from diverse populations are desirable for building robust and accurate machine learning models, even if they are not balanced.

Balanced datasets, where each subgroup has an equal number of samples, are desirable if we are to build unbiased modelschawla2002smote. But it is extremely difficult to obtain balanced data in practice. There are long-standing problems in recruiting volunteers across gender, age, and race. For example, female, minority ethnicity groups, and older subjects are less likely to participate in clinical trials as compared to white young menmurthy2004participation, chastain2020racial. Even when we are able to obtain balanced data, a machine learning model may still be biased due to unobserved confounders, e.g., severity of the disease or genetic factors. This thinking has inspired recent studies which argue for training models on extensive multi-source neuroimaging datasets he2022meta, schulz2022performance. Our results corroborate these findings; large-scale cohorts of multi-source data can enable training robust and unbiased machine learning models, and equally importantly also enable thorough evaluation across different attributes.

4 Methods

We summarize our technical approach in this section and Appendix A provides more details. We use 3D magnetic resonance (MR) images along with demographic (gender, age, race, education level, marital status, employment status, handedness, smoker), clinical (diabetes, hypertension, hyperlipidemia, systolic/diastolic blood pressure and body mass index), and genetic factors (apolipoprotein E (APOE) alleles 2, 3 and 4) and cognitive scores from three large consortia—iSTAGING habes2021brain for AD, PHENOM satterthwaite2010association, zhang2015heterogeneity, chand2020two for SZ, and ABIDE di2014autism for ASD. We use a standard processing pipeline tustison2010n4itk,doshi2013multi,doshi2016muse to compute structural features from T1-weighted MR images. All accuracies reported in this paper are calculated using 5 independent held-out test sets.

Pre-processing pipeline

Some features, predominantly clinical and genetic factors, and cognitive scores, are sparsely populated; see Table 2

. Continuous-valued features are normalized to have zero mean and unit variance after median imputation; quantile normalization is used for features with highly skewed distributions. For discrete-valued features, we introduce a “unknown” category for missing values. Corresponding to each feature with missing values, we introduce an additional Boolean feature which indicates whether the value was missing. No harmonization tools 

pomponio2020harmonization, wang2021harmonization are used to remove the batch/site effects.

Hyper-parameter optimization methodology

We use a machine learning framework called AutoGluon erickson2020autogluon which gives an easy way to train a large number of different types of models (deep networks lecun2015deep,

-nearest neighbor classifiers 

cover1967nearest, random forests 

breiman2001random, CatBoost prokhorenkova2018catboost, and LightGBM ke2017lightgbm) and perform hyper-parameter search. Using AutoGluon we can also build ensembles of these models via bagging breiman1996bagging, boosting bartlett1998boosting and stacking wolpert1992stacked.

The baseline deep network used in Section 2.1

has three fully-connected layers and is also built within AutoGluon. This network is trained using data that is normalized to have zero mean and unit standard deviation after dropping missing values.

Acknowledgments

This work was supported by the National Institute on Aging (RF1AG054409 and U01AG068057), the National Institute of Mental Health (R01MH112070), the National Science Foundation (2145164) and cloud computing credits from Amazon Web Services.

[title=Bibliography]

Appendix A Detailed methods

Data

We use 3D magnetic resonance (T1-weighted) images along with demographic (gender, age, race, education level, marital status, employment status, handedness, smoker), clinical (diabetes, hypertension, hyperlipidemia, systolic/diastolic blood pressure and body mass index), and genetic factors (apolipoprotein E (APOE) alleles 2, 3 and 4), as also cognitive scores (mini-mental state exam (MMSE), full-scale intelligence quotient (FIQ), verbal intelligence quotient (VIQ), and performance intelligence quotient (PIQ)) from three large consortia—iSTAGING habes2021brain for AD, PHENOM satterthwaite2010association, zhang2015heterogeneity, chand2020two for SZ, and ABIDE di2014autism for ASD. The subset of iSTAGING data used here consists of multiple clinical studies: AD Neuroimaging Initiative (ADNI) jack2008alzheimer, Penn Memory Center cohort (PENN), and Australian Imaging, Biomarkers and Lifestyle (AIBL) ellis2010addressing. In PHENOM scans are acquired from five different sites namely Penn (United States), China, Munich, Utrecht, and Melbourne. All tasks in this study are a binary classification problem with two labels (healthy controls and patients). We only use the baseline (first time point) scans from each cohort; all follow-up sessions are excluded. This way there is no data leakage for the same participant between training and test sets. For AD, we select stable cognitive normal (CN) and AD patients based on each participant’s longitudinal diagnosis status. We only include subjects who were diagnosed as CN or AD at the baseline and stayed stable during the follow-up sessions.

Methodology for creating features from structural measures

We compute features from T1-weighted MR images using a standard pipeline. Scans are bias-field corrected tustison2010n4itk, skull-stripped with a multi-atlas algorithm doshi2013multi, and then a multi-atlas label fusion segmentation method 

doshi2016muse is used to obtain anatomical region-of-interest (ROI) masks for 119 grey matter ROIs, 20 white matter ROIs and 6 ventricle ROIs of the brain (total 145). We further segment white matter hyperintensities (WMH) using a deep learning-based algorithm 

doshi2019deepmrseg on fluid-attenuated (FLAIR) and T1-weighted images. White matter lesion (WML) volumes are obtained by summing up the WMH mask voxels.

Evaluation methodology

We report accuracy on held-out test sets as follows. We split data into 5 equal-sized folds (stratified by labels), use four for training and validation (80%) and the fifth for testing (20%). All hyper-parameter tuning is performed using a further 5-fold cross-validation within the 80% data. This way, the 20% data is a completely independent test set which is used only for reporting the final accuracy. We report mean and standard deviation of the accuracy over 5 independent test sets (one for each outer fold). This is a computationally expensive, but rigorous, evaluation methodology. The three neurological disorders consist of data from multiple clinical studies; we create the training, validation and test sets for each study independently and then concatenate them.

Pre-processing pipeline

Our data contains structural features such as ROI and WML volumes in addition to demographic, clinical and genetic factors, and cognitive scores. Some of these features (predominantly last three) are sparsely populated; see Table 2. For continuous-valued features, we first impute missing values with the median of each variable and normalize the feature to have zero-mean and unit-variance. We apply quantile normalization to skewed distributions. For discrete-valued features, we introduce a “unknown” category for missing values. Corresponding to each feature with missing values, we introduce an additional Boolean feature which indicates whether the value was missing. This way we preserve the evidence of absence (rather than the absence of evidence) erickson2020autogluon. We did not use any harmonization tools pomponio2020harmonization, wang2021harmonization.

Hyper-parameter optimization methodology

We compare results from an optimized machine learning model, in which hyper-parameter optimization and ensemble learning were performed, with a basic network. For the former, we use a machine learning framework called AutoGluon erickson2020autogluon which gives an easy way to train a large number of different types of models (deep networks lecun2015deep, -nearest neighbor classifiers cover1967nearest, random forests breiman2001random, CatBoost prokhorenkova2018catboost, and LightGBM ke2017lightgbm) and perform hyper-parameter search. For deep network models we create an input layer that concatenates the embedding of continuous-valued and categorical features; the other models can natively handle both these types of features. Using AutoGluon we can also build ensembles of these models via bagging breiman1996bagging, boosting bartlett1998boosting and stacking wolpert1992stacked. For each neurological disorder, for each of the 5 outer folds, we train the above different types of models using different hyper-parameters in parallel across multiple CPUs and 4 GPUs for 1 hour and build an ensemble that obtains the best classification log-likelihood on the validation data.

The baseline deep network used in Section 2.1 has three fully-connected layers and is also built within the same software framework. This network is trained using data that is normalized to have zero mean and unit standard deviation after dropping missing values. It does not use the pre-processing pipeline described above.

Appendix B Summary of the data

Alzheimer’s Disease ADNI-1 ADNI-2/3 PENN AIBL Total
(22.81%) (30.58%) (33.40%) (13.22%)
Subjects
Control 173 261 228 119 781
Patient 191 227 305 92 815
Sex (%)
Female 10.90 15.66 21.37 8.08 56.02
Male 11.90 14.91 12.03 5.14 43.98
Age (%, years)
0–65 1.50 3.51 6.33 1.82 13.16
65–70 1.94 8.15 7.02 2.57 19.67
70–75 7.27 6.95 6.83 3.76 24.81
75–80 6.52 6.89 6.58 2.57 22.56
> 80 5.58 5.08 6.64 2.51 19.80
Race (%)
White 21.18 16.98 25.13 7.33 70.61
Black 1.19 0.88 6.70 - 8.77
Asian 0.31 0.56 0.63 - 1.50
Schizophrenia Penn China Munich Utrecht Melbourne Total
(22.28%) (13.94%) (29.64%) (20.12%) (14.03%)
Subjects
Control 131 76 157 115 84 563
Patient 96 66 145 90 59 456
Sex (%)
Female 11.87 6.77 7.75 6.97 4.02 37.39
Male 10.40 7.16 21.88 13.15 10.01 62.61
Age (%, years)
0–25 5.79 4.91 9.42 10.11 5.89 36.11
25–30 6.28 2.36 7.26 4.12 2.16 22.18
30–35 3.53 2.16 5.99 2.85 1.37 15.90
> 35 6.67 4.51 6.97 3.04 4.61 25.81
Race (%)
Native 10.50 - - - - 10.50
Asian 7.36 - - - - 7.36
Autism Spectrum ABIDE-1 ABIDE-2 Total
Disorder (62.78%) (37.22%)
Subjects
Control 224 138 362
Patient 196 111 307
Sex (%)
Female 6.73 6.28 13.00
Male 56.05 30.94 87.0
Age (%, years)
0–20 21.97 11.66 33.63
20–25 17.49 11.06 28.55
> 25 23.32 14.50 37.82
Table 1: Summary of participant demographics of the iSTAGING consortium (Alzheimer’s disease), the PHENOM consortium (Schizophrenia), and the ABIDE datasets (Autism spectrum disorder) used in this study.
Variables iSTAGING PHENOM ABIDE
MR imaging
Region-of-interest volumes
White matter lesion volume
Demographics
Gender
Age
Race
Education level
Marital status
Employment status
Handedness
Smoking status
Clinical
Diabetes
Hypertension
Hyperlipidemia
Blood pressure (systolic/diastolic)
Body mass index
Genetic factor
Apolipoprotein E alleles 2, 3 and 4
Cognitive scores
Mini-mental state exam
Full-scale intelligence quotient
Verbal intelligence quotient
Performance intelligence quotient
Table 2: Summary of variables in the data from the iSTAGING consortium (Alzheimer’s disease), the PHENOM consortium (Schizophrenia), and the ABIDE datasets (Autism spectrum disorder) used in this study.