It's easy to fool yourself: Case studies on identifying bias and confounding in bio-medical datasets

by   Subhashini Venugopalan, et al.

Confounding variables are a well known source of nuisance in biomedical studies. They present an even greater challenge when we combine them with black-box machine learning techniques that operate on raw data. This work presents two case studies. In one, we discovered biases arising from systematic errors in the data generation process. In the other, we found a spurious source of signal unrelated to the prediction task at hand. In both cases, our prediction models performed well but under careful examination hidden confounders and biases were revealed. These are cautionary tales on the limits of using machine learning techniques on raw data from scientific experiments.



There are no comments yet.


page 2

page 3

page 4


Bias-Resilient Neural Network

Presence of bias and confounding effects is inarguably one of the most c...

Measuring the effects of confounders in medical supervised classification problems: the Confounding Index (CI)

Over the years, there has been growing interest in using Machine Learnin...

Machine Learning for Clinical Predictive Analytics

In this chapter, we provide a brief overview of applying machine learnin...

Pulling Up by the Causal Bootstraps: Causal Data Augmentation for Pre-training Debiasing

Machine learning models achieve state-of-the-art performance on many sup...

Quantifying Confounding Bias in Neuroimaging Datasets with Causal Inference

Neuroimaging datasets keep growing in size to address increasingly compl...

Spatial machine-learning model diagnostics: a model-agnostic distance-based approach

While significant progress has been made towards explaining black-box ma...

Machine Learning on Biomedical Images: Interactive Learning, Transfer Learning, Class Imbalance, and Beyond

In this paper, we highlight three issues that limit performance of machi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning provides powerful tools to unravel hidden signal in data. The field has had tremendous success applying it to a number of problems in the medical domain from detecting cancers Esteva et al. (2017); Liu et al. (2017), diabetic retinopathy Gulshan et al. (2016), to predicting cardio vascular risk factors Poplin et al. (2018)

. But along with the power comes the peril of hunting for the source of signal in our data. In experimental sciences, confounders pose well known pitfalls. While using hand-engineered features extracted from raw observations, these have been less of a concern and models built on them have fewer parameters to learn spurious signal. However, deep neural nets built on raw high dimensional data such as images have greatly amplified the ability of these models to exploit confounding variables. For example, any variation in image acquisition settings such as camera type, background noise levels, illumination can be quickly exploited by these models while carefully hand-engineered features might be immune to a certain degree. These factors are usually not causally connected to the prediction task at hand.

In this work, we show some pitfalls in using deep neural nets on microscopic images of cells. We have used deep neural net models to find novel biomarkers for identifying cell types. We show how some biases could be identified by careful experimental design, visualization of model outputs and low dimensional projections of embeddings. We also employed a few model interpretability techniques to drill down to the source of our signal in our target classification model. Our work covers two case studies that try to identify biomarkers for: (i) Spinal Muscular Atrophy (SMA) from human fibroblast cells and (ii) Amyotrophic Lateral Sclerosis (ALS) from induced pluripotent stem cell (iPSC) derived motor neurons.

2 Data generation

(a) Data acquisition
(b) Preprocessing
Figure 1: Data generation starts with acquiring a piece of tissue or cells from a person. The cells are cultured, plated, stained, and then imaged using a microscope. The images are processed by segmenting cells (unsupervised), identifying cell centers, and cropping centered patches.

Data generation (Figure 1) starts with acquiring a piece of tissue or cells from a person (cell line). Cells are cultured, plated, then fixed, and stained. They are imaged through a high throughput microscope at 5 frequency bands (which are analogous to channels) at various sites (physical locations) in a well. The same experimental setup is replicated on several plates at multiple time-points a few weeks apart. Each time-point replicate is termed “batch" (or experimental batch). A typical experimental batch could contain 12 plates, each plate containing 96 wells, each imaged at multiple locations/sites. Each image has 5 channels. In our datasets, each well on a plate contains cells for one experimental condition, i.e. fibroblasts or iPSC cells from a cell line from a single individual. These cell lines were strategically distributed across the plates to mitigate known plate covariate nuisance effects.

Pre-processing From a microscope image acquired at a given site, there may be 10 to 100 individual cells in the image. We follow the unsupervised segmentation approach from Ando et al. (2017) to detect cell nuclei and obtain fixed-sized crops centered around the nuclei yielding the cell “patch" image. Image analysis can be done with either the entire site image, or the cell patch images.

2.1 Datasets

For this work we used two types of datasets: SMA and ALS. The first one is from human fibroblast cells to detect spinal muscular atrophy from Yang et al. (2019), split into two versions: a pilot dataset (SMA pilot dataset) and the main dataset (SMA Main dataset). For ALS, we obtained images of motor neurons differentiated from iPSC (Cell preparation similar to the one in Wainger et al. (2014)) from healthy / isogenic pairs TDP43 introduced. Cells were allowed to mature after plating and imaged after 3-5 days. The cells were fixed, stained and imaged in both the cases through a process similar to the one described in Bray et al. (2016).

SMA Pilot Dataset Six primary fibroblast cell lines were distributed and cultured in wells on 12 96-well plates in a single experimental batch. This was primarily used to study and correct biases in the acquisition process.

SMA Main Dataset 27 primary fibroblast cell lines from 12 subjects affected by SMA, and 15 otherwise healthy demographically matched (age, race, sex) subjects were used in an experiment with two batch replicates conducted weeks apart , yielding about 2 million images in total.

ALS dataset For this dataset, we had 5 independently differentiated batches that were imaged at 9 sites per well. Each batched contained 1-2 plates containing 96 wells per plate containing two types of cells: TDPwt and TDPmut representing a isogenic pair of motor neurons with and without ALS. We split the dataset into 5 folds leaving one batch out in each fold. Hence each fold contained 4 batches in train and 1 in validation. We did not use a separate test set for these experiments.

3 Approach

Quality analysis We first evaluate the focus quality of the images using a pre-trained CNN model described in Yang et al. (2018)

. The model is trained on biological images that are artificially blurred to different extents. It uses a ranked probability score to determine overall focus quality of the image and predicts a rating with a numeric score ranging from 0 to 1 (1 indicating in-focus). Figure 

2 visualizes the scores in the spatial layout of the acquired site images in wells on the 96-well plates.

Detecting nuisance using unsupervised embeddings For a first pass at identifying biases in the data, we use unsupervised clustering, t-SNE, to identify nuisance variables visually (Figure 3). We apply t-SNE on image embeddings of cell patches obtained from an inceptionv4 Szegedy et al. (2017)

model pre-trained on ImageNet 

Deng et al. (2009). We label the points in the resulting space by various known covariate factors, such as plate and location within a plate. Additionally, we also quantitatively assess and predict these nuisance covariate factors using logistic regression (Appendix Figure 7)

Supervised prediction of disease condition

To assess whether the disease state of the subjects can be inferred from the microscopy images of the cells, a k-fold cross validation scheme was used, whereby each fold contained a test set of a pair (one from a healthy individual, the other from a disease individual) of cell lines unseen in the training set. We use logistic classifiers and CNNs (using modified inceptionv4 with predicted heads) to predict the “disease condition" of the cell line. The logistic classifiers were applied on unsupervised embeddings, and also a set of 63 hand-engineered features computed from image statistics such as foreground area and foreground mean intensity, etc.

Model Interpretation We used visual explanation tools to understand which regions in the image our deep learning model is most influenced by when predicting healthy or disease. We present the saliency maps obtained by GradCAM Selvaraju et al. (2017) in Figure 6. We also use Partial Dependency Plots (PDP) Breiman (2001) on hand-engineered features to learn associations between target features and target responses.

4 Results

4.1 Detecting bias from quality analysis visualization

Figure 2: Visualization of the results from applying the focus quality analysis on 6 96-well plates from the SMA Pilot dataset (left) and the SMA Main dataset (right). Each pixel corresponds to a site within a well, and the color represents the focus quality, yellow being in-focus and purple depicting out-of-focus. 9 sample cell images from 2 wells are presented in the center.

As Figure 2 shows, the detected focus quality in our preliminary SMA pilot dataset reveals a clear spatial trend, whereby the images of the cells located towards the center of each 96-well plate during the experiment had better focus quality. To address this we acquired a z-stack of confocal images instead of a single widefield image which was used to create the SMA Main dataset. This helped generate images with much higher focus.

4.2 Detecting bias based on unsupervised clustering

Results from clustering of the unsupervised embeddings in Figure 3 show that in both the SMA Main dataset and the ALS dataset, there is a strong bias for images to cluster based on the experimental batch. This was also confirmed by quantitative assessment (appendix Figure 7).

(a) SMA Main dataset
(b) ALS dataset
Figure 3: Clustering on unsupervised CNN embeddings using t-SNE on two datasets. In each image, the data points are colored based on either disease condition or the batch.

4.3 Examination of supervised prediction models reveal confounders

(a) Supervised models
(b) Supervised CNN: Examining the folds
Figure 4: Plots comparing performance of supervised classifiers to predict healthy and disease lines on the SMA Main dataset. (a) Comparison of unsupervised embedding (logistic classifier) and supervised CNN. Each point represents ROC AUC performance on one cross-validation fold. (b) Performance of the fully supervised CNN on the SMA Main dataset on each of the 12 folds (held-out patient pairs). (right) The disease condition (healthy, sma1-3) and the lab source (A or B, grey square) where the patient’s cells were acquired.

Predict healthy/disease

We sought to assess whether the disease state of individuals from which the cells originated could be inferred from the cell images, for cells from an unseen person. The results on the SMA Main dataset, as shown in Figure 4, indicate that on an unseen pair of cell lines from an unseen healthy and disease individual, the model is able to predict with better-than-chance accuracy in all but one of the cross validation folds.

While promising, one significant covariate is the source lab of cell lines; the only cross validation fold with worse-than-random generalization is also the one in which both the healthy and disease cell line were obtained from the same, rather than different, lab sources. As seen in Figure 4

, either the cell lines in this fold were outliers, or the selectivity of the model may depend on a combination of cell line source and disease state.

Figure 5: [ALS Dataset] Comparison of logistic regression model trained on cell density, image statistics features, unsupervised embeddings, and the fully supervised CNN models. Each point represents ROC AUC performance on one cross-validation fold.

ALS Dataset On the ALS dataset (Figure 5) while the supervised CNN model performs the best; the model based on 63 hand-engineered features from the image statistics, and a model that uses just the cell density alone (i.e. number of cells in the site image) performs comparably. Partial dependency plots (PDP) Breiman (2001) applied on the hand-engineered features pointed to cell density as the confounder.

4.4 Model interpretation points to confounding

Saliency maps obtained by applying GradCAM on the ALS dataset are presented in Figure 6. The saliency maps indicate that the model is looking at empty regions when correctly predicting lines to be of type TDPwt. This further indicates density of cells as a confounding variable that the models exploit.

Figure 6: Saliency maps from GradCAM overlaid on the different image channels/stains for 2 sets of images from the ALS dataset. Yellow regions represent where the model is “looking" to correctly predict TDPwt cell lines.

5 Conclusion

In our work, we observed our models exploiting differences in the cell line source, experimental batch, plate, relative location of wells in a plate, image acquisition settings, cell density etc. to identify the batch / plate / well containing the input and thereby gaining insight into the possible target label. We then used these deep neural nets to understand the focus quality differences among various wells. We have used supervised classifiers to identify the extent to which the models can memorize nuisance factors and also employed model interpretability techniques to provide further evidence of the source of (spurious) signal in our trained models. Our work emphasizes the need to be cognizant of these pitfalls and urge the community to carefully examine the source of signal, especially in case of a novel discovery.


The authors would like to thank Minjie Fan, Zan Armstrong, Thorsten M. Schlaeger, Liyong Deng, Wendy K. Chung, Liadan O’Callaghan, Dosh Whye, Jon Hazard, Brian Patrick Williams, D. Michael Ando, and Philip Nelson for their help with earlier versions of this work.


  • [1] D. M. Ando, C. McLean, and M. Berndl (2017) Improving phenotypic measurements in high-content imaging screens. bioRxiv. External Links: Link Cited by: §2.
  • [2] M. Bray, S. Singh, H. Han, C. T. Davis, B. Borgeson, C. Hartland, M. Kost-Alimova, S. M. Gustafsdottir, C. C. Gibson, and A. E. Carpenter (2016) Cell painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nature protocols 11 (9), pp. 1757. Cited by: §2.1.
  • [3] L. Breiman (2001) Random forests, machine learning 45. Journal of Clinical Microbiology 2, pp. 199–228. Cited by: §3, §4.3, Appendix.
  • [4] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §3.
  • [5] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun (2017)

    Dermatologist-level classification of skin cancer with deep neural networks

    Nature 542 (7639), pp. 115. Cited by: §1.
  • [6] V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros, et al. (2016) Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama 316 (22), pp. 2402–2410. Cited by: §1.
  • [7] Y. Liu, K. Gadepalli, M. Norouzi, G. E. Dahl, T. Kohlberger, A. Boyko, S. Venugopalan, A. Timofeev, P. Q. Nelson, G. S. Corrado, et al. (2017) Detecting cancer metastases on gigapixel pathology images. arXiv preprint arXiv:1703.02442. Cited by: §1.
  • [8] R. Poplin, A. V. Varadarajan, K. Blumer, Y. Liu, M. V. McConnell, G. S. Corrado, L. Peng, and D. R. Webster (2018) Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering 2 (3), pp. 158. Cited by: §1.
  • [9] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §3.
  • [10] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning


    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §3.
  • [11] B. J. Wainger, E. Kiskinis, C. Mellin, O. Wiskow, S. S. Han, J. Sandoe, N. P. Perez, L. A. Williams, S. Lee, G. Boulting, et al. (2014) Intrinsic membrane hyperexcitability of amyotrophic lateral sclerosis patient-derived motor neurons. Cell reports 7 (1), pp. 1–11. Cited by: §2.1.
  • [12] S. J. Yang, M. Berndl, D. M. Ando, M. Barch, A. Narayanaswamy, E. Christiansen, S. Hoyer, C. Roat, J. Hung, C. T. Rueden, et al. (2018) Assessing microscope image focus quality with deep learning. BMC bioinformatics 19 (1), pp. 77. Cited by: §3.
  • [13] S. J. Yang, S. L. Lipnick, N. R. Makhortova, S. Venugopalan, M. Fan, Z. Armstrong, T. M. Schlaeger, L. Deng, W. K. Chung, L. O’Callaghan, et al. (2019) Applying deep neural network analysis to high-content image-based assays. SLAS DISCOVERY: Advancing Life Sciences R&D 24 (8), pp. 829–841. Cited by: §2.1.


Identifying bias by predicting nuisance covariates We used the unsupervised embeddings of cell images from the "control" wells to predict nuisance factors such as batch, plate, and well position using logistic regression. We also compare this to predicting the nuisance factors by permuting the embeddings feature columns Breiman (2001), and present the results for the SMA Main dataset in Figure 7. As seen in the figure, there is a higher than chance accuracy of predicting these nuisance variables indicating the underlying bias in the data.

Figure 7: Comparing prediction of nuisance factors - batch, plate, row, column - on the SMA Main dataset. (leftmost) Depicts a plate with experimental (E) wells, and control (C) wells highlighted in yellow. (right plots) Logistic regression models on unsupervised CNN embeddings of cells from the "control" wells with that of the embedding feature columns permuted (as the baseline).