Log In Sign Up

Mitigating the Effect of Dataset Bias on Training Deep Models for Chest X-rays

by   Yundong Zhang, et al.

Deep learning has gained tremendous attention on CAD (Computer-aided Diagnosing) application, particularly biomedical imaging analysis. We analyze three large-scale publicly available CXR (Chest X-ray) datasets and find that vanilla training of deep models on diagnosing common Thorax Diseases are subject to dataset bias, leading to severe performance degradation when evaluated on unseen test set. In this work, we frame the problem as multi-source domain generalization task and make two contributions to handle dataset bias: 1. we improve the classical Max-margin loss function by making it more general and smooth; 2. we propose a new training framework named MCT (Multi-layer Cross-gradient Training) for unseen data argumentation. Empirical studies show that our methods significantly improve the model generalization and robustness to dataset bias.


Boosted Cascaded Convnets for Multilabel Classification of Thoracic Diseases in Chest Radiographs

Chest X-ray is one of the most accessible medical imaging technique for ...

Learning Invariant Feature Representation to Improve Generalization across Chest X-ray Datasets

Chest radiography is the most common medical image examination for scree...

Multi-layer Domain Adaptation for Deep Convolutional Networks

Despite their success in many computer vision tasks, convolutional netwo...

A First Look at Dataset Bias in License Plate Recognition

Public datasets have played a key role in advancing the state of the art...

Generalization by design: Shortcuts to Generalization in Deep Learning

We take a geometrical viewpoint and present a unifying view on supervise...

Rethinking annotation granularity for overcoming deep shortcut learning: A retrospective study on chest radiographs

Deep learning has demonstrated radiograph screening performances that ar...

I Introduction

Despite the recent success of exploiting biomedical big data, researchers have found that naively use of those data may lead to significant biases. Those biases can arise from every aspect of healthcare process due to human-related or systematic factors, for example, the diagnosis standards vary across clinicians, the policies of provider organizations may encourage more screening tests and the work hours of hospitals may affect timing of time-related data [1]. Meanwhile, the dynamics of biases are evolving over time and varying across population demographics [12]: it is found in [11] that organizations reported patients safety incidents inconsistently; and opioid prescribing increased at rates that differed by practice and patient population [6].

Those biases can lead to misleading research outcome and pose significantly challenges to discover validated medical findings as well as design robust statistical models. In such scenarios, simply increasing the amount of training data does not help as biases are induced in the data collection process and data-driven learning algorithm can easily exploiting the subtle biases in the data to make predictions. In [1], researchers found that the laboratory test order of two large hospitals in Boston significantly correlated (

) to the patients’ odds of survival, regardless of other information about the test; as a result, models trained on those electrical health records (EHRs) over-weighted the importance of test order. Another group of researchers


analyzed the generalization capability of deep learning models in screening pneumonis on Chest X-ray images and reported the followings: (1) Convolution Neural Network (CNN) models performed significantly worse (about 8% to 10% drop) in external hospital system than internal hold-out test set; (2) CNNs could differentiate the origin of data with extremely high accuracy and calibrate their predictions accordingly, utilizing confounding information such as small objects or text labels in the images.

To accommodate the biomedical data biases, we survey the classical machine learning literature and frame the problem as domain generalization where each dataset is a subset of a common domain. Our contributions lie in three folds: 1. we show that data bias are significantly embedded in biomedical imaging (specifically, Chest X-ray), even though they belong to the same imaging modality; 2. we design a new loss function to undo bias by model each training dataset explicitly; 3. We propose a new data argumentation methods to improve model generalization by generating domain-guided perturbed hidden activations. Extensive experiments on several publicly available datasets demonstrate the superior performance of our methods.

Ii Related Work

Dataset bias is observed when a well-designed and optimized model for one dataset exhibit significantly performance degradation on another. Various metrics have been proposed to quantify the biases of dataset. The pioneering work in [36] suggested use cross-dataset generalization: measure the relative performance drop between the original test set and the new dataset, as long as they come from the same domain. [35]

proposed to replace the relative measure by direct difference followed by a sigmoid function, in order to better preserve the information about internal test set. Cross-dataset generalization is an intuitive and interpretable measure, hence it is widely used in the machine learning community

[17, 27]

. On the other hand, one can also perform Classifier Two-sample Test (C2ST) to verify whether two datasets are drawn from identical distribution

[24]. The idea is that if two datasets are in the same domain, a binary classifier trained on their joints should predict with chance-level on which dataset the sample is drawn. .

Besides the quantitative measures, several visualization techniques can be applied to qualitatively understand the source of biases. In [17], the trained weights of a linear-SVM classifier were overlaid on original images to discover the pattern of their spatial distribution; [45] generated the class activation heatmaps (CAM) to visualize the most contributed regions in input image for a trained CNN; [33]

proposed the guided backpropogation gradient activation heatmap (guided grad-CAM) to provide pixel-level attention of CNNs. In other fields such as natural language processing, the attention mechanism

[38, 41] is an effective way to visualize the focus of the model; One can also use Local Interpretable Model-Agnostic Explanations method (LIME) [16] to understand the model attentions. By comparing the model ”attention” with human sense, we can verify whether the learning algorithm is learning the correct representation features and infer the source of data biases. If human attention heatmap is given, one can also use Spearman’s rank correlation coefficients [30] or earth mover’s distance [20] to provide quantitative measures [44, 13].

Several framework has been developed to address dataset bias or domain generalization, where the goal is to train a model that generalizes to unseen datasets or domains. One of the earlier work [17]

was based on max-margin learning (SVM), which modeled biases as per-dataset bias vectors in classifier space. During training, SVM maximized the objective of each dataset by constructing a classifier using addition of dataset-specific bias vector and bias-free vector; then for inference, the bias-free vector was used alone as bias-removal classifier. Built on top of it,

[35] conducted more extensive experiments by using DECAF features [5] as input to the model. The author found that the bias removal technique in [17] worked better when using classical BOWSift features [25] while for DECAF features the opposite held. [21] further extended this shallow bias modelling structure to end-to-end training low-rank parametrized deep model and observed better performance.

Another series of work on domain generalization focus on feature level and aim to learn domain-invariant feature representation. In [29], a kernel-based method was used to project the data into common feature space where domain dissimilarity was minimized while the functional relationship of label was preserved. In [8]

, domain-robust feature was learnt by a multi-task data reconstruction autoencoders. Domain adversarial training technique could also be used for learning domain-independent feature by fooling a domain classifier

[7] or aligning distributions among different domains [22].

There are also efforts on addressing domain generalization through modifying the input data. [2] shuffled the original image patches and added an auxillary recompose task to the model to improve generalization. [3] used generative adversarial network (GAN) to generate domain-independent images. [23] developed a new dataset resample paradigm (REPAIR) that improved the model generalization by training on a re-weighted dataset. The one most similar to ours is cross gradient training [34], which generated inter-domain data by domain-guided perturbations of the inputs.

A similar work that attempted to address dataset bias of Chest X-ray (CXR) data is [40], where the authors collected ten CXR datasets internationally. However, they did not provide an effective method for handling dataset bias apart from directly trained and tested them in leave-one-out scheme. Also, their task (predict normal or abnormal) is simpler than ours.

Iii Problem Statement

Our goal is to train a model that perform well for in-domain datasets and generalize well to unseen domain. Formally, denote , where is the dataset or sub-domain in a shared domain , are the internal sets that are available during training and are the external sets which are completely hidden unless on test time. Here we focus on the classification task and assume all the datasets share the common labels. Then we aim to


where is the model prediction of sample parametrized by and

is our evaluation metric. The first double summation of (

1) is the internal set performance and the second part is that of external set. Notice that we have no access to the latter part of (1) during training and hence our optimization can only focus on the former part.

In this work, we are specifically interested in bias of Biomedical Imaging. We observe that current advanced deep learning models are subject to dataset bias and suffer from generalization problem: an Alexnet model trained on large-scale CXRs dataset can exhibit about 10% accuracy drop when testing on unseen data. Also, by looking at the feature embedding maps, we find that bias are learnt even without supervision (Fig. 3). To close the generalization gap and mitigate the effect of dataset bias, we propose two strategies in the following sections.

Iv Proposed Method

Iv-a Classifier-level Bias Modelling

We start by revisiting the undoing-bias framework in [17]. Formally, let be the extracted feature of sample , we aim to solve the following soft-constrained max-margin (SVM) optimization problem:


where is our debiased (visual world) classifier, is the biased classifier for dataset and , is the balancing hyper-parameters. Essentially, the second term of (2) is margin penalty of debiased (visual world) classifier and the last one is the biased one. Unlike [35] and [17] which use BOWSift [25]

or ImageNet pre-trained features for

, we propose to train the feature extractor end-to-end using deep model. That is, where is a neural net feature extractor parametrized by . During training, is updated by back-propagating the gradient through .

The above SVM framework have several drawbacks: 1. the hinge loss is not optimizer-friendly since it is not differentiable everywhere; 2. we cannot have a probability interpretation of the prediction, which is crucial in assisting medical diagnosis; 3. it models the bias weights with only additive relation. Therefore, we propose to train the network using cross-entropy loss to accommodate 1 and 2. For 3, we introduce

for each dataset as an additional trainable parameters, such that , where represents the element-wise product. Here models the multiplicative relation between the model bias and visual world. This enables the model to capture both the feature shifts and scaling of the bias datasets. With those changes, our proposed cross-entropy training objective is


where and is the negative log-likelihood between the last linear layer and ground-truth label.

We highlight the importance of the regularization term of visual world classifier in our proposed cross-entropy loss, because can easily overfit on a solution that takes advantage of all the bias features, leading to poor generalization performance in external set. To see this, consider a learned feature embedding , where is the common feature, and are the bias feature presented in and . Ideally, we want such that it can generalize to some unseen dataset or domain . However, without proper regularization, can still be a valid solution if for we have and for we have . By penalizing the norm of more than and , we push the bias learning to those bias vectors instead of visual world classifier.

Iv-B Feature-level Bias Mitigation

The above framework model the bias in higher level classifier space. However, since we have limited training domains, our feature extractor can still overfit on the in-domain data and suffer from significantly performance drop when testing on external data. One question is, can we synthesize samples from unseen domains to make our feature extractor more robust?

One idea is to use Mix-up [43, 37] strategy, where synthetic data are generated by linearly mixing samples from different datasets. However, as we will show in the following section, when the dataset-bias are severe, this method suffer from the convergence problem. Also, the diagnosis of medical imaging are usually relying on fine-grained features of the image. In this case, naively mixing samples will destroy the crucial details in the data.

Another idea is to augment the training data by Cross-gradient Training method [34]. Formally, consider a dataset classifier and data point , we can generate a new sample , where is the step size and is the dataset classification loss. Then we can train the model with this synthetic data . To ensure the gradient change have minimum effect on the label , we also argument the training of with , where . This makes the dataset classifier unsensitive to the labels and hence won’t change .

Our proposed method is built upon this idea. However, differ to [34], we do not train a separate network for . Instead, is just a linear layer which directly takes the feature embedding

as input and output the dataset classification logits. During training, the gradient

will not propagate through and is only used to update . The reasons for this design lie in two folds: training a separate feature extractor for or propagating the gradient through will lead to gradient vanishing , because the dataset classifier is significantly easier to train; more importantly, observing that now is differentiable with respect to , we can argument the intermediate features in addition to the input , leading to a new multi-layer augmentation training paradigm. Specifically, given a pre-determined set of layer output, e.g. , where and are the output of the last convolution and fully-connected layer, respectively. For each training step we can sample one of the layer output, say , compute , generate a new augmentation feature point and feed to the following layers. We shall also follow Cross-gradient Training to generate and augment the training of . Because the argument data point does not belong to any specific datasets, is only fed to the visual world classifier for training and . In this way, we improve the robustness of our feature extractor to unseen data. We name this novel domain guided argumentation method as Multi-layer Cross-gradient Training (MCT). Together with the classifier bias-modelling, we summarize the overall model pipeline in Fig. 1 and the training pseudo-code in Algorithm LABEL:Algorithm.

Fig. 1: Model Structure of our proposed MCT with Classifier-bias Modelling. Here is the bias classifier prediction of original sample of dataset ; , are the visual world classifier predictions of original sample and domain-guided argumented sample; , are the domain classifier prediction of original sample and argumented sample. During training, a layer activation (e.g. ) is randomly chosen from pre-determined set and its domain-guided pertubation versions ( and ) are generated to argument the training.


V Experiments

V-a Datasets

We use three large-scale Chest X-ray datasets which are all open-sourced. They are NIH ChestX-ray14 from NIH Clinical Center

[39], Stanford CheXpert from Stanford Hospital [14] and Mimic-CXR from Beth Israel Deaconess Medical Center [15, 9]. Since the above datasets have different label categories, we select 5 common diseases (Atelectasis, Cardiomegaly, Consolidation, Edema and Effusion) that they share with each other. We also discard all the lateral scans in CheXpert and Mimic-CXR as NIH only have frontal view images. Table I summarizes the basic information of each processed dataset. We use a roghly 7:1:2 split for train, val and test set of each dataset, except for NIH which has an official split that has the same ratio. We also ensure X-ray scans of the same patient belong to the same split set, preventing information leakage.

We also include three popular datasets of domain generalization to specifically verify our proposed Multi-layer Cross-gradient Training algorithm. Dataset details and experiments can be found in Appendix A.

Datasets # Patients # Scans # Atelectasis # Cardiomegaly # Consolidation # Edema # Effusion
NIH 30806 112120 11559 2776 4667 2303 13317
CheXpert 64534 191027 59583 23385 12983 61493 76899
MIMIC 62592 248236 60681 48894 11733 43559 58731
TABLE I: Summary of Three CXR Datasets

V-B Bias Measurements Metrics

In this section we introduce two quantitative metrics for measuring biases of trained models.

V-B1 Generalization-Based Metrics

The generalization-based metric works by evaluating how the model performs when trained on internal sets and test on external sets. Following [36, 35], let


be the internal test set performance and


be the average external test set performance. The cross-dataset performance drop for a particular model can be defined as:


Intuitively, measures the change of cross dataset performance, normalized by the internal set score. indicates that biases are present, which becomes more severe when it gets closer to 1. If , it means internal performance is sub-optimal and no informative conclusion can be drawn by cross-dataset evaluation.

V-B2 Classifier Two-sample Test

The Classifier Two-sample Test (C2ST) aims to determine whether two datasets are drawn from the same distribution by training a binary classifier to differ from each other. Formally, given two datasets and , we can construct a new dataset [24]


and a binary classifier

to be the conditional probability estimation of

. Then we can obtain the classification accuracy according to


Intuitively, if the two datasets are from the same distribution, the test set should be close to 0.5, i.e. no better than random guessing. Otherwise, there must be some distinct features in one of the dataset which are exploited by the classifier.

V-C Results on Large-scale CXRs

In this section, we evaluate our proposed methods on Large-scale CXR datasets. For training, we resize all the images to 256256, followed by a random crop of . Unlike [14], we do not use random horizontal flip since some diseases (e.g. Cardiomegaly) rely on spatial information. AlexNet [18] pretrained on Imagenet [4]

is used as feature extraction backbone for all the models which are compared in the following sections unless specified. As each patient can have multiple diseases at the same time, our task is essentially a multi-label classification. We use binary cross-entropy loss for each diseases. Hyperparameters are determined by validation set and the selected

of MCT here is the input and the last dense layer of feature extraction network. We implement all our experiments with Pytorch


V-C1 Name the Dataset

We first perform Name the Dataset study as in [36] and [35]. In this task, we build a simple 3-layer CNN to classify input image into one of the three collected datasets. Fig. 2 shows the classification result on a random subset. Surprisingly, we find that despite these three datasets belong to the same modality and scanning on the same parts of human body, nearly perfect classification accuracy is obtained, meaning that severe dataset biases are induced during the creation of the final image.

Fig. 2: Confusion Matrix of Name that Dataset Experiments.

V-C2 Classification on Seen Datasets

We now study whether the dataset bias will affect the learning of chest disease prediction. Remind that we not only want our model generalize well on unseen sets, but also on the available internal sets. Thus, we first evaluate how our proposed methods perform on seen datasets. We use leave-one-out scheme for splitting the domains and run experiments on all possible dataset combinations. Table II shows the internal performance of various models.

Alexnet 0.8170.002 0.8110.001 0.8310.001
DANN 0.7580.003 0.7890.002 0.8060.003
RAPAIR 0.8120.002 0.8080.000 0.8290.001
Mixup 0.8050.002 0.8090.001 0.8250.002
CrossGrad 0.8160.001 0.8110.000 0.8310.000
E2E-SVM(bias) 0.8120.001(0.8150.000) 0.8010.002(0.8090.001) 0.8220.002(0.8280.001)
E2E-CE(bias) 0.816(0.8190.001) 0.8040.000(0.8140.000) 0.8230.000(0.8330.000)
E2E-CE+Cg(bias) 0.816(0.8180.001) 0.8040.000(0.8140.001) 0.8230.001(0.8320.000)
E2E-CE+MCT(bias) 0.8170.001(0.8200.000) 0.8050.000(0.8140.000) 0.8240.000(0.8330.000)
  • For the last four models where we have bias classifier and visual world classifier , we show AUC score of both and the one in bracket is the bias result

TABLE II: AUCS Score of Different Models on Internal Set of CXR datasets

We choose vanilla Alexnet as our baseline and also compare our methods to several advanced models on domain adaptation and generalization. Specifically, in domain adversarial training (DANN) [7] we want to learn a domain-invariant feature representation by fooling a dataset discriminator. In REAPIR [23], we fix a trained feature extractor and assign a trainable weight for each sample to minimize the dataset representation bias; then we resample the dataset according to the weights and retrain the network. In Mixup [43], inputs and labels are modified to be weighted sum of data from different domains. In CrossGrad [34], adversarial inputs are synthesized guided by domain perturbations. We found that DANN fails to converge because the gradient of domain discriminator dominates the feature learning; REPAIR does not help for the internal performance; Mixup is worse than vanilla baseline; Crossgrad suffers from gradient vanishing problem. On the contrary, our proposed undoing bias framework with cross-entropy loss (E2E-CE) surpass all the alternatives, suggesting that there is performance gain by modelling dataset bias carefully in multi-source data training. The performance is further increased by using our proposed MCT argumentation. Notice that the bias weight vector in our proposed model performs better than the visual world ones in the internal set, indicating that our model effectively encode dataset-specific information in the bias model.

V-C3 Classification on unseen dataset

Table III demonstrates the AUC score of each model tested on external set. Unlike what is found in [35] where the undoing bias framework [17] perform worse with DECARF feature, we show that by training the model end-to-end we can in fact get better performance on external generalization. Moreover, we observe similar results as in internal set performance. Our proposed method surpass all the comparing methods in every domain split, closing the performance gap between internal and external domain. We also find that popular domain adaptation methods such as DANN [7] and data argumentation methods such as Mixup [43] do not work well for CXR data.

Alexnet 0.740 0.002 0.8000.001 0.7560.000
DANN 0.7050.006 0.7880.003 0.7230.005
RAPAIR 0.7410.001 0.7990.002 0.7570.001
Mixup 0.7350.002 0.7970.000 0.7550.001
CrossGrad 0.7420.000 0.8010.000 0.7550.001
E2E-SVM 0.7480.002 0.8010.002 0.7580.001
E2E-CE 0.7520.001 0.8050.000 0.7610.001
E2E-CE+Cg 0.7520.001 0.8040.001 0.7600.002
E2E-CE+MCT 0.7550.000 0.8070.001 0.7630.001
TABLE III: AUCS score of different models for common chest diseases on external set

V-C4 Measuring The Bias

In this section, We further present the bias measurements of each model. We extract the hidden representation of trained models at the last layer of feature extractor and train a dataset classifier on top of that to perform the classifier Test. Table

IV summarizes the results. Several observations can be drawn: 1. although CXR data seems to be very similar to each other regardless of its origin, training a deep model naively can significantly induce dataset bias, leading to large generalization gap when testing on other source of data; 2. our proposed methods obtain the best average performance with much smaller performance drop between internal set and external set; 3. our proposed methods have smaller dataset bias.

Models Performance Drop Classifier Test Rank Correlation
Alexnet 9.42% 99.37% 0.05
DANN 6.99% 77.58% 0.02
REPAIR 8.74% 98.21% 0.04
CrossGrad 9.07% 97.49% 0.07
E2E-CE+MCT 7.59% 92.86% 0.12
TABLE IV: Model Bias Comparison

We plot the feature representation of baseline model and our proposed methods by using t-SNE visualization [26] for better demonstration. By looking at Fig. 3, we can observe that the dataset bias is clearly present in the vanilla baseline model. On the other hand, the t-SNE embeddings of different datasets are mixed together in our proposed methods, indicating the effectiveness of bias mitigation of our model.

Fig. 3: t-SNE visualization of various models. Different colors represent different datasets (blue: CheXpert, red: MIMIC). Left: Alexnet baseline; right: our proposed E2E-CE with MCT argumentation. It can be seen that embeddings are separated in baseline method, suggesting bias are heavily present.

V-D Grad-CAM

To understand how the bias affect our disease prediction, we plot the gradient-guided activation heatmap (Grad-CAM) [33][31] for the proposed models. Figure 4 visualizes two sets of results randomly chosen from the external set. We can see that the vanilla Alexnet may take advantage of the unrelated tags and is subject to noise, while our proposed model is being more discriminative and robust.

We also provide quantitative measure for our generated heatmaps by comparing it to the ground-truth annotation using Spearman’s rank correlation coefficients [30], which measures the rank order between two sets. The results are shown in Table IV. It can be seen that our proposed method has better correlation with human annotations.

Original Image Grad-CAM of Alexnet Grad-CAM of Our Model
Fig. 4: Gradient-weighted Class Activation Map (Grad-CAM) of Alexnet baseline model and our proposed model for Cardiomegaly (enlarged heart) classification. From left to right is original CXR scan, Grad-CAM of Alexnet overlapped on the original image and Grad-CAM of our proposed model, respectively. The green boxes in the original CXR are the lesion region consulted by radiologists (not in NIH dataset), whereas in the Grad-CAM figures brighter region indicates higher contribution to the prediction (as demonstrated by the colorbar of last row).

Vi Conclusion

Dataset bias is present even in Biomedical images that belong to the same modality and anatomy. Naively training and using deep models in medical application can be dangerous as severe generalization problem is observed. Our proposed MCT with Classifier-bias modelling framework effectively utilizes multi-source training data, mitigating the damage of dataset bias and closing the performance gap between internal domains and unseen domains. Future work includes evaluating MCT on scenario where more sources of training data is available and identifying the source of bias during data creations.

Appendix A

We explicitly test our novel MCT domain-guided argumentation methods on three popular domain-generalization datasets. They are

  • Google Fonts [34]: the task is to classify 36 characters collected from 109 fonts;

  • Rotated-MNIST [8]: this dataset is created by rotating the original MNIST dataset with 6 different degress: 0, 15, 30, 45, 60 and 75. Each image now has a digit label and rotation angle as its domain;

  • Office-Caltech [10]: there are in total ten different common object categories of four domains (Amazon, Caltech, DSLR and Webcam). Domains tend to have very different viewing angles, object background and etc.

The experiment details are as follows: for Google Fonts and Rotated-MNIST (R-MNIST), We use the same configuration as in [34]; for Office-Caltech, we use the same setting as Google Fonts. The selected layer for the three experiments are all the layer including the input but without the first convolution and the last dense layer. The generalization results are shown on Table V. Here the baseline method of Google Fonts and Office-Caltech is LeNet [19] without special training, while that of R-MNIST is CCSA [28]. It can be seen that our proposed MCT argumentation method surpasses other comparing model by a large margin.

Models Fonts R-MNIST Office-Caltech
Baseline 68.5111Results are taken directly from [34] 95.6111Results are taken directly from [34] 41.6
DANN [7] 68.9111Results are taken directly from [34] 98.0111Results are taken directly from [34] 43.8
CrossGrad [34] 72.6111Results are taken directly from [34] 98.6111Results are taken directly from [34] 44.3
MCT (ours) 73.1 99.6 47.8
TABLE V: Experiments on Toy Dataset


  • [1] D. Agniel, I. S. Kohane, and G. M. Weber (2018) Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. bmj 361, pp. k1479. Cited by: §I, §I.
  • [2] F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, and T. Tommasi (2019) Domain generalization by solving jigsaw puzzles. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2229–2238. Cited by: §II.
  • [3] F. M. Carlucci, P. Russo, T. Tommasi, and B. Caputo (2018) Hallucinating agnostic images to generalize across domains. External Links: arXiv:1808.01102 Cited by: §II.
  • [4] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §V-C.
  • [5] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pp. 647–655. Cited by: §II.
  • [6] R. Foy, B. Leaman, C. McCrorie, D. Petty, A. House, M. Bennett, P. Carder, S. Faulkner, L. Glidewell, and R. West (2016) Prescribed opioids in primary care: cross-sectional and longitudinal analyses of influence of patient and practice characteristics. BMJ open 6 (5), pp. e010276. Cited by: §I.
  • [7] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: TABLE V, §II, §V-C2, §V-C3.
  • [8] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi (2015-12) Domain generalization for object recognition with multi-task autoencoders. In The IEEE International Conference on Computer Vision (ICCV), Cited by: 2nd item, §II.
  • [9] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C. Peng, and H. E. Stanley (2000) PhysioBank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation 101 (23), pp. e215–e220. Cited by: §V-A.
  • [10] B. Gong, Y. Shi, F. Sha, and K. Grauman (2012) Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2066–2073. Cited by: 3rd item.
  • [11] F. Healey, S. Scobie, D. Oliver, A. Pryce, R. Thomson, and B. Glampson (2008) Falls in english and welsh hospitals: a national observational study based on retrospective analysis of 12 months of patient safety incident reports. BMJ Quality & Safety 17 (6), pp. 424–430. Cited by: §I.
  • [12] J. B. Homer and G. B. Hirsch (2006) System dynamics modeling for public health: background and opportunities. American journal of public health 96 (3), pp. 452–458. Cited by: §I.
  • [13] D. Huk Park, L. Anne Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach (2018) Multimodal explanations: justifying decisions and pointing to the evidence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8779–8788. Cited by: §II.
  • [14] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. (2019) Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. arXiv preprint arXiv:1901.07031. Cited by: §V-A, §V-C.
  • [15] A. E. Johnson, T. J. Pollard, S. Berkowitz, N. R. Greenbaum, M. P. Lungren, C. Deng, R. G. Mark, and S. Horng (2019) MIMIC-cxr: a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042. Cited by: §V-A.
  • [16] G. J. Katuwal and R. Chen (2016) Machine learning model interpretability for precision medicine. arXiv preprint arXiv:1610.09045. Cited by: §II.
  • [17] A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba (2012) Undoing the damage of dataset bias. In European Conference on Computer Vision, pp. 158–171. Cited by: §II, §II, §II, §IV-A, §V-C3.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §V-C.
  • [19] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: Appendix A.
  • [20] E. Levina and P. Bickel (2001) The earth mover’s distance is the mallows distance: some insights from statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vol. 2, pp. 251–256. Cited by: §II.
  • [21] D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2017) Deeper, broader and artier domain generalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5542–5550. Cited by: §II.
  • [22] H. Li, S. Jialin Pan, S. Wang, and A. C. Kot (2018-06) Domain generalization with adversarial feature learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
  • [23] Y. Li and N. Vasconcelos (2019) REPAIR: removing representation bias by dataset resampling. arXiv preprint arXiv:1904.07911. Cited by: §II, §V-C2.
  • [24] D. Lopez-Paz and M. Oquab (2016) Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545. Cited by: §II, §V-B2.
  • [25] D. G. Lowe (1999) Object recognition from local scale-invariant features. In ICCV, pp. 1150–1157. External Links: Link, Document Cited by: §II, §IV-A.
  • [26] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §V-C4.
  • [27] N. McLaughlin, J. M. Del Rincon, and P. Miller (2015) Data-augmentation for reducing dataset bias in person re-identification. In 2015 12th IEEE International conference on advanced video and signal based surveillance (AVSS), pp. 1–6. Cited by: §II.
  • [28] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto (2017) Unified deep supervised domain adaptation and generalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5715–5725. Cited by: Appendix A.
  • [29] K. Muandet, D. Balduzzi, and B. Schölkopf (2013) Domain generalization via invariant feature representation. In International Conference on Machine Learning, pp. 10–18. Cited by: §II.
  • [30] J. L. Myers, A. D. Well, and R. F. Lorch Jr (2013) Research design and statistical analysis. Routledge. Cited by: §II, §V-D.
  • [31] U. Ozbulak (2019) PyTorch cnn visualizations. GitHub. Note: Cited by: §V-D.
  • [32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §V-C.
  • [33] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §II, §V-D.
  • [34] S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi, and S. Sarawagi (2018) Generalizing across domains via cross-gradient training. arXiv preprint arXiv:1804.10745. Cited by: 1st item, TABLE V, Appendix A, §II, §IV-B, §IV-B, §V-C2, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1, footnote 1.
  • [35] T. Tommasi, N. Patricia, B. Caputo, and T. Tuytelaars (2017) A deeper look at dataset bias. In Domain Adaptation in Computer Vision Applications, pp. 37–55. Cited by: §II, §II, §IV-A, §V-B1, §V-C1, §V-C3.
  • [36] A. Torralba, A. A. Efros, et al. (2011) Unbiased look at dataset bias.. In CVPR, Vol. 1, pp. 7. Cited by: §II, §V-B1, §V-C1.
  • [37] V. Verma, A. Lamb, C. Beckham, A. Najafi, A. Courville, I. Mitliagkas, and Y. Bengio (2018)

    Manifold mixup: learning better representations by interpolating hidden states

    Cited by: §IV-B.
  • [38] S. Wang and J. Jiang (2016) Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905. Cited by: §II.
  • [39] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097–2106. Cited by: §V-A.
  • [40] L. Yao, J. Prosky, B. Covington, and K. Lyman (2019) A strong baseline for domain adaptation and generalization in medical imaging. arXiv preprint arXiv:1904.01638. Cited by: §II.
  • [41] Z. Yu, J. Yu, J. Fan, and D. Tao (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 1821–1830. Cited by: §II.
  • [42] J. R. Zech, M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann (2018) Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS medicine 15 (11), pp. e1002683. Cited by: §I.
  • [43] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §IV-B, §V-C2, §V-C3.
  • [44] Y. Zhang, J. C. Niebles, and A. Soto (2019) Interpretable visual question answering by visual grounding from attention supervision mining. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 349–357. Cited by: §II.
  • [45] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §II.