Tasting the cake: evaluating self-supervised generalization on out-of-distribution multimodal MRI data

by   Alex Fedorov, et al.
Georgia Institute of Technology

Self-supervised learning has enabled significant improvements on natural image benchmarks. However, there is less work in the medical imaging domain in this area. The optimal models have not yet been determined among the various options. Moreover, little work has evaluated the current applicability limits of novel self-supervised methods. In this paper, we evaluate a range of current contrastive self-supervised methods on out-of-distribution generalization in order to evaluate their applicability to medical imaging. We show that self-supervised models are not as robust as expected based on their results in natural imaging benchmarks and can be outperformed by supervised learning with dropout. We also show that this behavior can be countered with extensive augmentation. Our results highlight the need for out-of-distribution generalization standards and benchmarks to adopt the self-supervised methods in the medical imaging community.


Self-supervised learning methods and applications in medical imaging analysis: A survey

The availability of high quality annotated medical imaging datasets is a...

Evaluating the Robustness of Self-Supervised Learning in Medical Imaging

Self-supervision has demonstrated to be an effective learning strategy w...

Histopathology DatasetGAN: Synthesizing Large-Resolution Histopathology Datasets

Self-supervised learning (SSL) methods are enabling an increasing number...

Robust and Efficient Medical Imaging with Self-Supervision

Recent progress in Medical Artificial Intelligence (AI) has delivered sy...

Towards Reducing Aleatoric Uncertainty for Medical Imaging Tasks

In safety-critical applications like medical diagnosis, certainty associ...

On self-supervised multi-modal representation learning: An application to Alzheimer's disease

Introspection of deep supervised predictive models trained on functional...

ContIG: Self-supervised Multimodal Contrastive Learning for Medical Imaging with Genetics

High annotation costs are a substantial bottleneck in applying modern de...

Code Repositories


Fusion is a self-supervised framework for data with multiple sources — specifically, this framework aims to support neuroimaging applications.

view repo

1 Introduction

Self-supervised learning has fueled recent advances in image recognition (Oord et al., 2018; Hjelm et al., 2018; Bachman et al., 2019; Tian et al., 2019; Chen et al., 2020; Grill et al., 2020; Chen & He, 2020) and spurred great interest and high expectations in neuroimaging (Fedorov et al., 2019; Mahmood et al., 2020; Jeon et al., 2020; Taleb et al., 2020; Fedorov et al., 2020b). The expectations are generally high even outside neuroimaging, so much so that in Yann LeCun’s metaphor of learning as a cake (LeCun, 2019), self-supervised learning makes the tastiest part of the cake: the filling.

However, the observed similarity in performance of self-supervised and supervised methods on natural images (Geirhos et al., 2020b) does not guarantee the same for other domains. Work on the generalization of these methods in neuroimaging data is generally lacking. We investigate the out-of-distribution generalization using simulated distortions and a natural distributional shift based on race with multimodal human MRI data to fill this gap. We consider multimodal data as a natural case of multi-view data and because it contains a wealth of complementary information regarding the healthy and dysfunctional brain (Calhoun & Sui, 2016). We show that on neuroimaging data, contrastive multimodal self-supervised learning leads to models that differ from models trained in a supervised way.111Further, for brevity we use the terms supervised models and self-supervised models, but what we mean is the models trained by these approaches to learning. The models disagree in how they react to distortions or modality. When using Dropout (Tompson et al., 2015), supervised models, counter to our expectations, can significantly outperform self-supervised models in out-of-distribution generalization. Further, we show that the class of methods inspired by DeepInfoMax (DIM) (Hjelm et al., 2018) tends to struggle with intensity-based distortions, which we attempt to solve using additional data augmentation. Our findings further reinforce the advantages of multimodal models over unimodal ones. We also show that maximization of similarity (e.g., CMC (Tian et al., 2019), SimCLR (Chen et al., 2020)) performs poorly on multimodal medical imaging data because it requires an additional step to learn the modality.

Finally, we argue that the medical imaging community needs standards and benchmarks for out-of-distribution generalization. Introducing them could advance researchers towards a more reliable and standardized evaluation of newly proposed methods because similar performance on downstream tasks does not necessarily imply robust generalization. Standards could lead to a better understanding of methodological trade-offs in medical imaging. For example, the out-of-distribution generalization to distortions (e.g., affine scale distortion) may drive models to learn trivial discriminative features (see shortcut (Geirhos et al., 2020a)).

2 Methodology

2.1 Dataset and modeling out-of-distribution generalization

We evaluate the model on multimodal neuroimaging dataset OASIS-3 (LaMontagne et al., 2019). The modalities we selected are T1 and resting-state fMRI (rs-fMRI), which capture the brain’s anatomy and functional dynamics, respectively. First, T1 volumes were masked to only include the brain, and the rs-fMRI volumes were used to compute the fractional amplitude of low-frequency fluctuation (fALFF) in the to Hz power band. The exhaustive details can be found in Appendix A.2.

To model out-of-distribution data, we utilize the following random data transformations available in TorchIO (Pérez-García et al., 2020): Affine, Elastic, Gamma, Motion, Spike, Ghosting, BiasField, GaussianNoise. The transformation details can be found in Appendix A.1. The samples are shown in the Appendix in Figure 4. To model natural distributional shift from the training set, we selected African-American subjects.

Figure 1: Left: The combined scheme of CL–CS and S–AE. Red arrow defines CS, Pink — CL, Purple — S, and Green — reconstruction for AE. Decoder is only required for S–AE, as CL–CS is a decoder-free model. Please refer to the text for the notation. Right: ROCAUC for holdout test sets with Non-Hispanic Caucausian and African American subjects for T1 and fALFF.

2.2 Multimodal self-supervised models

Let be a dataset with paired T1 and fALFF volumes. We want to learn a latent representation where is an encoder part of the 3D convolutional DCGAN architecture (Radford et al., 2015) for modality .

Based on the performance of the proposed taxonomy in Fedorov et al. (2020a)

, we selected the following multimodal self-supervised models: CL–CS (Cross-Local and Cross-Spatial connections) and S–AE (Similarity AutoEncoder) (Figure 

1). Additionally, we compare these models with Supervised and AutoEncoder (AE) unimodal models, and multimodal models based only on the maximization of similarity between latent representations (S).

CL–CS is an AMDIM (Bachman et al., 2019) inspired model. Its inductive bias is to maximize mutual information between a pair of global and local variables. The first part of the objective, called CL, is defined between a ”local” variable (embedding of the location in the convolutional featuremap) in layer and a ”global” (latent) variable as , where is the location index, and and are modality indexes. The second part of the objective, called CS, is defined on pairs where is a location index in the other modality. The CL–CS objective is , where is an InfoNCE (Oord et al., 2018)

based estimator with a separable critic

 (Bachman et al., 2019) and is the dimensionality of the embeddings.

S–AE is a fusion of CMC (Tian et al., 2019) and SimCLR (Chen et al., 2020) with DCCAE (Wang et al., 2015), where the CCA objective is substituted by a maximization of similarity (mutual information) between a pair of latent variables . The similarity is approximated with a DIM objective that allows the model to be trained end-to-end and improves numerical stability compared to SVD-based solutions of CCA (Wang et al., 2015). Combining an AE with similarity maximization is shown to avoid the collapse of its representation in a multimodal scenario (Fedorov et al., 2020a). The S–AE objective is , where is a reconstruction loss with a DCGAN decoder for modality .

3 Discussion

In our experiments, we use the frozen encoder and linear projection trained on Non-Hispanic Caucasian subjects. First, we pretrain the encoder on all possible pairs; then, we train a linear projection from the encoder’s output to classify Healthy Controls (HC) and patients with Alzheimer’s Disease (AD). Eventually, we evaluate the out-of-distribution generalization on a natural race-based distributional shift and simulated distortions applied to volumes in standardized MNI space. All the models follow the same pipeline and use the same hyperparameters (Appendix 


Figure 2: ROCAUC performance for out-of-distribution generalization on T1 data with simulated distortions. Each title contains the name of a random distorsion and x-axis label — parameter of the distorsion.

3.1 Do self-supervised multimodal models produce robust representations?

In Figure 2 we compare the results of selected models and their baselines on out-of-distributional generalization with simulated distortions. In most cases, the supervised model performs quite well and better than the self-supervised models, except it tends to break down on RandomGhosting, RandomSpike, RandomBlur, and Random Noise for stronger distortion levels. The unimodal AE and multimodal S perform much worse compared to the combined approach: S–AE. CL–CS performs much worse than other models on intensity-based noise: RandomGhosting, RandomSpike, RandomBlur, RandomNoise. We hypothesize that adding intensity-based data-augmentation should help DIM-inspired models. DIM-inspired models maximize the mutual information between features with respect to depth, which can lead to learning spurious correlations.

In Figure 3 we compare models on out-of-distributional performance for fALFF. Specifically, in this case, S completely fails to represent fALFF, and the unimodal AE tends to fail in most cases. The multimodal self-supervised models: S–AE and CL–CS outperform the supervised baseline in most cases.

We want to note that fALFF is a ”harder” modality because it represents rs-fMRI as a timeless and less-informative voxel-wise hand-engineered feature. Visually (Figure 1 (left)), it looks highly noisy compared to T1. Importantly, most of the distortions used to augment the data are not natural variations in the fALFF data because it has undergone a heavy preprocessing pipeline.

It is unclear what generalization requirements we should satisfy in medical imaging and what out-of-distribution generalization is meaningful. For example, if we look at the first subfigure with the RandomAffine and scale parameter in Figure 2, it is not clear whether generalization to scaling is desirable. When we scale the volume 1.7 or more, we can only see the center of the brain, which might force some models (Supervised, CL–CS) to learn trivial features, such as a reduced ventricle size is a well-known trivial biomarker for Alzheimer’s disease (Frisoni et al., 2010). In contrast, other models (AE, S, S–AE) try to utilize information from the whole brain for decision making because they aren’t directly optimizing an objective that maximizes classification accuracy.

Figure 3: ROCAUC performance for out-of-distribution generalization on fALFF data with simulated distortions. Each title contains the name of a random distorsion and x-axis label — parameter of the distorsion.

These findings suggest contrastive multimodal self-supervised learning produces models that differ strongly from models trained in a supervised way but that there is room for improvement for both methods. We explore the improvement of Supervised and DIM models in the next subsection.

3.1.1 Improving out-of-distribution generalization of supervised and DIM inspired models

To improve the supervised model’s out-of-distribution generalization, we utilized volumetric (3D) Dropout with . The Supervised model with dropout shows a significant boost in generalization for T1 volumes compared to self-supervised models (Supervised (p=0.5), Figure 2). Additionally, dropout did push the model to not only extract features from the center of the brain Figure 2 for RandomAffine scale distortion. This may suggest that dropout is a simple solution to the problem. Dropout however, does not work for noisy and hand-engineered fALFF data (Figure 3).

To improve the out-of-distribution generalization for CL–CS, we added RandomNoise with a maximum standard deviation of

with a probability of

to the data augmentation pipeline during the encoder’s pretraining. Such data augmentations partly improve generalization for T1 (CL–CS Noisy, Figure 2) but may reduce out-of-distribution generalization for some distortions and does not work for fALFF. This solution requires some additional finetuning. Some ideas to improve data-augmentation are to utilize curriculum learning (e.g., Sinha et al. (2020)) because it improves convergence and to address the problem of catastrophic forgetting (Kirkpatrick et al., 2017) because we can move away from the original data distribution.

3.2 Race-based distributional shift

The performance of the selected models is shown in Figure 1 (right). When comparing the models’ performance on different races visually, there is no evident racial bias trend. The S model fails to generalize to African American subjects with a classification accuracy of for fALFF. The reduced classification accuracy is likely due to its representations collapsing during pretraining.

4 Conclusions

Self-supervised medical imaging models are only now beginning to be developed, and we hope that our analysis will facilitate robust and fair self-supervised models. Additionally, we hope to see more exhaustive benchmarks to evaluate out-of-distribution generalization in medical imaging.


This work is supported by NIH R01 EB006841.

Data were provided in part by OASIS-3: Principal Investigators: T. Benzinger, D. Marcus, J. Morris; NIH P50 AG00561, P30 NS09857781, P01 AG026276, P01 AG003991, R01 AG043434, UL1 TR000448, R01 EB009352. AV-45 doses were provided by Avid Radiopharmaceuticals, a wholly-owned subsidiary of Eli Lilly.


Appendix A Appendix

a.1 Simulated distortions

Figure 4: The simulated samples are shown on T1 image.

The default parameters for data transformation are:

"affine": {
  "scales": (1, 1), "degrees": (0, 0), "translation": (0, 0)
"anisotropy": {"downsampling": (2, 2)}
"motion": {
  "degrees": (0, 0), "translation": (0, 0), "num_transforms": 1
"ghost": {"num_ghosts": (1, 1), "intensity": (0.5, 0.5)}
"spike": {"num_spikes": (1, 1), "intensity": (0.5, 0.5)}
"blur": {"std": (0.25, 0.25)}
"bias": {"coefficients": (0.5, 0.5), "order": 3}
"noise": {"mean": (0, 0), "std": (0.25, 0.25)}

For transformations that are not listed, we have used default parameters from TorchIO package (Pérez-García et al., 2020).

The parameter space is defined as:

class ParameterSpace():
  def __init__(self, left, right, n_steps, data_type, eps=10e-8):
      step = (right - left) / n_steps
      self.range = np.round(np.arange(left, right + eps, step), 2)
      self.range = self.range.astype(data_type)

Then we can define search space as:

search_space = {
  "affine": {
      "class": tio.transforms.RandomAffine,
      "space": {
          "scales": ParameterSpace(0.5, 2.5, 10, data_type=float),
          "degrees": ParameterSpace(-90, 90, 10, data_type=int),
          "translation": ParameterSpace(-45, 45, 10, data_type=int)
  "elastic": {
      "class": tio.transforms.RandomElasticDeformation,
      "space": {
          "num_control_points": ParameterSpace(
              5, 16, 10, data_type=int),
          "max_displacement": ParameterSpace(
              1, 32, 10, data_type=float),
  "motion": {
      "class": tio.transforms.RandomMotion,
      "space": {
          "degrees": ParameterSpace(-90, 90, 10, data_type=int),
          "translation": ParameterSpace(-9, 9, 10, data_type=int),
  "ghost": {
      "class": tio.transforms.RandomGhosting,
      "space": {
          "intensity": ParameterSpace(0.1, 0.9, 10, data_type=float),
  "spike": {
      "class": tio.transforms.RandomSpike,
      "space": {
          "intensity": ParameterSpace(0, 2, 10, data_type=float)
  "blur": {
      "class": tio.transforms.RandomBlur,
      "space": {
          "std": ParameterSpace(1, 10, 10, data_type=float)
  "bias": {
      "class": tio.transforms.RandomBiasField,
      "space": {
          "coefficients": ParameterSpace(
              0.1, 2, 10, data_type=float),
          "order": ParameterSpace(1, 8, 8, data_type=int)
  "noise": {
      "class": tio.transforms.RandomNoise,
      "space": {
          "mean": ParameterSpace(-2., 2., 10, data_type=float),
          "std": ParameterSpace(0, 2., 10, data_type=float)
  "gamma": {
      "class": tio.transforms.RandomGamma,
      "space": {
          "log_gamma": ParameterSpace(-0.9, 0.9, 10, data_type=float)

The data transformation class is defined by a class in the TorchIO package (Pérez-García et al., 2020). The sample images for each transformation are shown in Figure 4.

a.2 Dataset preprocessing details

To preprocess rs-fMRI, we registered the time-series to the first image in the series using mcflirt (FSL v6.0.2) (Jenkinson et al., 2002) with a 3-stage search level (mm, mm,

mm), 20mm field-of-view, matching with 256 histogram bins, 6 degrees-of-freedom (dof) for the transformation, a 6mm scaling factor, and normalized correlation values across the volumes as a cost function (smoothed to 1mm). The interpolation of the final transformations and outputs is done using splines. The fALFF volume was then computed in the 0.01 to 0.1 Hz power band using REST 

(Song et al., 2011).

To preprocess the T1 volume, we first removed 15 subjects after visual inspection. The T1 volumes were brainmasked with bert ((FSL v6.0.2) (Jenkinson et al., 2002)). The brainmasked volumes were then linearly warped (7 dof) to MNI space and resampled to a 3mm resolution with a final volume of 64-by-64-by-64 mm.

The samples were z-normalized and normalized with histogram standardization based on the training set before being fed into the deep neural network. During training, we apply random flips and random crops as data augmentation.

We selected non-Hispanic Caucasian subjects ( HC, AD, with other disorders). After pairing the two modalities for all of the scans the dataset included a total of pairs, because subjects can have multiple scans. We split the subjects into stratified folds ( subjects ( pairs), ()) and hold-out — (). The subset with African American subjects contains ( HC, AD) samples. In the downstream tasks we only use the first pair of multimodal volumes per subject.

a.3 Architecture, optimization and hyperparameters

The main architectures for the encoder and decoder are based on the fully convolutional DCGAN architecture  (Radford et al., 2015). The final convolutional layer in the encoder produces a 64x1x1x1 feature map. We initialize all layers with Xavier initialization.

The CL–CS method also uses a convolutional projection head to map (128x8x8x8) to a 64x8x8x8 to get 8x8x8 locations with 64-dimensional representation. The projection head consists of one ResNet (He et al., 2016) block, which combines information from two paths: identity and two convolutional layers with a kernel size of ,

channels, ReLU activation function and 3D Batch Normalization in between. The projections are shared between the CL and CS objectives. The layers of the convolutional projection are initialized as a uniform distribution

and set to on the diagonal, which is where the input and output dimensions match, similar to AMDIM (Bachman et al., 2019).

Similarly to AMDIM (Bachman et al., 2019), each InfoNCE objective is penalized using squared critic scores where and we clip the values of the critic by with .

We pretrain the encoders for epochs and train the linear projection layers for epochs with batch size using the RAdam (Liu et al., 2019) optimizer (learning rate ) and a OneCycleLR (Smith & Topin, 2019) scheduler (maximum learning rate ).

a.4 Implementation and computational resources

The experiments are implemented using PyTorch (Paszke et al., 2019) and Catalyst (Kolesnikov, 2018) frameworks. The experiments were performed with NVIDIA DGX-1 V100. The code will be available at some point.