Log In Sign Up

Self-supervised multimodal neuroimaging yields predictive representations for a spectrum of Alzheimer's phenotypes

by   Alex Fedorov, et al.

Recent neuroimaging studies that focus on predicting brain disorders via modern machine learning approaches commonly include a single modality and rely on supervised over-parameterized models.However, a single modality provides only a limited view of the highly complex brain. Critically, supervised models in clinical settings lack accurate diagnostic labels for training. Coarse labels do not capture the long-tailed spectrum of brain disorder phenotypes, which leads to a loss of generalizability of the model that makes them less useful in diagnostic settings. This work presents a novel multi-scale coordinated framework for learning multiple representations from multimodal neuroimaging data. We propose a general taxonomy of informative inductive biases to capture unique and joint information in multimodal self-supervised fusion. The taxonomy forms a family of decoder-free models with reduced computational complexity and a propensity to capture multi-scale relationships between local and global representations of the multimodal inputs. We conduct a comprehensive evaluation of the taxonomy using functional and structural magnetic resonance imaging (MRI) data across a spectrum of Alzheimer's disease phenotypes and show that self-supervised models reveal disorder-relevant brain regions and multimodal links without access to the labels during pre-training. The proposed multimodal self-supervised learning yields representations with improved classification performance for both modalities. The concomitant rich and flexible unsupervised deep learning framework captures complex multimodal relationships and provides predictive performance that meets or exceeds that of a more narrow supervised classification analysis. We present elaborate quantitative evidence of how this framework can significantly advance our search for missing links in complex brain disorders.


page 11

page 16

page 18


On self-supervised multi-modal representation learning: An application to Alzheimer's disease

Introspection of deep supervised predictive models trained on functional...

Taxonomy of multimodal self-supervised representation learning

Sensory input from multiple sources is crucial for robust and coherent h...

Cross-Domain Self-Supervised Deep Learning for Robust Alzheimer's Disease Progression Modeling

Developing successful artificial intelligence systems in practice depend...

A Heterogeneous Graph Based Framework for Multimodal Neuroimaging Fusion Learning

Here, we present a Heterogeneous Graph neural network for Multimodal neu...

Self-Supervised Multimodal Opinion Summarization

Recently, opinion summarization, which is the generation of a summary fr...

1 Introduction

The brain is a vastly complex organ whose proper function relies on the simultaneous operation of multitudes of distinct biological processes. As a result, individual imaging techniques often capture only a single facet of the information necessary to understand a dysfunction or perform a diagnosis. As an illustration, structural MRI (sMRI) captures static but relatively precise anatomy, while fMRI measures the dynamics of hemodynamic response but with substantial noise. Brain imaging analyses with a single modality have been shown to potentially lead to misleading conclusions calhoun2016multimodal ; plis2011effective , which is unsurprising given fundamental differences in measured information of the modalities as all of the modalities are flawed on their own in some way.

To address the limitations of unimodal analyses, it is natural to turn to multimodal data to leverage a wealth of complementary information, which is key to enhancing our knowledge of the brain and developing robust biomarkers. Unfortunately, multimodal modeling is often a challenge, as finding points of convergence between different multimodal views of the brain is a nontrivial problem. We propose self-supervised approaches to improve our ability to model joint information between modalities, thereby allowing us to achieve the following three goals. Our primary goal in addressing multimodal modeling is to understand how to represent multimodal neuroimaging data by exploiting unique and joint information in two modalities. As the second goal, we want to understand the links between different modalities (e.g., T1-weighted structure measurements and resting functional MRI data). The final goal is to understand how to exploit co-learning baltruvsaitis2018multimodal , especially in case one modality is particularly hard to learn. By achieving these three goals in this work, we address three out of the five multimodal challenges baltruvsaitis2018multimodal : representation, alignment, and co-learning leaving only generative translation and fused prediction for future work. Moreover, we present a general framework for self-supervised multimodal neuroimaging. The proposed approach can capitalize on the available joint information to show competitive performance relative to supervised methods. Our approach opens the door to additional data discovery. It enables characterizing subject heterogeneity in the context of imperfect or missing diagnostic labels and, finally, can facilitate visualization of complex relationships.

1.1 Related work

In neuroimaging, linear independent component analysis (ICA) 

comon1994independent , and canonical correlation analysis (CCA) hotelling1992relations are commonly used for latent representation learning and inter-modal link investigation. Joint ICA (jICA) moosmann2008joint performs ICA on concatenated representation for each modality. jICA has been extended with multiset canonical correlation plus ICA (mCCA+ICA) sui2011discriminating and spatial CCA (sCCA) sui2010cca

which mitigate limitations of jICA or CCA applied separately. These jICA or jICA-adjacent methods all estimate a joint representation. Another approach, parallel ICA (paraICA) 

liu2009combining , simultaneously learns independent components for fMRI and SNP data that maximize the correlation between specific multimodal pairs of columns, in the mixing matrix of different modalities. A recent improvement over paraICA, aNy-way ICA duan2020any , can scale to any number of modalities and requires fewer assumptions.

Most of the available multimodal imaging analysis approaches, including those mentioned above, rely on linear decompositions of the data. However, recent work suggests the presence of nonlinearities in neuroimaging data that can be exploited by deep learning (DL) abrol2021deep . Likewise, correspondence between modalities is unlikely to be linear calhoun2016multimodal . These findings motivate the need for deep nonlinear models. Within neuroimaging, supervised DL models have mostly proven to be successful due to their ease of use. These models show mostly unparalleled performance mainly in the abundance of training data: a data sample paired with a corresponding label. However, supervised models are prone to the shortcut learning phenomenon geirhos2020shortcut

; when a model hones in on trivial patterns in the training set that are sufficient to classify the available data but are not generalizable to data unseen at training. Next, it has been shown that supervised models can memorize noisy labels 

arpit2017closer which are commonplace in healthcare pechenizkiy2006class ; rokham2020addressing . Furthermore, supervised methods have also been shown to be data-inefficient CPCv2 while labels in medical studies are costly and scarce. Finally, in many cases, diagnostic labels are based on self-reports and interviews and thus may not accurately reflect the underlying biology rokham2020addressing

. Many of these problems can be addressed with unsupervised learning and, more recently, self-supervised learning (SSL) 

dosovitskiy2014discriminative . In SSL, a model trains on a proxy task that does not require externally provided labels. SSL has been shown to improve robustness NEURIPS2019_a2b15837 , data-efficiency CPCv2 , and can outperform supervised approaches on image recognition tasks SWAV .

In the early days of the current wave of unsupervised deep learning, common approaches were based on deep belief networks (DBNs) 

srivastava2012learning ; plis2014deep

, and deep Boltzmann machines (DBMs) 

srivastava2012multimodal ; hjelm2014restricted ; SUK2014569 . However, DBNs and DBMs are difficult to train. Later, deep canonical correlation analysis (DCCA) dcca was introduced for multiview unsupervised learning. DCCA dcca

, and its successor, deep canonically correlated autoencoder (DCCAE) 


are trained in a two-stage procedure. In the first stage, a neural network trains unimodally via layer-wise pretraining or using an autoencoder. In the second stage, CCA is used to capture joint information between modalities. Due to the need for a decoder, the autoencoding approaches demand high computational and memory requirements for full brain data as most brain segmentation models are still working on 3D patches 

fedorov2017end ; fedorov2017almost ; henschel2020fastsurfer .

Among many self-supervised learning approaches, we are specifically interested in methods that use maximization of mutual information, such as Deep Infomax (DIM) DIM and contrastive predictive coding (CPC) CPCv1 . These methods can naturally be extended to modeling multimodal data fedorov2021self compared to other self-supervised pre-text tasks misra2020self (e.g. relative position doersch2015unsupervised , rotation gidaris2018unsupervised

, colorization 

zhang2016colorful ). The maximization of mutual information in these methods allows a predictive relationship between representations at different levels as a learning signal for training. Specifically, the learning signal in DIM DIM

is the relationship between the intermediate representation of a convolutional neural network (CNN) and the whole representation of the input. In (CPC) 

CPCv1 , for example, this is done between the context and a future intermediate state. Both DIM and CPC have been successfully extended and applied unimodally for the prediction of Alzheimer’s disease from sMRI fedorov2019prediction

, transfer learning with fMRI 

mahmood2019transfer ; mahmood2020whole , and brain tumor, pancreas tumor segmentation, and diabetic retinopathy detection NEURIPS2020_d2dc6368 . In addition, these models do not reconstruct as part of their learning objective, unlike autoencoders. The reconstruction-free model saves a lot of compute and memory, especially for volumetric medical imaging applications.

In multiview and multimodal settings, self-supervised learning has enabled state-of-the-art results on various computer vision problems via maximization of mutual information between different views of the same image 

amdim ; cmc ; simclr , and multimodal data (e.g., visual, audio, text) miech2020end ; alayrac2020self ; Radford2021LearningTV ; CMIM . These SSL approaches capture the joint information between two corresponding distorted or augmented images, image-text, video-audio, or video-text pairs. In the multiview case amdim ; cmc ; simclr , the models learn transformation-invariant representations by capturing the joint information while discarding information unique to a transformation. In the case of multimodal data, the models learn modality-invariant demian representations known as retrieval models. The same ideas have been extended to the multidomain scenario to learn domain-invariant feng2019self representations. However, in the case of neuroimaging, when one modality can capture the anatomy, and the other can capture brain dynamics, the joint information alone will not be sufficient due to the information content that each of the modalities is measuring. We hypothesize that we additionally need to capture unique modality-specific information.

Most of the described multiview, multidomain, and multimodal work can be viewed as a coordinated representation learning baltruvsaitis2018multimodal . In terms of coordinated representation learning, we learn separate representations for each view, domain, or modality. But the representations are coordinated through an objective function by optimizing a similarity measure with possible additional constraints (e.g., orthogonality in CCA). In this case, the objective function mainly captures joint information between the global latent representation of modalities that summarize the whole input. However, such framework only considers a global-global relationship between modalities. To resolve this limitation, we can consider intermediate representations, namely as local representation that captures local information about the input. That would allows us to capture local-to-local, glocal-to-global, local-to-global multimodal relationships. Previously, augmented multiscale DIM (AMDIM) amdim , cross-modal DIM (CM-DIM) sylvain2019locality ; sylvain2020zeroshot , and spatio-temporal DIM (ST-DIM) anand2019unsupervised used local intermediate representation of convolutional layers to capture multi-scale relationships between multiple views, modalities or time frames. Thus, we hypothesize that multiscale relationships between modalities can also be used to learn representations from multimodal neuroimaging data. We extend the coordinated representation learning framework to a multiscale coordinated representation framework for multimodal data to verify this.

1.2 Contributions

First, we propose a multiscale coordinated framework as a family of models. The family of the models is inspired by many published SSL approaches based on the maximization of mutual information that we combine in a complete taxonomy. The family of methods within this taxonomy covers multiple inductive biases that can capture joint inter-modal and unique intra-model information. In addition, it covers multiscale multimodal relationships in the input data.

Secondly, we provide a methodology to exhaustively evaluate learned representation. We thoroughly investigate the models on a multimodal dataset OASIS-3 OASIS3 by evaluating performance on two classification tasks, measuring the amount of joint information in representations between modalities, and interpreting the representation in the brain voxel-space.

Our results provide strong evidence that self-supervised models yield useful predictive representations for classifying a spectrum of Alzheimer’s phenotypes. We show that self-supervised models are able to uncover regions of interest, such as the hippocampus yang2022human , thalamus elvsaashagen2021genetic , parietal lobule greene2010subregions , occipital gurys yang2019study on T1, and hippocampus 10.3389/fnagi.2018.00037 ; zhang2021regional , middle temporal gyrus hu2022brain , subthalamus hypothalamus chan2021induction and superior medial frontal gyrus cheung2021diagnostic on fALFF. Furthermore, self-supervised models can uncover multimodal links between anatomy and brain dynamics that are also supported by previous literature, such as Thalamus (on fALFF) - Precuneus (on T1) cunningham2017structural , precuneus (T1) and hippocampus (fALFF) kim2013hippocampus ; ryu2010measurement , and precuneus (T1) and middle cingulate cortex (fALFF) rami2012distinct ; bailly2015precuneus .

2 Materials and Methods

In this section, we first describe the foundation of our framework. Then we describe the evaluation of learned representation.

2.1 Overview of coordinated representation learning

To help the reader understand the foundation of our framework, we start with the idea of coordinated representation learning baltruvsaitis2018multimodal .

Let be an arbitrary sample of related input modalities collected from the same subject. Taken together, a set of samples comprise a multimodal dataset . In our case, the modalities of interest are sMRI and rs-fMRI represented as T1 and fALFF volumes, respectively. Thus, and , a tensor. While inter-modal relationships can be as complex as a concurrent acquisition of neuroimaging modalities, we pair sMRI and rs-fMRI simply based on the temporal proximity between sessions of the same subject from the OASIS-3 OASIS3 dataset.

For each modality , coordinated representation learning introduces an independent encoder that maps the input, , to a lower

-dimensional representation vector

. Thus, each modality has a separate encoder and a corresponding representation.

In this work, we parameterize encoder for each modality with a volumetric deep convolutional neural network. To learn each encoder’s parameters, we optimize an objective function that incorporates the representation of each modality and coordinates them. The objective encourages each of the encoders to learn an encoding for their respective modality that is informed by the other modalities. This cross-modal influence is learned during training and is captured in the parameters of each encoder. Hence, a representation of an unseen subject’s modality will capture cross-modal influences without a contingency on the availability of the other modalities. One common choice with respect to cross-modal influences is to coordinate representations across modalities by maximizing their similarity metric frome2013devise or correlation via CCA dcca between representation vectors.

To summarize, in coordinated representation learning, modality-specific encoders learn to generate representations in a cross-coordinated manner guided by an objective function.

The main limitation of the coordinated representation framework is its exclusive focus on capturing joint information between modalities instead of also capturing information that is exclusive to each modality. Thus, DCCA dcca and DCCAE DCCAE employ coordinated learning only as a secondary stage after pretraining the encoder. The first stage in these methods focuses on learning modality-specific information via layer-wise pretraining of the encoder in DCCA, or pretraining of the encoder with an autoencoder (AE) in DCCAE. On the other hand, deep collaborative learning (DCL) DCL attempts to capture modality-specific information using a supervised objective for phenotypical information with respect to each modality, in addition to CCA.

An additional limitation that previous work has not considered is the use of intermediate representation in the encoder. Intermediate representations have been integral to the success of U-Net architecture in biomedical image segmentation ronneberger2015u , brain segmentation tasks henschel2020fastsurfer , achieving state-of-the-art results with self-supervised learning on natural image benchmarks with DIM DIM and AMDIM amdim , and achieving near-supervised performance in Alzheimer’s disease progression prediction, with self-supervised pretraining fedorov2019prediction .

In our work, we address these limitations by proposing a multi-scale coordinated learning framework.

2.1.1 Multi-scale coordinated learning

To motivate multi-scale coordinated learning, we re-introduce intermediate representations and explain how they can benefit multimodal modeling.

Each encoder produces intermediate representations. Specifically, if the encoder is a convolutional neural network (CNN) with layers, each subsequent layer

in the CNN represents a larger part of the original input data. Furthermore, each of these scales, which correspond to the depth of a layer, is an increasingly non-linear transformation of the input and produces a more abstract representation of that input relative to the previous scales. The intermediate representations of layer

are convolutional features , where is the number of locations in the convolutional features of layer , and is the number of channels. These features are also often referred to as activation maps within the network. For example, if the input is a 3D cube, the arbitrary feature size within a layer of the CNN will be . Thus, each of the intermediate representations will have locations. Each location in the intermediate representation has a receptive field araujo2019computing that captures a certain subset of the input sample. Each intermediate representation thus captures some of the input’s local information, while the latent representation () captures the input’s global information.

With two scales and two modalities, we can define multi-scale coordinated learning based on four objectives, which are schematically shown in Figure 1. The Convolution-to-Representation (CR) objective captures modality-specific information as local-to-global intra-model interactions. The Cross Convolution-to-Representation (XX) objective captures joint inter-modal local-to-global interactions between the local representations in one modality and the global representation in another modality. The Representation-to-Representation (RR) objective captures joint information between global inter-modal representations as global-to-global interactions. The Convolution-to-Convolution (CC) objective captures joint information between local inter-modal representations as local-to-local interactions.

Thus, we can capture modality-specific information and multimodal relationships at multiple scales. These extensions cover two of the previously mentioned limitations in the coordinated learning framework. Our extensions also allow us to define a full taxonomy of models that can be constructed based on these four principal interactions and to show how these compare to or supersede related work.

To construct an effective objective based on these multi-scale coordinated interactions, first, we will define an estimator of mutual information. This estimator will be used to define each of the four objectives as a mutual information maximization problem that can be used to encourage the interactions between the corresponding representations. Lastly, we explain how one can construct an objective for a multi-scale coordinated representation learning problem, based on a combination of the four basic objectives between global and local features, and, additionally, show how these compare to related work.

2.1.2 Mutual Information Maximization

To estimate mutual information between random variables


, we use a lower-bound, based on the noise-contrastive estimator (InfoNCE) 

CPCv1 .


where the samples construct pairs: sampled from the joint (positive pair) and sampled from the product of the marginals (negative pair). The and represent -dimensional representation vectors, and can be local or global representations. The function in equation 1 is a scoring function that maps its input vectors to a scalar value and is supposed to reflect the goodness of fit. This functions is also known as the critic function Tschannen2020On . The encoder is optimized to maximize the critic function for a positive pair and minimize it for a negative pair, such that . Our choice of the critic function is a scaled dot-product amdim , and is defined as:


2.1.3 Taxonomy for multi-scale coordinated learning

Figure 1: The concept behind the multi-scale coordinated learning based on four principle relationships: Convolution-to-Representation (CR), Cross Convolution-to-Representation (XX), Representation-to-Representation (RR), and Convolution-to-Convolution (CC). Each colored vecor in the convolution activation map and , corresponds to arbitrary locations and in features maps of layers and for modalities and , respectively. We use “local representation” to denote each location in the convolutional activation map: the vector spanning the channels. Latent representation vector is the d-dimensional global representation. To avoid clutter we display only a slice of data but layer activations a volume per channel in our applications.
Figure 2: The complete taxonomy of interactions, based on the four principle interactions. The lower dots are the convolutional activations, whereas the upper dots are the global representations. The interactions are defined between 1st modality (left) and 2nd modality (right). The combinations represent names of the models that are based on these four interactions: CR, XX, RR and CC. The colors follow the colormap from Figure 1.

Given the mutual information estimator, we can construct four basic objectives and then use those to construct a full taxonomy of interactions, which is shown in Figure 2. Each option within the taxonomy specifies a unique optimization objective . Notably, the first row of the figure shows the principal losses: CR, XX, RR, and CC, - that we defined before. The remaining parts of the taxonomy are constructed by adding a composition of the principle losses together. For example, the 5th combinationRR-CC is the sum of the two basic objectives RR and CC.

To discuss the options in the taxonomy, we first reintroduce some notations. A local representation is any arbitrary location in the convolutional feature from a convolutional layer , with locations. A location is represented as the -dimensional vector, where is the number of channels of the convolutional activation map for layer . The choice of the layer

is a hyperparameter and can be guided by the following intuition. First, the chosen layer

should not be too close to the last layer or be the last layer because it will capture similar information to the global representation. In this case, the local and global content of the input could be very similar, with almost or exactly the same receptive field. Although a single layer difference between the local and global representations can be used as a strategy for layer-wise pretraining, as in Greedy Deep InfoMax (GIM) lowe2019putting , and the least case is not meaningful. Secondly, the chosen layer should not be too close to or be the first layer. This will lead to a local representation with a very small receptive field that only captures hyper-local information of the input. The first layer will essentially capture the intensity of a voxel, which has not worked well, as previously shown in DIM DIM .

The global representation is the encoder’s latent representation that summarizes the whole input. The global representation is a -dimensional vector, where is also a hyperparameter. With this global representation, we also define a -dimensional space, wherein we compute scores with the critic function . The the local representation, however, is a -dimensional vector. To overcome the difference in size, we add an additional local projection head . This projection head takes the -dimensional local representation from layer in the encoder and projects it to the -dimensional space so we can compute scores with in this -dimensional space. This projection is also parameterized by a neural network and separate for each modality. In addition, we introduce a global projection head . Both the local and global projection heads are shown to improve the training performance in DIM DIM and in SimCLR simclr , respectively.

The first objective (CR), in the top left corner of Figure 2

, trains two independent encoders, one for each modality, with a unimodal loss function that maximizes the mutual information between local

and global representations. This objective directly implements the Deep InfoMax (DIM) DIM objective. The idea behind this approach is to maximize the information between the lowest and the highest scales of the encoder. In other words, the local representations are driven to be predictive of the global representation. The objective for an arbitrary layer is defined as:


, where we only define the following objective for a modality . The objective has to be computed for each modality.

The CR objective can be extended to the multimodal case by measuring the inter-modal mutual information between local and global representations of modality and , respectively. We call this multimodal objective Cross Convolution-to-Representation (XX), and it is shown second from the top left in Figure 2. This objective has previously been used in the context of augmented multiscale DIM (AMDIM) amdim , cross-modal DIM (CM-DIM) sylvain2020zeroshot , and spatio-temporal DIM (ST-DIM) anand2019unsupervised . We define it as


, where the objective is defined for a pair of modalities and . The objective has to be computed for all possible pairs of modalities. In case of symmetric coordinated fusion, the symmetry has to be preserved for modalities and by computing both and , whereas for asymmetric fusion, this is not the case.

The third elementary objective measures mutual information between the global representation of one modality and the global representation of another modality , . This objective is called Representation-to-Representation (RR) and is the third in the top row of Figure 2. This interaction has been used in many prior contrastive multiview work cmc ; simclr ; moco ; SWAV and DCCAdcca . The RR objective is defined as


, where we the objective is defined for a pair of modalities and .

The fourth elementary objective is similar to RR, but only maximizes the mutual information between the two inter-modal local representations and , where and are arbitrary locations in layers within modality and , respectively. This objective is called Convolution-to-Convolution (CC) and is shown as the fourth from the top left in Figure 2. The CC objective has been used in AMDIM amdim , CM-DIM sylvain2020zeroshot , and ST-DIM anand2019unsupervised . Due to a large number of possible pairs of locations between the activation maps in each encoder, we reduce the computational costs by sampling arbitrary locations, which was proposed in AMDIM amdim . Thus, after sampling an arbitrary location from the convolutional activation map for one modality, we compute the objective in a similar way to XX, by treating sampled locations as the global representation.


, where is a sampled location from locations. The objective is defined for a pair of modalities and .

By combining these four primary objectives, we can construct more complicated objectives, as shown in Figure 2. For example, the XX-CC objective for two modalities (as and ) can be written as


The goal would be to find parameters that maximize . The objective is repeated with the modalities flipped to preserve the symmetry of XX and CC. Removing the symmetry is intuitively similar to guiding the representations of one modality by the representations of another modality, which may be interesting for future work on asymmetric fusion. The XX-CC objective coordinates representations locally with the CC objective on convolutional activation maps and coordinates representations across scales in the encoder with XX. The local representations of one modality should be predictive of the global and local representations of the other modality.

2.1.4 Baselines and other objectives

Figure 3: This figure shows schemes for the following models: Supervised, autoencoder (AE), deep canonical correlation autoencoder (DCCAE), the CR objective combined with CCA (CR-CCA), and the RR objective combined with an autoencoder (RR-AE). The Supervised and CR-CCA objectives follow a similar structure as schemes represented in our taxonomy, see Figure 2. The AE-based models (AE, DCCAE, RR-AE) represented by 3-dot scheme where the middle dot is a representation, the lower dots are convolutional activation maps of the input, and the upper dots are reconstructions of the input.

We compare our method to an autoencoder (AE), a deep canonical correlation autoencoder (DCCAEDCCAE , and a supervised model. Each type of model is a high-performing version of the three main categories of alternative approaches to our framework. The AE and supervised models are trained separately for each modality, while the DCCAE is trained jointly on all modalities. By Supervised, we refer to a unimodal model that is trained to predict a target using cross-entropy loss.

In addition to defining a unified framework that covers multiple existing approaches, our taxonomy contains a novel unpublished approach that combines different combinations of the four objectives. One novel approach combines the CR objective with the objective of the DCCAE, which we call CR-CCA. The CR objective allows us to train using modality-specific information, and the CCA objective aligns the representations between modalities. This leads to the following objective:


A second novel approach combines the AE objective with our RR objective to create the RR-AE objective. The AE objective ensures the learning of modality-specific representations, and the RR objective enforces the alignment of representations across modalities, similar to the CCA objective in the DCCAE. The final objective of the RR-AE is as follows.


where is the mean squared reconstruction error for the AE with an additional decoder , and modality :


The baseline schemes are shown in Figure 3.

2.2 Analysis of the model representation

It is important to validate the representations that the model learns, which in the proposed framework is done in three steps. The first step evaluates the representations using classification tasks with logistic regression. This step uses the features that are extracted by the encoder from the input. The goal of this evaluation is to ensure the discriminative power of the pre-trained features. In the second step, we compute the similarity between representations to measure how much joint information has been captured by the model. The third and last step consists of two analyses to explore the relationship between the latent space and the brain space to assess voxel-wise group differences based on saliency gradients for each of the

dimensions of the representation.

2.2.1 Classification evaluation

To evaluate the discriminative performance of the representations captured by the model, we train logistic regression on frozen representations from the last layer of the encoder. Note that most self-supervised learning algorithms evaluate the discriminative power of representations with a linear evaluation protocol based on linear probes linear_probes . We chose to use logistic regression, however, due to faster training times.

2.2.2 Alignment analysis

To evaluate the alignment between representations of different modalities, we use central kernel alignment (CKA) CKA . CKA is shown to be effective CKA as a method to identify the correspondence between representations of networks with different initializations, compared to CCA-based similarity measures SVCCA ; PWCCA . CKA is considered to be a normalized version of the Hilbert-Schmidt Independence Criterion (HSIC) gretton2005measuring . The CKA measure for a pair of modalities and is defined as:


where is the dimension of the latent representation, is a matrix of -dimensional representation for samplesm, is the Frobenius norm.

Our results are only evaluated using CKA since we find it fedorov2021self to be the most robust to noise, which reinforces findings in previous literature CKA ; CKAstructure that suggest the same.

2.2.3 Saliency explanation of the representation in brain space

To explain the representations in brain space, we adapt the integrated gradients algorithm sundararajan2017axiomatic . We want to understand the representations rather than the saliency of a specific label. Hence we propose a simple adaptation. Instead of using a target variable, we compute gradients with respect to each dimension of the representation. This is done by setting the specific dimension in the vector to 1 and all other dimensions to 0.

3 Experiments and results

Figure 4: The figures show the learning framework for the CR-based objective with the T1 image. It includes an encoder with DCGAN dcgan architecture, local and global projection heads, and the computation of the critic function. The evaluation of the representation is performed with a frozen pre-trained encoder. First, with logistic regression, we evaluate the downstream performance. Secondly, with alignment analysis, we explore the multimodal properties of the representation. Finally, we interpret the representation in the brain space with saliency gradients.

In this section, we present our findings on two classification tasks with the OASIS-3 dataset, investigating the relative performance of self-supervised and fully supervised approaches. In addition, to understand the inductive biases of different multimodal objectives, we calculate CKA to measure joint information captured in representations between modalities. Secondly, we compare group differences between the supervised and best self-supervised model from the first stage. Lastly, we explore multimodal links between T1 and fALLF.

The overall scheme of our experimental setup is shown in Figure 4 which consists of initial pretraining and further evaluation on the classification task, alignment of the representation, and analysis of representation through saliency in brain space.

3.1 Dataset

Here we validate our method on OASIS-3 OASIS3 , which is a multimodal neuroimaging dataset including multiple Alzheimer’s disease phenotypes.

Each subject in this dataset is represented by a T1 volume, and a fractional amplitude of low-frequency fluctuation (fALFF) zou2008improved volume, which is generated from T1w and resting-state fMRI (rs-fMRI) images. The purpose of the T1 volume is to account for the anatomy of the brain, and the fLAFF volume captures the resting-state dynamics. Both T1 volumes and fALFF volumes have previously been shown to be informative for studying not only Alzheimer’s disease he2007regional but also other cases (e.g., chronic smokers wang2017altered ).

The T1w images were brain-masked with BET in FSL fsl (v 6.0.20), linearly transformed to MNI space, and subsampled to 3mm after preprocessing. 15 T1w images were discarded because these images did not pass the initial visual quality check. The rs-fMRI was registered using MCFLIRT in FSL fsl

(v 6.0.20) to the first image. The specific parameters for MCFLIRT are: a 3-stage search level (8mm, 4mm, 4mm); 20mm field-of-view, 256 histogram bins (for matching); 6 degrees-of-freedom (DOF) for transformation; a scaling factor of 6mm; the normalized correlation values across the volumes as the cost function (smoothed to 1mm), and interpolation was computed using splines. The fALFF maps we computed using REST 

rest are within the 0.01 to 0.1 Hz power band. The final volume size for both modalities is . Although it is not required to register the volumes in MNI space, we perform this registration to simplify the analysis and interpretability of our method. Otherwise, we tried to minimize the preprocessing of the data to retain as much information in the original data that the neural network can then learn as possible. In addition, the subsampling to mm has been done to reduce computational needs, while applications on mm we considered as future work.

Non-Hispanic Caucasian subjects are the largest cohort () in the dataset. Thus, we selected ( HC, AD, unlabeled) non-Hispanic Caucasian subjects as the main set for the pretraining. OASIS-3 OASIS3 contains a large number of subjects that are neither classified as AD nor can readily be called controls. These subjects belong to one of 21 diagnostic categories, including some forms of cognitive impairment, frontotemporal dementia (FTD), Diffuse Lewy body disease (DBLD), and vascular dementia from preclinical cohort and followed longitudinal progression. We have combined all such subjects into a separate third class.

After matching up the scans of each modality that are closest in date out of all available scans for a subject, the final dataset contains pairs. The pairs are split into stratified folds ( subjects ( pairs), ()), and hold-out — (). The number of pairs is greater than the number of subjects because some subjects have multiple scans. Thus, we utilized more pairs during pretraining but used only one pair of images for each subject in the final evaluation. For the 2-way classification, we do not use unlabeled data, while for 3-way, we use unlabeled data as a "noisy" phenotypic third-class.

Before feeding the images into the neural network, the intensities of the T1 and fALFF were normalized using min-max rescaling to the unit interval ([0, 1]). We augment the dataset with random crops of size 64 after reflective padding with size eight from all sides during pretraining. The decision for the preprocessing and augmentation was based on evaluations of the supervised baseline. We also considered histogram standardization, z-normalization, random flips, and balanced data sampler 

balanced_data_sampler . However, the results were not substantially different. Thus to reduce the computational cost, we use on simple min-max rescaling and random crops.

3.2 Learning and evaluating representation

3.2.1 Pretraining

To train our model that is schematically shown in Figure 4, we have to choose an architecture for an encoder and global and local projection heads. The local projection head is needed to project local representation to a -dimensional channel space to ensure that the critic scores between the global and local representations will be computed in the same space. The global projection head is needed due optimization process. As it has been shown by authors of SimCLR simclr the last projection to the representation can develop a lower rank condition which is beneficial to the optimization of the objective but can be destructive to the representation.

For our encoder, we choose the architecture from the

deep convolutional generative adversarial networks

(DCGAN) dcgan . This architecture provides a simple, fully convolutional structure and has a specialized decoder which is important for the performance of generative approaches. We used volumetric convolutional layers for the experiments with neuroimaging OASIS-3 dataset. Most of the hyperparameters we left as in the original work dcgan

. We swapped the last tanh activation functions in the decoder with a sigmoid because the intensities of the input images are scaled into the unit interval. The last layer projects activations from the previous layer to the final

-dimensional representation vector that we named global. All convolutional layers are initialized with Xavier uniform xavier_uniform and gain related to the activation function paszke2019pytorch . Each modality has its encoder with DCGAN architecture.

For a local projection head we choose architecture that similar to AMDIM amdim . The projection head represents one ResNet block from a third layer of DCGAN architecture with feature size . One direction in the block consists of convolutional layers (kernel size , number of output and hidden channels , Xavier uniform initialization xavier_uniform ). The second direction consists of one convolutional layer (kernel size , number of output channels , initialization as identity). The projection heads are individual for each modality, and we added it only if the model has CC objective.

For a global projection head we follow SimCLR simclr . We perform a hyperparameter search for the number of hidden layers in the projection head for each model that can use the projection head (except Supervised, AE, CC). We have considered cases: without a projection head, with a linear projection head, and a projection head with 1-, 2-, or 3- hidden layers. The number of output dimensions in the projection layers equals .

In addition, following AMDIM amdim we add regularization to InfoNCE objective by penalizing the squared scores computed by the critic function as with , and cliping the scores by with .

We perform the training of the models on OASIS-3 dataset with RAdam liu2019variance optimizer with learning rate (). The pretraining step in our framework has been performed for epochs. For each trained model, we saved checkpoints based on the best validation loss.

3.2.2 Evaluation

After the pretraining step, to evaluate the discriminative performance of representations learned by the various objectives in our taxonomy, we perform two classification tasks. The first task is a binary classification of Alzheimer’s Disease (AD) vs. Healthy Cohort (HC). The second task is a ternary classification with an additional phenotypical class. The first task is easier than the second one because the latter has an added class.

The logistic regression is used to evaluate the discriminative performance of the learned representation of the data. The logistic regression is trained on global representation after extracting it with a pre-trained encoder. We use logistic regression (from scikit-learn scikit-learn ) to perform classification tasks. The hyperparameters of the logistic regression were optimized with Optuna optuna_2019 for iterations. The selections of the hyperparameters are performed based on the validation dataset. The search space for hyperparameters is defined as follows: inverse regularization strength is sampled log-uniformly with interval , for the elastic net penalty, the mixing parameter is uniformly sampled from unit interval . The logistic regression is trained using SAGA solver defazio2014saga . We use a ROC AUC and one-vs-one (OVO) ROC AUC Macro hand2001simple as a scoring function for a hyperparameter search for binary and ternary classification, respectively. The OVO strategy for ROC AUC metrics in multiclass classification is computed as the average AUC of all possible pairwise combinations of classes. In addition, it is insensitive to class imbalance for macro averaging scikit-learn . Classification is performed separately for each modality by training logistic regression on the representations extracted from that modality using the corresponding convolutional encoder.

After running logistic regression for each of the ten checkpoints, we select the checkpoint with the maximum ROC AUC metric for a binary case and OVO ROC AUC Macro metric for ternary on the validation set. After choosing the checkpoint, we choose a projection head based on cross-validation for some models that need the projection head during pre-training. Since multimodal models are paired, we select the checkpoint based on average performance on both modalities. While for unimodal models, we pair checkpoints based on the number of epochs, so the model trained longer together are paired together.

Lastly, the alignment score CKA is computed to measure the joint information content of the representation between modalities as a measure of the inductive bias of the training objective. The alignment score CKA is also computed on global representation .

3.2.3 Results

Figure 5: The ROC AUC performance of logistic regression in binary classification (top right corner) and ternary classification (bottom left corner) tasks. Markers correspond to the median of the ROC AUC, and error bars correspond to the IQR. The X- and Y-axis correspond to the ROC AUC on T1 and fALFF modalities. The ROC AUC was measured as a one-versus-one (OVO) macro metric for ternary classification. Two classification tasks are shown on the same plot to visualize the generalizability of the learned representation in tasks with different difficulties. The dashed line represents a diagonal of the balanced performance between T1 and fALFF. The CKA metric shows an alignment of the representation between modalities as a measure of joint information. Lower CKA values mean less joint information between representations, and higher CKA values — more joint information. Most of the self-supervised models can outperform Supervised on both classification tasks. The objectives that include multi-scale local-global relationship XX perform robustly on both tasks and retain their ranking relative to the other models. It shows XX that is an important building block for multimodal data. The higher content of joint information measured via CKA seems to help models on binary classification but less on harder ternary classification. Thus it does not seem to explain the predictive performance while the multi-scale local-global relationship XX of the model can.

The classification results on a hold-out test dataset are shown for both tasks in Figure 5. The performance is reported with a median and interquartile range (IQR) of the ROC AUC and one-versus-one (OVO) ROC AUC Macro (average) hand2001simple metrics for binary and ternary classification tasks, respectively. Additionally, ,we report CKA as the measure of joint information between representations of different modalities for each model.

Overall, the Supervised model outperforms self-supervised models on T1 for both tasks: 86.2 (86.1-86.8) for 2-way classification and 72.9 (72.0-77.0) for 3-way classification. However, the performance gap on T1 is small as the self-supervised multimodal model RR-XX-CC achieves 84.0 (83.0-84.9) for binary and 70.9 (67.9-72.1) for ternary classifications.

Most self-supervised models can achieve better classification performance on "noisier" fALFF than Supervised model in the 2-way classification task. For 3-way classification, the gap is reduced while the RR-XX model achieves 63.2 (59.1-67.8) versus 62.4 (58.7 70.1) for Supervised. This supports the benefits of multimodal learning, which could be seen as a regularization effect.

The unimodal autoencoder (AE) model can perform well on the simple binary classification task. However, the performance significantly drops on the harder ternary classification task. Unimodal models such as CR and AE are outperformed by most of the multimodal models. Evidently, multimodal extension of AE with CCA as DCCAE can improve performance. However, DCCAE can be outperformed by most of self-supervised decoder-free models on 2-way classification task and XX-CC, RR-XX, RR-XX-CC, XX on 3-way classification task. Therefore we could achieve more robust performance with the proposed models while reducing the computational cost of the decoder for each modality.

Overall, the proposed self-supervised models such as XX, XX-CC, RR-XX and RR-XX-CC perform robustly on both tasks and retain their ranking relative to the other models. Additionally, judging by the higher CKA alignment measure (0.63-0.73), these models capture joint information between modalities. While there are other models—RR-AE, RR-CC and RR—that achieve higher CKA alignment yet are not as robust. We hypothesize that the joint information alone is not the answer to the problem, but the architecture of the model is important. Note that the XX, XX-CC, RR-XX and RR-XX-CC models capture local-global relationship between modalities. While the RR-AE, RR-CC and RR models only capture joint information on global-global or local-local representation level. Thus, given the empirical evidence in Figure 5, the local-global relationship XX is an essential building block for multimodal data because it allows us to capture complex multi-scale relationships between modalities.

3.3 Interpretability

3.3.1 Explaining group differences between HC and AD

In this subsection, we explain the performance of the models by analyzing the saliency maps. As a point of interest and comparison, we select Supervised and RR-XX models. The Supervisedmodel performs the best with T1 input volumes and utilizes target labels. Thus it is a solid baseline to analyze group differences. The RR-XX model performs the best in the ternary classification task and precisely does well for fALFF input volumes. We use these two models to generate saliency maps and interpret what the models have learned.

For each selected model, we compute integrated gradients sundararajan2017axiomatic along each dimension in the 64-dimensional representation and discard the negative gradients. After computing saliency gradients for each dimension of the latent representation, we apply brain masking, rescale gradient values to a unit interval and smooth them with a Gaussian filter (). Then we perform a voxel-wise two-sided test with Mann-Whitney U-Test and compute Rank Bisseral Correlation (RBC) as an effect size. After selecting the voxels with a , using 3dClusterize from AFNI cox1996afni , we find clusters with at least 200 voxels. Then we apply whereami from AFNI cox1996afni to match those clusters with the ROIs defined in the template that is used in the Neuromark pipeline du2020neuromark . We call this template the Neuromark atlas in the rest of the text. To create the Neuromark atlas, the spatial ICA components of the Neuromark template have been combined to create an atlas by simple overlapping. Then atlas has been added to AFNI cox1996afni environment.

We select only the top brain ROI for each cluster based on the overlap and include only the ROIs that have been found in at least two folds. The overlap in the clustered saliencies with ROIs is measured using DICE. The final results are summarized for all dimensions in Figure 6 where we report the maximum DICE overlap for both models and both modalities.

Figure 6: The figure shows the maximum DICE overlap of the saliency clusters with regions in the Neuromark atlas. The first row shows the DICE maps for sMRI data, and the second row shows the maps found for fALFF. The first columns show maps for the Supervised model, and the second column shows the maps for the RR-XX model. Supervised shows a sparse choice of discriminative regions on T1, overall higher DICE performance, and stronger contrast. RR-XX shows a sparse choice of discriminative regions on fALFF, overall lower DICE, and lower contrast.

The results in Figure 6 suggest that higher discriminative performance is related to a sparse choice of the ROIs. Specifically, the Supervised model seems to be sparser than the RR-XX model on T1 volumes because its binary ROC AUC performance is , compared to . The RR-XX, however, seems sparser than the Supervised for fALFF data because its 2-way ROC AUC performance is compared to .

Another results in Figure 6 that the Supervised model has higher DICE and stronger contrast compared to the RR-XX. It suggests stronger localization of the saliency maps for Supervised model that can be explained by the use of the labels to learn representation. As a self-supervised model learns without labels, a lower contrast can support the idea that self-supervised learning learns less task-specific general representation.

In addition, in Figure 6, the models tend to capture information from the frontal lobe regions and less of the posterior part of the brain in their representations on fALFF.

3.3.2 Detailed analysis of group differences with best self-supervised and supervised models

Figure 7: The figures show the regions of interest of the Neuromark atlas that correspond to clusters (>200 voxels) in voxel-wise RBC maps with the highest DICE overlap for Supervised model. We show a peak RBC value of the cluster. The maps are chosen for dimensions in the representation that corresponds to the highest positive and lowest negative betas in the trained logistic regression. We show the regions that are found for T1 in the left column and the right column for fALFF. Supervised finds precuneus and anterior cingulate cortex.
Figure 8: The figures show the regions of interest for the Neuromark atlas that corresponds to clusters ( voxels) in voxel-wise RBC maps with the highest DICE overlap for RR-XX model. We show a peak RBC value of the cluster. The maps are chosen for dimensions in the representation that corresponds to the highest positive and lowest negative betas in the trained logistic regression. We show the regions that are found for T1 in the left column and the right column for fALFF. RR-XX finds many discriminative regions that are supported by the literature, such as the hippocampus, thalamus, parietal lobule, occipital gyrus, middle temporal gyrus, superior medial frontal gyrus, subthalamus hypothalamus.

In this analysis, we compare saliency maps for group differences for Supervised and RR-XX models specifically for most discriminative dimensions in the representation vector. The discriminative dimensions are the dimensions in with the highest and lowest beta scores in the trained logistic regression. After selecting the dimensions, we compute saliencies and perform a voxel-wise test as in the previous subsection to get RBC maps with significant voxels.

In Figure 7 we show RBC maps for Supervisedmodel, and in Figure 8 we show RBC maps for RR-XX for only the first fold on the hold-out test set.

The Supervised model has bigger clusters on T1, while the self-supervised model RR-XX has more local and smaller clusters. Given that Supervised has better performance on T1 than RR-XX, these RBC maps might explain the performance gap in 2-way classification. However, given the reduced gap in 3-way classification, it might also indicate that the Supervised model might overfit the task and uses more regions than is needed.

As we can see in Figure 8, the self-supervised model RR-XX able to pick on T1 the discriminative regions that supported by the literature: such as hippocampus yang2022human , thalamus elvsaashagen2021genetic , parietal lobule greene2010subregions , occipital gurys yang2019study . The regions that are found for Supervisedare also supported by the literature: such as precuneus guennewig2021defining and anterior cingulate cortex yu2021human .

On fALFF, Supervised and RR-XX models share similar behavior, including more local and smaller clusters. The supervised model shows precuneus, consistent with prior work focused on fALFF wang2021comparative . The RR-XX model shows hippocampus 10.3389/fnagi.2018.00037 ; zhang2021regional , middle temporal gyrus hu2022brain , subthalamus hypothalamus chan2021induction and superior medial frontal gyrus cheung2021diagnostic .

Figure 9: The Figure shows multimodal links between T1 and fALFF of ROIs in the Neuromark atlas for RR-XX model while for Supervised in the left top corner. The ROIs for T1 are shown on the left side with shades of blue, and the ROIs for fALFF are shown on the right side with shades of red. The edge weights are defined by the correlation between dimensions in the representation vector between T1 and fALFF and colored according to the spectral color bar. RR finds multiple multimodal links between regions supported by the literature, such as thalamus-precuneus, precuneus-hippocampus, and precuneus-middle cingulate cortex.

3.3.3 Exploring multimodal links

This section explores multimodal links between the T1 and fALFF modalities. To perform this analysis, we compute an asymmetric correlation matrix between all pairs of dimensions in -dimensional global representation of the T1 and fALFF. Then we select one ROI in the Neuromark atlas and find a dimension in the representation vector with a cluster with the highest DICE overlap with this ROI in RBC maps. After finding this dimension, we find a second dimension from another modality with the highest positive and negative correlation from the correlation matrix. Then we connect the first ROI with each ROI captured by a second dimension with an edge with correlation values as a weight. We repeat the same procedure for each of the 53 ROIs in the Neuromark cluster and each modality.

The final summarization of the multimodal relationships is shown in Figure 9. Note, we show the top 64 edges with maximum by absolute values weights, and, specifically, we only focus on the self-supervised multimodal model RR-XX. However, we also show the same diagram for the Supervised model in a restricted way. Because the correlation of the Supervised is much lower, and the model is learning representation unimodally, thus the relationships are more likely to be spurious. In the figure, we show ROIs for T1 on the left side with blue hues and fALFF on the right side with red hues. Additionally, we group ROIs by functional networks defined in the Neuromark atlas.

One positively correlated link (Pearson’s ) has been found by RR-XX is thalamus 5 (fALFF) - precuneus 48 (T1) that has been associated with changes in consciousness cunningham2017structural . Another negatively correlated link (Pearson’s ) with precuneus 48 (T1) and hippocampus 37 (fALFF) has been associated with Alzheimer’s disease kim2013hippocampus ; ryu2010measurement . The negatively correlated link (Pearson’s ) between precuneus 48 (T1) and middle cingulate cortex 39 (fALFF) could be related with findings in these works rami2012distinct ; bailly2015precuneus .

Overall, the self-supervised model RR-XX can learn meaningful multimodal relationships that clinicians can further explore.

3.4 Hardware, reproducibility, and code

The experiments were performed using an NVIDIA V100. The code is implemented mainly using PyTorch 

(paszke2019pytorch, ) and Catalyst (catalyst, ) frameworks. The code is available at for reproducibility and further exploration by the scientific community.

4 Discussion

4.1 Multi-scale coordinated self-supervised models

The proposed self-supervised multimodal multi-scale coordinated models can capture useful representation and multimodal relationships in the data. Compared to existing unimodal (CR and AE) and multimodal (DCCAE) counterparts, these models achieve higher discriminative performance in ROC AUC on downstream tasks. While some of them can capture higher joint information content between modalities as measured by CKA. Furthermore, these models can produce representations that, compared to the Supervised model, show competitive performance on T1 and outperform fALLF.

We show strong empirical evidence that the XX is the most important relationship to encourage when high discriminative performance is the goal. This results is evidence of the importance and existence of multi-scale local-to-global multimodal relationships in the functional and structural neuroimaging data, which other relationships can not capture.

However, not all multimodal variants from the taxonomy of Figure 2 result in robust and useful representations. Specifically, our experiments show that the CC relationship should not be used separately from other objectives, as the CC will optimize only the layers below the chosen one because the last layer will behave as a random projection. We show it only for a complete picture of achievable classification performance with all objectives in the taxonomy.

The CCA-based objectives did not show the good results as would be expected based on the current literature. However, our taxonomy revealed that the DCCA is related to SimCLR simclr which led us to develop the RR model. While CCA maximizes correlations, SimCLR

maximized cosine similarity between representations of different modalities. However, the

SimCLR objective has one more important difference: it performs an additional discrimination step on cosine similarity scores. Thus it does two things: maximizes the similarity between modalities and simultaneously performs additional discrimination of pairs based on similarity. This task is more challenging because it needs to capture richer information to classify pairs of representations from different modalities based on similarity. In addition, the CCA-objective is prone to numerical instability thanks to its implementation in DCCAE DCCAE . RR does not have such issues. We recommend using “softer” optimization based on mutual information estimators with deep neural networks and not the “exact” solutions based on linear algebra in DCCAE DCCAE .

While AE imposes an additional computational complexity due to the decoder, it has not shown benefits to the discriminative performance of the model. Specifically, AE struggles to deal with ternary classification tasks. Multimodal models from our proposed taxonomy have a reduced computational burden lacking a volumetric decoder. These findings concur with the poor performance of autoencoders on datasets with natural images DIM . We hypothesize that autoencoders, to achieve greater performance, may require encoders and decoders of high capacity. However, this will considerably increase the difficulty of training large volumetric models.

4.2 Future Work

The models we have constructed in this work do not disentangle representation into joint and unique modality-specific representations. The analysis between CKA and downstream performance shows existence of a joint subspace between modalities, and a specific amount of joint information measured by CKA is important to learn representation valuable for downstream tasks. Future work could consider models that explicitly represent factors of the joint and unique components. Some related ideas have been explored in work on natural images when disentangling content and style von2021self ; lyu2021understanding , similarly, for neural data with variational autoencoders liu2021drop .

In our analysis, we do not consider the family of multimodal generative variational models kingma2013auto . Currently, volumetric variational models are computationally expensive, and the field is under active development given many models that have been proposed recently. Including all possible models with all possible underlying technology was not precisely our goal and would make the already extensive list of models hard to analyze. Future work may consider variational models under the same taxonomy for a fair comparison and detailed analysis of multimodal fusion applications.

There is more than can be done concerning the explainability of the models. Currently, a common choice to model neuroimaging data is to use a convolutional neural network (CNN) abrol2021deep . However, the simple application of CNNs leads to representation, where each dimension captures multiple ROIs. This effect creates difficulties in analyzing the cross-modal relationship between modalities. The multimodal links between ROIs can only be measured by the correlation between dimensions of a representation in different modalities. Thus the measured links do not represent the multimodal link between one ROI and another ROI but rather between dimensions. Future work may consider ROI-based representations.

In addition, as we want to focus on unsupervised models without using group labels, we used HC and AD groups to show group differences or identify ROIs in our analysis. However, the data may contain phenotypically small groups of patients that are not represented by HC or AD groups. It will be hard to do group analysis in such a scenario because we do not have the labels. Thus future work can consider additional clustering of the representation for finding such subgroups that explainability methods can further analyze.

5 Conclusions

In this work, we presented a novel multi-scale coordinated framework for representation learning from multimodal neuroimaging data. We showed that self-supervised approaches can learn meaningful and useful representations which capture regions of interest with group differences without accessing group labels during the pre-training stage. We developed evaluation methodologies to access the properties of representations learned by models within the family of models in downstream task analysis, measurements of joint subspace, and explainability evaluations.

We outperformed previous unsupervised models AE and DCCAE on all classification tasks and modalities. In addition, our family of models does not require a decoder that saves computational and memory requirements. In addition, we can outperform the Supervised model on fALFF. This result suggests future use of the proposed multimodal objectives for asymmetric fusion as a regularization technique. Further, our findings suggest the importance of multi-scale local-to-global multimodal relationships XX = that considerably improve the performance and multimodal alignment over previous methods and within the proposed family of models. This result suggests that there exist multi-scale relationships between local structure and global summary of the inputs in different modalities that previously have been neglected in multimodal representation learning.

The RR-XX model, selected based on the best classification performance and higher joint information content via CKA, was able to capture important regions of interest related to Alzheimer’s disease such as hippocampus yang2022human ; 10.3389/fnagi.2018.00037 ; zhang2021regional , thalamus elvsaashagen2021genetic , parietal lobule greene2010subregions , occipital gurys yang2019study , middle temporal gyrus hu2022brain , subthalamus hypothalamus chan2021induction , and superior medial frontal gyrus cheung2021diagnostic . Importantly, the RR-XX model is able to capture multimodal links between regions that are supported by the literature such as thalamus-precuneus cunningham2017structural , precuneus-hippocampus kim2013hippocampus ; ryu2010measurement , and precuneus middle cingulate cortex rami2012distinct ; bailly2015precuneus .

The showcased benefits of applying a comprehensive approach, evaluating a taxonomy of methods, and performing extensive qualitative and quantitative evaluation suggest that multimodal representation learning is a field with significant potential in neuroimaging, despite being in a nascent state. Our work lays a foundation for future robust and increasingly more interpretable multimodal models.


This work was funded by the National Institutes of Health (NIH) grants R01MH118695, RF1AG063153, 2R01EB006841, RF1MH121885, and the National Science Foundation (NSF) grant 2112455.

Data were provided by OASIS-3: Principal Investigators: T. Benzinger, D. Marcus, J. Morris; NIH P50 AG00561, P30 NS09857781, P01 AG026276, P01 AG003991, R01 AG043434, UL1 TR000448, R01 EB009352. AV-45 doses were provided by Avid Radiopharmaceuticals, a wholly owned subsidiary of Eli Lilly.


  • (1) V. D. Calhoun, J. Sui, Multimodal fusion of brain imaging data: a key to finding the missing link (s) in complex mental illness, Biological psychiatry: cognitive neuroscience and neuroimaging 1 (3) (2016) 230–244.
  • (2) S. M. Plis, M. P. Weisend, E. Damaraju, T. Eichele, A. Mayer, V. P. Clark, T. Lane, V. D. Calhoun, Effective connectivity analysis of fmri and meg data collected under identical paradigms, Computers in biology and medicine 41 (12) (2011) 1156–1165.
  • (3) T. Baltrušaitis, C. Ahuja, L.-P. Morency, Multimodal machine learning: A survey and taxonomy, IEEE transactions on pattern analysis and machine intelligence 41 (2) (2018) 423–443.
  • (4) P. Comon, Independent component analysis, a new concept?, Signal processing 36 (3) (1994) 287–314.
  • (5) H. Hotelling, Relations between two sets of variates, in: Breakthroughs in statistics, Springer, 1992, pp. 162–190.
  • (6) M. Moosmann, T. Eichele, H. Nordby, K. Hugdahl, V. D. Calhoun, Joint independent component analysis for simultaneous eeg–fmri: principle and simulation, International Journal of Psychophysiology 67 (3) (2008) 212–221.
  • (7) J. Sui, G. Pearlson, A. Caprihan, T. Adali, K. A. Kiehl, J. Liu, J. Yamamoto, V. D. Calhoun, Discriminating schizophrenia and bipolar disorder by fusing fmri and dti in a multimodal cca+ joint ica model, Neuroimage 57 (3) (2011) 839–855.
  • (8) J. Sui, T. Adali, G. Pearlson, H. Yang, S. R. Sponheim, T. White, V. D. Calhoun, A cca+ ica based model for multi-task brain imaging data fusion and its application to schizophrenia, Neuroimage 51 (1) (2010) 123–134.
  • (9) J. Liu, G. Pearlson, A. Windemuth, G. Ruano, N. I. Perrone-Bizzozero, V. Calhoun, Combining fmri and snp data to investigate connections between brain function and genetics using parallel ica, Human brain mapping 30 (1) (2009) 241–255.
  • (10) K. Duan, V. D. Calhoun, J. Liu, R. F. Silva, any-way independent component analysis, in: 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), IEEE, 2020, pp. 1770–1774.
  • (11) A. Abrol, Z. Fu, M. Salman, R. Silva, Y. Du, S. Plis, V. Calhoun, Deep learning encodes robust discriminative neuroimaging representations to outperform standard machine learning, Nature communications 12 (1) (2021) 1–17.
  • (12) R. Geirhos, J. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, F. Wichmann, Shortcut learning in deep neural networks, arXiv preprint arXiv:2004.07780.
  • (13) D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al., A closer look at memorization in deep networks, in: International Conference on Machine Learning, PMLR, 2017, pp. 233–242.
  • (14)

    M. Pechenizkiy, A. Tsymbal, S. Puuronen, O. Pechenizkiy, Class noise and supervised learning in medical domains: The effect of feature extraction, in: 19th IEEE symposium on computer-based medical systems (CBMS’06), IEEE, 2006, pp. 708–713.

  • (15) H. Rokham, G. Pearlson, A. Abrol, H. Falakshahi, S. Plis, V. D. Calhoun, Addressing inaccurate nosology in mental health: A multilabel data cleansing approach for detecting label noise from structural magnetic resonance imaging data in mood and psychosis disorders, Biological Psychiatry: Cognitive Neuroscience and Neuroimaging 5 (8) (2020) 819–832.
  • (16) O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Eslami, A. v. d. Oord, Data-efficient image recognition with contrastive predictive coding (2020) 4182–4192.
  • (17) A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, T. Brox, Discriminative unsupervised feature learning with convolutional neural networks, Advances in neural information processing systems 27.
  • (18) D. Hendrycks, M. Mazeika, S. Kadavath, D. Song, Using self-supervised learning can improve model robustness and uncertainty, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019.
  • (19) M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, A. Joulin, Unsupervised learning of visual features by contrasting cluster assignments, in: H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, Vol. 33, Curran Associates, Inc., 2020, pp. 9912–9924.
  • (20) N. Srivastava, R. Salakhutdinov, Learning representations for multimodal data with deep belief nets, in: International conference on machine learning workshop, Vol. 79, 2012, p. 3.
  • (21) S. M. Plis, D. R. Hjelm, R. Salakhutdinov, E. A. Allen, H. J. Bockholt, J. D. Long, H. J. Johnson, J. S. Paulsen, J. A. Turner, V. D. Calhoun, Deep learning for neuroimaging: a validation study, Frontiers in neuroscience 8 (2014) 229.
  • (22) N. Srivastava, R. R. Salakhutdinov, Multimodal learning with deep boltzmann machines, Advances in neural information processing systems 25.
  • (23)

    R. D. Hjelm, V. D. Calhoun, R. Salakhutdinov, E. A. Allen, T. Adali, S. M. Plis, Restricted boltzmann machines for neuroimaging: an application in identifying intrinsic networks, NeuroImage 96 (2014) 245–260.

  • (24) H.-I. Suk, S.-W. Lee, D. Shen, Hierarchical feature representation and multimodal fusion with deep learning for ad/mci diagnosis, NeuroImage 101 (2014) 569–582. doi:
  • (25) G. Andrew, R. Arora, J. Bilmes, K. Livescu, Deep canonical correlation analysis, in: International conference on machine learning, PMLR, 2013, pp. 1247–1255.
  • (26) W. Wang, R. Arora, K. Livescu, J. Bilmes, On deep multi-view representation learning, in: International conference on machine learning, PMLR, 2015, pp. 1083–1092.
  • (27) A. Fedorov, J. Johnson, E. Damaraju, A. Ozerin, V. Calhoun, S. Plis, End-to-end learning of brain tissue segmentation from imperfect labeling, in: 2017 International Joint Conference on Neural Networks (IJCNN), IEEE, 2017, pp. 3785–3792.
  • (28) A. Fedorov, E. Damaraju, V. Calhoun, S. Plis, Almost instant brain atlas segmentation for large-scale studies, arXiv preprint arXiv:1711.00457.
  • (29) L. Henschel, S. Conjeti, S. Estrada, K. Diers, B. Fischl, M. Reuter, Fastsurfer-a fast and accurate deep learning based neuroimaging pipeline, NeuroImage 219 (2020) 117012.
  • (30) R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, Y. Bengio, Learning deep representations by mutual information estimation and maximization, in: International Conference on Learning Representations, 2019.
  • (31) A. Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding, arXiv preprint arXiv:1807.03748.
  • (32) A. Fedorov, T. Sylvain, E. Geenjaar, M. Luck, L. Wu, T. P. DeRamus, A. Kirilin, D. Bleklov, V. D. Calhoun, S. M. Plis, Self-supervised multimodal domino: in search of biomarkers for alzheimer’s disease, in: 2021 IEEE 9th International Conference on Healthcare Informatics (ICHI), IEEE, 2021, pp. 23–30.
  • (33)

    I. Misra, L. v. d. Maaten, Self-supervised learning of pretext-invariant representations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6707–6717.

  • (34) C. Doersch, A. Gupta, A. A. Efros, Unsupervised visual representation learning by context prediction, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1422–1430.
  • (35) S. Gidaris, P. Singh, N. Komodakis, Unsupervised representation learning by predicting image rotations, arXiv preprint arXiv:1803.07728.
  • (36) R. Zhang, P. Isola, A. A. Efros, Colorful image colorization, in: European conference on computer vision, Springer, 2016, pp. 649–666.
  • (37) A. Fedorov, R. D. Hjelm, A. Abrol, Z. Fu, Y. Du, S. Plis, V. D. Calhoun, Prediction of progression to alzheimer’s disease with deep infomax, in: 2019 IEEE EMBS International conference on biomedical & health informatics (BHI), IEEE, 2019, pp. 1–5.
  • (38) U. Mahmood, M. M. Rahman, A. Fedorov, Z. Fu, S. Plis, Transfer learning of fmri dynamics, arXiv preprint arXiv:1911.06813.
  • (39) U. Mahmood, M. M. Rahman, A. Fedorov, N. Lewis, Z. Fu, V. D. Calhoun, S. M. Plis, Whole milc: generalizing learned dynamics across tasks, datasets, and populations, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2020, pp. 407–417.
  • (40) A. Taleb, W. Loetzsch, N. Danz, J. Severin, T. Gaertner, B. Bergner, C. Lippert, 3d self-supervised methods for medical imaging, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, Vol. 33, Curran Associates, Inc., 2020, pp. 18158–18172.
  • (41) P. Bachman, R. D. Hjelm, W. Buchwalter, Learning representations by maximizing mutual information across views, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019.
  • (42) Y. Tian, D. Krishnan, P. Isola, Contrastive multiview coding, arXiv preprint arXiv:1906.05849.
  • (43) T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations, in: International conference on machine learning, PMLR, 2020, pp. 1597–1607.
  • (44) A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, A. Zisserman, End-to-end learning of visual representations from uncurated instructional videos, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9879–9889.
  • (45) J.-B. Alayrac, A. Recasens, R. Schneider, R. Arandjelović, J. Ramapuram, J. De Fauw, L. Smaira, S. Dieleman, A. Zisserman, Self-supervised multimodal versatile networks, Advances in Neural Information Processing Systems 33 (2020) 25–37.
  • (46) A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language supervision, in: ICML, 2021.
  • (47) T. Sylvain, F. Dutil, T. Berthier, L. Di Jorio, M. Luck, D. Hjelm, Y. Bengio, Cmim: Cross-modal information maximization for medical imaging, in: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 1190–1194. doi:10.1109/ICASSP39728.2021.9414132.
  • (48) K. Saito, Y. Mukuta, Y. Ushiku, T. Harada, Demian: Deep modality invariant adversarial network, arXiv preprint arXiv:1612.07976.
  • (49) Z. Feng, C. Xu, D. Tao, Self-supervised representation learning from multi-domain data, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3245–3255.
  • (50) T. Sylvain, L. Petrini, D. Hjelm, Locality and compositionality in zero-shot learning, arXiv preprint arXiv:1912.12179.
  • (51) T. Sylvain, L. Petrini, D. Hjelm, Zero-shot learning from scratch (zfs): leveraging local compositional representations (2020). arXiv:2010.13320.
  • (52) A. Anand, E. Racah, S. Ozair, Y. Bengio, M. Côté, D. Hjelm, Unsupervised state representation learning in atari, in: NeurIPS, 2019.
  • (53) P. J. LaMontagne, T. L. Benzinger, J. C. Morris, S. Keefe, R. Hornbeck, C. Xiong, E. Grant, J. Hassenstab, K. Moulder, A. G. Vlassenko, M. E. Raichle, C. Cruchaga, D. Marcus, Oasis-3: Longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and alzheimer disease, medRxivarXiv:, doi:10.1101/2019.12.13.19014902.
  • (54) A. C. Yang, R. T. Vest, F. Kern, D. P. Lee, M. Agam, C. A. Maat, P. M. Losada, M. B. Chen, N. Schaum, N. Khoury, et al., A human brain vascular atlas reveals diverse mediators of alzheimer’s risk, Nature 603 (7903) (2022) 885–892.
  • (55) T. Elvsåshagen, A. Shadrin, O. Frei, D. van der Meer, S. Bahrami, V. J. Kumar, O. Smeland, L. T. Westlye, O. A. Andreassen, T. Kaufmann, The genetic architecture of the human thalamus and its overlap with ten common brain disorders, Nature communications 12 (1) (2021) 1–9.
  • (56) S. J. Greene, R. J. Killiany, A. D. N. Initiative, et al., Subregions of the inferior parietal lobule are affected in the progression to alzheimer’s disease, Neurobiology of aging 31 (8) (2010) 1304–1311.
  • (57) H. Yang, H. Xu, Q. Li, Y. Jin, W. Jiang, J. Wang, Y. Wu, W. Li, C. Yang, X. Li, et al., Study of brain morphology change in alzheimer’s disease and amnestic mild cognitive impairment compared with normal controls, General psychiatry 32 (2).
  • (58) X. Liu, W. Chen, Y. Tu, H. Hou, X. Huang, X. Chen, Z. Guo, G. Bai, W. Chen, The abnormal functional connectivity between the hypothalamus and the temporal gyrus underlying depression in alzheimer’s disease patients, Frontiers in Aging Neuroscience 10 (2018) 37. doi:10.3389/fnagi.2018.00037.
  • (59) F. Zhang, B. Hua, M. Wang, T. Wang, Z. Ding, J.-R. Ding, Regional homogeneity abnormalities of resting state brain activities in children with growth hormone deficiency, Scientific Reports 11 (1) (2021) 1–7.
  • (60) Q. Hu, Y. Li, Y. Wu, X. Lin, X. Zhao, Brain network hierarchy reorganization in alzheimer’s disease: A resting-state functional magnetic resonance imaging study, Human Brain Mapping.
  • (61) D. Chan, H.-J. Suk, B. Jackson, N. Milman, D. Stark, S. Beach, L.-H. Tsai, Induction of specific brain oscillations may restore neural circuits and be used for the treatment of alzheimer’s disease, Journal of Internal Medicine 290 (5) (2021) 993–1009.
  • (62) E. Y. Cheung, Y. Shea, P. K. Chiu, J. S. Kwan, H. K. Mak, Diagnostic efficacy of voxel-mirrored homotopic connectivity in vascular dementia as compared to alzheimer’s related neurodegenerative diseases—a resting state fmri study, Life 11 (10) (2021) 1108.
  • (63) S. I. Cunningham, D. Tomasi, N. D. Volkow, Structural and functional connectivity of the precuneus and thalamus to the default mode network, Human Brain Mapping 38 (2) (2017) 938–956.
  • (64) J. Kim, Y.-H. Kim, J.-H. Lee, Hippocampus–precuneus functional connectivity as an early sign of alzheimer’s disease: A preliminary study using structural and functional magnetic resonance imaging data, Brain research 1495 (2013) 18–29.
  • (65) S.-Y. Ryu, M. J. Kwon, S.-B. Lee, D. W. Yang, T.-W. Kim, I.-U. Song, P. S. Yang, H. J. Kim, A. Y. Lee, Measurement of precuneal and hippocampal volumes using magnetic resonance volumetry in alzheimer’s disease, Journal of Clinical Neurology 6 (4) (2010) 196–203.
  • (66) L. Rami, R. Sala-Llonch, C. Solé-Padullés, J. Fortea, J. Olives, A. Lladó, C. Pena-Gómez, M. Balasa, B. Bosch, A. Antonell, et al., Distinct functional activity of the precuneus and posterior cingulate cortex during encoding in the preclinical stage of alzheimer’s disease, Journal of Alzheimer’s disease 31 (3) (2012) 517–526.
  • (67) M. Bailly, C. Destrieux, C. Hommet, K. Mondon, J.-P. Cottier, E. Beaufils, E. Vierron, J. Vercouillie, M. Ibazizene, T. Voisin, et al., Precuneus and cingulate cortex atrophy and hypometabolism in patients with alzheimer’s disease and mild cognitive impairment: Mri and 18f-fdg pet quantitative analysis using freesurfer, BioMed research international 2015.
  • (68) A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov, Devise: A deep visual-semantic embedding model, Advances in neural information processing systems 26.
  • (69) W. Hu, B. Cai, A. Zhang, V. D. Calhoun, Y.-P. Wang, Deep collaborative learning with application to the study of multimodal brain development, IEEE Transactions on Biomedical Engineering 66 (12) (2019) 3346–3359.
  • (70) O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer, 2015, pp. 234–241.
  • (71) A. Araujo, W. Norris, J. Sim, Computing receptive fields of convolutional neural networks, DistillHttps:// doi:10.23915/distill.00021.
  • (72) M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, M. Lucic, On mutual information maximization for representation learning, in: International Conference on Learning Representations, 2020.
  • (73) S. Löwe, P. O’Connor, B. Veeling, Putting an end to end-to-end: Gradient-isolated learning of representations, Advances in neural information processing systems 32.
  • (74) K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised visual representation learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
  • (75) G. Alain, Y. Bengio, Understanding intermediate layers using linear classifier probes, arXiv preprint arXiv:1610.01644.
  • (76) S. Kornblith, M. Norouzi, H. Lee, G. Hinton, Similarity of neural network representations revisited, in: International Conference on Machine Learning, PMLR, 2019, pp. 3519–3529.
  • (77) M. Raghu, J. Gilmer, J. Yosinski, J. Sohl-Dickstein, Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 30, Curran Associates, Inc., 2017.
  • (78) A. S. Morcos, M. Raghu, S. Bengio, Insights on representational similarity in neural networks with canonical correlation, in: NeurIPS, 2018, pp. 5732–5741.
  • (79) A. Gretton, O. Bousquet, A. Smola, B. Schölkopf, Measuring statistical dependence with hilbert-schmidt norms, in: International conference on algorithmic learning theory, Springer, 2005, pp. 63–77.
  • (80) T. Nguyen, M. Raghu, S. Kornblith, Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth, in: International Conference on Learning Representations, 2021.
  • (81) M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in: International conference on machine learning, PMLR, 2017, pp. 3319–3328.
  • (82) A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv preprint arXiv:1511.06434.
  • (83) Q.-H. Zou, C.-Z. Zhu, Y. Yang, X.-N. Zuo, X.-Y. Long, Q.-J. Cao, Y.-F. Wang, Y.-F. Zang, An improved approach to detection of amplitude of low-frequency fluctuation (alff) for resting-state fmri: fractional alff, Journal of neuroscience methods 172 (1) (2008) 137–141.
  • (84) Y. He, L. Wang, Y. Zang, L. Tian, X. Zhang, K. Li, T. Jiang, Regional coherence changes in the early stages of alzheimer’s disease: a combined structural and resting-state functional mri study, Neuroimage 35 (2) (2007) 488–500.
  • (85) C. Wang, Z. Shen, P. Huang, H. Yu, W. Qian, X. Guan, Q. Gu, Y. Yang, M. Zhang, Altered spontaneous brain activity in chronic smokers revealed by fractional ramplitude of low-frequency fluctuation analysis: a preliminary study, Scientific reports 7 (1) (2017) 1–7.
  • (86) M. Jenkinson, P. Bannister, M. Brady, S. Smith, Improved optimization for the robust and accurate linear registration and motion correction of brain images, Neuroimage 17 (2) (2002) 825–841.
  • (87) X.-W. Song, Z.-Y. Dong, X.-Y. Long, S.-F. Li, X.-N. Zuo, C.-Z. Zhu, Y. He, C.-G. Yan, Y.-F. Zang, Rest: a toolkit for resting-state functional magnetic resonance imaging data processing, PloS one 6 (9) (2011) e25031.
  • (88) A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification, arXiv preprint arXiv:1703.07737.
  • (89)

    X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, 2010, pp. 249–256.

  • (90) A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance deep learning library, Advances in Neural Information Processing Systems 32 (2019) 8026–8037.
  • (91) L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, J. Han, On the variance of the adaptive learning rate and beyond, arXiv preprint arXiv:1908.03265.
  • (92) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in python, Journal of Machine Learning Research 12 (2011) 2825–2830.
  • (93) T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter optimization framework, in: ICKDM, 2019.
  • (94) A. Defazio, F. Bach, S. Lacoste-Julien, Saga: A fast incremental gradient method with support for non-strongly convex composite objectives, arXiv preprint arXiv:1407.0202.
  • (95) D. J. Hand, R. J. Till, A simple generalisation of the area under the roc curve for multiple class classification problems, Machine learning 45 (2) (2001) 171–186.
  • (96) R. W. Cox, Afni: software for analysis and visualization of functional magnetic resonance neuroimages, Computers and Biomedical research 29 (3) (1996) 162–173.
  • (97) Y. Du, Z. Fu, J. Sui, S. Gao, Y. Xing, D. Lin, M. Salman, A. Abrol, M. A. Rahaman, J. Chen, et al., Neuromark: An automated and adaptive ica based pipeline to identify reproducible fmri markers of brain disorders, NeuroImage: Clinical 28 (2020) 102375.
  • (98) B. Guennewig, J. Lim, L. Marshall, A. N. McCorkindale, P. J. Paasila, E. Patrick, J. J. Kril, G. M. Halliday, A. A. Cooper, G. T. Sutherland, Defining early changes in alzheimer’s disease from rna sequencing of brain regions differentially affected by pathology, Scientific Reports 11 (1) (2021) 1–15.
  • (99) M. Yu, O. Sporns, A. J. Saykin, The human connectome in alzheimer disease—relationship to biomarkers and genetics, Nature Reviews Neurology 17 (9) (2021) 545–563.
  • (100) S.-M. Wang, N.-Y. Kim, D. W. Kang, Y. H. Um, H.-R. Na, Y. S. Woo, C. U. Lee, W.-M. Bahk, H. K. Lim, A comparative study on the predictive value of different resting-state functional magnetic resonance imaging parameters in preclinical alzheimer’s disease, Frontiers in psychiatry 12.
  • (101) S. Kolesnikov, Accelerated deep learning r&d, (2018).
  • (102) J. Von Kügelgen, Y. Sharma, L. Gresele, W. Brendel, B. Schölkopf, M. Besserve, F. Locatello, Self-supervised learning with data augmentations provably isolates content from style, Advances in Neural Information Processing Systems 34.
  • (103) Q. Lyu, X. Fu, W. Wang, S. Lu, Understanding latent correlation-based multiview learning and self-supervision: An identifiability perspective, in: International Conference on Learning Representations, 2021.
  • (104) R. Liu, M. Azabou, M. Dabagia, C.-H. Lin, M. Gheshlaghi Azar, K. Hengen, M. Valko, E. Dyer, Drop, swap, and generate: A self-supervised approach for generating neural activity, Advances in Neural Information Processing Systems 34.
  • (105) D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114.