Fusion is a self-supervised framework for data with multiple sources — specifically, this framework aims to support neuroimaging applications.
Sensory input from multiple sources is crucial for robust and coherent human perception. Different sources contribute complementary explanatory factors and get combined based on factors they share. This system motivated the design of powerful unsupervised representation-learning algorithms. In this paper, we unify recent work on multimodal self-supervised learning under a single framework. Observing that most self-supervised methods optimize similarity metrics between a set of model components, we propose a taxonomy of all reasonable ways to organize this process. We empirically show on two versions of multimodal MNIST and a multimodal brain imaging dataset that (1) multimodal contrastive learning has significant benefits over its unimodal counterpart, (2) the specific composition of multiple contrastive objectives is critical to performance on a downstream task, (3) maximization of the similarity between representations has a regularizing effect on a neural network, which sometimes can lead to reduced downstream performance but still can reveal multimodal relations. Consequently, we outperform previous unsupervised encoder-decoder methods based on CCA or variational mixtures MMVAE on various datasets on linear evaluation protocol.READ FULL TEXT VIEW PDF
Introspection of deep supervised predictive models trained on functional...
We put forward a comprehensive assessment of self-supervised representat...
Self-supervised representation learning has seen remarkable progress in ...
The effective application of representation learning to real-world probl...
User modeling is critical for developing personalized services in indust...
In this paper, we propose a self-supervised learning approach that lever...
Multiple data types naturally co-occur when describing real-world phenom...
Fusion is a self-supervised framework for data with multiple sources — specifically, this framework aims to support neuroimaging applications.
The idealized tasks on which machine learning models are benchmarked commonly involve a single data-source and readily available labels. By contrast, real-life data is often composed of multiple sources: different MRI modalities in medical imaging, LiDAR and video for self-driving cars , data influenced by confounders 
. In addition to this, labels can be poor or scarce, which leads to the need for unsupervised or at least semi-supervised learning. In this work, we will be addressing both these constraints simultaneously, by proposing an analysis of unsupervised learning approaches to multi-source data, while contrasting these to more classical methods. The unsupervised learning methods we will consider will be reliant on contrastive learning.
Classical approaches to multi-source data include canonical correlation analysis (CCA) , which finds maximally correlated linear projections of two data sources. More recently, CCA has been extended to allow for representations obtained from neural networks in works such as DCCA  and DCCAE .
In addition to this family of methods, there are generative variational approaches. Specifically, a variational multi-modal mixture of experts (MMVAE)  has resulted in large performance improvements.
Contrastive objectives have in recent years become essential components of a large number of unsupervised learning methods. Mutual information estimation has inspired a number of successful uses to a single (DIM , CPC ) and multi-view (AMDIM , CMC , SimCLR 
) image classification, reinforcement learning (ST-DIM) and zero-shot learning (CM-DIM [29, 30]). Such systems have resulted in large improvements to representation learning by considering different views of a single input. These and other related methods mostly operate in an unsupervised fashion, where the goal is to encourage similarity between transformed representations of a single input. These objectives can also be readily applied to the present context whereby different sources can be understood as different views of the same data point. In addition to the relative paucity of literature on this topic, we also note that there are no studies that consider explicit combinations of these objectives. This work aims to solve both issues. Our contributions are as follows:
We show empirically that multimodal contrastive learning has significant benefits over its unimodal counterpart.
By analyzing the effect of adding different contrastive objectives, we show that correctly combining such objectives has a critical effect on performance.
Even in cases where the similarity metric has a detrimental effect, we propose a means by which it can be used for similarity analysis.
Let be set of datasets with samples in each. For each th dataset define a sampled image , an CNN encoder , a convolutional feature from a fixed layer of the encoder as and a latent representation defined as . In case of AE-based approaches we also need a reconstruction version of the sample .
To learn the set of encoders we want to maximize the objective defined as:
where is a loss and is a weight coefficient between datasets and
. There are multiple choices of the loss functions one can choose from. In this study, we are specifically exploring the self-supervised contrastive objectives based on the maximization of mutual information as a choice for.
where is a critic function, are some embeddings. The embeddings are obtained through additional projections of a corresponding location of convolutional feature or a latent (e.g. and ) (also known as projection heads ) parametrized by separate neural networks.
To describe a critic function, we define positive and negative pairs. A pair
called positive if it is sampled from a joint distributionand negative — from product of marginals . For example, a single entity can be represented differently in the dataset and . More specifically, the digit ”1” can be represented by an image in the multiple domains as a handwritten digit in MNIST and house number in the SVHN dataset. Then a pair constructed from MNIST and SVHN by sampling a digit ”1” will be positive and negative — if we choose different digit from one of the datasets (such as ).
The idea behind a critic function to assign higher values to positive pairs and less to negative pairs. Our choice of the critic function for this study is a separate critic as in AMDIM  implementation (e.g. there are other possible choices such as billinear, concatenated critics ). Such critic is equivalent to scaled-dot product used in transformers .
All recent contrastive self-supervised methods incorporate in some way the estimator of mutual information which we schematically show in Figure 1. The origin of all these methods is based on the idea of Local DIM  which we show it as L. This method maximizes the mutual information between the location (where representation considered along channels) of convolutional feature and . Further, AMDIM , ST-DIM  and CM-DIM  incorporate Cross-Local (CL) and Cross-Spatial (CS) objectives. In CL setting, researcher have to pair latent with , , and in CS setting — locations in and , . The last connections were introduced by CMC  and SimCLR . These two methods connect and , which we call cross-latent similarity (S). The first two objectives L and CL induce Deep InfoMax principle while S and CS — similarity which is closely connected to CCA idea.
Given these basic contrastive objective researchers can combine them as in Figure 2. We treat each edge as a type of objective. For a full picture, one can combine previous approaches as AE and DCCA  with contrastive objectives as S-AE and L-CCA. We show them schematically in Figure 3 with other baselines: uni-source Supervised and AE, and multi-source DCCAE  and MMVAE  with loose IWAE estimator. We trained the Supervised model specifically to get a discriminative bound on a multi-modal dataset.
For our experiments, we incorporate diverse datasets with multiple sources (shown in Figure 1).
Two-View MNIST is inspired by  where each view represents a corrupted version of an original MNIST digit. First, the intensities of the original image are rescaled to the unit interval. Then we resized the image to to specifically fit DCGAN architecture. Lastly, to generate the first view we rotate the image by a random angle from interval. Then for the second view — we add unit uniform noise and rescale intensity again to a unit interval.
Multi-domain dataset MNIST-SVHN is used by the authors  where the first view is grayscale MNIST digit and the second view — RGB street view house number sampled from the SVHN dataset. We only modified MNIST digits by resizing an image to to use with DCGAN encoder. This dataset represents the more complicated case when the digit is represented by different underlying domains. All intensities are scaled to a unit interval.
For experiments with a multi-modal dataset, we utilize data provided by OASIS-3  to evaluate representation on Alzheimer’s disease (AD) classification. We preprocessed fMRI into fALFF (in 0.01 to 0.1 Hz power band) and T1w to a structural MRI (sMRI) by brain masking using REST  and FSL  (v 6.0.2). All images are linearly converted to MNI space and resampled to 3mm resolution. The final input volume is . After careful selection (removing bad images, selecting most represented non-Hispanic Caucasian subset) we left 826 subjects. For each subject, we combined sMRI and fALFF into 4021 pairs. We left 100 () subjects for hold-out and used others in stratified (about ) 5-folds for training and validation. We defined 3 groups: healthy cohort (HC), AD, and others (subjects with other brain problems). During pretraining, we employ all groups and pairs, while during linear evaluation we take only one pair for each subject and use only HC or AD subjects. As additional preprocessing, we applied histogram standardization and z-normalization. During pretraining, we also use simple data-augmentation as random crops and flips. For the last two steps, we used the TorchIO library . Since this dataset is highly unbalanced we utilize a class balanced data sampler .
. The methodology implies training a linear mapping from a latent representation to a number of classes. During this process, the encoder kept frozen. For our task, we evaluate each encoder separately. For OASIS-3 dataset we use Logistic Regression instead. The hyperparameters (inverse regularization, penalty) of Logistic Regression have been optimized using Optuna .
To better understand the underlying inductive bias of the specific objective we compare similarity between representation of the sources using SVCCA , which is mean correlation of aligned directions, and CKA , which is shown to reliably identify the relationship between representations of networks.
The architecture and hyperparameters for encoders for each source are completely based on DCGAN . For decoder-based methods, we also utilized DCGAN decoder. However, we removed one layer for experiments with natural images to use an input size of . Encoders project input to a -dimensional vector . All the layers are initialized with a uniform Xavier.
The projection layer for latent representation is identity. For convolutional features
we use a 2-layer convolutional neural network with kernel-sizeand a number of hidden channels equal to the dimension of the latent representation . The convolutional features are taken from the layer with a feature side size . The critic function eventually calculates the score on dimensional space. The parameters of the projection layers are shared for each loss, but not between sources.
We use RAdam (), OneCycleLR scheduler (). The pretraining task lasts over and epochs for natural images and OASIS-3, respectively. The linear for natural images is trained for epochs, and for logistic regression, we run Optuna for iterations. All experiments performed with a batch size of . In some runs, we noticed that CCA-objective and MMVAE are not stable. The MMVAE method was only able to train 2 folds out of 5 and with a batch size of due to memory constraints on OASIS-3 dataset. For contrastive objectives, we additionally practice a penalty (except for OASIS) and clipping as in AMDIM .
The code was written using PyTorch and Catalyst framework . For data transforms of the brain images we utilized TorchIO , for CKA analysis of the representation — code by anatome , SVCCA , for AMDIM , for DCCAE , for MMVAE and MNIST-SVHN . The experiments were performed with NVIDIA DGX-1 V100.
As we can see in Figures 4 and 5, the presence of cross-modal contrastive losses has a strong positive impact on downstream test performance, across different architecture and model choices. We also note that the formulations have different performance across settings and datasets, leading to the conclusion that applying them in practice requires careful adaptation to a given problem.
While in simple multi-view case contrastive method are absolute leaders, in multi-domain experiments reconstruction based models such as MMVAE and S-AE stand out. However, the performance for most of the models within from S-AE. Thus one can choose decoder-free self-supervised approaches to reduce computational cost. Uni-source AE, L, and multi-source DCCAE, L-CCA are clearly not able to learn the SVHN.
Figure 6 shows the results on the OASIS-3. Multi-source approaches exhibit strong performance, but the difference with other methods is less noticeable than for the previous tasks. As we discussed in the multi-domain experiment, here AE is also important to learn the modality. While similarity-based method S has the highest metric on T1, however it significantly bad on fALFF. It might indicate that modality T1 dominated fALFF during training. We think that method S is related to CCA, but it is a more stable objective. Thus S can be a good candidate to substitute CCA. Adding reconstruction to contrastive objective (as CCA and AE in DCCAE) we achieve S-AE. It helps to regularize the model. S-AE is equivalent to AE by downstream performance. However, it has a higher CKA similarity between representation, which can be used in multi-modal fusion  and clearly substitutes DCCAE .
On the fALFF modality, the absolute leader is L-CL, while it is better than the Supervised model by and comparable to AE and S-AE. Thus multi-source objective can help learn a representation of a struggling modality.
By SVCCA metric most self-supervised method and Supervised model are lower than MMVAE, DCCAE, S, S-AE, and AE. However, it does not contrast the relationship between the representations in different modalities as CKA.
In this work, we proposed a unifying view on contrastive methods and benchmarked them along baselines when applied to multi-source data. We believe that this unifying view will boost further understanding of how to learn powerful representations from multiple sources. Hopefully, instead of combining the similarities in various ways and publishing the winning combinations as individual methods the field moves to taking a broader perspective on the problem.
We empirically demonstrated that multi-modal contrastive approaches result in performance improvements over methods that rely on a single modality for contrastive learning. We also showed that downstream performance is highly dependent on how such objectives are composed. We argue that the similarity might not guarantee higher downstream performance. In some cases, it may weaken the representation or have a regularization effect on the objective. The highest similarity between representations can be important for other applications, i.e. multimodal analysis .
Not having to train the decoder, self-supervised models significantly reduce computation costs. While keeping comparably high downstream performance they can democratize medical imaging by lowering the hardware requirements.
For future work, we are interested in considering how the conclusions we draw here hold in different learning settings with scarcer data or annotations such as few-shot or zero-shot learning cases.
This work is supported by NIH R01 EB006841.
Data were provided in part by OASIS-3: Principal Investigators: T. Benzinger, D. Marcus, J. Morris; NIH P50 AG00561, P30 NS09857781, P01 AG026276, P01 AG003991, R01 AG043434, UL1 TR000448, R01 EB009352. AV-45 doses were provided by Avid Radiopharmaceuticals, a wholly-owned subsidiary of Eli Lilly.
On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265. Cited by: §3.3.
Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §3.3.
Variational mixture-of-experts autoencoders for multi-modal deep generative models. In NeurIPS, Cited by: §1, §2.3, §3.1.2, §3.4.