ContIG: Self-supervised Multimodal Contrastive Learning for Medical Imaging with Genetics

by   Aiham Taleb, et al.
Hasso Plattner Institute

High annotation costs are a substantial bottleneck in applying modern deep learning architectures to clinically relevant medical use cases, substantiating the need for novel algorithms to learn from unlabeled data. In this work, we propose ContIG, a self-supervised method that can learn from large datasets of unlabeled medical images and genetic data. Our approach aligns images and several genetic modalities in the feature space using a contrastive loss. We design our method to integrate multiple modalities of each individual person in the same model end-to-end, even when the available modalities vary across individuals. Our procedure outperforms state-of-the-art self-supervised methods on all evaluated downstream benchmark tasks. We also adapt gradient-based explainability algorithms to better understand the learned cross-modal associations between the images and genetic modalities. Finally, we perform genome-wide association studies on the features learned by our models, uncovering interesting relationships between images and genetic data.


page 8

page 20


Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations

Improving generalization is a major challenge in audio classification du...

Tasting the cake: evaluating self-supervised generalization on out-of-distribution multimodal MRI data

Self-supervised learning has enabled significant improvements on natural...

Cross and Learn: Cross-Modal Self-Supervision

In this paper we present a self-supervised method for representation lea...

Lesion-based Contrastive Learning for Diabetic Retinopathy Grading from Fundus Images

Manually annotating medical images is extremely expensive, especially fo...

Self-Supervised Contrastive Learning for Efficient User Satisfaction Prediction in Conversational Agents

Turn-level user satisfaction is one of the most important performance me...

Reducing Annotation Need in Self-Explanatory Models for Lung Nodule Diagnosis

Feature-based self-explanatory methods explain their classification in t...

A Positive/Unlabeled Approach for the Segmentation of Medical Sequences using Point-Wise Supervision

The ability to quickly annotate medical imaging data plays a critical ro...

1 Introduction

Figure 1: Overview of our contrastive learning method from imaging and genomic data. It learns representations by bringing the modalities of each individual closer in the embedding space, and apart from different individuals’. In this example, the modalities are retinal fundus images (in brown), SNP data (in green), and polygenic risk scores (PGS) (in purple). Our method handles missing modalities (e.g. missing PGS for the person in the upper right).

Medical imaging plays a vital role in patient healthcare. It aids in disease prevention, early detection, diagnosis, and treatment. However, efforts to employ machine learning algorithms to support in clinical settings are often hampered by the high costs of required expert annotations 

[annotate]. At the same time, large-scale biobank studies have recently started to aggregate unprecedented scales of multimodal data on human health. For example, the UK Biobank (UKB) [ukbio] contains data on individuals, including a wide range of imaging modalities such as retinal fundus images and cardiac, abdominal, and brain MRI. Similar studies are currently underway in other countries, such as the Nationale Kohorte (NaKo) [nako], BioMe [biome], FinnGen [FinnGen], Estonia Biobank [est_biobank], and others. While some of these studies also include phenotypic descriptions, e.g. a person’s medical history, such data tend to be both highly incomplete and biased due to clinical practices and assessment methods [o2005measuring], making learning from them challenging and error-prone. On the other hand, genetic data is increasingly abundant. While chip-based genotyping technology has enabled the study of common genetic variation at scale [verlouw2021comparison], the exponentially decreasing costs of genomic sequencing is driving progress for rare genetic variation [park2016trends]. Due to these advances, the UKB and other biobanks often contain a rich array of genetic and genomic measurements. Genetic data is generally less susceptible to bias factors, and most diseases have at least a partially genetic cause, with some genetic disorders being exclusively attributed to genetic mutations [witte2014contribution]. Similarly, most other traits – not directly related to diseases –, e.g. height and human personality, are also strongly influenced by genetics [lippert2017identification, zwir2020uncovering]. Complementary imaging-genetics datasets are increasingly also available in other application settings, e.g. plant breeding [yang2020crop].

Unlabelled medical images carry valuable information about organ structures, and an organism’s genome is the blueprint for biological functions in the individual’s body. Clearly, integrating these distinct yet complementary data modalities can help create a more holistic picture of physical and disease traits. Integrating these data types, however, is non-trivial and challenging. The human genome consists of three billion base pairs, yet most genetic differences between individuals have little effect. This leads to challenges both in terms of computational aspects, and in terms of statistical efficiency. Unfortunately, it is not clear a priori which parts of the genome are relevant and which are not. Typically, genome-wide association studies (GWAS) [gwas1, gwas2] use statistical inference techniques to discover relationships between genetic variations and particular physical or disease traits. To date, thousands of scientific works have found more than genetic-phenotype associations [gwas_cat]. However, even now a large portion of known or presumed heritability of traits is not yet accounted for by the individual genome-trait associations, a phenomenon known as “missing heritability” [manolio2009finding]. Better methods to find – and explain – the relationships between genetic and imaging modalities may help close this gap.

Therefore, the growing number of biobanks of unlabeled multimodal (i.e. imaging-genetics) data, calls for solutions that can: (i) learn semantic data representations without requiring expensive expert annotations, (ii) integrate these data modalities end-to-end in an efficient manner, and (iii) explain discovered cross-modal correspondences (associations). Self-supervised (unsupervised) representation learning offers a viable solution when unlabeled data is abundant and labels are scarce. These methods witnessed a surge of interest after proving successful in several application domains [survey_self_supervised]. The representations learned by these methods facilitate data-efficient fine-tuning on supervised downstream tasks, reducing significantly the burden of manual annotation. Furthermore, such methods allow for integrating multiple data modalities as distinct views, which can lead to considerable performance gains. Despite the recent advancements in self-supervised methods, e.g

. contrastive learning, only little work has been done to adopt these methods in the medical domain. In fact, we are not aware of any prior work that leverages self-supervised representation learning on combined imaging and genetic modalities. We believe self-supervised learning has the potential to address the above challenges inherent to the medical domain.

Contributions. (i) We propose a self-supervised method, called ContIG, that can learn from multimodal datasets of unlabeled medical images and genetic data. ContIG aligns these modalities in the representation space using a contrastive loss, which enables learning semantic representations in the same model end-to-end. Our approach handles the case of multiple genetic modalities, in conjunction with images, even when the available modalities vary across individuals. (ii) We adapt gradient-based explainability algorithms to better understand the learned cross-modal correspondences (associations) between the images and genetic modalities. Our method discovers interesting associations, and we confirm their relevance by cross-referencing biomedical literature.

Our work presents a framework on how to exploit inexpensive self-supervised solutions on large corpora (e.g. Biobanks) of (medical) images and genetic data.

2 Related Work

Self-supervised learning with pretext tasks. These methods learn an embedding (representation) space by deriving a proxy (pretext) task from the data itself, requiring no human labels. The learned embeddings will also be useful for real-world downstream tasks, afterwards. A large body of works relied on such proxy tasks [w2v, context_prediction, jig, color, deep_cluster, rotations]. A comprehensive review of similar works is provided in [survey_self_supervised]. The limitation of such methods is the need to design handcrafted proxy tasks to learn representations.

Contrastive learning approaches [CPC1, CPC2, simclr, byol, simsiam, moco, dim, pirl, wu2018unsupervised, barlow, swav, mocov2, nnclr]

circumvent the above challenge by maximizing mutual information between related signals in contrast to others, by employing Noise Contrastive Estimation 


. Contrastive methods advanced the results of unsupervised learning on ImageNet 

[imagenet_cvpr09]. However, unlike our method, these methods process uni-modal images only.

Multimodal learning. Learning from multimodal data poses several inherent challenges, such as: multimodal fusion, alignment, and representation [multimodal_survey, multimodal_dl]. Prior works, some of which are self-supervised, learn from a variety of modalities, such as: image and text (vision and language) [videobert, cbt, vilbert, lxmert, visualbert, image_caption1, image_caption2, cross_modal_scenes], image and audio [audio_image1, audio_image2, audio_image3, audio_image4, audio_image5, soundnet], audio and text [audio_text1, audio_text2], and multi-view (multimodal) images [cmc, cross_learn, rgb_optical1, rgb_optical2]. More recent works employed contrastive learning for multimodal inputs (image and text captions) [contrastive_sound, contrastive_text1, contrastive_text2, contrastive_text3, contrastive_text4, contrastive_text_audio]. We follow this line of work, and we extend contrastive pretraining to novel modalities, i.e. images and genetics, for the first time.

Self-supervision on medical images. Early works of self-supervision in the medical context [depth, surgery, register, body, disc, Yan2018DeepLG, cardiac_self_supervised, endoscopic_videos, Cytoarchitectonic_segmentation] made assumptions about input data, limiting their generalization to other target tasks. Then, many works proposed employing proxy tasks for self-supervision from medical scans [orientation_prediction_tajbakhsh, spitzer_3d_distance, ssl_models_genesis, ultrasound_video, taleb_multimodal_puzzles, rubik, image_context_medical, ultrasound_video, image_context_medical2]. A review of similar works is in [review_annotation_efficient]. Recently, contrastive learning [chaitanya2020contrastive, taleb_3D_paper, contrastive_registration1, contrastive_registration2] has been applied to medical scans, where it also showed promising results. Our work, as opposed to these works, utilizes multiple modalities (images and one or more genetic modalities) to improve the learned representations by capturing imaging-genetic relationships.

Deep learning from both genetics and images. In addition to its successful applications to medical imaging [dl_medical_imaging], deep learning also found success in applications on genomics [dl_in_genetics1, dl_in_genetics2, dl_in_genetics3, dl_in_genetics4, dl_in_genetics5]

. There is a growing number of recent works that utilize deep neural networks to jointly learn from both modalities, such as  

[joint_ad1, joint_ad2, multimodal_ad, ash2021joint, gundersen2020end, badea2020identifying, transferGWAS, chang2018deep, fujinami2021prediction, dai2021multi]

. However, these methods are all either highly application specific or fully supervised. Notably, we are not aware of any prior work leveraging the self-supervised framework (with contrastive loss functions) to improve representation learning from both imaging and genetic data.

3 Method

In Sec. 3.1 we first review some biomedical foundations and motivate the genetic modalities chosen in this work. Then, we describe our contrastive method in Sec. 3.2, and different modality aggregation types. Finally, we detail the explanation methods for genetic features in Sec. 3.3.

3.1 Modalities of Genetic Data

The basic building blocks of DNA, which encodes the biological functions needed for the development of an organism, are called nucleotides. A long sequence of the four nucleotides Adenine (A), Thymine (T), cytosine (C), and Guanine (G) make up the genome - the ”recipe” needed to build an organism [dna_seq]. A relatively small fraction of the genome codes for proteins, while the remaining parts have regulatory or structural functions. However, over generations, genetic mutations occur, for example substituting one nucleotide for another, e.g. A to C. Some of these genetic changes can alter physical traits (e.g. eye color), or cause disease (e.g. Alzheimer’s). ”Genotyping” is the process of measuring these genetic changes [snp_genotyping]. The most frequently measured type of changes are single-nucleotide-polymorphisms (SNPs), where a single pair of nucleotides is altered at a specific position in the genome.

There are three billion base pairs in the human genome, but typically only a small fraction of them is measured, due to cost and technological restraints. Even if large parts of the sequence are available, as is the case for whole-genome sequencing studies, working with the raw data is not feasible, both in terms of statistical efficiency

– most of those base pairs carry no causal signal and only add noise to the estimation process – and in terms of

computational efficiency. For these reasons, most studies record only a small subset of all nucleotides, usually on the order of several hundred thousand to several million SNPs. Furthermore, human traits of interest are constructed by a spectrum of different genetic architectures. At the same time, due to evolutionary dynamics, some SNPs exhibit their possible variations frequently in a population (“common” variants), while other SNPs are identical for the overwhelming majority of the population with only few individuals having mutations (“rare” variants) – a form of class imbalance. Therefore, in this work we consider three different ways to encode the genetic modalities that emphasize different aspects of human physiology.

Complex traits are traits that are influenced by a large number of causal factors, including relatively common genetic variations. One example is height, which is determined to a large degree by many SNPs all across the human genome [yang2010common]. Many common diseases and impairments are complex traits, which makes them especially relevant to human health applications [frazer2009human]. To best encode genetic architectures associated with complex traits, we utilize polygenic risk scores (PGS) [pgs]. PGS aggregate many, mostly common, SNPs into a single score that reflects a person’s inherited susceptibility to a specific disease [pgs_nature]. The individual SNPs are weighted based on their strength of association with the disease. By using many different PGS for different traits and diseases we can get a multi-faceted view of an individual’s complex trait predisposition.

Recent advances in DNA sequencing have also enabled assessing the contribution of rare genetic variants to heritable traits [rare]. Rare variants occur at low frequencies (e.g. MAF111minor allele frequency or MAF ) in a population. Large genetic effects often negatively affect an individual’s health and are strongly selected against by evolution. Hence, in contrast to common variants, many rare variants have a large effect size and predispose for genetic diseases. Rare variants are usually not included in PGS, and due to their low frequencies they pose a challenge for robust statistical models. In this work, we use burden scores [burden_test], which aggregate several rare variants within a localized genetic region.

Finally, we also employ a uniformly sampled cross section of the whole genome, by including every -th SNP that has been genotyped in the respective study. These raw SNPs are mostly common variants (due to the biological sampling procedure) and give a broad representation of an individual’s genetic composition. This representation likely carries population structure such as ancestry [lippert2011fast], but also tags highly diverse functional information.

These three genetic modalities – polygenic risk scores, burden scores, and raw SNPs – capture complementary aspects and together paint a broad description of an individual’s genetic predisposition. We employ them both individually and jointly as contrastive views to medical images.

Figure 2: Schematic illustration for the steps of our proposed method. (a) Assuming one imaging modality (retinal fundus shown in brown), and three genetic modalities (Single-nucleotide polymorphisms (SNP) in green, polygenic risk scores (PGS) in purple, burden scores in yellow). Note that different genetic modalities exhibit different variant frequencies (denoted by the histogram in blue): SNP and PGS use common variants (high frequency), while burdens use rare variants (low frequency). (b) We extract features from each modality with deep neural networks, i.e. Convolutional Networks for images and Fully Connected (MLP) networks from genomic data. We use a projection head (MLP) for each modality, which produces equally-sized modality embeddings . (c) We use these embeddings in the contrastive loss computation. The embeddings of each individual are encouraged to come closer in the feature space (depicted by the gray circle), and farther from other individuals’. The dotted gray lines demonstrate the contrasting mechanism between modalities.

3.2 Contrastive Learning from Images & Genetics

We assume a dataset of multimodal samples, one for each individual person. Each sample consists of a medical image paired with multiple genetic modalities. Here, we denote each image by , and the corresponding genetic modalities by , where is the individual and is the genetic modality. We group images and genetic modalities in batches of size by the individual modalities: and . The number of available genetic modalities may vary across individuals.

Our method, illustrated in Fig. 2, processes these input modalities with a set of neural network encoders, one per modality. We denote the image encoding by , and the genetics encodings as , with distinct genetics encoders. The resulting

-dimensional vector representations

are then processed with projection heads , respectively, where . Following [simclr], each projection head is a non-linear MLP with one hidden-layer.

Contrastive Loss with Two Modalities.

We first define the contrastive loss assuming pairs of an image and one genetic modality , with their respective representations . Then, for the image sample in the pair, we consider the genetic sample as the positive (true) sample among the negative genetic samples of other individuals in the same batch. Similarly, the image is the positive sample of , amongst the negative image samples . Therefore, the contrastive loss is the sum of these two parts: i) image-to-genetics (fix the image and contrast genetic samples), and ii) genetics-to-image (fix the genetics and contrast images). Formally, in each step of the training we select a random batch of size with indices and use the batch-wise loss function:


where is a temperature parameter,

is the cosine similarity, and

is a loss weighting hyperparameter.

Generalizing to Multiple Genetic Modalities.

We generalize here the above contrastive loss formulation to the case when there are multiple available genetic modalities, corresponding to the same image sample. Since we aim to improve the learned visual representations mainly, the image modality is used at the center of this training scheme (we deem alternative contrasting schemes a future work). In other words, we contrast the image with each one of the genetic modalities. Therefore, the generalized multimodal contrastive loss becomes:


This formulation ensures the learned visual representations capture useful information from all available genetic modalities. However, this assumes that every individual has all the genetic modalities, which is not normally the case. Hence, we define two aggregation schemes to handle the missing genetic modalities: i) the ”inner” aggregation scheme, which uses only those individuals for which all the modalities exist, and ii) the ”outer” aggregation scheme, which covers all the individuals, even those with missing genetic modalities. In particular, for each in Eq. 2, the “outer” aggregation only includes individuals with non-missing data for this specific modality. The ”outer” scheme can make better use of all available data. Both schemes allow for training on combinations of existing modalities.

3.3 Genetic Features Explanation

For a given multimodal tuple of image and genetic representations, we perform feature explanations to understand the contribution of each genetic feature for the model output. Standard deep learning explainability approaches are not directly applicable in this setting, as they require a simple one-to-one relation from input to output, while the contrastive loss Eq. 2 is computed over batches. Instead, we utilize a fixed reference batch of individuals with images and genetic modalities () and define the explainer function

with defined as in  Eq. 2, but fixed. We can then use standard feature attribution methods such as Integrated Gradients [integrated_gradients] or DeepLift [deep_lift] to explain the contribution of all elements in towards the full batch loss. We can additionally also fix the input image to only consider the attribution of the genetic effects. Note that the explanation will be sensitive to the choice of the reference batch; to minimize this effect, we choose to be relatively large ( in our experiments).

In addition to these local instance-specific attributions, we are especially interested in understanding the behavior of our models globally. For this, we aggregate many individual explanations, all using the same (independent) reference batch. Feature importance both in negative and positive direction is important in our setting, and therefore we consider the mean absolute value for each feature dimension as a measure of global attribution.

The setting with missing values can be handled analogously to the ”outer” aggregation scheme in Sec. 3.2, by just omitting the respective modalities.

4 Experimental Results

In this section, we present the evaluation results of our method. First, we detail the datasets used for pretraining and evaluation (Sec. 4.1). Then, we assess the quality of the learned representations (Sec. 4.2), by: i) fine-tuning (i.e

. transfer learning) on four downstream tasks, and ii) performing a genome-wide association study (GWAS) on the model features. Finally, we present the genetic feature explanation results (

Sec. 4.3), and we analyze the findings to check their relevance with medical literature resources.

4.1 Datasets

We pretrain our models (and the unsupervised baselines) on data obtained from the UK Biobank (UKB) dataset [ukbio]. This dataset contains multimodal data for almost 500k individuals, although imaging data is only available for a subset of those. The UKB consists to an overwhelming majority of individuals of European descent; we restrict our dataset to those European descent individuals, as including more individuals would likely introduce very large confounding effects [lippert2011fast]. For the purposes of pretraining, we utilize the retinal fundus images, which amount to imaging samples (left and right eyes). The genetic modalities we employ (see Sec. 3.1), amount to Raw-SNP samples (all individuals have Raw-SNPs), PGS samples, and burden scores. In terms of feature dimensions, for the raw-SNPs, we uniformly sample every SNP from Chromosomes (excluding the X and Y chromosomes), resulting in SNPs per sample. For PGS, we used 481 scores for a wide variety of different traits downloaded from the PGS Catalog [lambert2021polygenic]. We created burden scores for protein-coding genes [monti2021identifying]. These binary scores indicate whether a participant has at least one potentially damaging rare (MAF ) variant within a given gene. We holdout a test split () from the UKB dataset, and the remaining data are for training () and validation (). Each person only appears in one split.

For the downstream tasks, we employ: i) APTOS 2019 Blindness Detection [APTOS] dataset for Diabetic Retinopathy detection in Sec. 4.2.1, which has retinal fundus training samples. ii) Retinal Fundus Multi-disease Image Dataset (RFMiD) [rfmid] for disease classification (Sec. 4.2.2), which has training images. iii) images from the UKB [ukbio] training split, but now we predict cardiovascular risk factors (Sec. 4.2.4). iv) Pathologic Myopia challenge dataset [palm] for Pathological Myopia Segmentation (Sec. 4.2.3), which has image samples with segmentation masks. More datasets details in the appendix.

4.2 Transfer Learning Results

In this section, we evaluate the quality of representations by fine-tuning to downstream tasks. However, we find that a linear evaluation protocol [CPC1, simclr] (encoder weights are kept frozen) behaves similarly to fine-tuning, see appendix.

Models & architectures.

Across the following experiments, we employ a Resnet50 [resnet] architecture as the encoder for image data ( in Fig. 2). For the genetic encoders (), we vary the number of fully connected layers: ”None” hidden layers, one hidden layer ”H1”, and two hidden layers ”H12”. We also vary the combination of genetic modalities (detailed in Sec. 3.1) used in pretraining, along with modality aggregation schemes (explained in Sec. 3.2).


We compare to the following baselines:

  • [noitemsep]

  • Training from scratch (Baseline): we train the same model on each downstream task, but initialized from random weights. Comparing to this baseline provides insights about the benefits of pretraining.

  • State-of-the-art contrastive methods: we compare to self-supervised (contrastive) methods from literature by training on the same data splits, and using the same experimental setup. Namely, we compare to models pretrained with SimCLR [simclr], BYOL [byol], Barlow Twins [barlow], SimSiam [simsiam], and NNCLR [nnclr].

Model & Genetics Encoder APTOS RFMiD PALM Cardio. Risk Pred.
QwKappa ROC-AUC Dice-Score MSE ROC-AUC
Baseline - 80.47 91.64 77.25 3.440 56.29
SimCLR [simclr] - 81.83 91.88 70.41 3.451 59.38
SimSiam [simsiam] - 75.44 91.28 72.26 3.442 57.37
BYOL [byol] - 71.09 89.88 66.32 3.414 59.73
Barlow Twins [barlow] - 72.28 92.03 70.53 3.430 59.05
NNCLR [nnclr] - 77.93 91.89 72.06 3.426 61.95
ContIG (Raw-SNP) H1 84.01 93.22 76.98 3.254 70.10
ContIG (PGS) H1 85.93 93.31 78.47 3.176 72.72
ContIG (Burden) H1 83.22 93.03 76.49 3.160 72.37
ContIG (Inner RPB) H1 81.52 92.95 77.34 3.202 70.80
ContIG (Outer RPB) H1 84.22 93.62 76.97 3.187 71.80
Table 1: Downstream evaluation results by fine-tuning on each task. Bold indicates the best result, underlined is second best. RPB in our method stand for the genetic modalities used: Raw-SNPs, PGS-scores, and Burden-scores. means higher is better, and lower is better.

4.2.1 Diabetic Retinopathy Detection (APTOS)

Millions of people suffer from Diabetic Retinopathy, the leading cause of blindness among working aged adults. The APTOS dataset [APTOS] contains 2D fundus images, which were rated by a clinician on a severity scale of to . These levels define a five-way classification task. We fine-tune the image encoder of our models and the baselines on this dataset, and then we evaluate on a fixed test split ( of the data). The metric used in the task, as in the official Kaggle challenge, is the Quadratic Weighted Kappa (QwKappa [qwkappa]), which measures the agreement between two ratings. Its values vary from random (0) to complete agreement (1), and if there is less agreement than chance it may become negative. The evaluation results in Tab. 1 support the effectiveness of our proposed contrastive method (ContIG). Our pretrained models outperform all baselines in this task, demonstrating the quality of its learned representations.

4.2.2 Retinal Fundus Disease Classification (RFMiD)

The Retinal Fundus Multi-disease Image Dataset (RFMiD) [rfmid]

also contains 2D fundus images, which are captured using three different cameras. It has 46 class labels, which represent disease conditions annotated through adjudicated consensus of two experts. Similarly, to evaluate on this task, we fine-tune the image encoders on this dataset, and we measure the performance on the test set. We should note that this task is solved as a multi-label classification task, since the patients may have multiple conditions at the same time. As an evaluation metric, we compute area under the ROC curve (ROC-AUC), and we use a micro averaging scheme 

[micro_roc_auc]. The results for this task in Tab. 1 also demonstrate the gains in performance obtained by training with ContIG. Our models also outperform the self-supervised baselines in this task.

4.2.3 Pathological Myopia Segmentation (PALM)

Myopia has become a global burden of public health. Pathologic myopia causes irreversible visual impairment to patients, which can be detected by the changes it causes in the optic disc, including peripapillary atrophy, tilting, etc. The PALM dataset [palm] contains segmentation masks for these lesions, from which we evaluate on disc and atrophy segmentation tasks. Similar to the above downstream tasks, we fine-tune the image encoder on this dataset and evaluate on the test split. To predict segmentation masks, we add a u-net decoder [UNET] on top of the ResNet50 encoder. In terms of evaluation metrics, we use the dice score [dice_score]. The results of this task in Tab. 1 demonstrate the quality of the learned representations by ContIG on semantic segmentation.

4.2.4 Cardiovascular Risk Prediction

Previous work has shown that retinal fundus images can predict a range of risk factors for cardiovascular diseases [google_cardio_prediction]. Namely, retinal fundus images have been found to carry information about age, sex, smoking status, systolic and diastolic blood pressure (SBP, DBP), and body mass index (BMI). We predict these six risk factors using a subset of the UK Biobank [ukbio] dataset, by fine-tuning the image encoder on these values. As evaluation metrics, we use Mean Squared Error (MSE) for the numerical factors (age, BMI, SBP, DBP), and we use the ROC-AUC value for the categorical factors (sex and smoking status). As Tab. 1 shows, models pretrained with ContIG outperform the baseline models in both classification and prediction tasks.

4.2.5 Genome-wide Association Study Results

A GWAS is a statistical analysis that correlates individual genetic markers sampled along the full genome with a trait of interest, such as a specific disease. GWASs usually require a low-dimensional, well-defined trait for association analysis; there is only little work yet on leveraging full medical imaging data in a GWAS setting [ash2021joint, transferGWAS]. Here, we follow the transferGWAS [transferGWAS]

framework to evaluate the embeddings learned by ContIG. In this framework, images are projected onto their latent space embeddings and then the dimensionality is further reduced with a Principal Component Analysis. These low dimensional image representations can then be efficiently associated with SNPs using statistical association analysis tools such as PLINK 

[purcell2007plink, chang2015second]. To compare different training methods, we count how many independent genetic regions each method finds; a more expressive image representation is expected to find more associated regions. We defer the analysis details to the appendix.

Tab. 2 shows the number of found independent regions for each pretraining method. Genetic pretraining increases the statistical power of the genetic association study considerably. Only BYOL [byol] achieves near-competitive results and all other self-supervised methods are outperformed by a large margin. We also looked up the found regions in the GWAS catalog [gwas_cat] of published association results. Many of the regions were already known to be associated with skin pigmentation. This is not surprising, as the retina is known to be pigmented itself, which again is likely to be correlated with actual skin pigmentation. Besides pigmentation, the GWAS catalog records associations with an array of cardiovascular traits (such as BMI, pulse pressure, large artery stroke, and blood biomarkers), as well as eye-specific associations (cataract and astigmatism). Similar results were found by [transferGWAS], albeit with a larger sample size.

Model Found Regions
SimCLR [simclr] 4
SimSiam [simsiam] 2
BYOL [byol] 17
Barlow Twins [barlow] 8
NNCLR [nnclr] 3
ContIG (Raw-SNP) 16
ContIG (PGS) 20
ContIG (Burden) 19
ContIG (Inner RPB) 22
ContIG (Outer RPB) 18
Table 2: GWAS results. Indicated is the number of independent regions associated with the image embeddings for each model

4.3 Genetic Feature Explanation Results

In this section, we investigate the representations learned by ContIG, using the explanation methods developed in Sec. 3.3. We first validate that our explanation approach can distinguish meaningful features from noise features, see appendix. Next, we analyze the models trained with a single genetic modality. Fig. 3 shows the 30 PGS with the strongest attributions, aggregated over 1,000 examples with a reference batch of size 1,000. The most important features are different kinds of skin cancers (basal & squamous cell carcinoma, cutaneous melanoma and melanoma). This can be explained by the fact that the retina is pigmented and skin pigmentation is highly correlated to skin cancer.

Figure 3: Global explanations for genetic features in ContIG (PGS only). Recorded is the mean absolute attribution per feature, aggregated over 1000 individuals, and the 30 PGS with highest associations are shown. Repeated traits (e.g. Melanoma) are due to multiple different risk scores published in the PGS catalog.

Besides that, glaucoma, which is a disease of the optic nerve, is a highly relevant PGS, and many of the other traits are linked to cardiovascular functions (abnormal EKG, HDL cholesterol, blood protein measurements, QT interval), smoking status (lung adenocarcinoma, FEV/FEC ratio, response to bronchodilator) and liver and kidney function (triglyceride & serum urea measurements). This is in line with previous studies which found strong signals with similar biomarkers in retinal fundus images [google_cardio_prediction]. Interestingly, ContIG also finds correlations with neurological conditions such as Parkinson’s disease and autism, which have previously been linked to retinal changes as well [satue2014retinal, gialloreti2014reduction].

Similarly, among the 15 strongest associations for raw SNPs, these SNPs were previously associated with cardiovascular traits (rs10807207, rs228416, rs1886785, rs10415889, rs3851381), pigmentation (rs228416), neurological and psychological conditions (rs1886785, rs1738895, rs6533374), and smoking status (rs6533374).

Figure 4: Local explanation attributions (signed) of genetic features for one image-PGS pair. Only the risk scores with highest values in either positive direction are shown. Retinal fundus image reproduced by kind permission of UK Biobank ©.

In addition to the global attributions, Fig. 4 shows the local attributions for one image/PGS pairing. The retinal fundus image shows strong signs of vascular tortuosity, a known and important biomarker for cardiovascular conditions [cheung2011retinal]. Analogously, for this instance there is a large number of PGSs very strongly related to cardiovascular health (insulin resistance, many blood biomarkers, type II diabetes, Brugada syndrome, thromboembolism).

These local and global explanations together provide further evidence that self-supervised pretraining with ContIG is able to learn semantically meaningful image representations without the need for manual annotations. We provide additional explanatory results in the appendix.

5 Discussion & Limitations

We presented ContIG, a self-supervised representation learning algorithm for imaging-genetics datasets. Our evaluation results show that including genetic information in the pretraining process can considerably boost performance of image models in a variety of downstream tasks relevant for clinical practice and genetic research. We additionally conjecture that the self-supervised baseline methods’ reliance on image augmentations alone may be disadvantageous in medical applications due to the more uniform nature (e.g. color distributions) of medical images compared to in natural images. We also leveraged interpretability methods to understand the relationship between imaging and genetic modalities in more detail and find interesting associations.

Naturally, there are a number of limitations for our proposed approach. First, ContIG requires datasets that capture both imaging and genetics data, and is thus not applicable to pure-imaging datasets. In recent years, however, an increasing number of imaging-genetics studies have started, and proprietary datasets of joint imaging and genetics data are available in some large-scale health systems. With the decreasing prices in both imaging and genotyping technology, this trend is likely to continue further. A second limitation lies in the potentially limited application fields of our method. ContIG is not applicable to standard natural images, as there are no corresponding genetic features. On the other hand, large-scale biobanks often include multiple imaging modalities, such as different MRI and histopathology images. Our method is also applicable to imaging-genetics applications in live-stock and plant breeding, and may also be useful in basic science studies.

Unfortunately, most large-scale imaging-genetics datasets to date are conducted in European and Northern American countries. Therefore, one limitation of the presented results is that the UKB mostly consists of populations with European ancestry, and may carry a biased representation. We have shown that ContIG nevertheless improves downstream tasks in other populations, e.g. in APTOS (collected in India), RFMiD (collected in India), and PALM (collected in China). We deem extending ContIG to other medical imaging datasets and genetic populations a future work.


Appendix A Training & Implementation Details

a.1 Imaging Preprocessing

a.1.1 Image Quality Control

The UK Biobank contains a relatively large number of retinal fundus images with bad quality (e.g

. completely black or extremely overexposed). To filter out extreme outliers, we performed two steps of quality control. First, we only included images where a simple circle-detection algorithm

[illingworth1987adaptive] could find a circle. In the second step, we filtered out the top and bottom brightest and darkest remaining images.

a.1.2 Image transformations

We cropped images to the circles detected in Sec. A.1.1 and rescaled to pixels. During training, we randomly transform images by a rotation of up to and flip the image horizontally with a probability. We also follow the common practice of normalizing (standardizing) all the image intensities using the mean and standard deviation from ImageNet [imagenet_cvpr09].

a.2 Genetics Preprocessing

In all our experiments we used the genetic data provided by the UK Biobank. The three different genetic modalities require different preprocessing steps, which we detail in this section.

a.2.1 Raw SNPs

The raw SNPs are a cross section of all SNPs collected on microarray chips, collecting approximately 800k genetic variants in total across all chromosomes. More information on data collection can be found at

The individual SNPs are coded in additive format, i.e. 0 stands for no deviation from the reference genome, 1 means that one of the two chromosome copies has a deviation and the other not, and 2 means that both chromosome copies show a deviation from the reference genome. We treated SNPs as continuous variables (opposed to, e.g

. separating them into three classes each) and imputed missing values by mode imputation. Since 800k feature dimensions are challenging to handle, and SNPs are highly spatially correlated along the genome 

[reich2001linkage], we only sampled every 100-th SNP from the full microarray. We also only included SNPs on the 22 autosomal (=not sex-specific) chromosomes, as handling sex chromosomes requires special statistical care and leads to non-shared features between genetic males and females. Together, this means we include 7,854 SNPs in our models.

a.2.2 Polygenic Risk Scores

For computing polygenic risk scores, we downloaded all PGS weight files included in the PGS Catalog [lambert2021polygenic] (, last accessed October 11, 2021; at the time of writing, a large batch of new scores has been added to the PGS catalog), a collection of published PGS. The PGS files provide weights for a linear model to compute risk scores from the raw genetic data. To have a large intersection of available SNPs for our UKB population and the weights provided by the PGS catalog, instead of using the raw microarray data from Sec. A.2.1, we used imputed data. The imputed data uses prior knowledge about correlations between SNPs collected and not collected on the respective microarray (“linkage disequilibrium”, LD) to infer the missing features with high accuracy. Imputed data was precomputed by the UKB, and more information can be found at Using the imputed data, we computed 481 polygenic scores for our cohort using the PLINK software [purcell2007plink], ignoring scores that gave errors or that only recorded genome positions in a different reference genome build.

For some traits, there are multiple distinct risk scores in the PGS catalog, as multiple independent studies have been performed on the same trait. For example, the trait “melanoma” appears 9 times in our subset of selected PGS scores, while other traits, such as “insomnia” appear only once. The scores contain partially overlapping genetic markers, and the number of SNPs used for each individual score vary from only 1 to several millions.

a.2.3 Burden Scores

We ran the Functional Annotation and Association Testing Pipeline [monti2021identifying] to functionally annotate all the genetic variants present in the UK Biobank 200k exome sequencing release [szustakowski2021advancing]. Protein loss of function and missense variants that were predicted to be damaging were used to construct burden scores across all protein coding genes. We considered only rare variants with minor allele frequencies below 1%. Of these variants 41% were ”singletons”, i.e. only observed once in our sample. Specifically, each participant was assigned a binary vector of length 18,574 corresponding to the number of protein coding genes. For every gene, the entry in this vector is 1 if the participant harbored at least one potentially damaging variant in that gene, or 0 if no potentially damaging variants were observed in that gene for that participant. This coding has been applied in rare-variant association studies in order to aggregate the effects of many rare variants within genes, where it can boost statistical power and reduce the burden of multiple testing[burden_test, monti2021identifying].

a.3 Downstream Datasets Preprocessing

a.3.1 Diabetic Retinopathy detection (APTOS)

In this task we use the APTOS 2019 Blindness Detection [APTOS] dataset, which has retinal fundus training samples. As explained in the main paper, the labels in this dataset have five levels of disease severity, defining five classes. However, these classes are not mutually exclusive, as a higher disease severity of e.g. four is also of level three and below. Hence, we employ a multi-hot encoding scheme for the labels. For instance, class three is encoded as [1,1,1,0,0] and two as [1,1,0,0,0], and so on. We split the dataset into three different splits of training (60%), validation (20%), and test (20%). There is no overlap of patients across these splits.

a.3.2 Retinal Fundus Disease Classification (RFMiD)

For this task, we use the Retinal Fundus Multi-disease Image Dataset (RFMiD) [rfmid], which has images. The overall number of disease classes is 45. However, we found that two classes (”HR” and ”ODPM”) have no positive cases, so we exclude these two classes and only work with the remaining 43 classes. As mentioned before, we convert these classes to multi-hot labels and solve the task as multilabel classification. We use this dataset’s official splits for training, validation, and test.

a.3.3 Pathological Myopia Segmentation (PALM)

We use the Pathologic Myopia challenge dataset [palm] for this task, which has 400 image samples with segmentation masks. As for segmentation labels, this dataset has three annotated areas: i) peripapillary atrophy (available for 311 cases), ii) optic disc (available for all cases), and iii) detachment (available for 12 cases only). Given that detachment is rarely available, we omit it from this task and only predict the atrophy and disc classes. We stratify the patients using the atrophy labels, to ensure equal representation of classes in train (60% of dataset size) / val (20%) / test (20%) splits.

a.3.4 Cardiovascular Risk Prediction (UKB)

To predict the cardiovascular risk factors of (sex, age, BMI, SBP, DBP, smoking status) from retinal fundus scans, we use images from the UKB [ukbio]. This corresponds to the training split ( of UKB dataset size). We use the remaining scans for validation ( of dataset size) and for the test split (

). Each person only appears in one split. The training for this task is performed using two models: i) one model to classify the categorical labels (sex to binary labels {0,1}, smoking status to binary labels too), ii) a second model to predict – solved as a regression task – the remaining continuous variables (age, BMI, SBP, and DBP). We use two models because the loss values of these two tasks have different scales. We preprocess the values of the continuous factors by standardization (removing the mean and scaling to unit variance). Finally, we impute the missing values of these factors by using the ”mean” for continuous factors and ”median” for discrete factors.

a.4 Training Details

We provide the training details for all pretraining (self-supervised) and downstream tasks in this section.

  • [noitemsep]

  • Batch sizes: we use a unified batch size of 64 across all pretraining and downstream tasks.

  • Optimizers: we use Adam optimizer [Adam_supp] in all pretraining and downstream tasks.

  • Schedulers: during self-supervised pretraining (with ContIG and the baselines), we decay the learning rate with the cosine decay schedule without restarts [cosine_annealing].

  • Learning rates: we use an initial learning rate of across all tasks. However, we reduce the learning rate during training in the PALM semantic segmentation task to

    after 10 warum epochs.

  • Weight decay: in pretraining tasks, we use a weight decay factor of . In downstream tasks, we use a weight decay factor of .

  • Number of epochs: in pretraining tasks, we train all models for 100 epochs. In downstream tasks, we fine-tune for:

    • [noitemsep]

    • For the PALM, APTOS, and RFMiD tasks: we train all models for 50 epochs.

    • For Cardiovascular risk prediction tasks: we fine-tune all models for 5 epochs ( steps).

  • Network architectures: for the image encoder, as mentioned before, we use a Resnet50 [resnet] architecture across all pretraining and downstream tasks. For the genetics encoders, we vary between following choices:

    • [noitemsep]

    • None: here we do not have any hidden fully-connected layer for the genetics, and we feed them as inputs to the projection head directly.

    • H1: we process the genetic inputs with one hidden layer of size 2048. (followed by a ReLU activation and Batchnorm1D layers)

    • H12: we process the genetics with two hidden layers, both of size 2048. (Each layer is followed by a ReLU and Batchnorm1D)

    For the projection head, we follow [simclr] in using two fully-connected layers. The first has a size of 2048 and is followed by a ReLU. The second has size of 128, which is the projection embedding size. Finally, for classification and regression downstream tasks we add one fully-connected Linear layer on top to perform the task. But for the PALM segmentation task, we add a U-Net [UNET] decoder on top of the Resnet50 encoder. For upsampling layers in the decoder, we use transposed convolutional layers ConvTranspose2d.

  • Loss functions: the used loss functions for each task are as follows:

    • [noitemsep]

    • ContIG: for training our method, we use a contrastive loss (NTXentLoss). This loss is implemented using a cross-entropy loss, where the model is trained to classify which sample is positive in each mini-batch. However, our version of the NTXentLoss only does inter-modal contrasting, and not intra-modal. We set in this loss (Eq. 1 in the main paper), and the temperature . Note that a larger value of gives more importance to image features than genetic features.

    • APTOS & RFMiD: we use the binary cross-entropy loss in both tasks.

    • PALM: we use a weighted combined loss of Dice-loss [dice_score] (weight=0.8) and binary cross-entropy (weight=0.2).

    • Cardiovascular risk classification (sex & smoking status): we use a binary cross-entropy loss.

    • Cardiovascular risk prediction (age & BMI & SBP & DBP): we use the Mean Square Error (MSE) loss.

    • SimCLR [simclr]: this method uses the contrastive NTXentLoss too. We similarly set the temperature .

    • NNCLR [nnclr]: this method uses the contrastive NTXentLoss too. We similarly set the temperature .

    • Simsiam [simsiam]: this method does not use negative sampling, and instead uses a Siamese network to minimize the similarity between two augmented views of the same image. Hence, the loss function used is the negative cosine similarity loss.

    • BYOL [byol]: this method has the same loss used in Simsiam, which is the negative cosine similarity.

    • Barlow Twins [barlow]

      : this method modifies the contrastive loss to compute the cross-correlation matrix between two sets of embeddings, which are for the same batch of images but with different image augmentations. Then, it tries to make this matrix close to the identity matrix.

a.5 Implementation Details

We implement all of our methods using Python. The libraries we rely on are PyTorch v1.9.1, Pytorch-Lightning v1.4.8, torchvision v0.10.0, torchmetrics v0.4.0, and Lightly [lightly] (for baseline self-supervised implementations). We also follow the reproducibility instructions for Pytorch-Lightning [reproducibility], i.e. by setting a unified random seed of for all scripts and workers, and by using deterministic algorithms. We attach our source code with this supplementary material submission.

Appendix B Additional Downstream Results

b.1 Complete Finetuning Results

In this section, we present the full set of results for fine-tuning our ContIG models versus the same baselines. These extended evaluation results, in Tab. 4, show that ContIG is advantageous to the baselines. The rows in Tab. 4 are grouped in the following order: i) baseline trained from scratch, ii) self-supervised baselines, iii) ContIG trained on single genetic modalities with the images, and iv) ContIG trained on multiple genetic modalities with images.

b.2 Linear Evaluation Results

In this section, we follow a linear evaluation protocol [color, CPC1, simclr], meaning that the encoder weights are kept frozen and we only train a linear classifier / regressor on top. As shown in Tab. 4, models trained with our method “ContIG” consistently outperform the baselines. Linear evaluation aims to provide a good idea about the quality of semantic representations stored in the model encoder.

Model & Genetics Encoder APTOS RFMiD PALM Cardio. Risk Pred.
QwKappa ROC-AUC Dice-Score MSE ROC-AUC
Baseline - 80.47 91.64 77.25 3.440 56.29
SimCLR [simclr] - 81.83 91.88 70.41 3.451 59.38
SimSiam [simsiam] - 75.44 91.28 72.26 3.442 57.37
BYOL [byol] - 71.09 89.88 66.32 3.414 59.73
Barlow Twins [barlow] - 72.28 92.03 70.53 3.430 59.05
NNCLR [nnclr] - 77.93 91.89 72.06 3.426 61.95
ContIG (Raw-SNP) None 81.99 92.27 74.96 3.366 64.71
ContIG (Raw-SNP) H1 84.01 93.22 76.98 3.254 70.10
ContIG (Raw-SNP) H12 82.56 93.09 77.02 3.201 69.58
ContIG (PGS) None 83.84 91.63 76.86 3.257 69.81
ContIG (PGS) H1 85.93 93.31 78.47 3.176 72.72
ContIG (PGS) H12 86.44 93.04 77.04 3.216 70.69
ContIG (Burden) None 82.92 93.68 76.89 3.273 71.91
ContIG (Burden) H1 83.22 93.03 76.49 3.160 72.37
ContIG (Burden) H12 83.61 93.14 76.72 3.236 71.50
ContIG (Inner RPB) None 83.49 93.31 77.11 3.195 71.68
ContIG (Inner RPB) H1 81.52 92.95 77.34 3.202 70.80
ContIG (Inner RPB) H12 80.24 92.94 75.37 3.235 68.89
ContIG (Outer RPB) None 82.93 93.01 76.31 3.260 69.16
ContIG (Outer RPB) H1 84.22 93.62 76.97 3.187 71.80
ContIG (Outer RPB) H12 84.21 93.41 77.51 3.233 71.13
Table 3: Downstream evaluation results by fine-tuning on each task. Bold indicates the best result, underlined is second best. RPB in our method stand for the genetic modalities used: Raw-SNPs, PGS-scores, and Burden-scores. means higher is better, and lower is better.
Model & Genetics Encoder APTOS RFMiD PALM Cardio. Risk Pred.
QwKappa ROC-AUC Dice-Score MSE ROC-AUC
SimCLR [simclr] - 35.02 86.53 59.77 3.998 52.26
SimSiam [simsiam] - 21.25 87.91 56.58 3.998 53.13
BYOL [byol] - 17.39 87.84 54.04 4.009 52.29
Barlow Twins [barlow] - 44.75 87.65 59.52 3.952 54.28
NNCLR [nnclr] - 24.76 85.80 66.25 3.870 54.17
ContIG (Raw-SNP) None 59.14 89.24 72.82 3.683 59.07
ContIG (Raw-SNP) H1 69.85 89.99 75.25 3.443 64.36
ContIG (Raw-SNP) H12 68.72 90.47 74.39 3.439 69.58
ContIG (PGS) None 66.34 88.16 75.03 3.488 62.64
ContIG (PGS) H1 72.38 90.43 76.35 3.426 63.98
ContIG (PGS) H12 70.20 90.01 77.13 3.481 63.27
ContIG (Burden) None 70.29 91.08 75.31 3.453 64.72
ContIG (Burden) H1 70.67 90.62 75.42 3.421 64.70
ContIG (Burden) H12 71.22 91.10 76.09 3.434 64.84
ContIG (Inner RPB) None 70.26 89.94 75.27 3.439 63.84
ContIG (Inner RPB) H1 66.94 88.65 75.00 3.404 64.73
ContIG (Inner RPB) H12 68.41 90.56 73.08 3.457 63.45
ContIG (Outer RPB) None 66.94 90.38 75.29 3.448 65.20
ContIG (Outer RPB) H1 66.60 89.46 77.04 3.398 64.59
ContIG (Outer RPB) H12 68.57 90.51 76.50 3.388 65.20
Table 4: Downstream evaluation results by linear evaluation on each task. Similarly, the results obtained by ContIG outperform all baselines. Bold indicates the best result, underlined is second best. RPB in our method stand for the genetic modalities used: Raw-SNPs, PGS-scores, and Burden-scores. means higher is better, and lower is better.

b.3 Data-Efficiency Results

In this section, we assess the quality of semantic representations in a semi-supervised experimental scheme. We choose randomly 1% and 10% of the labels provided by UK Biobank (UKB) [ukbio], and perform the downstream tasks of Cardiovascular Risk Factors prediction. Then, we evaluate using the same fixed test split of 20% of UKB dataset size. We choose this particular downstream task as UKB’s dataset size is large enough to allow a simulation for expert annotation collection process, i.e. 1% of number of overall labels is approximately 1000 samples, and such number may simulate an annotation process. The other benchmark datasets (APTOS [APTOS], RFMiD [rfmid], and PALM [palm]) are relatively small in size. The evaluation results shown in Tab. 5 compare models trained with ContIG to models trained with the self-supervised baselines. ContIG outperforms the baselines in this evaluation scheme too. Note that all models are trained on the same exact subset of individuals and also evaluated on the same test set. The results for this data-efficient evaluation scheme especially confirm the advantages of pretraining with multiple genetic modalities using the ”Outer” aggregation scheme. Notably, semi-supervised pretraining of ContIG with only 1% labeled data still outperforms the self-supervised baselines when they have as much labeled data available.

Model Label Fraction
1% 10%
SimCLR [simclr] 4.029 51.43 3.762 54.29
SimSiam [simsiam] 3.861 53.35 3.564 57.45
BYOL [byol] 3.894 51.68 3.505 56.71
Barlow Twins [barlow] 3.788 51.89 3.558 56.86
NNCLR [nnclr] 3.913 52.20 3.643 55.99
ContIG (Raw-SNP) 3.541 60.11 3.414 64.81
ContIG (PGS) 3.521 59.23 3.391 65.86
ContIG (Burden) 3.540 59.74 3.393 65.41
ContIG (Inner RPB) 3.511 59.95 3.397 65.71
ContIG (Outer RPB) 3.490 60.39 3.378 65.99
Table 5: Data-efficient evaluation results by fine-tuning on subsets of UKB samples. All our ContIG models use the ”H1” genetic encoder variant. Bold indicates the best result, underlined is second best. means higher is better, and lower is better.

Appendix C Additional Feature Explanation Results

c.1 Method Validation

We ran a baseline experiment to validate that our feature explanation method properly attributes to meaningful features. In this experiment, instead of genetic features, we use phenotypic covariates such as age, sex, systolic and diastolic blood pressure (SBP and DBP), which can be predicted reliably from retinal fundus images. Additionally, we include the first 40 principal components, which mostly capture population structure information. As control variables, we also feed five random noise variables into the training process, which have no association with the images at all. Fig. 5 shows the aggregated feature explanations. As expected, the noise variables (noise0, ... noise4) get assigned very low explanation scores, while all other variables have considerable influence. This validates that our feature explanation approach can distinguish between variables that carry true information relevant to the network and variables that are unrelated to the images.

Figure 5: Explanation method validation. Shown is the mean absolute attribution for each feature aggregated over a batch-size of 1,000 individuals. noise0, ..., noise4 don’t carry any information and also get downweighted by our attribution method.
(a) Absolute attribution for each modality, aggregated by mean.
(b) Absolute attribution for each modality, aggregated by sum.
Figure 6: Absolute attributions by modality for ContIG (Outer RPB).
Figure 7: Manhattan plot for the GWAS with ContIG (Inner RPB). The x-axis shows the position of each SNP on the genome, the y-axis is the negative base-10 logarithm of the p-value for each SNP. Higher values correspond to lower p-values, correspond to stronger signal. The red line corresponds to a significance threshold of 0.05 Bonferroni-adjusted for the number of SNPs; the green line corresponds to “genome-wide significance” (). P-values are clamped at for clearer visualization (only relevant for the loci on chromosome 15 with a minimum p-value of ).

c.2 Multimodal Explanation Results

Fig. 6 shows the aggregated attribution scores for each of the three modalities, Raw-SNPs, PGS, and Burdens, for ContIG with the “Outer” training scheme. Fig. 5(a) shows that PGS scores on average have more influence than individual SNPs or burden scores. However, Fig. 5(b) also shows that that in aggregate, raw SNPs and burden scores have more total influence on the model. This is likely due to PGS only having 481 features, while raw SNPs and Burdens have 7,854 and 18,574 features, respectively. This may also explain the small but counterintuitive performance drop from ContIG (PGS) to ContIG (Outer RPB): the strongest signal, PGS, gets “drowned out” by the less important but overabundant signal in the raw SNPs and burden scores.

Appendix D GWAS Analysis Details

We produced feature vectors by computing the hidden-layer embedding for each image in the test-split of our dataset (10% of the whole dataset, 7,079 individuals). In contrast to the main training, we only used embeddings of the left eye and only included each individual once. Feature vectors were reduced to 10 dimensions using a PCA. Before computing the association results, we also used an inverse-normal transform [sofer2019fully]

after conditioning on the potential confounders “sex”, “age”, as well as the first 15 genetic PCs. This ensures that the residuals of the marginal distributions are approximately normally distributed and outlier deviations from normality don’t artificially inflate the type-1 error rate, leading to spurious correlations. We performed the genetic association study with the PLINK2 software

[chang2015second], using a linear model for each of the ten dimensions individually. We again correct for the same confounders in the linear model. Finally, we aggregate the summary statistics of the ten individual features into a single p-value for each SNP by using a Bonferroni-correction of the factor 10, following [transferGWAS].

Genetic variants are locally highly correlated. Therefore, we group significantly associated SNPs that are spatially close and in LD together using the PLINK [purcell2007plink] clumping functionality (using parameters , , , ). We reported the number of independent associated regions returned by this procedure in the main document.

Fig. 7 shows the manhattan plot of genome-wide associations from the GWAS with ContIG (Inner RPB) pretraining. A number of very strong signals, e.g. on chromosomes 15 and 5, are known to be associated with skin pigmentation and cardiovascular traits. Manhattan plots for the other pretrained models look similar but with less signal. Almost all models found the very strong signals on chromosome 15. Interestingly, the manhattan plots for both SimCLR and BYOL (not displayed here) show clear signs of a ill-fitted association model, with many (for BYOL) but small, most likely spurious associations distributed over the whole genome but no signal in the chromosome-15 pigmentation region. This happens even after applying the inverse-normal transformation to counteract outliers and is likely due to different forms of confounding. This finding also explains the surprisingly large number of hits for BYOL – they are most likely false-positives. A more careful analysis with mixed effect models [lippert2011fast] and in-depth inspection of the image features is beyond the scope of this article.