MedAug: Contrastive learning leveraging patient metadata improves representations for chest X-ray interpretation

02/21/2021 ∙ by Yen Nhi Truong Vu, et al. ∙ Stanford University 17

Self-supervised contrastive learning between pairs of multiple views of the same image has been shown to successfully leverage unlabeled data to produce meaningful visual representations for both natural and medical images. However, there has been limited work on determining how to select pairs for medical images, where availability of patient metadata can be leveraged to improve representations. In this work, we develop a method to select positive pairs coming from views of possibly different images through the use of patient metadata. We compare strategies for selecting positive pairs for chest X-ray interpretation including requiring them to be from the same patient, imaging study or laterality. We evaluate downstream task performance by fine-tuning the linear layer on 1 Our best performing positive pair selection strategy, which involves using images from the same patient from the same study across all lateralities, achieves a performance increase of 3.4 previous contrastive method and ImageNet pretrained baseline respectively. Our controlled experiments show that the keys to improving downstream performance on disease classification are (1) using patient metadata to appropriately create positive pairs from different images with the same underlying pathologies, and (2) maximizing the number of different images used in query pairing. In addition, we explore leveraging patient metadata to select hard negative pairs for contrastive learning, but do not find improvement over baselines that do not use metadata. Our method is broadly applicable to medical image interpretation and allows flexibility for incorporating medical insights in choosing pairs for contrastive learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self-supervised contrastive learning has recently made significant strides in enabling the learning of meaningful visual representations through unlabeled data

[instancedisc, deepinfomax, mocov2, simclrv2]. In terms of medical imaging, previous work has found performance improvement when applying contrastive learning on chest X-ray interpretation [sowrirajan, sriram2021covid, azizi], dermatology classification [azizi] and MRI segmentation [localglobalcontrastive]. Despite the early success of these applications, there is only limited work on determining how to improve upon standard contrastive algorithms using medical information [sowrirajan, localglobalcontrastive, textimagecontrastive, clocs].

In contrastive learning, the selection of pairs controls the information contained in learned representations, as the loss function dictates that representations of positive pairs are pulled together while those of negative pairs are pushed apart

[cpc]. For natural images where there are no other type of annotations, positive pairs are created using different augmented views of the same image while negative pairs are views of different images [simclrv2]. In that setting, goodview argue that good positive pairs are those that contain minimal mutual information apart from common downstream task information, while tamkin2020viewmaker train a generative model which learns to produce multiple positive views from a single input. However, previous contrastive learning studies on medical imaging have not systematically investigated how to leverage patient metadata available in medical imaging datasets to select positive pairs that go beyond crops of the same image while containing common downstream information.

In this work, we propose a method to treat different images that share common properties found in patient metadata as positive pairs in the context of contrastive learning. We demonstrate the application of this method to a chest X-ray interpretation task. Similar to the concurrent work by azizi, we experiment with requiring positive pairs to come from the same patient as these images likely share highly similar visual features. However, our method incorporates these positive pairs with possibly different images directly as part of the view generation scheme in a single contrastive pretraining stage, as opposed to azizi, which adds a second pretraining stage where a positive pair must be formed by two distinct images. Further, we go beyond the simple strategy of forming positive pair using any two data points coming from the same patient in azizi; clocs and experiment with other metadata such as study number and laterality. Although study number has also been leveraged successfully in sriram2021covid to create a sequence of pretrained embeddings representating patient disease progression, our work differs in that we use this information specifically to choose positive pairs during the contrastive pretraining stage.

We conduct MoCo-pretraining mocov2 using these different criteria and evaluate the quality of pretrained representations by freezing the base model and fine-tuning a linear layer using 1% fraction of the labeled dataset for the task of pleural effusion. Our contributions are:

  1. We develop a method, MedAug, to use patient metadata to select positive pairs in contrastive learning, and apply this method to chest X-rays for the downstream task of pleural effusion.

  2. Our best pretrained representation achieves a performance increase of 3.4% and 14.4% in mean AUC compared to sowrirajan and the ImageNet pretrained baseline respectively, showing that using patient metadata to select positive pairs from multiple images can significantly improve representations.

  3. We perform comparative empirical analysis to show that (1) using positive pairs that share underlying pathologies improves pretrained representations, and (2) increasing the number of distinct images selected to form positive pairs per image query improves the quality of pretrained representations.

  4. We perform an exploratory analysis on strategies to select negative pairs using patient metadata, and do not find improvement over the default strategy that does not use metadata.

fig:method

Figure 1: Selecting positive pairs for contrastive learning with patient metadata

2 Methods

2.1 Chest X-ray dataset and task

We use CheXpert, a large collection of de-identified chest X-ray images. The dataset consists of 224,316 images from 65,240 patients labeled for the presence or absence of 14 radiological observations. Following sowrirajan, we perform experiments for the pleural effusion classification to provide a head-to-head comparison.

2.2 Selecting positive pairs for contrastive learning with patient metadata

Given an input image , encoder , and a set of augmentations , most contrastive learning algorithms involve minimizing the InfoNCE loss L(x) = - logexp[g(~x1) ⋅g(~x2)]exp[g(~x1) ⋅g(~x2)] + ∑i=1Kexp[g(~x1)) ⋅g(zi))]. Here, the positive pair with are augmentations of the input image , while the negative pairs , are pairs of augmentations of different images, with coming from either a queue in the case of MoCo or the minibatch in the case of SimCLR. Recognizing that many augmentation strategies available for natural images are not applicable to medical images, sowrirajan restrict to be the set of simple augmentations such as horizontal flipping and random rotation between -10 to 10 degrees. As a result, their method can be thought of as instance discrimination, as and must come from the same image input.

In this work, we propose MedAug, a method to use multiple images as a way to increase the number of positive pair choices. Beyond the disease labels, we can use patient metadata such as patient number, study number, laterality, patient historical record etc. to create appropriate positive pairs. Formally, we can use patient metadata to obtain an enhanced augmentation set dependent on as follows T_enhanced(x) &= {{ t_i(x’) — t_i ∈T, x’ ∈S_c(x)} & if S_c(x) ≠∅
T(x) & otherwise where is the set of all images satisfying some predefining criteria in relation to the properties of . The criteria for using the metadata could be informed by clinical insights about the downstream task of interest.

We apply this method on chest X-ray interpretation and pretrain ResNet-18 models using MoCo v2 with hyperparameter choices as in

sowrirajan. Since the downstream task is disease classification, we experiment with using since images from the same patient are likely to share high amount of visual features. We also experiment with further applying criteria on study numbers as well as laterality. An example application of the method is illustrated in Figure LABEL:fig:method.

2.3 Fine-tuning and evaluation

We evaluate the pretrained representations by (1) training a linear classifier on outputs of the frozen encoder using labeled data and (2) end-to-end fine-tuning. Pretrained checkpoints are selected with k-nearest neighbors algorithm

wu2018unsupervised based on Faiss similarity search and clustering library JDH17

. To simulate label scarcity encountered in medical contexts, we fine-tune using only 1% of the labeled dataset. The fine-tuning experiments are repeated on 5 randomly drawn 1% splits from the labeled dataset to provide an understanding of the model’s performance variance. We report the mean AUC and standard deviation over these five 1% fine-tuning splits. Following

sowrirajan and chexpert, we use a learning rate of

, batch size of 16 and 95 epochs for training.

3 Experiments

3.1 Positive pair selection

Our formulation of using any set of images from the same patient to enhance the set of augmentations for contrastive learning provides the flexibility of experimenting with different criteria for constraining . We experiment with limiting using properties found in the metadata of the query . In particular, we focus on two properties:

Study number.

The study number of an image associated with a particular patient reflects the session in which the image was taken. We experiment with three different criteria on study number:

  1. All studies: no restriction on is dependent on the study number of

  2. Same study: only images from the same study with belong to

  3. Distinct studies: only images with different study number from belong to

Laterality.

Chest X-rays can be of either frontal (AP/PA) view or lateral view.

  1. All lateralities: no restriction on is dependent on the laterality of

  2. Same laterality: only images from the same laterality with belong to

  3. Distinct lateralities: only images with a different laterality from that of belongs to

tab:positive Baseline models Linear End-to-end ImageNet baseline MoCo v2 baseline sowrirajan MoCo v2 baseline with random crop scale Criteria for creating Linear End-to-end Same patient, same study, same laterality Same patient, same study, distinct lateralities Same patient, same study Same patient, all studies Same patient, distinct studies Same patient, same study with random crop scale

Table 1: Except for criteria that involve images from different studies, using images from the same patient to select positive pairs result in improved AUC in downstream pleural effusion classification.
Results.

We report the results of experiments using these criteria in Table LABEL:tab:positive. Except from when includes images with different study numbers from , where there is a drop in performance, we see consistent large improvement from the baseline in sowrirajan. The best result is obtained when using , the set of images from the same patient and same study as that of , regardless of laterality. Incorporating this augmentation strategy while holding other settings from sowrirajan constant results in respective gains of 0.029 (3.4%) and 0.021 (2.4%) in AUC for the linear model and end-to-end model. We also experiment with including random crop augmentation from MoCo v2 mocov2, where the scaling is modified to be in order to avoid cropping out areas of interest in the lungs. Adding this augmentation to the same patient, same study strategy, we obtain our best pretrained model, which achieves a linear fine-tuning AUC of 0.883 and an end-to-end fine-tuning AUC of 0.906 on the test set, significantly outperforming previous baselines.

3.2 Comparative Empirical Analysis

We perform comparative analysis to understand how different criteria on patient metadata affect downstream performance results seen in Table LABEL:tab:positive.

3.2.1 All studies v.s. same study

We hypothesize that the drop in transfer performance when moving from using images with the same study number to using images regardless of study number is because may contain images of a different disease pathology than that seen in . As a result, the model is asked to push the representation of a diseased image close to the representation of a non-diseased image, causing poor downstream performance. To test this hypothesis, we carry out an oracle experiment with , the set of images from the same patient and with the same downstream label as that of , regardless of study number.

Results.

Table LABEL:tab:cheat_studies shows that the model pretrained with achieves a respective improvement of 0.034 and 0.022 in AUC over strategy for the linear model and end-to-end model. This experiment supports our hypothesis that positive pairs from images with different downstream labels hurt performance.

tab:cheat_studies Criteria for creating Linear End-to-end Same patient, all studies Same patient, all studies, same disease label as

Table 2: Experiment with and without using downstream labels shows that positive pairs with different labels hurt downstream classification performance.

3.2.2 All studies v.s. distinct studies

There is a further performance drop when moving from using images across all studies of the same patient to images with a different study number from the current query image (Table LABEL:tab:positive). This finding may also support our hypothesis because there is a larger proportion of positive pairs of different disease pathologies in pairs of images from strictly different studies (see Appendix A). To make sure this result holds independent of the different number of available images to form pair per query, we repeated these experiments while forcing via random subset pre-selection. Further, we only use distinct images as a pair, i.e. skipping any with in eqn:enhanced_aug in order to remove any possible contribution from positive pairs formed from the same image.

Results.

Table LABEL:tab:control_distinct_studies shows the same patient, all studies strategy (AUC = 0.848) outperforms the same patient, distinct studies strategy (AUC = 0.792) even when the size of is controlled. This supports the hypothesis that a higher proportion of positive pairs with different disease pathologies hurts downstream task performance.

tab:control_distinct_studies Criteria for creating Linear End-to-end Same patient, distinct studies Same patient, all studies (size controlled)

Table 3: Experiments where we force positive pairs to come from different images and control the size of shows that higher proportion of pairs with different downstream labels contribute to lower downstream performance.

3.2.3 All lateralities v.s. distinct lateralities v.s. same laterality

First, we hypothesize that the drop in performance from the all lateralities to the same laterality strategy could be due to having smaller size. To test this, we carry out an experiment in which is constrained by , the number of images from the same study and has the same laterality as .

tab:control_laterality_1 Criteria for creating Linear End-to-end Same patient, same study, same laterality Same patient, same study, all lateralities (size controlled) Same patient, same study, all lateralities (no control)

Table 4: Experiments with all lateralities where we control the size of show that the size of affects downstream performance.

tab:control_laterality_2 Criteria for creating Linear End-to-end Same patient, same study, same laterality Same patient, same study, distinct lateralities

Table 5: Experiments to compare same v.s. distinct lateralities with size restriction on shows no significant difference.

Our second hypothesis is that mutual information in images with different lateralities is lower, which benefits retaining only information important to the downstream task, as shown in goodview. We test this by training two models on images that include at least one counterpart from the other laterality. We pretrain one model with containing only images with the same laterality as , and the other model with containing only images with different laterality from . To prevent the effect of different sizes of , we force that via random subset pre-selection.

Results.

Table LABEL:tab:control_laterality_1 shows that once we control for the size of , there is no significant difference between using images from the same laterality (AUC = 0.862) or from all lateralities (AUC = 0.860). However, the model pretraining with all images from all lateralities achieves much larger downstream AUC of 0.876. Thus, it supports our first hypothesis that the size of influences pretrained representation quality. Table LABEL:tab:control_laterality_2 shows that once we control for the size of , the model pretrained with images from different lateralities only gain AUC in linear fine-tuning performance and a non-significant in end-to-end performance. This experiment shows that the effect of mutual information from different lateralities on pretrained representation quality is less pronounced.

3.3 Negative pair selection

We explore strategies using metadata in the CheXpert dataset to define negative pairs. Similar to our method of defining positive pairs, we take advantage of metadata available in the dataset to select the negative pairs. However, unlike positive pair selection, where only a single pair is required for each image, an image has to pair with the entire queue to select negative pairs. This property makes selecting negative pairs from the same patient as done in selecting positive pairs not suitable because only a small number of images are available for a patient. We instead use a more general property – laterality – across the patients to define negative pairs to retain sufficient negative pairs in the loss function (2.2). Similarly, other metadata such as age and sex may be exploited for the same purpose.

The default negative pair selection strategy is to select all keys from the queue that are not views of the query image. However, we hypothesize that negative pairs with the same laterality are “hard” negative pairs that are more difficult to distinguish and provide more accurate pretrained representations for the downstream task. We describe our four strategies briefly as follows and in more detail in Appendix B

. Our first strategy is to only select images from the queue with the same laterality as the query to create negative pairs. Our second strategy is to reweight the negative logits based on laterality so in effect queries with each laterality (frontal and lateral) equally contribute to the loss and the queue size remains fixed as in the original MoCo approach. Following a similar idea in

kalantidis2020hard, our third strategy is to sample a portion of negative pairs with the same laterality for each query and append them to the queue for loss computation. Our fourth strategy is to create synthetic negatives for additional hard negative pairs. Unlike kalantidis2020hard, we do not determine hardness of negative pairs based on similarities of representations. Instead, we use existing metadata (image laterality) to approximate hardness of an negative pair. We evaluate the performance of each of these negative pair strategies combined with the positive pair strategy of “same patient, same study, all lateralities”.

Results.

Results are given in Table LABEL:tab:negative_results. The default negative pair selection strategy (AUC = 0.876) is not outperformed by any of the metadata-exploiting negative pair selection strategies including same laterality only (AUC = 0.872), same laterality reweighted (AUC = 0.864 ), same laterality appended (AUC = 0.875) and same laterality synthetic (AUC = 0.870). Thus, our exploratory analysis does not indicate sufficient evidence for performance improvement using strategies that incorporate metadata, but further experiments with other metadata sources may be required to further understand this relationship.

tab:negative_results Negative Pairs Strategy Linear Default Same Laterality only Same Laterality (reweighted) Same Laterality (appended) Same Laterality (synthetic)

Table 6: Experiments with the default negative pair definition (different images) and various negative pair selection strategies.

4 Discussion

We introduce MedAug, a method to use patient metadata to select positive pairs for contrastive learning, and demonstrate the utility of this method on a chest X-ray interpretation task.

Can we improve performance by leveraging metadata to choose positive pairs? Yes. Our best pretrained strategy with multiple images from the same patient and same study obtains an increase of 3.4% in linear fine-tuning AUC in comparison to the instance discrimination approach implemented in sowrirajan. A similar result has been shown by clocs for ECG signal interpretation. azizi also found improvement in dermatology classification when applying a second contrastive pretraining stage where strictly distinct images from the same patient are selected as positive pairs.

Unlike previous work, our empirical analysis on using images from all studies and distinct studies shows that simply choosing images from the same patient may hurt downstream performance. We show that using appropriate metadata such as study number to select positive pairs that share underlying disease information is needed to obtain the best representation for the downstream task of disease classification. For future studies, it is of interest to experiment with other metadata such as age group, medical history, etc. and how they can inform on tasks other than disease classification.

Our analysis using different criteria on laterality shows that the number of images selected to form positive pairs plays an important role, while the effect of mutual information is less clear. Given time and resources, it would be informative to experiment with how the maximum number of distinct images chosen per query affect downstream performance.

Can we improve performance by leveraging metadata to choose hard negative pairs? Not necessarily. We perform an exploratory analysis of strategies to leverage patient metadata to select negative pairs, and do not find them to outperform the baseline.

In closing, our work demonstrates the potential benefits of incorporating patient metadata into self-supervised contrastive learning for medical images, and can be extended to a broader set of tasks rajpurkar2020appendixnet; uyumazturk2019deep.

References

Appendix A Proportions of positive pairs with different disease labels

In 3.2.2, we argue that downstream performance from the is lower than that of because there is a likely a higher proportion of positive pairs with different disease labels in . Figure 2 shows that there is almost 9% of where contains only images with a different disease label from , whereas this scenario does not appear for .

Figure 2: Histogram showing the distribution of the proportions of positive pairs with different disease labels in versus .

Appendix B Negative Pairs

Following the loss function in equation (1), we denote the exponential sum of the negative pairs by L(x)=-logexp[g(~x1)⋅g(~x2)]exp[g(~x1)⋅g(~x2)]+G(~x1,zi). where G(~x_1,z_i)=∑_z_i∈Qexp[g(~x_1))⋅g(z_i))] We follow the MoCo setup and denote as the queue. Let be the set of image representations in that have the same laterality as . We use the symbol to denote list concatenation. We describe each of our negative pair selection strategies as follows:

  1. (Same laterality only) For each query, we select keys in the queue that have the same laterality as the query. Specifically, we replace in equation (4) by

    G^l(~x_1,z_i)=∑_ z_i∈S(x)exp[g(~x_1))⋅g(z_i))]

  2. (Same laterality reweighted) The first strategy excluded the keys in the queue that have different laterality from the query. Here we set a target hard negative weight and reweight each term to achieve the target weight. Let

    G^w(~x_1,z_i)=∑_ z_i∈S(x)w_i^sexp[g(~x_1))⋅g(z_i))]+∑_ z_i∈S(x)^cw_i^dexp[g(~x_1))⋅g(z_i))]

    where is the target hard negative weight and is the proportion of the negative keys in the queue that have the same laterality as . Then and for all . In our experiments, we set to allocate 90% of the weight to hard negatives. This allows us to include all negative pairs in the contrastive loss but place emphasis on hard negative pairs with the same laterality.

  3. (Same laterality appended) For each query, we select a random sample of the keys that have the same laterality and append them to the existing queue

    where is the queue size. Let A = { z_i_1,z_i_2,…,z_i_m} ⊂S(x) be the random sample of keys with the same laterality as the query. The new queue is

    and

    replaces in equation (4).

  4. (Same laterality synthetic) For each query, in addition to appending samples of the keys from , we use the samples to generate synthetic keys and append them to the queue. We randomly sample pairs and call this set of pairs .

    For each pair , we uniformly sample a number between 0 and 1 and let

    A synthetic image representation is defined as the normalized vector

    . Let be the set of these synthetic image representations and

    is the new queue. in equation (1) is replaced by

    Note that unlike kalantidis2020hard, we only construct synthetic images once.