Continual Active Learning Using Pseudo-Domains for Limited Labelling Resources and Changing Acquisition Characteristics

by   Matthias Perkonigg, et al.
MedUni Wien

Machine learning in medical imaging during clinical routine is impaired by changes in scanner protocols, hardware, or policies resulting in a heterogeneous set of acquisition settings. When training a deep learning model on an initial static training set, model performance and reliability suffer from changes of acquisition characteristics as data and targets may become inconsistent. Continual learning can help to adapt models to the changing environment by training on a continuous data stream. However, continual manual expert labelling of medical imaging requires substantial effort. Thus, ways to use labelling resources efficiently on a well chosen sub-set of new examples is necessary to render this strategy feasible. Here, we propose a method for continual active learning operating on a stream of medical images in a multi-scanner setting. The approach automatically recognizes shifts in image acquisition characteristics - new domains -, selects optimal examples for labelling and adapts training accordingly. Labelling is subject to a limited budget, resembling typical real world scenarios. To demonstrate generalizability, we evaluate the effectiveness of our method on three tasks: cardiac segmentation, lung nodule detection and brain age estimation. Results show that the proposed approach outperforms other active learning methods, while effectively counteracting catastrophic forgetting.



There are no comments yet.


page 2

page 5

page 14

page 15

page 23


Continual Active Learning for Efficient Adaptation of Machine Learning Models to Changing Image Acquisition

Imaging in clinical routine is subject to changing scanner protocols, ha...

Adversarial Continual Learning for Multi-Domain Hippocampal Segmentation

Deep learning for medical imaging suffers from temporal and privacy-rela...

What is Wrong with Continual Learning in Medical Image Segmentation?

Continual learning protocols are attracting increasing attention from th...

ALPS: Active Learning via Perturbations

Small, labelled datasets in the presence of larger, unlabelled datasets ...

A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning

Current deep learning research is dominated by benchmark evaluation. A m...

One Pass ImageNet

We present the One Pass ImageNet (OPIN) problem, which aims to study the...

Active learning for object detection in high-resolution satellite images

In machine learning, the term active learning regroups techniques that a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The performance of deep learning models in the clinical environment is hampered by frequent changes in scanner hardware, imaging protocols, and heterogeneous composition of acquisition routines. Ideally, models trained on a large data set should be continuously adapted to the changing characteristics of the data stream acquired in imaging departments. However, training on a data stream of images acquired solely by recent acquisition technology can lead to catastrophic forgetting (McCloskey and Cohen, 1989), a deterioration of performance on preceding domains or tasks, see Figure 1 (a). Therefore, a continual learning strategy is required to counteract forgetting. Model training in a medical context requires expert labelling of data in new domains. This is often prohibitively expensive and time-consuming. Therefore, reducing the number of cases requiring labelling, while still providing training with the variability necessary to generalize well, is a key challenge in active learning on medical images (Budd et al., 2019). Here, we propose an active learning approach to make efficient use of annotation resources during continual machine learning. In a continual data stream of examples from an unlabelled distribution, it identifies those that are most informative if labelled next.

Figure 1: Experimental setup: A model is pre-trained on scanner A data (base training) and then subsequently updated on a data stream gradually including data of scanner B, C and D. (a) The accuracy of a model trained on a static data set of only scanner A drops as data from other scanner appear in the data stream. (b) Continual learning can incorporate new knowledge, but requires all samples in the data stream to be labelled. (c) Active continual learning actively chooses the labels to annotated from the stream and is able to keep the model up to date while limiting the annotation effort.

We focus on accounting for domain shifts occurring in a continual data stream, without knowledge about when those shifts occur. Figure 1 depicts the scenario our method is designed for. A deep learning model is trained on a base data set of labelled data of one domain (Scanner A), afterwards it is exposed to the data stream in which scanners B, C and D occur. For each sample of the data stream, continual active learning has to take a decision on whether or not labelling is required for the given image. Labelled images are then used for continual learning with a rehearsal memory. Previously proposed continual active learning methods either disregard domain shifts in the training distribution or assume that the domain membership of images is known (Lenga et al., 2020; Özgün et al., 2020; Karani et al., 2018). However, this knowledge can not be assumed in clinical practice due to the variability in encoding the meta data (Gonzalez et al., 2020). Therefore, a technique to detect those domain shifts in a continuous data stream is needed. A combination of continual active learning with automatic detection of domain shifts is desirable to ensure that models can deal with a diverse and growing number of image acquisition settings, and at the same time minimizing the manual efforts and resources needed to keep the models up to date.


Here, we propose an continual active learning method. The approach operates without domain membership knowledge and learns by selecting informative samples to annotate from a continuous data stream. We first run a base training on data of a single scanner. Subsequently, the continuous data stream is observed and domain shifts in the stream are detected. This detection triggers the labelling of samples of the newly detected pseudo-domain and knowledge about the new samples is incorporated to the model. At the same time, the model should not forget knowledge about previous domains, thus we evaluate the final model on data from all observed domains. Our approach combines continual active learning with a novel domain detection method for continual learning. We refer to our approach as Continual Active Learning for Scanner Adaptation (CASA). CASA uses a rehearsal method to alleviate catastrophic forgetting and an active labelling approach without prior domain knowledge. CASA is designed to learn on a continuous stream of medical images under the restriction of a labelling budget, to keep the required manual annotation effort low. The present paper expands on prior work on continual active learning (Perkonigg et al., 2021).

2 Related Work

The performance of machine learning models can be severely hampered by changes in image acquisition settings (Castro et al., 2020; Glocker et al., 2019; Prayer et al., 2021). Harmonization can counter this in medical imaging (Fortin et al., 2018; Beer et al., 2020), but requires all data to be available at once, a condition not feasible in an environment where data arrives continually. Domain adaptation (DA) addresses domain shifts, and in particular approaches dealing with continuously shifting domains are related to the proposed method. Wu et al. (2019)

showed how to adapt a machine learning model for semantic segmentation of street scenes under different lightning conditions. Rehearsal methods for domain adaptation have been shown to perform well on benchmark data sets such as rotated MNIST

(Bobu et al., 2018) or Office-31 (Lao et al., 2020). In the area of medical imaging, DA is used to adapt between different image acquisition settings Guan and Liu (2021). A method dealing with shifting domains for lung segmentation in X-rays proposed by Venkataramani et al. (2019). However, similar to harmonization, most DA methods require that source and target domains are accessible at the same time.

Continual learning (CL) is used to incorporate new knowledge into ML models without forgetting knowledge about previously seen data. For a detailed review on CL see (Parisi et al., 2019; Delange et al., 2021). An overview of the potential of CL combined with medical imaging combined is given in (Pianykh et al., 2020). Ozdemir et al. (2018) used continual learning to incrementally add new anatomical regions into a segmentation model. Related to this work, CL has been used for domain adaptation for chest X-ray classification (Lenga et al., 2020) and for brain MRI segmentation (Özgün et al., 2020). Karani et al. (2018)

proposed a simple, yet effective approach for lifelong learning for brain MRI segmentation by using separate batch normalization layers for each protocol. Most closely related to the proposed approach is

Hofmanninger et al. (2020) where a image style-based rehearsal memory is used in a classification task in lung CT images.

Active Learning is an area of research where the goal is to identify samples to label next to minimize annotation effort, while maximizing training efficiency. A detailed review of active learning in medical imaging is given in (Budd et al., 2019). In context of this review our work is closest related to stream-based selective sampling. Also Pianykh et al. (2020)

discuss human-in-the-loop concepts with continual learning, which is similar to the approach presented in this work. Active learning was used to classify fundus and histopathological images by

Smailagic et al. (2020) in an incremental learning setting. Zhou et al. (2021)

combine transfer learning and active learning to choose samples for labelling based on entropy and diversity. They show the benefits of their method on polyp detection and pulmonary embolism detection. Different from the proposed method, those approaches do not take data distribution shifts during training into account.

3 Methods

The continual active learning method CASA uses a rehearsal memory and performs active training sample selection from an unlabelled, continuous data stream to keep a task model up-to-date under the presence of domain shifts, while at the same time countering forgetting. For active sample labelling an oracle can be queried to return task annotations . In a real world clinical setting this oracle can be thought of as a radiologist. Due to the cost of manual labelling, the queries to the oracle are limited by the labelling budget . CASA aims at training a task network on a continuous data stream under the restriction of , while at the same time alleviating catastrophic forgetting. CASA detects pseudo-domains

to keep a diverse set of training samples in the rehearsal memory. Those pseudo-domains are formed by groups of examples with similar appearance. Similarity is measured as style difference of images. The proposed method consists of a pseudo-domain module, a task module and and two memories (outlier memory and training rehearsal memory), that are controlled by the CASA-Algorithm described in the following.

Figure 2: Overview of the CASA algorithm. Each sample from the data stream is processed by the pseudo-domain module to decide whether its routed to the oracle or to the outlier memory. Whenever a new item is added to the outlier memory it is evaluated if a new pseudo-domain (pd) should be created. The oracle labels a sample and stores it in the training memory, from which the task module trains a network. Binary decision alternatives resulting in discarding the sample are left out for clarity of the figure.

3.1 CASA Training Scheme

Before starting continual training, the task module is pre-trained on a labelled data set of image-label pairs obtained on a particular scanner (base training

). This base training is a conventional epoch based training procedure before continual training is started from a model which performs well for a single scanner. After base training is finished, continual active training follows the scheme depicted in Figure

2 and outlined in Algorithm 1. First, an input-mini-batch is drawn from and the pseudo-domain module evaluates the style embedding (see Section 3.2) of each image . Based on this embedding, a decision is taken to store in one of the memories ( or ) or discard . The fixed sized training memory , where is the image, the corresponding label and the assigned pseudo-domain, holds samples the task network can be trained on. Labels can only be generated by querying the oracle , and are subject to the limited labelling budget . is initialized with a random subset of before starting continual training. Pseudo-domain detection is performed within the outlier memory , which holds a set of unlabelled images. represents the image and is a counter, how long the image is part of . Details about the outlier memory are given in Section 3.5. Given that training has not saturated on all pseudo-domains, a training step is performed by sampling training-mini-batches of size from , and training the task module (Section 3.3) for one step. This process is continued by drawing the next mini-batch from .

Input : Pre-trained task model , continual data stream , limited budget , task-memory , outlier memory , training steps per batch
1 while  do
2       for  do
4             if  then
6             end if
7            else
8                   if  then
11                   end if
13             end if
15       end for
17       for  to  do
20       end for
22 end while
Algorithm 1 CASA Training Algorithm

3.2 Pseudo-domain module

CASA does not assume direct knowledge about domains (e.g. scanner vendor, scanning protocol). Therefore, in the pseudo-domain module the style of each image is evaluated and is assigned to a pseudo-domain. Pseudo-domains represent groups of images which exhibit similar style. A set of pseudo-domains is defined by their style embedding center and the maximum distance from that an image is considered belonging to pseudo-domain . In addition, a running average of the performance on for each pseudo-domain is stored.

Style embedding

A style embedding is calculated for an image based on a style network pre-trained on a different dataset (not necessarily related to the task) and not updated during training. From this network, we evaluate the style of an image based on the gram matrix , where is the number of feature maps in layer . Following (Gatys et al., 2016; Hofmanninger et al., 2020),

is defined as the inner product between the vectorized activations

and of two feature maps and in a layer given a sample image :


where denotes the number of elements in the vectorized feature map. Based on the gram matrix a style embedding is defined: For a set of convolutional layers of the style network, gram matrices () are calculated and Principle Component Analysis (PCA) is applied to reduce the dimensionality of the embedding. PCA is fitted on the base training set.

Pseudo-domain assignment

CASA uses pseudo-domains to assess if training for a specific style is needed and to diversify the memory . A new image is assigned to the pseudo-domain minimizing the distance between the center of the pseudo-domain and the style embedding according to the following equation:


If , the image is added to the outlier memory from which new pseudo-domains are detected (see Section 3.5). If the pseudo-domain is known and has completed training, we discard the image, otherwise it is added to according to the strategy described in Section 3.4.

Average performance metric

is the running average of a performance metric of the target task calculated on the last elements of pseudo-domain that have been labelled by the oracle. The performance metric is measured before training on the sample. is used to evaluate if the pseudo-domain completed training, that is for classification tasks and for regression tasks, where is a fixed performance threshold.

3.3 Task module

The task module is responsible for learning the target task (e.g. cardiac segmentation), where the main component of this module is the task network (), mapping from input image to target label . During base training, this module is trained on a labelled data set . During continual active training, the module is updated in every step by drawing training-input-batches from the memory and performing a training step on each of the batches. The aim of CASA is to train a task module performing well on images of all image acquisition settings available in without suffering catastrophic forgetting.

3.4 Training memory

The sized training memory is balanced between the pseudo-domains currently in . Each of the pseudo-domains can occupy up to elements in the memory. If a new pseudo-domain is added to (see Section 3.5) a random subset of elements of all previous domains is flagged for deletion, so that only elements are kept protected in . If a new element is inserted to and is not reached, an element currently flagged for deletion is replaced by . Otherwise the element will replace the one in , which is of the same pseudo-domain and minimizes the distance between the style embeddings. Formally, the element with index is replaced:


3.5 Outlier memory and pseudo-domain detection

The outlier memory holds candidate examples that do not fit an already identified pseudo-domain, and might form a new pseudo-domain by themselves. Whether they form a pseudo-domain is determined based on their proximity in the style embedding space. Examples are stored until they are assigned a new pseudo-domain, or if a fixed number of training steps is reached. If no new pseudo-domain is discovered for an image within steps, it is considered a ’real’ outlier and removed from the outlier memory. Within , new pseudo-domains are discovered, and subsequently added to . The discovery process is started when , where is a fixed threshold of minimum elements in . To detect a dense region in the style embedding space of examples in the outlier memory the pairwise euclidean distances of all elements in are calculated. If there is a group of images for which all pair-wise distances are below a threshold , a new pseudo-domain is established by these images. For all elements belonging to the new pseudo-domain, labels are queried from the oracle and they are transferred from to .

4 Experimental Setup

We evaluate CASA on data streams containing imaging data sampled from different scanners. To demonstrate the generalizability of CASA to a range of different areas in medical imaging, three different tasks are evaluated:

  • Cardiac segmentation on cardiovascular magnetic resonance (CMR) data

  • Lung nodule detection in computed tomography (CT) images of the lung

  • Brain Age Estimation on T1-weighted MRI data

For all tasks, the performance of CASA is compared to several baseline techniques (see Section 4.3).

4.1 Data set

Siemens (C1) GE (C2) Philips (C3) Canon (C4) Total
Base 1120 0 0 0 1120
Continual 614 720 2206 758 4298
Validation 234 248 220 258 960
Test 228 246 216 252 942
Cardiac segmentation data set
GE/L (L1) GE/H (L2) Siemens (L3) LNDb (L4) Total
Base 253 0 0 0 253
Continual 136 166 102 479 883
Validation 53 23 10 55 141
Test 85 26 18 91 220
Lung nodule detection data set
1.5T IXI (B1) 1.5T OASIS (B2) 3.0T IXI (B3) 3.0T OASIS (B4) Total
Base 201 0 0 0 201
Continual 52 190 146 1504 1892
Validation 31 23 18 187 259
Test 31 23 18 187 259
Brain age estimation data set
Table 1: Splitting of the data sets into a base training, continual training, validation, and test set. The number of cases in each split are shown.

Cardiac segmentation

2D cardiac segmentation experiments were performed on data from a multi-center, multi-vendor challenge data set (Campello et al., 2021). The data set included CMR data from four different scanner vendors (Siemens, General Electric, Philips and Canon), where we considered each vendor as a different domain. We split the data into base training, continual training, validation, and test set on a patient level. Table 1 (a) shows the number of slices for each domain in those splits. Manual annotations for left ventricle, right ventricle and left ventricular myocardium were provided. 2D images were center-cropped to 240196px and normalized to a range of [0-1]. In the continual data set, the scanners appeared in the order Siemens, GE, Philips and Canon and are referred to Scanner C1-C4.

Lung nodule detection

For lung nodule detection, we used two data sources: the LIDC-database (Armato et al., 2011), with the annotations as provided for the LUNA16-challenge (Setio et al., 2017) and the LNDb challenge data set (Pedrosa et al., 2019). Lung nodule detection was performed as 2D bounding box detection, therefore bounding boxes around annotated lesions were constructed for all available lung nodule annotations. From LIDC, the three most common domains, in terms of scanner vendor and reconstruction kernel, were used to collect a data set suitable for continual learning with shifting domains. Those domains were GE MEDICAL SYSTEMS with low frequency reconstruction algorithm (GE/L), GE MEDICAL SYSTEM with high frequency reconstruction algorithm (GE/H) and Siemens with B30f kernel (Siemens). In addition, data from LNDb was used as a forth domain which was comprised of data from multiple Siemens scanners. For LNDb, nodules with a diameter were excluded to match the definition in LIDC. Image intensities were cropped from -1024 to 1024 and normalized to [0-1]. From the images 2D slices were extracted and split into base training, continual training, validation and test data set according to Table 1 (b). For all continual learning experiments, the order of the domains was GE/L, GE/H, Siemens and LNDb, those are referred to as L1-L4.

Brain age estimation

Data pooled from two different data sets containing three different scanners was used for brain age estimation. The IXI data set111 and data from OASIS-3 (LaMontagne et al., 2019) was used to collect a continual learning data set. From IXI, we used data from a Philips Gyroscan Intera 1.5T and a Philips Intera 3.0T scanner, from OASIS-3 we used data from a Siemens Vision 1.5T and a Siemens TrioTim 3.0T scanner. Images were resized to 64x128x128px and normalized to a range between 0 and 1. Data was split into base base training, continual training, validation and test set (see Table 1 (c)). In continual training data occurred in the order: Philips 1.5T, Siemens 1.5T, Philips 3.0T and Siemens 3.0T, the scanner domains are referred to B1-B4 in the following.

4.2 Experimental setup

Cardiac segmentation

For segmentation, a 2D-UNet (Ronneberger et al., 2015) was used as task network. The style network was a ResNet-50 (Ren et al., 2017)

, pretrained on ImageNet and provided in the torchvision package. For segmentation, the performance metric used in all experiments was the mean dice score (DSC) over the three annotated anatomical regions (left ventricle, right ventricle and left ventricular myocardium).

Lung nodule detection

As a task network, Faster R-CNN with a ResNet-50 backbone was used (Ren et al., 2017)

. For evaluating the style, we used a ResNet-50 pretrained on ImageNet. For lung nodule detection, we used average precision (AP) as the performance metric to evaluate the performance of the models with a single metric. We followed the AP definition by

Everingham et al. (2010).

Brain age estimation

As a task network, a simple 3D feed-forward network was used (Dinsdale et al., 2020). The style network used in the pseudo-domain module was a 3D-ModelGenesis model, pre-trained on computed tomography images of the lung (Zhou et al., 2020). The main performance measure for brain age estimation we used was the mean absolute error (MAE) between predicted and true age.

4.3 Methods compared

Throughout the experiments, five methods were evaluated and compared:

  1. Joint model (JM): a model trained in a standard, epoch-based approach on samples from all scanners in the experiment jointly.

  2. Domain specific models (DSM): a separate model is trained for each domain in the experiment with standard epoch-based training. The evaluation for a domain is done for each domain setting separately.

  3. Naive AL (NAL): a naive continuously trained, active learning approach of labelling every -th label from the data stream, where depends on the labelling budget .

  4. Uncertainty AL (UAL): Is a common type of active learning which labels samples where the task network is uncertain about the output (Budd et al., 2019)

    . Here, uncertainty is calculated using dropout at inference as an approximation for Bayesian inference

    (Gal and Ghahramani, 2016).

  5. CASA (proposed method): The method described in this work.

Joint models and DSM require the whole training data set to be labelled, and thus are an upper limit to which the continual learning methods are compared to. CASA, UAL and NAL use an oracle to label specific samples only. The comparison to NAL and UAL evaluates if the detection of pseudo-domains and labelling based on them is beneficial in an active learning setting. Note, that the aim of our experiments is to show the gains of CASA compared to other active learning methods, not to develop new state-of-the-art methods for either of the three tasks evaluated.

4.4 Experimental evaluation

We evaluate different aspects of CASA:

  1. Performance across domains: For all tasks, we evaluate the performance across domains at the end of training, and highlight specific properties of CASA in comparison to the baseline methods. Furthermore, we evaluate the ability of continual learning to improve accuracy on existing domains by adding new domains backward transfer (BWT), and the contribution of previous domains in the training data to improving the accuracy on new domains forward transfer (FWT) (Lopez-Paz and Ranzato, 2017). BWT measure how learning a new domain influences the performance on previous tasks, FWT quantifies the influence on future tasks. Negative BWT values indicate catastrophic forgetting, thus avoiding negative BWT is especially important for continual learning.

  2. Influence of labelling budget : For cardiac segmentation, the influence of the is studied. The labelling budget is an important parameter in clinical practice, since labelling new samples is expensive. We express as a fraction of the continual data set. Different settings of are analysed , , and . To solely study the influence of , the memory size in this experiments is fixed for all settings.

  3. Influence of memory size : For cardiac segmentation different settings for the memory size are evaluated. The memory size is the number of samples that are stored for rehearsal, and might be limited due to privacy concerns and/or storage space. Here, is evaluated for and a fixed .

  4. Memory composition and pseudo-domains: We study if our proposed method of detecting pseudo-domains is keeping samples in memory that are representative of the whole training set for cardiac segmentation. In addition, we evaluate how the detected pseudo-domains are connected to the real domains determined by the scanner types.

  5. Learning on a random stream: We study how CASA is performing on a random stream of data, where images of different acquisition settings are appearing randomly in the data stream. In contrast to the standard setting, where these acquisition settings appear subsequently with a phase of transition in between.

5 Results

5.1 Performance across domains

Here, the quantitative results at the end of the continual training are compared for a memory size and a labelling budget of . Different settings for and are evaluated in Section 5.2 and 5.3 respectively.

Cardiac segmentation

Performance for cardiac segmentation was measured using the mean dice score. Continual learning with CASA applied to cardiac segmentation outperformed UAL and NAL for scanners C2, C3 and C4 (Table 2). For scanner C1, the performance of the model trained with CASA was slightly below UAL and NAL. This was due to the distribution in the rehearsal memory, where CASA balanced between all four scanner domains, while for UAL and NAL, a majority of the rehearsal memory was filled with C1 images (further details are discussed in Section 5.4). Compared to the base model, which corresponds to the model performance prior to continual training the performance of CASA remained constant for C1 and at the same time rose significantly for C2 (), C3 () and C4 (), showing that CASA was able to perform continual learning without forgetting the knowledge acquired in base training. UAL and NAL were also able to learn without forgetting during continual learning, this is also reflected in a BWT of around for all compared methods. However, UAL and NAL performed worse in terms of FWT and overall dice for C2 to C4. As expected, JModel outperformed all other training strategies since it has access to the fully labelled training set at once and thus can perform epoch-based deep learning.

Meth. C1 C2 C3 C4 BWT FWT
Table 2: Cardiac segmentation: Quantitative results for , measured in mean dice score.

marks the standard deviations over

independent training runs.

Lung nodule detection

In Table 3, results for lung nodule detection measured as average precision are compared. CASA performed significantly better than NAL and UAL for all scanners. For L4, which were the images extracted from LNDb, the distribution of nodules was different. For scanners L1-L3, the mean lesion diameter was 8.29mm, while for L4, lesion diameter was 5.99mm on average. This lead to a worse performance on L4 for all approaches. Nevertheless, CASA was the only active learning approach able to label a large enough amount of images for L4 such that it can significantly outperform the base model, as well as NAL and UAL.

Meth. L1 L2 L3 L4 BWT FWT
Table 3: Lung nodule detection: Quantitative results for , measured in average precision. marks the standard deviations over independent training runs.

Brain age estimation

Table 4 (c) shows the results for brain age estimation in terms of MAE. CASA was able to perform continual learning without forgetting, and outperformed UAL and NAL for all scanners (B1-B4) at the end of the continuous data stream. Comparing MAE for B1 data for UAL () and NAL () with the base model () shows that forgetting has occurred for UAL and NAL. For CASA, MAE for B2 and B3 was notably higher than for B1 and B4 respectively. Due to the composition of the continual training set, B2 (n=190) and B3 (n=146) occurred less than B4 (n=1504) in the data stream, consequently leading to fewer B2 and B3 images seen during training, and consequently a worse performance. Nevertheless, CASA was able to handle this data set composition better than UAL and NAL.

Meth. B1 B2 B3 B4 BWT FWT
Table 4: Brain Age Estimation: Quantitative results for , measured in mean absolute error. marks the standard deviations over independent training runs.

5.2 Influence of labelling budget

Figure 3: Influence of labelling budget for cardiac segmentation with comparing CASA, uncertainty AL and naive AL. Performance was measured in mean DSC.

The influence of on cardiac segmentation performance is shown in Figure 3. For the first scanners C1 and C2, a similar performance can be observed for all methods and values of . This was due to the fact that all methods have a sufficient amount of budget to adapt from C1 to C2. For C3, the performance for CASA was slightly higher compared to UAL and NAL. The most striking difference between the methods can be seen for scanner C4. There, CASA performed significantly better for and . For CASA outperformed UAL and NAL on average however, a large deviation between the individual five runs is observable. Investigating this further revealed that CASA ran out of budget before C4 data appeared in the stream for one of the random seeds. Thus it was not able to adapt to C4 for this random seed. For , CASA consumed the whole labelling budget before C4 data occured in the stream. Thus, it was not able to adapt to C4 data properly. UAL only had little budget left when C4 data came in and runs out afterwards thus, the performance was significantly worse to the results with more labelling budget. NAL performed the best for the setting with the lowest budget , due to the fact that NAL labels every 20th step and thus did not run out of budget until the end of the stream is reached.

5.3 Influence of memory size

For cardiac segmentation, we investigate the influence of the training memory size .

Figure 4: Influence of memory size for cardiac segmentation with labelling budget comparing CASA, uncertainty AL and naive AL. Performance was measured as mean DSC.

The memory size influences the the adaption to new domains the the continuous data stream. In Figure 4, CASA, UAL and NAL are compared for . For the first scanners C1 and C2 in the stream, the performance was similar across AL methods and settings of . For , CASA could not gain any improvement in comparison to UAL and NAL, meaning that the detection of pseudo-domains and balancing based on them is more useful for reasonable large memory sizes. All methods performed best for however, the memory on the end of the stream was primarily filled with C3 and C4 data, which would lead to forgetting effects if training continues.

5.4 Evaluation of the Memory and Pseudo-Domains

Figure 5: t-SNE visualization of style embeddings for cardiac segmentation. (a) shows the distribution of the different domains C1-C4 in the embedding space. (b) For CASA, UAL and NAL the memory elements at the end of continual training are marked in the embedding space, showing a balanced distribution for CASA. (c) Counts of elements in the rehearsal memory at the end of training for CASA, UAL and NAL.

We analyzed the balancing of our memory at the end of training (with ) and the detection of different pseudo-domains by extracting the style embedding for all samples in the training set (combined base and continual set). Those embeddings were mapped to two dimensions using t-SNE (Maaten and Hinton, 2008) for plotting. In Figure 5 (a), it is observable that different scanners are located in different areas of the embedding space. Especially scanner C4 forms a compact cluster separated from the other scanners. Furthermore, a comparison of the distribution of the samples in the rehearsal memory at the end of training (Figure 5(b)) shows that for CASA, the samples distributed over the whole embedding including scanner C4. For UAL and NAL, most samples focused on scanner C1 samples (base training scanner), and a lower number of images of later scanners were kept in memory. Note, that this does not mean that UAL and NAL labelled primarily scanner C1 samples but that those methods did not balance the rehearsal memory. So, labelled images from C2-C4 might be lost in the process of choosing what samples to keep. As shown in Appendix Figure 8, those observations were stable over differnt test runs. Figure 5 (c) confirms the finding. Here, the memory distribution for the compared methods over five independent runs (with different random seeds) demonstrated the capability of CASA to balance the memory across scanner, although the real domains are not known during training.

Figure 6: Distribution of images of specific scanners (C1-C4) to the discovered pseudo-domains for one run of CASA training with , .

Pseudo-domain discovery in CASA (with , ) resulted in 6-8 pseudo-domains, showing that the definition of pseudo-domains does not exactly capture the definitions of domains in terms of scanners. In addition, the detection was influenced by the order of the continuous data stream. Figure 6 shows the distribution of images from certain scanners in the pseudo-domains for one training run of CASA (results for all independent runs are shown in Appendix Figure 9). The first two pseudo-domains were dominated by samples from scanner C1, while the last pseudo-domain consisted mainly of scanner C4. The pseudo-domains 3-6 represented a mix of C2 and C3 data. This is consistent with Figure 5 where we see that the distributions of C2 and C3 overlap while C1 and especially C4 data is more separated.

5.5 Learning on a random stream of data

To analyze the influence of the sequential nature of the stream, we show how CASA performs on a random stream of data, with no sequential order of the scanners used. Results are given in Table 5. CASA performed well on a random stream however, for scanners with few samples in the training set (Scanner C2, C4), a drop in performance was observed. This is due to the fact that the pseudo-domain detection based on the outlier memory was not as effective as on a continuous stream, where we expect a accumulation of outliers as new domains start to occur in the stream. NAL performed well on a random stream, outperforming NAL on a ordered stream, as well as CASA. Due to the randomness in the stream and the sampling strategy of NAL (taking every -th sample), NAL learned on a diverse set of samples and managed to reach a more balanced training.

Meth. C1 C2 C3 C4
CASA - Random
NAL - Random
Table 5: Dice scores for cardiac segmentation on a random stream. CASA with , is compared to naive active learning. marks the standard deviations over independent training runs.

6 Discussion and Conclusion

We propose a continual active learning method to adapt deep learning models to changes of medical imaging acquisition settings. By detecting novel pseudo-domains occurring in the data stream, our method is able to keep the number of required annotations low, while improving the diversity of the training set. Pseudo-domains represent groups of images with similar but new imaging characteristics. Balancing the rehearsal memory based on pseudo-domains ensures a diverse set of samples is kept for retraining on both new and preceding domains.

Experiments showed that the proposed approach improves model accuracy in a range of different medical imaging tasks ranging from segmentation, to detection and regression. Performance of the models is improved across all domains for each task, while effectively counteracting catastrophic forgetting. Extensive experiments to gain insights on the effect of the composition of the rehearsal memory showed that CASA successfully balances training data between the real, but unknown domains.

A question that could be addressed by future research is that CASA needs to store samples that are part of the memory from preceding domains until the end of training, which could lead to data privacy concerns. A possible direction of further research is how to combine the concepts presented in this work with pseudo-rehearsal methods that do not store samples directly but rather a privacy conserving representation of the previously seen samples. In our experiments, the memory size was fixed while in real world applications running for possibly unlimited time, strategies how to expand the memory to include a sufficient amount of samples to cover the whole data distribution are needed.

This work was partially supported by the Austrian Science Fund (FWF): P 35189, and by the Vienna Science and Technology Fund (WWTF): LS20-065, and by Novartis Pharmaceuticals Corporation.

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.

M.P. and J.H. declare no conflicts of interests. C.H.: Research Consultant for Siemens Healthineers and Bayer Healthcare, Stock holder at Hologic Inc. H.P.: Speakers Honoraria for Boehringer Ingelheim and Roche. Received a research grant by Boehringer Ingelheim. G.L.: Co-founder and stock holder at contextflow GmbH. Received research funding by Novartis Pharmaceuticals Corporation.


  • S. G. Armato, G. McLennan, L. Bidaut, M. F. McNitt-Gray, C. R. Meyer, A. P. Reeves, B. Zhao, D. R. Aberle, C. I. Henschke, E. A. Hoffman, E. A. Kazerooni, H. MacMahon, E. J.R. Van Beek, D. Yankelevitz, A. M. Biancardi, P. H. Bland, M. S. Brown, R. M. Engelmann, G. E. Laderach, D. Max, R. C. Pais, D. P.Y. Qing, R. Y. Roberts, A. R. Smith, A. Starkey, P. Batra, P. Caligiuri, A. Farooqi, G. W. Gladish, C. M. Jude, R. F. Munden, I. Petkovska, L. E. Quint, L. H. Schwartz, B. Sundaram, L. E. Dodd, C. Fenimore, D. Gur, N. Petrick, J. Freymann, J. Kirby, B. Hughes, A. Vande Casteele, S. Gupte, M. Sallam, M. D. Heath, M. H. Kuhn, E. Dharaiya, R. Burns, D. S. Fryd, M. Salganicoff, V. Anand, U. Shreter, S. Vastagh, B. Y. Croft, and L. P. Clarke (2011) The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): A completed reference database of lung nodules on CT scans. Medical Physics 38 (2), pp. 915–931. External Links: Document, ISSN 00942405 Cited by: §4.1.
  • J. C. Beer, N. J. Tustison, P. A. Cook, C. Davatzikos, Y. I. Sheline, R. T. Shinohara, and K. A. Linn (2020) Longitudinal ComBat: A method for harmonizing longitudinal multi-scanner imaging data. NeuroImage 220. External Links: Document, ISSN 10959572 Cited by: §2.
  • A. Bobu, E. Tzeng, J. Hoffman, and T. Darrell (2018) Adapting to continously shifting domains. In ICLR Workshop, Cited by: §2.
  • S. Budd, E. C. Robinson, and B. Kainz (2019) A Survey on Active Learning and Human-in-the-Loop Deep Learning for Medical Image Analysis. External Links: Link Cited by: §1, §2, item 4.
  • V. M. Campello, P. Gkontra, C. Izquierdo, C. Martin-Isla, A. Sojoudi, P. M. Full, K. Maier-Hein, Y. Zhang, Z. He, J. Ma, M. Parreno, A. Albiol, F. Kong, S. C. Shadden, J. C. Acero, V. Sundaresan, M. Saber, M. Elattar, H. Li, B. Menze, F. Khader, C. Haarburger, C. M. Scannell, M. Veta, A. Carscadden, K. Punithakumar, X. Liu, S. A. Tsaftaris, X. Huang, X. Yang, L. Li, X. Zhuang, D. Vilades, M. L. Descalzo, A. Guala, L. La Mura, M. G. Friedrich, R. Garg, J. Lebel, F. Henriques, M. Karakas, E. Cavus, S. E. Petersen, S. Escalera, S. Segui, J. F. Rodriguez-Palomares, and K. Lekadir (2021) Multi-Centre, Multi-Vendor and Multi-Disease Cardiac Segmentation: The M&Ms Challenge. IEEE Transactions on Medical Imaging, pp. 1–1. External Links: Document Cited by: §4.1.
  • D. C. Castro, I. Walker, and B. Glocker (2020) Causality matters in medical imaging. Nature Communications 11 (1), pp. 1–10. External Links: Link, Document, ISSN 20411723 Cited by: §2.
  • M. Delange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars (2021) A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. External Links: Link, Document, ISSN 0162-8828 Cited by: §2.
  • N. K. Dinsdale, M. Jenkinson, and A. I.L. Namburete (2020) Unlearning Scanner Bias for MRI Harmonisation in Medical Image Segmentation. Communications in Computer and Information Science 1248 CCIS, pp. 15–25. External Links: ISBN 9783030527907, Document, ISSN 18650937 Cited by: §4.2.
  • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2010) The Pascal Visual Object Classes (VOC) Challenge.

    International Journal of Computer Vision

    88 (2), pp. 303–338.
    External Links: Link, Document, ISSN 0920-5691 Cited by: §4.2.
  • J. P. Fortin, N. Cullen, Y. I. Sheline, W. D. Taylor, I. Aselcioglu, P. A. Cook, P. Adams, C. Cooper, M. Fava, P. J. McGrath, M. McInnis, M. L. Phillips, M. H. Trivedi, M. M. Weissman, and R. T. Shinohara (2018) Harmonization of cortical thickness measurements across scanners and sites. NeuroImage 167, pp. 104–120. External Links: Document, ISSN 10959572 Cited by: §2.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning Zoubin Ghahramani. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1050–1059. Cited by: item 4.
  • L. Gatys, A. Ecker, and M. Bethge (2016) A Neural Algorithm of Artistic Style. Journal of Vision 16 (12), pp. 326. External Links: Document, ISSN 1534-7362 Cited by: §3.2.
  • B. Glocker, R. Robinson, D. C. Castro, Q. Dou, and E. Konukoglu (2019) Machine Learning with Multi-Site Imaging Data: An Empirical Study on the Impact of Scanner Effects. arXiv Preprint. External Links: Link Cited by: §2.
  • C. Gonzalez, G. Sakas, and A. Mukhopadhyay (2020) What is Wrong with Continual Learning in Medical Image Segmentation?. arXiv Preprint. External Links: Link Cited by: §1.
  • H. Guan and M. Liu (2021) Domain Adaptation for Medical Image Analysis: A Survey. External Links: Link Cited by: §2.
  • J. Hofmanninger, M. Perkonigg, J. A. Brink, O. Pianykh, C. Herold, and G. Langs (2020) Dynamic Memory to Alleviate Catastrophic Forgetting in Continuous Learning Settings.

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    12262 LNCS, pp. 359–368.
    External Links: ISBN 9783030597122, Document, ISSN 16113349 Cited by: §2, §3.2.
  • N. Karani, K. Chaitanya, C. Baumgartner, and E. Konukoglu (2018) A lifelong learning approach to brain MR segmentation across scanners and protocols. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), Vol. 11070 LNCS, pp. 476–484. External Links: ISBN 9783030009274, Document, ISSN 16113349 Cited by: §1, §2.
  • P. J. LaMontagne, T. L. S. Benzinger, J. C. Morris, S. Keefe, R. Hornbeck, C. Xiong, E. Grant, J. Hassenstab, K. Moulder, A. G. Vlassenko, M. E. Raichle, C. Cruchaga, and D. Marcus (2019) OASIS-3: Longitudinal Neuroimaging, Clinical, and Cognitive Dataset for Normal Aging and Alzheimer Disease. medRxiv, pp. 2019.12.13.19014902. External Links: Link, Document Cited by: §4.1.
  • Q. Lao, X. Jiang, M. Havaei, and Y. Bengio (2020) Continuous Domain Adaptation with Variational Domain-Agnostic Feature Replay. External Links: Link Cited by: §2.
  • M. Lenga, H. Schulz, and A. Saalbach (2020) Continual Learning for Domain Adaptation in Chest X-ray Classification. In Conference on Medical Imaging with Deep Learning (MIDL), External Links: Link Cited by: §1, §2.
  • D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. Advances in Neural Information Processing Systems, pp. 6468–6477. External Links: ISSN 10495258 Cited by: item 1.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-SNE. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §5.4.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of Learning and Motivation - Advances in Research and Theory 24 (C), pp. 109–165. External Links: Document, ISSN 00797421 Cited by: §1.
  • F. Ozdemir, P. Fuernstahl, and O. Goksel (2018) Learn the New, Keep the Old: Extending Pretrained Models with New Anatomy and Images. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11073 LNCS, pp. 361–369. External Links: ISBN 9783030009366, Document, ISSN 16113349 Cited by: §2.
  • S. Özgün, A. Rickmann, A. G. Roy, and C. Wachinger (2020) Importance Driven Continual Learning for Segmentation Across Domains. pp. 423–433. External Links: Link, Document Cited by: §1, §2.
  • G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019)

    Continual lifelong learning with neural networks: A review

    Neural Networks 113, pp. 54–71. External Links: Link, Document, ISSN 08936080 Cited by: §2.
  • J. Pedrosa, G. Aresta, C. Ferreira, M. Rodrigues, P. Leitão, A. S. Carvalho, J. Rebelo, E. Negrão, I. Ramos, A. Cunha, and A. Campilho (2019) LNDb: A lung nodule database on computed tomography. arXiv, pp. 1–12. External Links: ISSN 23318422 Cited by: §4.1.
  • M. Perkonigg, J. Hofmanninger, and G. Langs (2021) Continual Active Learning for Efficient Adaptation of Machine Learning Models to Changing Image Acquisition. In Advances in Information Processing in Medical Imaging, IPMI, Cited by: §1.
  • O. S. Pianykh, G. Langs, M. Dewey, D. R. Enzmann, C. J. Herold, S. O. Schoenberg, and J. A. Brink (2020) Continuous learning AI in radiology: Implementation principles and early applications. Radiology 297 (1), pp. 6–14. External Links: Document, ISSN 15271315 Cited by: §2, §2.
  • F. Prayer, J. Hofmanninger, M. Weber, D. Kifjak, A. Willenpart, J. Pan, S. Röhrich, G. Langs, and H. Prosch (2021) Variability of computed tomography radiomics features of fibrosing interstitial lung disease: A test-retest study. Methods 188, pp. 98–104. Cited by: §2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2017) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. External Links: Document, ISSN 01628828 Cited by: §4.2, §4.2.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. pp. 1–8. External Links: Link, ISBN 9783319245737, Document, ISSN 16113349 Cited by: §4.2.
  • A. A. A. Setio, A. Traverso, T. de Bel, M. S.N. Berens, C. v. d. Bogaard, P. Cerello, H. Chen, Q. Dou, M. E. Fantacci, B. Geurts, R. v. d. Gugten, P. A. Heng, B. Jansen, M. M.J. de Kaste, V. Kotov, J. Y. Lin, J. T.M.C. Manders, A. Sóñora-Mengana, J. C. García-Naranjo, E. Papavasileiou, M. Prokop, M. Saletta, C. M. Schaefer-Prokop, E. T. Scholten, L. Scholten, M. M. Snoeren, E. L. Torres, J. Vandemeulebroucke, N. Walasek, G. C.A. Zuidhof, B. v. Ginneken, and C. Jacobs (2017) Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: The LUNA16 challenge. Medical Image Analysis 42, pp. 1–13. External Links: Link, Document, ISSN 13618415 Cited by: §4.1.
  • A. Smailagic, P. Costa, A. Gaudio, K. Khandelwal, M. Mirshekari, J. Fagert, D. Walawalkar, S. Xu, A. Galdran, P. Zhang, A. Campilho, and H. Y. Noh (2020) O-MedAL: Online active deep learning for medical image analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10 (4), pp. 1–15. External Links: Document, ISSN 19424795 Cited by: §2.
  • R. Venkataramani, H. Ravishankar, and S. Anamandra (2019) Towards continuous domain adaptation for medical imaging. In Proceedings - International Symposium on Biomedical Imaging, Vol. 2019-April, pp. 443–446. External Links: ISBN 9781538636411, Document, ISSN 19458452 Cited by: §2.
  • Z. Wu, X. Wang, J. Gonzalez, T. Goldstein, and L. Davis (2019) ACE: Adapting to changing environments for semantic segmentation. Proceedings of the IEEE International Conference on Computer Vision, pp. 2121–2130. External Links: ISBN 9781728148038, Document, ISSN 15505499 Cited by: §2.
  • Z. Zhou, J. Y. Shin, S. R. Gurudu, M. B. Gotway, and J. Liang (2021)

    Active, continual fine tuning of convolutional neural networks for reducing annotation efforts

    Medical Image Analysis 71. External Links: Document, ISSN 13618423 Cited by: §2.
  • Z. Zhou, V. Sodha, J. Pang, M. B. Gotway, and J. Liang (2020) Models Genesis. Medical Image Analysis, pp. 101840. External Links: Document, ISSN 13618415 Cited by: §4.2.

Appendix A.

Figure 7: Style embeddings of CASA training memories for different runs with different labelling budgets .
Figure 8: Style embeddings of training memories for different runs of CASA, UAL and NAL.
Figure 9: Distribution of images acquired with a specific scanner (C1-C4) to the discovered pseudo-domains for five runs of CASA training with , .