Continual Active Learning for Efficient Adaptation of Machine Learning Models to Changing Image Acquisition

by   Matthias Perkonigg, et al.
MedUni Wien

Imaging in clinical routine is subject to changing scanner protocols, hardware, or policies in a typically heterogeneous set of acquisition hardware. Accuracy and reliability of deep learning models suffer from those changes as data and targets become inconsistent with their initial static training set. Continual learning can adapt to a continuous data stream of a changing imaging environment. Here, we propose a method for continual active learning on a data stream of medical images. It recognizes shifts or additions of new imaging sources - domains -, adapts training accordingly, and selects optimal examples for labelling. Model training has to cope with a limited labelling budget, resembling typical real world scenarios. We demonstrate our method on T1-weighted magnetic resonance images from three different scanners with the task of brain age estimation. Results demonstrate that the proposed method outperforms naive active learning while requiring less manual labelling.



There are no comments yet.


page 4


Continual Active Learning Using Pseudo-Domains for Limited Labelling Resources and Changing Acquisition Characteristics

Machine learning in medical imaging during clinical routine is impaired ...

Robust Active Learning: Sample-Efficient Training of Robust Deep Learning Models

Active learning is an established technique to reduce the labeling cost ...

Online Continual Adaptation with Active Self-Training

Models trained with offline data often suffer from continual distributio...

A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning

Current deep learning research is dominated by benchmark evaluation. A m...

ALPS: Active Learning via Perturbations

Small, labelled datasets in the presence of larger, unlabelled datasets ...

Knowledge-driven Active Learning

In the last few years, Deep Learning models have become increasingly pop...

A Connected Component Labelling algorithm for multi-pixel per clock cycle video stream

This work describes the hardware implementation of a connected component...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The frequently changing scanner hardware, imaging protocols, and heterogeneous composition of acquisition routines in the clinical environment hamper the longevity and utility of deep learning models. After initial training on a static data set they need to be continuously adapted to the changing characteristics of the data stream acquired in imaging departments. This is challenging, since training on a data stream without a continual learning strategy can lead to catastrophic forgetting [McCloskey1989], when adapting a model to a new domain or task leads to a deterioration of performance on preceding domains or tasks. Additionally, for continual learning in a medical context manual labelling is required for new cases. However, labelling is expensive and time-consuming, requiring extensive medical knowledge. Here, active learning is a possible solution, where the aim is to identify samples from an unlabelled distribution that are most informative if labelled next. Keeping the number of cases requiring labelling as low as possible is a key challenge in active learning with medical images [Budd2019AAnalysis].

Most currently proposed continual active learning methods either do not take domain shifts in the training distributions into account or assume knowledge about the domain membership of data points. However, due to the variability in how meta data is encoded knowledge about the domain membership can not be assumed in clinical practice [Gonzalez2020WhatSegmentation]. Therefore, we need a technique to reliably detect domain shifts in a continuous data stream. Combining continual and active learning with the detection of domain shifts can ensure that models perform well on a growing repertoire of image acquisition technology, while at the same time minimizing the resource requirements, and day to day effort necessary for continued model training.


We propose an online active learning method in a setting where the domain membership (i.e. which scanner an image was acquired with) is unknown and in which we learn continuously from a data stream, by selecting an informative set of training examples to be annotated. Figure 1 illustrates the experimental setup reflecting this scenario. A model is pre-trained on data of one scanner. Subsequently, our method observes a continual stream and automatically detects if domain shifts occur, activating the learning algorithm to incorporate knowledge about the new domain in the form of labelled training examples. We evaluate the model on data from all observed domains, to assess if the model performs well on a specific task for all observed scanners. The proposed algorithm combines online active learning and novel domain recognition for continual learning. The proposed Continual Active Learning for Scanner Adaptation (CASA) approach uses a rehearsal method and an active labelling approach without any prior knowledge of the domain membership of each data sample. CASA learns on a stream of radiology images with a limited labelling budget, which is desirable to keep the required expert effort low.

Figure 1: Experimental setup: A model is pre-trained on scanner A and then subsequently updated on a data stream gradually including data of scanner B and scanner C. The model has to recognize the domain shifts, and identify cases whose annotation will contribute best to the model accuracy. The final model is evaluated on test sets of all three scanners.
Related work

Our work is related to three research areas: First, continual learning with the goal to adapt a deep learning system to a continual stream of data while mitigating catastrophic forgetting. Second, active learning, which aims to wisely select samples to be labelled by an oracle to keep the number of required labels low. Third, domain adaptation, with the goal of adapting knowledge learned on one domain to a new target domain. An overview of continual learning in medical imaging is given in [Pianykh2020ContinuousApplications]. Here, we focus on the group of continual learning methods closest related to ours, namely rehearsal and pseudo-rehearsal methods, which store a subset of training samples and periodically use them for training. Our memory interprets different domains as images having a different style, this is closely related to [Hofmanninger2020DynamicSettings]. They use a dynamic memory which updates the subset for training without assuming domain knowledge based on a gram-matrix based distance. Furthermore, [Ozdemir2018LearnImages] incrementally add different anatomy to a segmentation network by keeping a representative set of images from previous tasks and adding a new segmentation head for each new task. [Lenga2020] used the pseudo-rehearsal technique Learning-without-Forgetting (LwF) [Li2018] on a domain adaptation problem on chest X-Ray. A detailed review and discussion of active learning and human-in-the-loop systems in medical imaging can be found [Budd2019AAnalysis]. Also [Pianykh2020ContinuousApplications] describes how continual learning approaches combined with human-in-the-loop algorithms can be useful in clinical practice. The setting in which our method operates is closely related to stream-based selective sampling as described in [Budd2019AAnalysis], where a continual stream of unannotated data is assumed. There, the decision of whether or not labelling is required is made for every item on the stream independently. However, the authors claim that this family of methods have limited benefits due to the isolated decision for each sample. In this work, we alleviate this limitation by taken collected information about the data distribution observed into account. In the area of domain adaptation, various approaches were proposed to adapt to continuously shifting domains. [Wu2019ACE:Segmentation] proposed an approach for semantic segmentation in street scenes under different lightning conditions. [Lao2020ContinuousReplay, Bobu2018AdaptingDomains]

proposed replay methods for continuous domain adaptation and showed performance in benchmark data sets (rotated MNIST

[Bobu2018AdaptingDomains] and Office-31 [Lao2020ContinuousReplay] respectively). In the area of medical imaging an approach for lung segmentation in X-rays with shifting domains was proposed in [Venkataramani2019].

2 Method

The proposed method CASA is a continual rehearsal method with active sample selection. We assume an unlabelled, continual image data stream with unknown domain shifts. There exists an oracle that can be queried and returns a task label . In a clinical setting this oracle would be a radiologist who labels an image. Since expert annotations are expensive, label generation by the oracle is limited by the label-budget . The goal of CASA is to train a task network on such a data stream with the restriction of , while mitigating catastrophic forgetting. We achieve that by detecting pseudo-domains, defined as groups of images similar in terms of style. The identification of pseudo-domains helps to keep a diverse set of training examples. The proposed approach consists of two modules and two memories, which are connected by the CASA-Algorithm, described in the following.

Figure 2: Overview of the CASA

algorithm. Each sample from the data stream is processed by the pseudo-domain module to decide whether its routed to the oracle or to the outlier memory. Whenever a new item is added to the outlier memory it is evaluated if a new pseudo-domain (pd) should be created. The oracle labels a sample and stores it in the training memory, from which the task module trains a network. Binary decision alternatives resulting in discarding the sample are left out for clarity of the figure.

2.1 CASA Training Scheme

Before starting continual training, the task module is pre-trained on a labelled data set , of image-label pairs

of a particular scanner. This pre-training is done in a conventional epoch based training procedure to start continual training on a model which performs well for a given scanner.

After pre-training is finished, continual training follows the scheme depicted in Figure 2. As a first step, an input-mini-batch is drawn from and the pseudo-domain module decides for each image to add it to one of the memories ( or ) or discard the image. is defined as a fixed size samples holding training memory of labelled data , where is the image, the corresponding task label and the pseudo-domain the image was assigned to. Task labels are generated by calling the oracle . Pseudo-domains are defined and assigned by the process described in Section 2.2.1. is initialized with a random subset of before starting continual training. The outlier memory holds a set of unlabelled images. Here, is an image and is a count how long (measured in training steps) the image is a member of .

If there are pseudo-domains for which training is not completed, a training step is done by sampling a training-mini-batches of size from and then training the task module for one step on . This process continues with the next drawn from until the end of .

2.2 Pseudo-domain module

The pseudo-domain module is responsible to evaluate the style of an image and assignment of to a pseudo-domain. Pseudo-domains are defined as groups of images that are similar in terms of style. Since our method does not assume direct domain (i.e. scanner and/or protocol) knowledge, we use the procedure described below to identify pseudo-domains as part of continual learning. We define the set of pseudo-domains as . Where is a trained Isolation Forest (IF) [TonyLiu2008IsolationForest]

used as one-class anomaly detection for the pseudo-domain

. We use IFs, because of their simplicity and the good performance on small sample sizes. is the running average of a classification or regression performance metric of the target task calculated on the last elements of pseudo-domain that have been labelled by the oracle. The performance metric is measured before training on the sample. is used to evaluate if the pseudo-domain completed training, that is for classification tasks and for regression tasks, where is a fixed performance threshold. CASA then uses pseudo-domains to assess if training for a specific style is needed and to diversify the memory .

2.2.1 Pseudo-domain assignment

To assign an image to an existing pseudo-domain we use a style network, which is pre-trained on a different dataset (not necessarily related to the task) and not updated during training. From this network we evaluate the style of an image based on the gram matrix , where is the number of feature maps in layer . Following [Gatys2016, Hofmanninger2020DynamicSettings]

is defined as the inner product between the vectorized activations

and of two feature maps and in a layer given a sample image :


where denotes the number of elements in the vectorized feature map. Based on the gram matrix we define a gram embedding : For a set of convolutional layers of the style network, we calculate the gram matrices () and apply a dimensionality reduction using Sparse Random Projection (SRP) [Li2007VeryReduction]. The isolation forests in are trained on those embeddings . We assign an image to the pseudo-domain maximizing the decision function of :


If the image is added to the outlier memory from which new pseudo-domains are detected (see Section 2.5). If the pseudo-domain is known and have completed training we discard the image, otherwise it is added to according to the strategy described in Section 2.4.

2.3 Task module

The task module is responsible for learning the target task (e.g. age estimation in brain MRI), the main component of this module is the task network (), learning a mapping from input image to target label . This module is pre-trained on a labelled data set . During continual training, the module is updated by drawing training-input-batches from and performing a training step on each of the batches. At the end of training the aim of CASA is that the task module performs well on images of all scanners available in without suffering catastrophic forgetting.

2.4 Training memory

The sized training memory is balanced between the pseudo-domains currently in . Each of the pseudo-domains can occupy up to elements in the memory. If a new pseudo-domain is added to (see Section 2.5) a random subset of elements of all previous domain are flagged for deletion, so that only are kept protected in . If a new element is to be inserted to and is not reached an element currently flagged for deletion is replaced by . Otherwise the element will replace the one in , which is of the same pseudo-domain and minimizes the distance between the gram embeddings. Formally we replace the element with index:


2.5 Outlier memory and pseudo-domain identification

The outlier memory holds candidate examples that do not fit an already identified domain, and might form a new domain. Examples are stored until they are assigned a new pseudo-domain or if a fixed number of training steps is reached. If no new pseudo-domain is discovered for an image it is considered a ’real’ outlier and removed from the outlier memory. Within we discover new pseudo-domains to add to . The discovery process is started when , where is a fixed threshold. A check if a dense region is present in the memory is done by calculating the pairwise euclidean distances of all elements in . If there is a group of images where the distances are below a threshold a new IF is fitted to the gram embeddings of the dense region and the set of pseudo-domains is updated by adding . For all elements belonging to the new pseudo-domain labels are queried from the oracle and they are transferred from to .

3 Experiments and Results

We evaluated our method on the task of brain age estimation on T1-weighted magnetic resonance imaging (MRI) and compare CASA to different baseline techniques, described in Section 3.2. First, we evaluate the task performance on all domains to show how the mean absolute error between predictions and ground truth changes on the validation set (Section 3.4). Furthermore, we evaluate the ability of continual learning to improve accuracy on existing domains by adding new domains backward transfer (BWT), and the contribution of previous domains in the training data to improving the accuracy on new domains forward transfer (FWT)[Lopez-Paz2017]. BWT measure how learning a new domain influences the performance on previous tasks, FWT quantifies the influence on future tasks. Negative BWT values indicate catastrophic forgetting, thus avoiding negative BWT is especially important for continual learning. In Section 3.5, we analyze the memory elements at the end of training to evaluate if the detected pseudo-domains match the real domains determined by the scanner types.

3.1 Data set

Scanner 1.5T IXI 3.0T IXI 3.0T OASIS Total
Pre-train 201 0 0 201
Continual 52 146 1504 1702
Validation 31 18 187 236
Test 31 18 187 236
Table 1: Data: Splitting of the data into a pre-train, continual, validation, and test set. The number of cases in each split are shown.

We use data pooled from two different data sets containing three different scanners. We use the IXI data set111 and data from OASIS-3 [LaMontagne2019OASIS-3:Disease]. From IXI we use data from a Philips Gyroscan Intera 1.5T and a Philips Intera 3.0T scanner, from OASIS-3 we use data from a Siemens TrioTim 3.0T scanner. Data was split into base pre-train, continual training, validation and test set (see Table 1). Images are resized to 64x128x128 and normalized to a range between 0 and 1.

3.2 Methods compared in the evaluation

We compared four methods in our evaluation:

  1. Joint model: a model trained in a standard, epoch-based approach on samples from all scanners in the experiment jointly.

  2. Scanner models: a separate model for each scanner in the experiment trained with standard epoch-based training. The evaluation for a scanner is done for each scanner setting separately.

  3. Naive AL: a naive continuously trained, active learning approach of labelling every -th label from the data stream, where depends on the labelling budget .

  4. CASA (proposed method): The method described in this work. The settings for experimental parameters for CASA are described in Section 3.3.

The joint and scanner specific models (1, 2) require the whole training data set to be labelled and available at once, thus they are an upper limit to which our method is compared to. Naive AL and CASA use an oracle to label specific samples only. The comparison with naive AL evaluates the gains of our method of choosing samples to label by detecting domain shifts. Note, that as the goal of our experiments is to show the impact of CASA in comparison to baseline active continual learning strategies and not to develop the best network for brain age estimation we do not compare to state-of-the-art brain age estimation methods.

3.3 Experimental setup

The task model is a simple feed-forward network as described in [Dinsdale2020UnlearningSegmentation] and provided on github222 The style network used in the pseudo-domain module is a 3D-ModelGenesis model pre-trained on computed tomography images on the lung [Zhou2020ModelsGenesis]. We run experiments with different parameter settings evaluating the influence of the memory size , the task performance threshold , and the labelling budget , expressed as a fraction of the continuous training set. We test the influence of with different settings (n=85), (n=170), (n=212) and (n=340). For the performance threshold, we tested and , to demonstrate the influence of on the labelling need of CASA. Values for are set after observing the performance of the baseline models. The main performance measure for brain age estimation we use is the mean absolute error (MAE) between predicted and true age.

3.4 Model Accuracy Across Domains

Figure 3 shows how the mean absolute error on the validation set, of CASA and the naive AL approach changes during training. Adaption to new domains is much faster when using our approach, even if the labeling budget is low (e.g. ). Furthermore, as seen from the curves training is more stable. Lower values are reached for CASA in comparison to naive AL at the end of training across all scanners for the validation set.

Figure 3: Evaluation for and with different . Y-axis shows the mean absolute error (MAE, lower is better) of the model on the validation set. Zoomed in sections of training steps of particular interest for : (a): CASA detects the domain shift to 3.0T IXI and trains on the new domain, this also leads to a big forward transfer for 3.0T OASIS. Naive AL only takes every 20-th image for training, thus failing to adapt to 3.0T IXI. (b): CASA is relatively stable, while naive AL incorporates more knowledge about 3.0T OASIS images and start to overfit on those, while showing slight forgetting for 1.5T IXI and 3.0T IXI (c): At the end of the continuous stream CASA show an equal performance for all three scanners, while naive AL leads to good performance on the last domain, but significantly poorer results on previous domains.

In Table 2 different parameter settings are compared. The performance gap between CASA and naive AL trained with is especially large for images of 3.0T IXI. There, our approach successfully identified the new domain as a pseudo-domain, and trained the model accordingly. Naive AL takes every -th element according to , thus samples from scanner 3.0T IXI are seen less often and the model cannot adapt to this domain. The advantage of CASA increases as the labelling budget is reduced. Evaluating two different memory sizes and shows, that CASA could not gain performance over a naive AL approach when is small. Comparing CASA with to demonstrates, that a more challenging choice of leads to better overall performance, furthermore CASA takes more advantage of a larger when .

Labelled Meth. 1.5T IXI 3.0T IXI 3.0T OASIS BWT FWT
[74-85] CASA 128 5.0
[70-85] CASA 128 7.0
85 NAL 128 -
[90-170] CASA 128 5.0
[74-75] CASA 128 7.0
170 NAL 128 -
[91-212] CASA 128 5.0
[74-75] CASA 128 7.0
212 NAL 128 -
[91-112] CASA 128 5.0
[69-110] CASA 128 7.0
340 NAL 128 -
[62-66] CASA 64 7.0
85 NAL 64 -
[63-69] CASA 64 7.0
170 NAL 64 -
[63-64] CASA 64 7.0
212 NAL 64 -
[62-67] CASA 64 7.0
340 NAL 64 -
Table 2: Results for age estimation on a test set reported as mean absolute error (MAE, lower is better, indicates the interval of results with 3 independent runs with different seeds). The table compares CASA, naive active learning (NAL), individual scanner models (ScM), and a joint model trained from all data (JM). The column Labelled denotes the amount of labelling by the oracle needed during training.

The BWT and FWT comparison in Table 2 shows positive values for all approaches. Since the task remains the same and only the imaging domains change, this is an expected behaviour. Backward transfer is similar between CASA and naive AL over all settings. For approaches with we see a clear FWT gap between CASA and naive AL. This shows that CASA is able choose meaningful samples that are also helpful for subsequent tasks.

Figure 4: Detected pseudo-domains capture scanner differences: Left: Distribution of pseudo-domains in after training over 3 scanners. Right: TSNE embedding of the gram embeddings for the whole training set, with marked position of elements in .

3.5 Evaluation of the Memory and Pseudo-Domains

Here, we analyze the memory for the parameter settings , and . Other parameter settings show similar trends. The final training memory with CASA consists of 47 1.5T IXI, 36 3.0T IXI and 45 3.0T OASIS images. In comparison for naive AL 96 1.5T IXI, 7 3.0T IXI and 25 3.0T OASIS images are stored in the memory. This demonstrates that with CASA the memory is more diverse and captures all three scanners equally, while for naive AL images from 3.0T IXI are heavily underrepresented. Detecting and balancing between the pseudo-domains as done in CASA is beneficial to represent the diversity of the training data.

Figure 4 illustrates the capability of pseudo-domain detection to match real domains. We detect 5 pseudo-domains with the parameter setting mentioned above. Each pseudo-domain is present mainly in a single scanner, with the third scanner being represented by three pseudo-domains (2, 3 and 4). This might be related to appearance variability within a scanner. We plot a t-distributed Stochastic Neighbor Embedding (TSNE) [maaten2008visualizing] for the gram embeddings of all samples in the training set. The shaded areas represent the density of samples from the individual scanners. The markers show where the the elements are located in the embedding of the whole data set and their assignments to pseudo-domains. This embedding demonstrates that samples in the final are distributed over the embedding space, with areas with high sampling density for the under represented scanner 3.0T IXI and less dense areas for 3.0T OASIS with more samples.

4 Conclusion

We propose a continual active learning method for the adaptation of deep learning models to changing image acquisition technology. It detects emerging domains corresponding to new imaging characteristics, and optimally assigns training examples for labeling in a continuous stream of imaging data. Continual learning is necessary to enable models that cover a changing set of scanners. Experiments show that the proposed approach improves model accuracy over all domains, avoids catastrophic forgetting, and exploits a limited budget of annotations well.


This work was supported by Novartis Pharmaceuticals Corporation and received funding by the European Union’s Horizon 2020 research and innovation programme under the Marie Marie Skłodowska-Curie grant agreement No 765148.