In clinical practice, medical images are produced with continuously changing policies, protocols, scanner hardware or settings resulting in different visual appearance of scans despite the same underlying biology. In most cases, such a shift in visual appearance is intuitive for clinicians and does not lessen their ability to assess a scan. However, the performance of machine-learning based methods can deteriorate significantly when the data distribution changes. Continuous learning adapts a model to changing data and new tasks by sequentially updating the model on novel cases. The ground truth labels of these cases can be, for instance, acquired by corrective labelling. However, catastrophic forgetting is a major undesired phenomenon affecting continuous learning approaches . That is, when a model is continuously updated on a new task or a different data distribution, the model performance will deteriorate on the preceding tasks. Alleviating catastrophic forgetting is one of the major challenges in continuous learning.
Here, we propose an approach for a scenario where new data are sequentially available for model training. We aim to utilize such a data stream to frequently update an existing model without the requirement of keeping the entire data available. We assume a real-world clinical setting where changing visual domains and appearance of classification targets can occur gradually over time and where the information about such an eventual shift in data is not available to the continuous learning system. Figure 1 illustrates the scenario and the experimental setup. A trained model (base-model) has been trained to perform well on a certain classification task. Subsequently, a continuous data stream is used to update the model with the aim to learn variations of the initial task given new data. This model should become accurate on new data, while at the same time staying accurate on data generated by previous technology. Accordingly, the final model is then evaluated on all tasks to assess the effect of catastrophic forgetting and the classification performance. Note that here we use the term task to denote the detection of the same target but on shifted visual domains and not for additional target classes.
There are various groups of methods dealing with the problem of continuously updating a machine learning model over time such as continuous learning, continuous domain adaptation or active learning. They operate on similar but different assumptions about the problem settings, the data available and the level of supervision required. For example, continuous domain adaption assumes novel data to be shifted but closely related to the previous data. Continuous learning makes no assumptions about domain shifts and is not limited to a specific task to perform in new data (e.g. incremental learning). Active learning is characterized by the task of automatically selecting examples for which supervision is beneficial. In this work, we propose a continuous learning technique which can also be categorized as a supervised continuous domain adaptation method.
Various methods for continuous learning have been proposed to alleviate catastrophic forgetting in scenarios where multiple tasks are learned sequentially.
A popular method to retain previous knowledge in a network is elastic weight consolidation (EWC) . EWC is a regularization technique aiming to constrain parameters of the model which are critical for performing previous tasks during the training of new tasks. Alternative methods attempt to overcome catastrophic forgetting by rehearsing past examples  or proxy information (pseudorehearsal)  when new information is added . In the field of medical imaging, continuous learning has been demonstrated to reduce catastrophic forgetting on segmentation and classification tasks. Karani et al. proposed domain-specific batch norm layers to adapt to new domains (different MR protocols) while learning segmentations of various brain regions . Baweja et al. applied EWC to sequentially learn normal brain structure and white matter lesion segmentation . Ravishankar et al. propose a pseudorehearsal technique and training of task-specific dense layers for pneumothorax classification . These current approaches expect that the information about the domain to which a training example belongs, is available to the learning system. In real world medical imaging data, such information may not be available at the image level (e.g. a change in treatment policies, or different hardware updates across departments). At the same time, changes in protocol or scanner manufacturer may not automatically lead to a loss of performance of the model and considering each protocol as a novel domain may lead do adverse effects such as overfitting.
We propose an approach for continuous learning of continuously or repeatedly shifting domains. This is in contrast to most previous methods treating updates as sequentially adding well specified tasks. In contrast to existing rehearsal methods, we propose a technique that automatically infers data shifts without explicit knowledge about them. To this end, the method maintains a diverse memory of previously seen examples that is dynamically updated based on high level representations in the network.
We continuously update the parameters of an already trained model with new training data. Our approach composes this training data to capture novel data characteristics while sustaining the diversity of the overall training corpus. It chooses examples from previously seen data (dynamic memory ) and new examples (input-mini-batch ) to form the training data (training-mini-batch ) for the model update.
Our approach is a rehearsal method to counter catastrophic forgetting in continuous learning. We adopt a dynamic memory (DM)
of a fixed-size holding image-label pairs that are stored and eventually replaced during continuous training. To alleviate catastrophic forgetting, a subset of cases of is used for rehearsal during every update step. It is critical that the diversity of is representative of the visual variation across all tasks, even without explicit knowledge about the task membership of training examples. As the size of is fixed, the most critical step of such an approach is to decide which samples to keep in and which to replace with a new sample. To this end, we define a memory update strategy based on following rules: (1) every novel case will be stored in the memory, (2) a novel case can only replace a case in memory of the same class and (3) the case in memory that will be replaced is close according to a high level metric. Rule 1 allows the memory to dynamically adapt to changing variation. Rule 2 prevents class imbalance in the memory and rule 3 prevents the replacement of previous cases if they are visually distant. The metric used in rule 3 is critical as it ensures that cases of previous tasks are kept in memory and not fully replaced over time. We define a high-level metric based on the gram matrix where is the number of feature maps in layer .
is defined as the inner product between the vectorized activationsand of two feature maps and in a layer given a sample image :
where denotes the number of elements in the vectorized feature map (width height). For a set of convolution layers we define a gram distance between two images and as:
The rationale behind using the gram matrix is the fact, that the gram matrix encodes high level style information. Here, we are interested in this style information to maintain a diverse memory not only with respect to the content but also with respect to different visual appearances. Similar gram distances have been used in computer vision methods in the area of neural style transfer as a way to compare the style of two natural images.
During continuous training, a memory update is performed after an input-mini-batch of sequential cases (image and label ) is taken from the data stream. Sequentially, each element of replaces an element of . More formally, given an input sample , the sample will replace the element in with index
During the initial phase of continuous training, the memory is filled with elements of the data stream. Only after the desired proportion of a class in the memory is reached, the replacement strategy is applied. After the memory is updated, a model update is done by assembling a training-mini-batch of size . Each misclassified element of for which the model predicted the wrong label is added to and additional cases are randomly drawn from until . Finally, using the training-mini-batch , a forward and backward pass is performed to update the parameters of the model.
3 Experiments and Results
We evaluated and studied the DM method in a realistic setting using medical images from clinical routine. To this end, we collected a representative dataset described in Section 3.1
. Based on these data, we designed a learning scenario with three tasks (A, B and C) in which a classifier pre-trained on task A is continuously updated with changing input data over time. Figure1 illustrates the learning scenario, the data, and the experimental setup. Within the continuous dataset, we created two shifts, (1) a modality-shift between scanner protocols and (2) a target-shift by changing the target structure from high to low intensity.
|Task A||Task B||Task C||Total|
In total, we collected 8883 chest CT scans from individual studies and for each extracted an axial slice at the center of the segmented  lung. Each scan was performed on a Siemens scanner with either B3 reconstruction kernel and 3mm slice-thickness (B3/3) or B6 reconstruction kernel and 1mm slice-thickness (B6/1). We collected 3784 cases with B3/3 protocol and 5099 cases with the B6/1 protocol. We imprinted a synthetic target structure in the form of a cat on random locations, rotations and varying scale at 50% of the cases (see also Figure 1). The high-intensity target structures were engraved by randomly adding an offset between 200 and 400 hounsfield units (HU) and the low-intensity target structures by subtracting between 200 and 400 HU. A synthetic target was chosen to facilitate data set collection, as well as to create a dataset without label noise. Table 1 lists the data collected and shows the partitioning into base, continuous, validation and test split and the stratification into the three tasks.
We created the base model by fine-tuning a pre-trained Res-Net50 
model, as provided by the pytorch-torchvision library111https://pytorch.org on the base training set (Task A). Given this base model, we continuously updated the model parameters on the continuous training set using different strategies:
Naive: The naive approach serves as a baseline method by training sequentially on the data stream, without any specific strategy to counter catastrophic forgetting.
Elastic Weight Consolidation (EWC): As a reference method, we used EWC as presented in . EWC regularizes weights that are critical for previous tasks based on the fisher information. We calculated the fisher information after training on the base set (Task A) so that in further updates, weights that are important for this task are regularized.
: In preliminary experiments, we found, that EWC is not suitable for networks using batch norm layers in scenarios where a modality-shift occurs. The reason for that is, that the regularization of EWC does not effect batch norm parameters (mean and variance) and that these parameters are constantly updated by novel data. Thus, to quantify this effect, we show results where we fixed the batch norm layers once the base training was completed.
Dynamic Memory (DM): The method as described in this paper. The input-mini-batch size and training-mini-batch size have been set to 8 for all experiments. If not stated differently, results have been computed with a memory size of 32. The gram matrices have been calculated on feature maps throughout the network covering multiple scales. Specifically, we used the feature maps of the last convolution layers before the number of channels is increased. That is, for Res-Net50 we calculate the matrices on four maps with 256, 512, 1024 and 2048 features.
In addition to the continuous scenario we trained an upper bound network in a conventional, epoch-based way, using all training data at once (full training). All training processes were run five times to assess the variability and robustness and the results were averaged. All models were trained using an Adam optimizer 
and binary cross entropy as a loss function for the classification task.
With the described methods we,
studied the dynamics during training by calculating the classification accuracies on the validation set every 30 iterations.
calculated quantitative results after training on the test set for each task separately. Classification accuracy, backward transfer (BWT) and forward transfer (FWT), as defined in  were used. BWT and FWT measure the influence that learning a new task has on the performance of previously learned, respectively future tasks. Thus, larger BWT and FWT values are preferable. Specifically, negative BWT quantifies catastrophic forgetting.
studied the influence of memory size on the performance of our method during the training process and after training. We included experiments with set to 16, 32, 64, 80, 128 and 160.
Dynamics during training are shown in Figure 2. As expected, the naive approach shows catastrophic forgetting for task A (drop of accuracy from 0.91 to 0.51) and B (drop from 0.93 to 0.70) after new tasks are introduced. Without any method to counteract forgetting, knowledge of previous tasks is lost after sequential updates to the model. EWC exhibits catastrophic forgetting for task A (0.92 to 0.54) after the introduction of task B and C data into the data stream. The shift from task A to B is a modality shift, the batch norm layers of the network adapt to the second modality and knowledge about the first modality is lost. Although EWC protected the weights that are important for task A, those weights were not useful after the batch norm layers were adapted. EWC-fBN avoids this problem by fixing the batch norm layers together with the weights that are relevant for task A. In this setting a forgetting effect for task B (accuracy drop from 0.90 to 0.70) can be observed after the target-shift to task C. This effect is due to the requirement of EWC to know when shifts occur. Thus, in our scenario, EWC only regularizes weights that are important for task A. As described previously, DM does not have this requirement by dynamically adapting to changing data. Using the DM approach only mild forgetting occurs and all three tasks reach a comparable performance of about 0.85 accuracy after training.
|ACC Task A||ACC Task B||ACC Task C||BWT||FWT|
Quantitative results are shown in Table 2. The large negative BWT values for the naive approach (-0.32) and EWC (-0.20) indicate that these methods suffer from catastrophic forgetting. Using EWC-fBN mitigates the forgetting for task A, but the model is forgetting part of the knowledge for task B when task C is introduced (observable in Figure 2). Both, DM and EWC-fBN show comparable backward and forward transfer capabilities. The accuracy values in Table 2 show that DM performs equally well on all tasks, while the other approaches show signs of forgetting of task A (naive and EWC) and task B (naive and EWC-fBN).
The influence of memory size during training is shown in Figure (a)a. For a small
of 16, adapting to new tasks is fast. However, such a limited memory can only store a limited data variability leading to catastrophic forgetting effects. Increasing memory size decreases catastrophic forgetting effects but slows adaption of new tasks. For the two largest investigated sizes 128 and 160 adaption, especially after the target-shift (Task C), is slower. The reason is, that more elements of task A are stored in the memory and more iterations are needed to fill the memory with samples of task C. This reduces the probability that elements from task C are drawn and in turn, slows down training for task C. For the same reason, with higher memory sizes there is an increase in task A accuracy, since examples from the first task are more often seen by the network. Results indicate that setting the memory size is a trade-off between faster adaption to novel data and more catastrophic forgetting. In our setting, training worked comparably well onof 32, 64, and 80. Considering memory sizes between 16 and 160, setting resulted in the highest average accuracy of 0.87 on the validation set (Fig. (b)b). Therefore, we used this model for comparison to other continuous learning methods on the test set.
Here, we presented a continuous learning approach to deal with modality- and tasks-shifts induced by changes in protocols, parameter setting or different scanners in a clinical setting. We showed that maintaining a memory of diverse training samples mitigates catastrophic forgetting of a deep learning classification model. We proposed a memory update strategy that is able to automatically handle shifts in the data distribution without explicit information about domain membership and the moment such a shift occurs.
This work was supported by Austrian Science Fund (FWF) I 2714B31 and Novartis Pharmaceuticals Corporation.
-  (2018) Towards continual learning in medical imaging. External Links: Cited by: §1.
-  (2016) A Neural Algorithm of Artistic Style. Journal of Vision 16 (12), pp. 326. External Links: Cited by: §2.
Deep Residual Learning for Image Recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, External Links: Cited by: §3.2.
-  (2020) Automatic lung segmentation in routine imaging is a data diversity problem, not a methodology problem. External Links: Cited by: §3.1.
-  (2018) A lifelong learning approach to brain MR segmentation across scanners and protocols. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), Vol. 11070 LNCS, pp. 476–484. External Links: Cited by: §1.
-  (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, Cited by: §3.2.
Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America 114 (13), pp. 3521–3526. External Links: Cited by: §1, 2nd item.
-  (2017) Gradient episodic memory for continual learning. Advances in Neural Information Processing Systems, pp. 6468–6477. External Links: Cited by: §1, 2nd item.
-  (1989) Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. Psychology of Learning and Motivation - Advances in Research and Theory 24 (C), pp. 109–165. External Links: Cited by: §1.
-  (2019-10) Feature Transformers: Privacy Preserving Lifelong Learners for Medical Imaging. In Medical Image Computing and Computer Assisted Intervention (MICCAI), pp. 347–355. External Links: Cited by: §1.
-  (1995) Catastrophic Forgetting, Rehearsal and Pseudorehearsal. Connection Science 7 (2), pp. 123–146. External Links: Cited by: §1.