Log In Sign Up

CLOPS: Continual Learning of Physiological Signals

Deep learning algorithms are known to experience destructive interference when instances violate the assumption of being independent and identically distributed (i.i.d). This violation, however, is ubiquitous in clinical settings where data are streamed temporally and from a multitude of physiological sensors. To overcome this obstacle, we propose CLOPS, a healthcare-specific replay-based continual learning strategy. In three continual learning scenarios based on three publically-available datasets, we show that CLOPS can outperform its multi-task learning counterpart. Moreover, we propose end-to-end trainable parameters, which we term task-instance parameters, that can be used to quantify task difficulty and similarity. This quantification yields insights into both network interpretability and clinical applications, where task difficulty is poorly quantified.


page 8

page 9


Is Multi-Task Learning an Upper Bound for Continual Learning?

Continual and multi-task learning are common machine learning approaches...

Batch-level Experience Replay with Review for Continual Learning

Continual learning is a branch of deep learning that seeks to strike a b...

Continual Learning in Task-Oriented Dialogue Systems

Continual learning in task-oriented dialogue systems can allow us to add...

Continual Learning in Low-rank Orthogonal Subspaces

In continual learning (CL), a learner is faced with a sequence of tasks,...

Task-agnostic Continual Hippocampus Segmentation for Smooth Population Shifts

Most continual learning methods are validated in settings where task bou...

Reproducibility Report: La-MAML: Look-ahead Meta Learning for Continual Learning

The Continual Learning (CL) problem involves performing well on a sequen...

Multiple Modes for Continual Learning

Adapting model parameters to incoming streams of data is a crucial facto...

1 Introduction

Deep learning algorithms typically expect instances from a dataset to be independent and identically distributed (i.i.d.). Therefore, violating these assumptions can be detrimental to the training behaviour and performance of an algorithm. For instance, the independence assumption can be violated when data are streamed temporally from a particular sensor. On the other hand, the introduction of multiple sensors in a changing environmental context can introduce covariate shift, arguably the ’Achilles heel’ of machine learning model deployment

(Quionero-Candela et al., 2009).

A plethora of realistic scenarios violate the i.i.d. assumption. This is particularly true in healthcare where the multitude of physiological sensors generate time-series recordings that may vary temporally (due to seasonal diseases; e.g. flu), across patients (due to different hospitals or hospital settings), and in their modality. Tackling the challenges posed by such scenarios is the focus of continual learning (CL) whereby a learner, when exposed to tasks in a sequential manner, is expected to perform well on current tasks without compromising performance on previously seen tasks. The outcome is a single algorithm that can reliably solve a multitude of tasks. Given the potential impact of designing such an algorithm and the machine learning community’s efforts towards achieving artificial general intelligence, research on continual learning has increased (Parisi et al., 2019). However, most, if not all, research in this field has been limited to a small handful of imaging datasets (Lopez-Paz and Ranzato, 2017; Aljundi et al., 2019b, a). Although understandable from a benchmarking perspective, this approach fails to explore the utility and design of healthcare-specific continual learning methodologies. This is despite the potential impact of CL on medical diagnostics as mentioned in Farquhar and Gal (2018). To the best of our knowledge, we are the first to explore and propose a CL approach in the context of physiological signals.

The dynamic and chaotic environment that characterizes healthcare necessitates the availability of algorithms that are dynamically reliable; those that can adapt to potential covariate shift without catastrophically forgetting how to perform tasks from the past. Such dynamic reliability has a twofold effect. Firstly, a CL algorithm no longer needs to be retrained on data or tasks to which it has been exposed in the past, thus improving its data-efficiency. Secondly, consistently strong performance by an algorithm across a multitude of tasks increases its trustworthiness, which is a desirable trait sought by medical professionals (Spiegelhalter, 2020).

Our Contributions. In this paper, we propose a healthcare-specific replay-based continual learning methodology that is based on the following:

  1. Importance-Guided Storage: task-instance parameters, a scalar corresponding to each instance in each task, as informative signals for loss-weighting and buffer-storage.

  2. Uncertainty-Based Acquisition:

    an active learning inspired methodology that determines the degree of informativeness of an instance and thus acts as a

    buffer-acquisition mechanism.

2 Related Work

Continual learning (CL) has resurfaced in recent years with van de Ven and Tolias (2019) suggesting a three-tier categorization of methods: those based on 1) dynamic architectures, 2) regularization, and 3) memory. Although we only review memory-based methods as they are most similar to our approach, an extensive summary of all methods can be found in (Parisi et al., 2019). Memory-based methods replay instances from previously seen tasks while training on the current task. For instance, in Learning without Forgetting (LwF) Li and Hoiem (2017), parameters from the previous task are used to generate soft targets for current task inputs, forming an auxiliary loss. Methods that involve a replay buffer include iCaRL Rebuffi et al. (2017), CLEAR Rolnick et al. (2019), GEM Lopez-Paz and Ranzato (2017), and aGEM Chaudhry et al. (2018) where the latter two naively populate their replay buffer with the last m examples observed for a particular task. A more sophisticated buffer storage strategy is employed by Isele and Cosgun (2018) and Aljundi et al. (2019b) where the latter solves a quadratic programming problem in the absence of task boundaries. By way of contrast, we propose task-instance parameters to guide the storage of instances from each task. Aljundi et al. (2019a) store instances using reservoir sampling and propose sampling instances that incur the greatest change in loss if parameters were to be updated on the subsequent task. Such a process is computationally expensive as it requires two forward and backward passes per batch. In contrast to previous research that independently investigated buffer storage and acquisition strategies, we focus on a dual storage and acquisition strategy. Moreover, the only work that focuses on CL for medical purposes is that of Lenga et al. (2020) wherein existing methodologies are simply implemented on chest X-ray datasets.

Active learning (AL) and healthcare

have been relatively under-explored within machine learning. Early work in AL acquires instances using a mixture of Gaussians to minimize the variance of a learner

(Cohn et al., 1996)

and a support vector machine (SVM) to reduce the size of the version space

(Tong and Koller, 2001). A more complete review of active learning methodologies can be found in Settles (2009). In the healthcare domain, Gong et al. (2019) propose a Bayesian deep latent Gaussian model to acquire important features from electronic health record (EHR) samples in the MIMIC dataset (Johnson et al., 2016) to improve mortality prediction. Also dealing with EHR data, Chen et al. (2013)

use the distance of unlabelled samples from the hyperplane in an SVM to acquire datapoints.

Wang et al. (2019) implement an RNN with active learning to acquire ECG samples during training. Zhou et al. (2017)

propose using transfer learning in conjunction with a convolutional neural network to acquire biomedical images in an online manner.

Smailagic et al. (2018) actively acquire unannotated medical images by measuring their distance in a latent space to images in the training set. Such similarity metrics, however, are sensitive to the amount of available labelled training data. This work was extended to the online domain by Smailagic et al. (2019). Gal et al. (2017) adopt BALD (Houlsby et al., 2011) in the context of Monte Carlo Dropout to acquire datapoints that maximize the Jensen-Shannon divergence (JSD) across MC samples. BatchBALD (Kirsch et al., 2019) is an extension to this work which considers correlations between samples within the acquired batch. To the best of our knowledge, we are the first to employ AL-inspired acquisition functions in the context of CL.

3 Background

3.1 Continual Learning

In this work, we consider a learner , a neural network parameterized by that, for each task , maps inputs of dimension to outputs , where is the number of classes. This learner is exposed to tasks sequentially whereby new tasks are only tackled once previous tasks are mastered. In this paper, we formulate our tasks based on a modification of the three-tier categorization of continual learning proposed by van de Ven and Tolias (2019). In all of the following cases, task identities are absent during both training and testing and neural architectures are single-headed.

  1. Class Incremental Learning (Class-IL) - in this case, mutually-exclusive pairs of classes belonging to the same dataset are presented to the learner in a sequential manner.

  2. Time Incremental Learning (Task-IL) - in this case, although the same dataset and prediction problem are used for each task, the time of year at which the data were collected differs from one task to another. Such seasonality is most common in healthcare applications.

  3. Domain Incremental Learning (Domain-IL) - in this case, although the same dataset and prediction problem are used for each task, the modality of the inputs differ from one task to another.

4 Methods

The two key ideas behind our proposal are the storage and acquisition of instances from a buffer such that destructive interference is mitigated. We describe these in more detail below.

4.1 Task-Instance Parameters,

Task-instance parameters, , are learnable parameters assigned to each instance, i, in each task, . We use them to weight instance losses and populate a replay buffer. Such a buffer, of finite size, , consists of instances that augment the training set of subsequent tasks.

Loss-Weighting Mechanism. We incorporate as a coefficient of the loss, , for each instance of the current task. The objective function of a mini-batch of size, , where is the number of samples from each task, k, is as follows.


In this setup, whereby , lower values of are indicative of instances that are relatively difficult to train on. After discovering that decays to zero and thus hinders learning, we introduced a regularization term that allows for informative loss signals while also maintaining differences between instances, as shown in Sec. 6.1.


When , extra loss terms are required to account for the instances that are replayed from the buffer (see Sec. 4.2). These instances, in contrast to those from the current task, are not weighted.


Buffer-Storage Mechanism. Prior work suggests that task-instance parameters may converge to similar values (Saxena et al., 2019)

. This would limit their ability to discriminate between instances and the utility of the buffer. To circumvent this potential issue, we propose tracking the task-instance parameter after each training epoch,

t, until the final epoch, , for the task at hand. This results in the storage function, , approximated using the trapezoidal rule.


Once training is complete on the task at hand, instances are ranked in descending order based on the value of the storage function, , before the top b fraction are acquired. We opted for this strategy over storing the bottom b fraction due to its superiority, as shown in Sec. 6.4. Each task is allotted a fixed portion of the buffer.

4.2 Acquisition Function,

Uncertainty-based acquisition functions such as BALD (Houlsby et al., 2011; Gal and Ghahramani, 2016) acquire unlabelled instances for which a set of decision boundaries disagree the most.

Buffer-Acquisition Mechanism. At epoch number, , which we refer to as Monte Carlo (MC) epochs, each instance, x, in the buffer, , undergoes T

forward passes through the network. Each forward pass is associated with a stochastic binary dropout mask and generates a posterior probability distribution over the classes,

C. These distributions are stored in a matrix . An acquisition function, , is thus a function .



represents the entropy of the posterior predictive distribution averaged across the MC samples, and

as in Gal and Ghahramani (2016). At sample epochs, , instances are ranked in descending order based on the acquisition function before the top a fraction from each task in the buffer are sampled. This replay procedure exposes the network to instances from previous tasks that it is most confused about and thus nudges it to avoid destructive interference in a data-efficient manner. Algorithms 1-4 illustrate the entire procedure.

  Input: MC epochs , sample epochs , MC samples T, storage fraction , acquisition fraction , task data , buffer , training epochs per task   for  do      calculate using eq. 4      update      if epoch =  then           = StoreInBuffer(, b)      end if      if epoch in  then          G = MonteCarloSamples()      end if      if epoch in  then           = AcquireFromBuffer(, G, a)      end if   end for Algorithm 1 CLOPS   Input: task-instance parameters , b   calculate using eq. 5   SortDescending()       Algorithm 2 StoreInBuffer   Input:   for  do      for MC sample in T do          obtain and store in      end for   end for Algorithm 3 MonteCarloSamples   Input: , MC posterior distributions G, a   calculate using eq. 6   SortDescending()       Algorithm 4 AcquireFromBuffer

5 Experimental Design

5.1 Datasets and Network

Experiments were implemented in PyTorch

(Paszke et al., 2019) and given our emphasis on healthcare settings, we evaluate our approach on three publically-available datasets that include physiological time-series data such as the electrocardiogram (ECG) alongside cardiac arrhythmia labels. We use = Cardiology ECG (Hannun et al., 2019) (12-way), = Chapman ECG (Zheng et al., 2020) (4-way), and = PhysioNet 2020 ECG (9-way, multi-label). The same network was used for all formulations and can be found in Appendix B.

5.2 Hyperparameters

In all experiments, we chose the number of training epochs for each task, , as we found that to achieve strong performance on the validation set. We chose the MC epochs and the sample epochs where in order to sample data from the buffer at every epoch following the first task. The only constraint that these values must satisfy is . For computational reasons, we chose the storage fraction of the size of the training dataset and the acquisition fraction of the number of samples per task in the buffer. To calculate the acquisition function, we chose the number of Monte Carlo samples, . We also explore the effect of changing these values on the performance. We chose the regularization coefficient, .

5.3 Continual Learning Scenarios

Here, we outline the three primary continual learning scenarios we use for our experiments. Sample sizes corresponding to the tasks within these scenarios can be found in Appendix A.

  1. Class-IL - in this case, dataset is split according to mutually- exclusive pairs of classes , , , , , and . This scenario allows us to evaluate the sensitivity of a network to new classes.

  2. Time-IL - in this case, dataset is split into three tasks; Term 1, Term 2, and Term 3 corresponding to mutually-exclusive times of the year during which patient data were collected. This scenario allows us to evaluate the effect of temporal non-stationarity on a network’s performance.

  3. Domain-IL - in this case, dataset is split according to the 12 leads of an ECG; 12 different projections of the same electrical signal generated by the heart. This scenario allows us to evaluate how robust a network is to the input distribution.

5.4 Baseline Methods

Multi-Task Learning (MTL) (Caruana, 1993) is a strategy whereby all datasets are assumed to be available at the same time and thus can be simultaneously used for training. Although this assumption may not hold in clinical settings due to the nature of data collection, privacy or memory constraints, it is nonetheless a strong baseline.

Fine-tuning is a strategy that involves updating all parameters when training on subsequent tasks as they arrive without explicitly dealing with catastrophic forgetting.

5.5 Evaluation Metrics

To evaluate our methods, we exploit metrics suggested by Lopez-Paz and Ranzato (2017)

such as the average AUC and Backward Weight Transfer (BWT) and propose two additional evaluation metrics.

t-Step Backward Weight Transfer. To determine how performance changes ‘t-steps into the future’, we propose which evaluates the performance of the network on a previously-seen task, after having trained on t-tasks after it.


Lambda Backward Weight Transfer. can be extended to all possible time-steps, t, to generate . Using this, one can identify potential improvements in methodology at the task-level.


6 Experiments

6.1 Class-IL

Short and long-term destructive interference is experienced by a learner when exposed to binary tasks involving novel classes. In Fig. 0(a), after achieving an AUC0.92 on the (green) task, the network quickly forgets how to perform this task when shown the (red) task, as shown by the significant drop in the AUC. Moreover, the final performance of the network at epoch 120 on the first task (AUC0.78) is below the best achieved performance. CLOPS, shown in Fig. 0(b), alleviates this interference where on the task, the AUC curve is ’flatter’ and ends at a higher value (AUC0.85). Such improvements in BWT are quantified in Table 1. To pinpoint better the cause of this benefit, we perform ablation experiments in Section 6.5. Furthermore, task order can significantly affect the degree of destructive interference. We illustrate this phenomenon and the robustness of our method to task order in Appendix E.

(a) Fine-tuning
Figure 1: Mean validation AUC of a) fine-tuning and b) CLOPS strategy in the Class-IL scenario. Each task belongs to a mutually-exclusive pair of classes from . Storage and acquisition fractions are and

, respectively. Thicker curves indicate tasks on which the learner is currently being trained. The shaded area represents one standard deviation from the mean across 5 seeds.

Method Average AUC BWT
MTL 0.701 0.014 - - -
Fine-tuning 0.770 0.020 0.037 0.037 (0.076) 0.064 (0.176) 0.080
CLOPS 0.796 0.013 0.053 0.023 0.018 0.010 0.008 0.016
Table 1: Performance of CL strategies in the Class-IL scenario. Storage and acquisition fractions are and , respectively. Mean and standard deviation are shown across five seeds.

Qualitative Assessment of Task-Instance Parameters. Tracked task-instance parameters, , result in distinguishable distributions as shown in Fig. 2. Since lower values are indicative of more difficult instances, the different distributions suggest that tasks differ in difficulty with being the most difficult. To support our proposed interpretation of task-instance parameters, we plot two ECG tracings that correspond to the lowest and highest values. Although both tracings belong to the class Normal Sinus Rhythm, the tracing with the lower value contains more noise and exhibits an elongated P-wave, both potential sources of confusion for the network.

Figure 2: Distribution of the tracked task-instance parameter values, , corresponding to CLOPS in the Class-IL scenario. Each colour corresponds to a different task. Storage and acquisition fractions are and , respectively with results shown for one seed. The ECG tracings associated with the lowest and highest values are shown.

6.2 Time-IL

Time-incremental learning is arguably the most relevant to medical applications. Upon introducing the first two tasks, minimal destructive interference is observed. We show this in Fig. 2(a) where the AUC of Term 1 remains unchanged after having trained on Term 2. Such behaviour, in the context of Fig. 2(b), is indicative that CLOPS does not disturb performance during these phases of the training procedure.

Notably, this CL scenario sheds light on the forward weight transfer (FWT) capabilities of CLOPS. In the spirit of ’doing more with less’, CLOPS generates an AUC0.62 on the Term 3 (blue) task after only one epoch of training. In contrast, the fine-tuning strategy requires a 20-fold increase in training time to reach that equivalent performance. We attribute this FWT to the loss-weighting role played by the task-instance parameters. By placing greater emphasis on more useful instances, the generalization performance of the algorithm is improved.

(a) Fine-tuning
Figure 3: Mean validation AUC of (a) fine-tuning and (b) CLOPS strategy in the Time-IL scenario. Each task represents datasets collected at different times of the year. Thicker curves indicate tasks on which the learner is currently being trained. The shaded area represents one standard deviation from the mean across 5 seeds.

6.3 Domain-IL

When exposing a learner to ECG inputs belonging to different leads, a significant amount of destructive interference is experienced. This is supported by the large magnitude and negative BWT values shown for the fine-tuning strategy in Table 2. The reduction in the magnitude of the BWT values (less negative) once again illustrates the ability of CLOPS to mitigate destructive interference. This can be explained by the incorporation of instances from previous tasks about which the network is most confused. In doing so, the version space is adjusted to account for such instances, and thus allows for generalization to those tasks.

Method Average AUC BWT
MTL 0.730 0.016 - - -
Fine-tuning 0.687 0.007 (0.041) 0.008 (0.047) 0.004 (0.070) 0.007
CLOPS 0.731 0.001 (0.011) 0.002 (0.020) 0.004 (0.019) 0.009
Table 2: Performance of CL strategies in the Domain-IL scenario. Storage and acquisition fractions are and , respectively. Mean and standard deviation are shown across five seeds.

6.4 Effect of Storage Fraction, b, and Acquisition Fraction, a

Larger storage and acquisition fractions expose the learner to a more representative distribution of data from previous tasks. Arguably, this should further alleviate destructive interference and improve generalization performance. We quantify this graded response in Fig. 3(a) where the AUC increases from 0.758 to 0.832 as b and a increase from 0.1 to 1. We also claim that a performance bottleneck lies at the storage phase. This can be seen by the larger improvement in performance as a result of an increased storage fraction compared to that observed for a similar increase in acquisition fraction. Despite this, a strategy with fractions as low as and is sufficient to outperform the fine-tuning strategy.

In addition to exploring the graded effect of fraction values, we wanted to explore the effect of storing the bottom, most difficult, b fraction of instances in a buffer. The intuition is that if a learner can perform well on these difficult replayed instances, then strong performance should be expected on the remaining relatively easier instances. We show in Fig. 3(b) that although the performance seems to be on par with that in Fig. 3(a) for , the most notable differences arise at fractions

(red box). We believe this is due to the extreme ’hard-to-classify’ nature of such replayed instances. These findings justify our storing of the top

b fraction of instances.

(a) Storing top b fraction
(b) Storing bottom b fraction
Figure 4: Mean validation AUC of CLOPS in the Class-IL scenario at combinations of storage fractions, b, and acquisition fractions, a. Results are shown as an average across five seeds.

6.5 Effect of Task-Instance Parameters and Acquisition Function

To understand better the root cause of CLOPS’ benefits, we conduct two ablation studies, 1) Random Storage dispenses with task-instance parameters and instead randomly stores instances into the buffer and 2) Random Acquisition dispenses with acquisition functions and instead randomly acquires instances from the buffer. We illustrate the effect of both of these strategies on generalization performance in Fig. 5. Their effect on BWT can be found in Appendix F.

(a) Random Storage
(b) Random Acquisition
Figure 5: Mean validation AUC of random strategies in the Class-IL scenario at combinations of storage fractions b and acquisition fractions a. Results are shown as an average across five seeds.

Dispensing with task-instance parameters eliminates the graded performance response observed in Fig. 3(a). This implies that, with CLOPS, more storage is better. To isolate the loss-weighting contribution of our parameters, we look to the row (yellow rectangle) in Fig. 4(a) and Fig. 3(a). At , , the absence of this loss-weighting mechanism worsens performance (AUC 0.826 vs. 0.770). We hypothesize that this mechanism is analogous to an attention mechanism placed on instance losses. Consequently, the network learns which instances to learn more from.

To isolate the effect of the acquisition function, we direct your attention to the column (green rectangle) in Fig. 4(b) and Fig. 3(a). At , , the absence of an acquisition function worsens performance (AUC 0.798 vs. 0.766). This makes sense as samples instances that lie in the region of uncertainty, thus reducing the size of the version space Settles (2009).

6.6 Quantifying Task Similarity via Task-Instance Parameters,

Inspired by the work of Silver and Mercer (1996); Nguyen et al. (2019); Rostami et al. (2020), we aim to use the tracked task-instance parameters, , to compare tasks. To do so, we first fit a Gaussian to each of the distributions in Fig. 2. We then define cross-task similarity using the Hellinger distance,

, between two Gaussian distributions parameterized by

and , respectively.


To validate our hypothesis, a learner is presented with an easy task followed by similar tasks. We choose the first task based on the highest mean value and subsequent tasks based on cross-task similarity values (see Appendix K). These decisions are influence by curriculum learning (Bengio et al., 2009) and the intuition that training on similar tasks is less likely to lead to destructive interference. Indeed, by following this procedure, constructive interference is improved (BWT 0.053 vs. 0.087), as shown in Table 3. Such an outcome lends support to our proposed definition of task difficulty and similarity derived from task-instance parameters.

Task Order Average AUC BWT
Random 0.796 0.013 0.053 0.023 0.018 0.010 0.008 0.016
Easy to Hard 0.744 0.009 0.087 0.011 0.038 0.021 0.076 0.037
Hard to Easy 0.783 0.022 0.058 0.016 (0.013) 0.013 (0.003) 0.014
Table 3: Performance of CLOPS in the Class-IL scenario with different task orderings. Storage and acquisition fractions are and , respectively. Results are shown across five seeds.

7 Discussion and Future Work

In this paper, we introduced CLOPS, a healthcare-specific replay-based method to mitigate destructive interference during continual learning. CLOPS consists of an importance-guided buffer-storage and active-learning inspired buffer-acquisition mechanism. We showed that our approach can outperform multi-task training and a naive fine-tuning strategy when evaluated based on forward and backward weight transfer. Furthermore, our proposed task-instance parameters can assist with network interpretability, be used to quantify task difficulty and cross-task similarity, and guide training procedures positively. We now elucidate promising future paths.

Extensions to Task Similarity. The notion of task similarity was initially explored by Thrun and O’Sullivan (1996); Silver and Mercer (1996). In this work we briefly proposed a novel definition of task similarity and used it to guide the order of presentation of tasks. Exploring more robust definitions, validating them through domain knowledge, and exploiting them to further improve generalization is an exciting extension.

Dynamic Storage and Acquisition Fraction. In this work, the fraction of data stored into and sampled from the buffer was fixed throughout the lifetime of the learner. An interesting line of research could focus on identifying an optimal strategy that dynamically changes those fractions during training. Such decisions can be based on task similarity and the relative amount of data in each task.

Predicting Destructive Interference. This work, and most research in CL, deals reactively with the notion of destructive interference. We observe this phenomenon and attempt to design approaches to alleviate it. An impactful research question can focus on predicting the degree of forgetting that may occur if one where to sequentially train on a particular dataset. With such knowledge, one can deploy tools to alleviate the problem proactively.

8 Broader Impact

The exploration and design of continual learning algorithms in the context of healthcare is more critical than ever. Nowadays, common clinical scenarios involve heterogenous data streaming from a multitude of physiological sensors over time. Exploiting this data reliably without forgetting what was previously learned in a situation where privacy is of utmost importance will lead to more trustworthy algorithms. This, in turn, should help increase the adoption rate of clinical decision support systems by medical practitioners.

Although our work exploits several publically-available datasets and a diverse set of continual learning formulations, it still suffers from various limitations that may have dire consequences. Firstly, it does not explore destructive interference nor its alleviation for tasks beyond cardiac arrhythmia classification. Even in our existing formulations, destructive interference is not completely eradicated. As a result, the algorithm may generate incorrect predictions for patients seen in the past, thus negatively affecting clinical decision making and potentially patient outcomes.

As the US Food and Drug Administration begins to explore best practices for modifying algorithms Feng et al. (2019), we hope that our work will encourage researchers and policymakers to more seriously consider the role of continual learning in healthcare.


  • R. Aljundi, E. Belilovsky, T. Tuytelaars, L. Charlin, M. Caccia, M. Lin, and L. Page-Caccia (2019a) Online continual learning with maximal interfered retrieval. In Advances in Neural Information Processing Systems, pp. 11849–11860. Cited by: §1, §2.
  • R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio (2019b) Gradient based sample selection for online continual learning. In Advances in Neural Information Processing Systems, pp. 11816–11825. Cited by: §1, §2.
  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §6.6.
  • R. A. Caruana (1993) Multitask connectionist learning. In In Proceedings of the 1993 Connectionist Models Summer School, Cited by: §5.4.
  • A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny (2018) Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420. Cited by: §2.
  • Y. Chen, R. J. Carroll, E. R. M. Hinz, A. Shah, A. E. Eyler, J. C. Denny, and H. Xu (2013) Applying active learning to high-throughput phenotyping algorithms for electronic health records data. Journal of the American Medical Informatics Association 20 (e2), pp. e253–e259. Cited by: §2.
  • D. A. Cohn, Z. Ghahramani, and M. I. Jordan (1996) Active learning with statistical models.

    Journal of Artificial Intelligence Research

    4, pp. 129–145.
    Cited by: §2.
  • S. Farquhar and Y. Gal (2018) Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733. Cited by: §1.
  • J. Feng, S. Emerson, and N. Simon (2019) Approval policies for modifications to machine learning-based software as a medical device: a study of bio-creep. arXiv preprint arXiv:1912.12413. Cited by: §8.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §4.2, §4.2.
  • Y. Gal, R. Islam, and Z. Ghahramani (2017) Deep Bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1183–1192. Cited by: §2.
  • W. Gong, S. Tschiatschek, R. Turner, S. Nowozin, and J. M. Hernández-Lobato (2019) Icebreaker: element-wise active information acquisition with bayesian deep latent gaussian model. arXiv preprint arXiv:1908.04537. Cited by: §2.
  • A. Y. Hannun, P. Rajpurkar, M. Haghpanahi, G. H. Tison, C. Bourn, M. P. Turakhia, and A. Y. Ng (2019) Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine 25 (1), pp. 65. Cited by: §5.1.
  • N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel (2011) Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745. Cited by: §2, §4.2.
  • D. Isele and A. Cosgun (2018) Selective experience replay for lifelong learning. In Thirty-second AAAI conference on artificial intelligence, Cited by: §2.
  • A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-III, a freely accessible critical care database. Scientific Data 3, pp. 160035. Cited by: §2.
  • A. Kirsch, J. van Amersfoort, and Y. Gal (2019) BatchBALD: efficient and diverse batch acquisition for deep bayesian active learning. arXiv preprint arXiv:1906.08158. Cited by: §2.
  • M. Lenga, H. Schulz, and A. Saalbach (2020) Continual learning for domain adaptation in chest x-ray classification. arXiv preprint arXiv:2001.05922. Cited by: §2.
  • Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §2.
  • D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp. 6467–6476. Cited by: §1, §2, §5.5.
  • C. V. Nguyen, A. Achille, M. Lam, T. Hassner, V. Mahadevan, and S. Soatto (2019) Toward understanding catastrophic forgetting in continual learning. arXiv preprint arXiv:1908.01091. Cited by: §6.6.
  • G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §1, §2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §5.1.
  • J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence (2009) Dataset shift in machine learning. The MIT Press. Cited by: §1.
  • S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) Icarl: incremental classifier and representation learning. In

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    pp. 2001–2010. Cited by: §2.
  • D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, and G. Wayne (2019) Experience replay for continual learning. In Advances in Neural Information Processing Systems, pp. 348–358. Cited by: §2.
  • M. Rostami, D. Isele, and E. Eaton (2020) Using task descriptions in lifelong machine learning for improved performance and zero-shot transfer. Journal of Artificial Intelligence Research 67, pp. 673–704. Cited by: §6.6.
  • S. Saxena, O. Tuzel, and D. DeCoste (2019) Data parameters: a new family of parameters for learning a differentiable curriculum. In Advances in Neural Information Processing Systems, pp. 11093–11103. Cited by: §4.1.
  • B. Settles (2009) Active learning literature survey. Technical report University of Wisconsin-Madison, Department of Computer Sciences. Cited by: §2, §6.5.
  • D. L. Silver and R. E. Mercer (1996) The parallel transfer of task knowledge using dynamic learning rates based on a measure of relatedness. In Learning to learn, pp. 213–233. Cited by: §6.6, §7.
  • A. Smailagic, P. Costa, A. Gaudio, K. Khandelwal, M. Mirshekari, J. Fagert, D. Walawalkar, S. Xu, A. Galdran, P. Zhang, et al. (2019) O-medal: online active deep learning for medical image analysis. arXiv preprint arXiv:1908.10508. Cited by: §2.
  • A. Smailagic, P. Costa, H. Y. Noh, D. Walawalkar, K. Khandelwal, A. Galdran, M. Mirshekari, J. Fagert, S. Xu, P. Zhang, et al. (2018) MedAL: accurate and robust deep active learning for medical image analysis. In IEEE International Conference on Machine Learning and Applications, pp. 481–488. Cited by: §2.
  • D. Spiegelhalter (2020) Should we trust algorithms?.

    Harvard Data Science Review

    2 (1).
    Note: External Links: Document, Link Cited by: §1.
  • S. Thrun and J. O’Sullivan (1996) Discovering structure in multiple learning tasks: the tc algorithm. In ICML, Vol. 96, pp. 489–497. Cited by: §7.
  • S. Tong and D. Koller (2001) Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2 (Nov), pp. 45–66. Cited by: §2.
  • G. M. van de Ven and A. S. Tolias (2019) Three scenarios for continual learning. arXiv preprint arXiv:1904.07734. Cited by: §2, §3.1.
  • G. Wang, C. Zhang, Y. Liu, H. Yang, D. Fu, H. Wang, and P. Zhang (2019)

    A global and updatable ecg beat classification system based on recurrent neural networks and active learning

    Information Sciences 501, pp. 523–542. Cited by: §2.
  • J. Zheng, J. Zhang, S. Danioko, H. Yao, H. Guo, and C. Rakovski (2020) A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients. Scientific Data 7 (1), pp. 1–8. Cited by: §5.1.
  • Z. Zhou, J. Shin, L. Zhang, S. Gurudu, M. Gotway, and J. Liang (2017) Fine-tuning convolutional neural networks for biomedical image analysis: actively and incrementally. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7340–7351. Cited by: §2.