PyTorch implementation of various methods for continual learning (XdG, EWC, online EWC, SI, LwF, DGR, DGR+distill, RtF, iCaRL).
Standard artificial neural networks suffer from the well-known issue of catastrophic forgetting, making continual or lifelong learning problematic. Recently, numerous methods have been proposed for continual learning, but due to differences in evaluation protocols it is difficult to directly compare their performance. To enable more meaningful comparisons, we identified three distinct continual learning scenarios based on whether task identity is known and, if it is not, whether it needs to be inferred. Performing the split and permuted MNIST task protocols according to each of these scenarios, we found that regularization-based approaches (e.g., elastic weight consolidation) failed when task identity needed to be inferred. In contrast, generative replay combined with distillation (i.e., using class probabilities as "soft targets") achieved superior performance in all three scenarios. In addition, we reduced the computational cost of generative replay by integrating the generative model into the main model by equipping it with generative feedback connections. This Replay-through-Feedback approach substantially shortened training time with no or negligible loss in performance. We believe this to be an important first step towards making the powerful technique of generative replay scalable to real-world continual learning applications.READ FULL TEXT VIEW PDF
PyTorch implementation of various methods for continual learning (XdG, EWC, online EWC, SI, LwF, DGR, DGR+distill, RtF, iCaRL).
Current state-of-the-art deep neural networks can be trained to impressive performance on a wide variety of individual tasks. Learning multiple tasks in sequence, however, remains a substantial challenge for deep learning. When trained on a new task, standard neural networks forget most information related to previously learned tasks, a phenomenon referred to as “catastrophic forgetting”.
In recent years, numerous methods for alleviating catastrophic forgetting have been proposed. However, due to the wide variety of experimental protocols used to evaluate them, many of these methods claim “state-of-the-art” performance [e.g., 1, 2, 3, 4, 5, 6]. To obscure things further, some methods shown to perform well in some experimental settings are reported to dramatically fail in others (e.g., compare the performance of elastic weight consolidation in  and  with that in  and ).
To enable a fairer and more structured comparison of methods for reducing catastrophic forgetting, as a first contribution this paper identifies three distinct continual learning scenarios with increasing level of difficulty. These scenarios are distinguished by whether at test time task identity is provided and, if it is not, whether task identity needs to be inferred (see section 2). We show that such differences in experimental design can explain seemingly contradictory results reported in the recent literature: even for experimental protocols involving the relatively simple classification of MNIST-digits, methods that perform well in one continual learning scenario can completely fail in another.
Using these three scenarios, a second contribution of this paper is to provide an extensive comparison of recently proposed methods. The methods to be compared are discussed in section 3, the implementation of our experiments in section 4 and the results in section 5. These experiments reveal that generative replay, especially when combined with distillation techniques, has the capability to perform well on all three scenarios. An important disadvantage of this approach, however, is that it can be computationally very costly.
As a third contribution, in section 6 we propose a way to reduce these computational costs. Current approaches using generative replay train two separate models: a main model for solving the tasks and a generative model for sampling examples representative of previous tasks. We merge the generative model into the main model by equipping it with feedback connections that are trained to have generative capability. We demonstrate that this substantially reduces training time, with no or negligible loss in performance.
We consider the continual learning problem in which a single model needs to sequentially learn a series of tasks, whereby it is not allowed to store raw data. This continual learning framework has been actively studied in recent years: many methods for alleviating catastrophic forgetting are being proposed, with almost as many different experimental protocols being used for their evaluation. We found that an important difference between these experimental protocols is whether at test time information about the task identity is available and—if it is not—whether the model is required to identify the identity of the task it has to solve. Yet, this crucial experimental design consideration is not always clearly stated and differences in this regard are sometimes not appreciated. For example, in  a substantial improvement over state-of-the-art is reported, while their method assumes task identity is always available and the compared methods operate without this assumption. To enable more meaningful comparisons, we identify three distinct scenarios for continual learning of increasing difficulty (see Table 1).
|Scenario||Required at test time|
|Incremental task learning||Solve tasks, task identity provided|
|Incremental domain learning||Solve tasks, task identity not provided|
|Incremental class learning||Solve tasks and infer task identity|
In the first scenario, models are always informed about which task needs to be performed. This is the easiest continual learning scenario, and we refer to it as incremental task learning. Since task identity is always provided, it is possible to train models with task-specific components. A typical neural network architecture used in this scenario has a “multihead” output-layer, meaning that each task has its own output units but the rest of the network is (potentially) shared between tasks.
In the second scenario, which we refer to as incremental domain learning, task identity is not available at test time. Models however only need to solve the task at hand; they are not required to infer which task it is. Typical examples of this scenario are protocols whereby the structure of the tasks is always the same, but the input-distribution is changing. A classical example of such a task protocol is ‘permuted MNIST’ 
, in which all tasks involve classifying MNIST-digits but with a different permutation applied to the pixels for each new task (see Figure2). Although permuted MNIST is most naturally performed according to the incremental domain learning scenario, it can be performed according to the other scenarios too (see Table 3).
Finally, in the third scenario, models need to be able both to solve each task seen so far and to infer which task they are presented with. We refer to this last scenario as incremental class learning, as it includes protocols in which a model incrementally needs to learn to recognize new classes. An example of a task protocol most naturally performed under this scenario is sequentially learning to classify MNIST-digits (‘split MNIST’ ; see Figure 1), although this protocol has also been performed according to the other two scenarios (see Table 2).
A simple and intuitive explanation for catastrophic forgetting is that after a neural network is trained on a new task, its parameters are now optimized for the new task and no longer for the previous one(s). This formulation highlights two strategies for alleviating catastrophic forgetting: (1) not freely optimizing the entire network on each task, and (2) modifying the training data to make it more representative for previous tasks.
A straightforward way of not optimizing the full network on every task is to explicitly define a different sub-network for each task to be learned. A variety of recent papers have utilized this strategy, with different approaches as to how the parts of the network for each task are selected. A simple approach is to randomly and a priori assign which nodes will participate in each task (Context-dependent Gating [XdG; 4
]). Other approaches use evolutionary algorithms or gradient descent  to learn which sets of units to employ for each task. By design, however, these approaches are limited to the incremental task learning scenario, as they require knowledge of task identity to select the correct task-specific components.
A modification to make this strategy applicable in the other scenarios is to preferentially train a different part of the network for each task, but to always use the entire network for execution. One way to do this is by differently regularizing the network’s parameters during training on each new task, which is the approach of Elastic Weight Consolidation [EWC; 1] and Synaptic Intelligence [SI; 7
]. Both methods estimate for all parameters of the network how important they are for the previously learned tasks and penalize future changes to them accordingly (i.e., learning is slowed down for parts of the network important for previous tasks).
A second strategy is to complement the training data for each new task to be learned with “pseudo-data” representative of the previous tasks. We refer to this strategy as replay. An early implementation of this strategy, called pseudo-rehearsal, generated completely random inputs as pseudo-data and labeled them based on the predictions of a copy of the model stored after finishing training on the previous task . This approach had some success with very simple, artificial inputs, but does not work with more complicated inputs .
An alternative is to take the input data of the current task, label them using the model trained on the previous tasks, and use the resulting input-target pairs as pseudo-data. This is the approach of Learning without Forgetting [LwF; 15]. Another important aspect of this method is that instead of labeling the inputs to be replayed as the most likely category according to the previous tasks’ model (i.e., “hard targets”), it pairs them with the by that model predicted probabilities for all target classes (i.e., “soft targets”). The objective for the replayed data is then to match the probabilities predicted by the model being trained to these target probabilities. The approach of matching predicted (and typically temperature-raised) probabilities of one network to those of another network had previously been used to compress (or “distill”) information from one (large) network to another (smaller) network .
Another option is to generate the input data to be replayed. For this, besides the main model for task performance (e.g., classification), a separate generative model is sequentially trained on all tasks to generate samples from their input data distributions. For the first application of this approach, which was called Deep Generative Replay [DGR; 17], the generated input samples were paired with “hard targets” provided by the main model. We note that it is possible to combine LwF and DGR by replaying input samples from a generative model and pairing them with soft targets [see also 6, 18]. We include this hybrid method in our comparison under the name DGR+distill.
A final option is to store examples from previous tasks and replay those. Such “exact replay” can substantially boost performance [2, 3, 5], but due to for example privacy concerns or memory constraints, it is not always possible to do so. In this paper we restrict ourselves to the case where storing raw data is not allowed (but see Appendix C for further discussion).
We used two different task protocols to compare the performance of the above discussed approaches. To illustrate the importance of task identity being provided / requested, both task protocols were performed according to all three continual learning scenarios defined in section 2.
|Incremental task learning||With task given, is it the first or second class?|
|(e.g., ‘0’ or ‘1’)|
|Incremental domain learning||With task unknown, is it a first or second class?|
|(e.g., in [‘0’,‘2’,‘4’,‘6’,‘8’] or in [‘1’,‘3’,‘5’,‘7’,‘9’])|
|Incremental class learning||With task unknown, which digit is it?|
|(i.e., choice from ‘0’ to ‘9’)|
The first task protocol was split MNIST (, see Figure 1). For this protocol, the original MNIST-dataset was split into five tasks, where each task was a two digit classification. The original 28x28 pixel grey-scale images were used without pre-processing. The standard training/test-split was used resulting in 60,000 training images (6000 per digit) and 10,000 test images (1000 per digit).
). The tasks of this protocol were classifying MNIST-digits (every task now had all ten digits), whereby in each task the pixels of the MNIST-images were permutated in a different way. We used a sequence of ten such tasks. To generate the permutated images, the original images were first zero-padded to 32x32 pixels. For each task, a random permutation was then generated and applied to these 1024 pixels. No other pre-processing was performed. Again the standard training/test-split was used, resulting in 60,000 training and 10,000 test images per task.
|Incremental task learning||Given permutation X was applied, which digit is it?|
|Incremental domain learning||With permutation unknown, which digit is it?|
|Incremental class learning||Which digit is it and which permutation was applied?|
For a fair comparison, the same neural network architecture was used for all methods. For the split MNIST experiments, this was a multi-layer perceptron with 2 hidden layers of 400 nodes each, followed by a softmax output layer. ReLU non-linearities were used in all hidden layers. For the permuted MNIST experiments each hidden layer consisted of 1000 nodes.
All methods used the standard cross entropy classification loss for the model’s predictions on the current task (; see Appendix A.1.1). The regularization-based methods (i.e.
, EWC, online EWC and SI) added a regularization term to this loss, with regularization strength controlled by a hyperparameter:. The value of this hyperparameter was set by a grid search, even though it could be argued that this is problematic in the context of continual learning as it violates the principle that tasks—including their validation set—should be visited in sequence and only once (see Appendix B). The replay-based methods (i.e., LwF, DGR and DGR+distill) instead added a loss-term for the replayed data. In this case a hyperparameter could be avoided, as the loss for the current and replayed data could be weighted according to how many tasks the model has been trained on so far: .
We compared the following approaches:
The model was sequentially trained on all tasks in the standard way. This is also called fine-tuning, and can be seen as a lower bound.
Following , for each task a random subset of of the units in each hidden layer was fully gated (i.e., their activations set to zero), with a hyperparameter whose value was set by a grid search (see Appendix B). As this method requires availability of task identity at test time, it could only be used in the incremental task learning scenario.
A separate generative model was trained to generate the images to be replayed. Following , the replayed images were labeled with the most likely category predicted by a copy of the main model stored after training on the previous task (i.e., hard targets).
A separate generative model was trained to generate the images to be replayed, but these were then paired with soft targets (as in LwF) instead of hard targets (as in DGR).
The model was always trained using the data of all tasks so far. This is also called joint training, and was included as it can be seen as an upper bound.
For the split MNIST protocol, all models were trained for 2000 iterations per task using the ADAM-optimizer (, )  with learning rate 0.001. The same optimizer was used for the permuted MNIST protocol, but with 5000 iterations and learning rate 0.0001. For each iteration, (and ) was calculated as average over 128 examples from the current task and—if replay was used—an additional 128 replayed examples (equally divided over all previous tasks) were used to calculate . Importantly, since the total number of replayed examples does not depend on the number of previous tasks, for our implementation of the replay-based methods the computational costs per task do not need to increase with number of tasks so far.
For DGR and DGR+distill, a separate generative model was sequentially trained on all tasks. A symmetric variational autoencoder [VAE;22
] was used as generative model, with 2 fully connected hidden layers of 400 (split MNIST) or 1000 (permuted MNIST) units and a stochastic latent variable layer of size 100. A standard normal distribution was used as prior. See AppendixA.1.3 for more details. Training of the generative model was also done with generative replay (provided by its own copy stored after finishing training on the previous task) and with the same hyperparameters (i.e., learning rate, optimizer, iterations, batch sizes) as for the main model.
|task learning||domain learning||class learning|
|None – lower bound||85.15 ( 1.00)||57.33 ( 1.66)||19.90 ( 0.02)|
|XdG||98.74 ( 0.31)||-||-|
|EWC||85.48 ( 1.20)||57.80 ( 1.61)||19.90 ( 0.02)|
|Online EWC||85.22 ( 1.06)||57.60 ( 1.66)||19.90 ( 0.02)|
|SI||99.14 ( 0.11)||63.77 ( 1.18)||20.04 ( 0.08)|
|LwF||99.60 ( 0.03)||71.02 ( 1.26)||24.17 ( 0.51)|
|DGR||99.47 ( 0.03)||95.74 ( 0.23)||91.24 ( 0.33)|
|DGR+distill||99.59 ( 0.03)||96.94 ( 0.14)||91.84 ( 0.27)|
|RtF||99.66 ( 0.03)||97.31 ( 0.11)||92.56 ( 0.21)|
|Offline – upper bound||99.64 ( 0.03)||98.41 ( 0.06)||97.93 ( 0.04)|
|task learning||domain learning||class learning|
|None – lower bound||81.79 ( 0.48)||78.51 ( 0.24)||17.26 ( 0.19)|
|XdG||91.40 ( 0.23)||-||-|
|EWC||94.74 ( 0.05)||94.31 ( 0.11)||25.04 ( 0.50)|
|Online EWC||95.96 ( 0.06)||94.42 ( 0.13)||33.88 ( 0.49)|
|SI||94.75 ( 0.14)||95.33 ( 0.11)||29.31 ( 0.62)|
|LwF||69.84 ( 0.46)||72.64 ( 0.52)||22.64 ( 0.23)|
|DGR||92.52 ( 0.08)||95.09 ( 0.04)||92.19 ( 0.09)|
|DGR+distill||97.51 ( 0.01)||97.35 ( 0.02)||96.38 ( 0.03)|
|RtF||97.31 ( 0.01)||97.06 ( 0.02)||96.23 ( 0.04)|
|Offline – upper bound||97.68 ( 0.01)||97.59 ( 0.01)||97.59 ( 0.02)|
For the split MNIST task protocol, we found a clear difference in difficulty between the three continual learning scenarios (see Table 4). Perhaps surprisingly, for all three scenarios with the split MNIST protocol, EWC and online EWC barely outperformed fine-tuning. SI performed better: it reduced catastrophic forgetting in the incremental task learning and incremental domain learning scenarios, but it also failed in the incremental class learning scenario. Strikingly, replaying images from the current task (LwF; e.g., replaying ‘2’s and ‘3’s in order not to forget how to recognize ‘0’s and ‘1’s), prevented the forgetting of previous tasks better than SI. Importantly, only the methods using generative replay retained good performance (above 90%) in the incremental class learning scenario, and DGR+distill outperformed DGR in all scenarios.
For the permuted MNIST protocol (see Table 5), there was less difference between EWC, online EWC and SI: they performed reasonably well in the incremental task learning and incremental domain learning scenarios, but failed again in the incremental class learning scenario. While LwF had some success with the split MNIST protocol, this method did not work with the permuted MNIST protocol. The methods using generative replay were again the only ones successful in the incremental class learning scenario, and DGR+distill again always outperformed DGR.
Finally, although in the incremental task learning scenario XdG succeeded in reducing catastrophic forgetting on both task protocols, on both it was outperformed by SI (and thus by DGR+distill).
Generative replay with distillation consistently outperformed the competing methods and even obtained excellent results in the challenging incremental class learning scenario. However, an important disadvantage of generative replay is that it is usually computationally expensive, among others because a separate generative model is trained. Indeed, in our experiments the training time for DGR and DGR+distill was roughly twice as long as for SI (see below). To reduce the computational cost of generative replay, we propose to integrate the generative model into the main model by equipping it with generative feedback connections.
To enable the main model to generate replay itself, we add (1) feedback connections that are trained to reconstruct inputs from their hidden representations and (2) a layer of stochastic latent variablesthat are trained to follow a known distribution from which it is easy to sample. In case of classification, the resulting network is for example a symmetrical VAE with an additional softmax classification layer from the final hidden layer of the encoder (see Figure 3). Besides removing the need for a separate generative model, it is possible that regularization provided by the added generative objective helps to train a more robust classifier [23, 24].
The loss function for the data of the current task now has two terms:, whereby is the standard cross-entropy classification loss and is the VAE loss (see Appendix A.1.3). On later tasks, the training data of the current task is supplemented with replayed data. For the replayed data, as for LwF and DGR+distill, the classification term is replaced by a distillation term: . The loss terms for the current and replayed data are again weighted according to how many tasks the model has seen so far.
For a fair comparison with the methods in section 4.2, the model we used for RtF also had 2 fully connected hidden layers with 400 (split MNIST) or 1000 (permuted MNIST) units. Similar to the VAE used for DGR and DGR+distill, the stochastic latent variable layer was of size 100 with a standard normal prior. Also the same hyperparameters (i.e., learning rate, optimizer, iterations, batch sizes) were used for training.
We found that RtF slightly outperformed DGR+distill on all experiments with the split MNIST task protocol (see Table 4), while it performed slightly less than DGR+distill on the experiments with the permuted MNIST task protocol (see Table 5). These differences were relatively small, and—similar to DGR+distill—RtF comfortably outperformed all other tested methods. To assess the extent to which RtF succeeded in reducing the computational cost of generative replay, and to compare the resulting cost with that of the other methods, in Figures 4 and 5 we plotted for each method its performance against total training time on a NVIDIA GeForce GTX 1080 GPU. As expected the training time was always longest for DGR and DGR+distill, and this time was substantially reduced—for most experiments almost halved—by RtF.
Catastrophic forgetting is a major obstacle to the development of artificial intelligence applications capable of true lifelong learning [25, 26], and enabling neural networks to sequentially learn multiple tasks has become a topic of intense research. Despite its scope, this research field lacks common benchmarks—even though the same datasets tend to be used, which makes direct comparisons between published methods difficult. We found that an important difference between currently used experimental protocols is in whether task identity is provided and—if it is not—in whether it needs to be inferred. For each of the resulting three scenarios, we performed a comprehensive comparison of recently proposed methods. An important conclusion is that for the incremental class learning scenario (i.e., when task identity needs to be inferred), only replay-based methods are capable of producing acceptable results. In this scenario, even for relatively simple task protocols involving the classification of MNIST-digits, regularization-based methods such as EWC and SI completely failed. Moreover, also in the other scenarios, generative replay combined with distillation consistently outperformed all other tested methods. These results establish generative replay as a promising general strategy for lifelong learning.
However, an important limitation of the current study is that generating MNIST-digits is relatively easy. We leave it for future work to empirically address whether generative replay can scale to task protocols with more complicated inputs, but here we highlight several reasons why we believe this will be the case. First, with the permuted MNIST protocol, we observed that even when the quality of the replayed samples had substantially declined (see Figure 6), they still helped to prevent catastrophic forgetting. Second, under some conditions (e.g., with the split MNIST protocol), replaying inputs from the current task (i.e., LwF) works reasonably well, further indicating that the replayed samples need not be perfect and that “good enough” can suffice. We hypothesize that the use of distillation is especially important to make generative replay more robust to the quality of the replayed inputs. Finally, of course, the capabilities of generative models are improving at a rapid pace [e.g., 27, 28, 29].
This last point however also warrants caution. Although the latest developments in for example generative adversarial networks, auto-regressive decoders or flow-based models enable training high quality generative models for increasingly complicated input distributions, this can come at high computational costs. Especially in a lifelong learning setting, where models continually need to be trained on new tasks and where training sometimes has to be in real-time, efficiency is important. We therefore emphasize that continual learning methods should not only be evaluated in terms of their performance, but also in terms of for example their training time (e.g., Figures 4 and 5; see also ). Here, we improved the efficiency of generative replay by merging the generator into the main model. We also want to highlight that in our implementation of replay the number of replayed examples did not increase with number of tasks. We hypothesize that a relatively small number of examples per task can be acceptable because information on the previous tasks is also contained in the initiation bias (i.e., training on each new task starts with a network that is already optimized for the previous tasks).
To conclude, we believe that generative replay brings more to the table than simply “shift[ing] the catastrophic forgetting problem to the training of the generative model” [19; p.3], and we envision that a small amount of good enough replay generated by the model’s own feedback connections could become a valuable tool for real-world continual learning applications.
We thank Mengye Ren and Zhe Li for comments on earlier versions. This research project has been supported by an IBRO-ISN Research Fellowship, by the Lifelong Learning Machines (L2M) program of the Defence Advanced Research Projects Agency (DARPA) via contract number HR0011-18-2-0025 and by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC) contract number D16PC00003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, IARPA, DoI/IBC, or the U.S. Government.
The standard per-sample cross entropy loss function for an input labeled with a hard target is given by:
is the conditional probability distribution defined by the neural network whose trainable bias and weight parameters are collected in
. An important note is that in this paper this probability distribution is not always defined over all output nodes of the network, but only over the “active nodes”. This means that the normalization performed by the final softmax layer only takes into account these active nodes, and that learning is thus restricted to those nodes. For experiments performed according to the incremental task learning scenario, for which we use a “multihead” softmax layer, always only the nodes of the task under consideration are active. Typically this is the current task, but for replayed data it is the task that is (intended to be) replayed. For the incremental domain learning scenario always all nodes are active. For the incremental class learning scenario, the nodes of all tasks seen so far are active, both when training on current and on replayed data.
For the method DGR, there are also some subtle differences between the continual learning scenarios when generating hard targets for the inputs to be replayed. With the incremental task learning scenario, only the classes of the task that is intended to be replayed can be predicted (in each iteration the available replays are equally divided over the previous tasks). With the incremental domain learning scenario always all classes can be predicted. With the incremental class learning scenario only classes from up to the previous task can be predicted.
The methods LwF, DGR+distill and RtF use distillation loss for their replayed data. For this, each input
to be replayed is labeled with a “soft target”, which is a vector containing a probability for each active class. This target probability vector is obtained using a copy of the main model stored after finishing training on the most recent task, and the training objective is to match the probabilities predicted by the model being trained to these target probabilities (by minimizing the cross entropy between them). Moreover, as is common for distillation, these two probability distributions that we want to match are made softer by temporary raising the temperatureof their models’ softmax layers.111The same temperature should be used for calculating the target probabilities and for calculating the probabilities to be matched during training; but during testing the temperature should be set back to 1. A typical value for this temperature is 2, which is the value used in this paper.
This means that before the softmax normalization is performed on the logits, these logits are first divided by. For an input to be replayed during training of task , the soft targets are given by the vector whose element is given by:
where is the vector with parameter values at the end of training of task and is the conditional probability distribution defined by the neural network with parameters and with the temperature of its softmax layer raised to . The distillation loss function for an input labeled with a soft target vector is then given by:
where the scaling by is included to ensure that the relative contribution of this objective matches that of a comparable objective with hard targets .
When generating soft targets for the inputs to be replayed, there are again subtle differences between the three continual learning scenarios. With the incremental task learning scenario, the soft target probability distribution is defined only over the classes of the task intended to be replayed. With the incremental domain learning scenario this distribution is always over all classes. With the incremental class learning scenario, the soft target probability distribution is first generated only over the classes from up to the previous task and then zero probabilities are added for all classes in the current task.
The separate generative model that is used for DGR and DGR+distill is a variational autoencoder (VAE; 22), of which both the encoder network and the decoder network are multi-layer perceptrons with 2 hidden layers containing 400 (split MNIST) or 1000 (permuted MNIST) units with ReLU non-linearity. The stochastic latent variable layer has 100 units and the prior over them is the standard normal distribution. Following , the “latent variable regularization term” of this VAE is given by:
whereby and are the elements of respectively and , which are the outputs of the encoder network given input . Following , the output layer of the decoder network has a sigmoid non-linearity and the “reconstruction term” is given by the binary cross entropy between the original and decoded pixel values:
whereby is the value of the pixel of the original input image and is the value of the pixel of the decoded image with , whereby is sampled from . The per-sample VAE loss for an input is then given by :
The two terms of this loss function correspond to the two objectives mentioned in the main text: encourages the feedback connections to be able to reconstruct the inputs from their hidden representations and encourages—given the observed inputs—the distribution of the latent variables to be close to a standard normal distribution.
The model used for RtF is equal to the symmetrical VAE described above, except for an added softmax classification layer to the last hidden layer of the encoder. As explained in the main text, the per-sample loss of this model is simply the sum of a classification term (either standard cross-entropy loss or distillation loss) and the above generative loss term. It would be possible to add a hyperparameter to set the relative weight of these two terms (which would presumably further increase performance), but given the issue associated with hyperparameters in a continual learning setting (see Appendix B), we choose to avoid this.
The regularization term of elastic weight consolidation [EWC; 1] consists of a quadratic penalty term for each previously learned task, whereby each task’s term penalizes the parameters for how different they are compared to their value directly after finishing training on that task. The strength of each parameter’s penalty depends for every task on how important that parameter was estimated to be for that task, with higher penalties for more important parameters. For EWC, a parameter’s importance is estimated for each task by the parameter’s corresponding diagonal element of that task’s Fisher Information matrix, evaluated at the optimal parameter values after finishing training on that task. The EWC regularization term for task is given by:
whereby is the element of , which is the vector with parameter values at the end of training of task , and is the diagonal element of , which is the Fisher Information matrix of task evaluated at . Following the definitions and notation in , the diagonal element of is defined as:
whereby is the (theoretical) input distribution of task and is the conditional distribution defined by the neural network with parameters . Note that in  it is not specified exactly how these are calculated (except that it is said to be “easy”); here we calculate them as the diagonal elements of the “empirical Fisher” :
whereby is the training data of task . This calculation can be interpreted as an approximation under the assumption that the predictions made by on the training data of task are near-perfect.222An alternative way to calculate is to replace in equation 9 the provided label by , the label predicted by the model with parameters given . Another option is, instead of taking for each training input only the most likely label predicted by model , to sample for each multiple labels from the entire conditional distribution defined by this model (i.e., to approximate the inner expectation of equation 8 for each training sample with Monte Carlo sampling from ). The calculation of the Fisher Information is time-consuming, especially if tasks have a lot of training data.333It should be noted that it might be possible to improve the efficiency of our implementation of the Fisher Information calculation. This calculation requires the gradients for each individual data point (as they need to be squared before being summed), but batch-wise operations in PyTorch do not allow access to the unaggregated gradients. We therefore performed the Fisher Information calculation with a batch size of 1. In practice it might therefore sometimes be beneficial to trade accuracy for speed by using only a subset of a task’s training data for this calculation (e.g., by introducing another hyperparameter that sets the maximum number of samples to be used in equation 9).
A disadvantage of the original formulation of EWC is that the number of quadratic terms in its regularization term grows linearly with the number of tasks. This is an important limitation, as for a method to be applicable in a true lifelong learning setting its computational cost should not increase with the number of tasks seen so far. It was pointed out by  that a slightly stricter adherence to the approximate Bayesian treatment of continual learning, which had been used as motivation for EWC, actually results in only a single quadratic penalty term on the parameters that is anchored at the optimal parameters after the most recent task and with the weight of the parameters’ penalties determined by a running sum of the previous tasks’ Fisher Information matrices. This insight was adopted by , who proposed a modification to EWC called online EWC. The regularization term of online EWC when training on task is given by:
whereby is the value of parameter after finishing training on task and is a running sum of the diagonal elements of the Fisher Information matrices of the first tasks, with a hyperparameter that governs a gradual decay of each previous task’s contribution. That is: , with and is the diagonal element of the Fisher Information matrix of task calculated according to equation 9.
Similar as for online EWC, the regularization term of synaptic intelligence [SI; 7] consists of only one quadratic term that penalizes changes to parameters away from their values after finishing training on the previous task, with the strength of each parameter’s penalty depending on how important that parameter is thought to be for the tasks learned so far. To estimate parameters’ importance, for every new task a per-parameter contribution to the change of the loss is first calculated for each parameter as follows:
with the total number of iterations per task, the value of the parameter after the training iteration on task and the gradient of the loss with respect to the parameter during the training iteration on task . For every task, these per-parameter contributions are normalized by the square of the total change of that parameter during training on that task plus a small dampening term (set to 0.1, to bound the resulting normalized contributions when a parameter’s total change goes to zero), after which they are summed over all tasks so far. The estimated importance of parameter for the first tasks is thus given by:
with , where indicates the value of parameter right before starting training on task . (An alternative formulation is , with the value of parameter it was initialized with and its value after finishing training on task .) The regularization term of SI to be used during training on task is then given by:
As discussed in section 4.2 and Appendix A.2, several of the in this paper compared continual learning methods have one or more hyperparameters. The typical way of setting the value of hyperparameters is by training models on the training set for a range of hyperparameter-values, and selecting those that result in the best performance on a separate validation set. This strategy has been adapted to the continual learning setting as training models on the full protocol with different hyperparameter-values using only every task’s training data, and comparing their overall performances using separate validation sets (or sometimes the test sets) for each task [e.g., 10, 1, 8, 19]. However, here we would like to stress that this means that these hyperparameters are set (or learned) based on an evaluation using data from all tasks, which violates the continual learning principle of only being allowed to visit each task once and in sequence. Although it is tempting to think that it is acceptable to relax this principle for tasks’ validation data, we argue here that it is not. A clear example of how using each task’s validation data continuously throughout an incremental training protocol can lead to an in our opinion unfair advantage is provided by , in which after finishing training on each task a “bias-removal parameter” is set that optimizes performance on the validation sets of all tasks seen so far (see their section 3.3). Although the hyperparameters of the methods compared here are much less influential than those in the above paper, we believe that it is important to realize this issue associated with traditional grid searches in a continual learning setting and that at a minimum influential hyperparameters should be avoided in methods for continual learning.
Nevertheless, to give the competing methods of generative replay the best possible chance—and to explore how influential their hyperparameters are—we do perform grid searches to set the values of their hyperparameters (see Figures 7 and 8). Given the issue discussed above we do not see much value in using validation sets for this, and we evaluate the performances of all hyperparameter(-combination)s using the tasks’ test sets. For this grid search each experiment is run once, after which 20 new runs are executed using the selected hyperparameter-values to obtain the results in Tables 4 and 5 and Figures 4 and 5.
The Replay-through-Feedback framework we introduce here has some similarity with one of the components of FearNet . The “medial prefrontal cortex (mPFC) network” of this brain-inspired method is also a generative autoencoder with added classification capability. At specific times (whenever their model “sleeps”), this autoencoder is also used to generate pseudo-examples representative of previously learned tasks. This autoencoder is however not a VAE, but its generative capability relies on the storage of a mean feature vector and a covariance matrix for each encountered class. FearNet further consists of a “hippocampal complex (HC) network”, which temporary stores the data from previous tasks444The data from previous tasks is stored in the HC until their model “sleeps”, which for their main reported results is every ten tasks. Confusingly, in the abstract of  it is claimed that their method does not store previous examples; this was probably intended as that their method does not permanently store previous examples., and a “basolateral amygdala (BLA) network” that decides whether a newly-encountered example should be classified by the HC or the mPFC. The authors of  show good results with this method on a by them introduced variant of the incremental class learning scenario: models are first non-incrementally trained on half of all classes to be learned, after which in the incremental learning part of their paradigm the remaining classes are presented one-by-one (i.e., each new task only consists of one class). As this experimental paradigm is quite specific, it remained unclear how generally applicable their approach is. Moreover, due to the complexity of their method, the specific contribution of replay generated by their autoencoder remained unclear. Indeed, it seems likely that especially the storage of data—the temporary storage of original data and/or the permanent storage of hidden summary statistics—also had an important role in FearNet’s good performance.
In the current paper, based on the argument that storing data is not always possible due to privacy concerns or memory constraints, we only considered methods that do not store data. However, as indicated by methods such as iCaRL  and FearNet , when possible, storing data can substantially boost performance, especially in the incremental class learning scenario. Of particular note is that these two methods point out that it is not necessary that all of the original data is stored permanently. Indeed, iCaRL demonstrates that storing a relatively small number of well-chosen examples can be helpful, while—as discussed above—FearNet’s good performance seems to suggest that temporary storage of data can already be useful. Both these reductionist approaches to storing data of course reduce memory storage demands. Finally, an interesting aspect of FearNet is that it also stored hidden summary statistics. The promising idea of storing hidden representations, which besides reducing memory storage demands could also address the privacy issue, is further worked out in . We expect that the sparse and/or temporary storage of hidden representations could be a useful complement to generative replay, which might help it to scale up to real-world continual learning problems. However, we want to stress that when allowing the storage of data, it is important to take extra care to ensure fair comparisons between methods (i.e., that they all have the same rules regarding to how much, for how long and what data can be stored).