1 Introduction
Finetuning has been established as the most common method to use when learning a new task on top of an already learned one. This works well if you no longer require the system to perform the previous task. However, in many real-world situations one is interested in learning consecutive tasks which, in the end, the system should be able to perform all. This is the setting studied in lifelong learning, also referred to as sequential, incremental or continual learning. In this setting, the popular approach of finetuning suffers from catastrophic forgetting french1999catastrophic ; goodfellow2013empirical ; mccloskey1989catastrophic ; li2020baseline ; mermillod2013stability : all network capabilities are used for learning the new task, which leads to forgetting of the previous ones.
A popular strategy to avoid this is to use importance-weight loss proxies or regularizers aljundi2018memory ; kirkpatrick2017overcoming ; lee2017overcoming . These approaches compute an importance score for each of the weight parameters of the model based on previous tasks and use this to decide which weights can be modified for the current task. A drawback of these methods is that they need to store an extra variable (the importance score) for each weight. This leads to an overhead of a float per weight parameter, i.e. double the number of parameters which have to be stored. Other methods work with a binary mask to select part of the model for each task mallya2018piggyback ; mallya2018packnet . This leads to an overhead of one bit per task per weight parameter. Finally, some methods directly make a copy of the network rusu2016progressive or rely on the storage of exemplars lopez2017gradient ; rebuffi2016icarl ; chaudhry2018efficient ; riemer2018learning , which again increases memory consumption and renders these methods unsuitable when privacy requirements forbid storing of data.
In this paper, we advocate computing a mask at the level of the features111With features or activations we refer to the inputs of a layer that are also the output of a previous layer. The input images can be considered features for the network to perform the task at hand, but here we will only use the term features for the activations between layers. (activations) instead of at the level of the weights. We need the mask to be ternary, i.e. adding a third state for allowing features to be used during the forward pass while being masked during the backward pass. That allows to reuse the representations from previous tasks without modifying them and introducing forgetting. This drastically reduces the number of extra parameters that need to be stored. As an example, the popular Alexnet architecture krizhevsky2012imagenet has around 60 million weights, while having less than 10k features. One earlier method that builds on this idea is HAT serra2018overcoming , which stores an attention value for each feature for each task. Recently, SSL aljundi2018selfless
also brings attention to the activation neurons by promoting sparsity with different losses inspired by lateral inhibition in the mammalian brain. Those two recent works stress the importance of focusing on the features instead of the weights, not only because of the reduction in memory overhead, but also because they allow for better performance and less forgetting. However, both methods still allow some forgetting as new tasks are learned.
Over time, the forgetting typically increases with the number of tasks aljundi2018memory ; kirkpatrick2017overcoming ; lee2017overcoming ; lopez2017gradient ; rebuffi2016icarl ; serra2018overcoming . However, for many practical systems, it is undesirable if the accuracy of the system deteriorates over time, while the system learns new tasks. Moreover, under these settings, the user typically has no control on the amount of forgetting, i.e. there are no guarantees on the performance of the system after new tasks have been added. Even worse: to the user, it is unknown how much the system has actually forgotten or how well the system still performs on older tasks.
For these reasons, some works have studied systems which perform continual learning without any forgetting at all. Currently, apart from the methods that make copies of the network for each task rusu2016progressive , mask-based approaches are the only ones that guarantee no-forgetting mallya2018piggyback ; mallya2018packnet . Indeed, all methods that allow backward propagation into the parameters of previously learned tasks have no control on the amount of forgetting. Not updating the weights used for previous tasks using a binary mask prevents any forgetting. In the case of recent approaches PackNet mallya2018packnet and PiggyBack mallya2018piggyback
, this is enforced by binary-masking weights or learning masks that will be binarized after the task is learned. However, both these non-forgetting methods mask weights and therefore have a larger memory overhead than methods which would put the masks on the features instead. Another drawback of
mallya2018piggyback is that it requires a backbone network as a starting point.In this article, we propose a method for continual learning which does not suffer from any forgetting. Due to the nature of our proposed mask-based approach, we will only focus on evaluation on task-aware experimental setups. Instead of applying masks to all weights in the network, we propose to move the masks to the feature level, thereby significantly reducing the memory overhead. Our initial method requires only a 2-bit mask value for each activation for each task. In addition, we introduce a task-dependent normalization on the features. This allows to adjust previously learned features to be of more optimal use for later tasks, without changing the performance or weights assigned to previous tasks. This introduces a further memory overhead of storing two floats more per activation per task. Nevertheless, this method still has a significantly lower memory overhead than any method which stores additional parameters per weight aljundi2018memory ; kirkpatrick2017overcoming ; lee2017overcoming ; mallya2018piggyback ; mallya2018packnet ; liu2018rotate ; zenke2017continual .
The remainder of the paper is organized as follows. In the next section we review existing continual learning methods. In Section 3, we discuss masks on the features and introduce the proposed method Ternary Feature Masks (TFM). Then, we present the experimental setup and analysis of the results in Section 4 and conclude in Section 5.
2 Related work
Lifelong learning in the proposed continual learning setup has been addressed in multiple prior works. A large part of these use regularization-based techniques to reduce catastrophic forgetting without having to store raw input. They can be divided into two main families. Distillation approaches use teacher-student setups that aim at preserving the output of the teacher model on the new data distribution rusu2016progressive ; li2017learning ; jung2016less ; rannen2017encoder ; zhang2019class . The output for each of the previous tasks is encoded using targets, exemplars or other representations and constrained when learning the new task. Model-based approaches aljundi2018memory ; kirkpatrick2017overcoming ; lee2017overcoming ; aljundi2018selfless ; liu2018rotate ; zenke2017continual ; chaudhry2018riemannian focus on overcoming catastrophic forgetting by defining the importance of the weights in the network. By penalizing changes to important weights, that have a big impact on the network output or performance, the loss protects previous information.
Learning without Forgetting (LwF) proposes to use the knowledge distillation loss li2017learning to preserve the performance of previous tasks. However, if the data distribution of the new task is very different from the previous tasks, performance drops drastically aljundi2016expert . In order to solve that, rebuffi2016icarl (iCaRL) stores a subset of each tasks’ data as exemplars; while rannen2017encoder
(EBLL) solves the issue by learning undercomplete autoencoders for each task. Less Forgetting Learning (LFL) is also similar to LwF, preserving the previous tasks’ performance by penalizing changes on the shared representation
jung2016less . This approach argues that the task-specific decision boundaries should not change, and freezes the last layer instead of mitigating the change with a loss. Expert Gate aljundi2016expert (EGate) learns a model for each task and an autoencoder gate which will choose the model to be used. Recently, Deep Model Consolidation (DMC) proposes to learn new classes separately and then learn a final model with double distillation and extra unlabelled data zhang2019class . However, most of these methods need a pre-processing step before each task. Furthermore, a main issue is also the scalability when learning many tasks, since the described methods have to store data, autoencoders, or larger models for each new task.When learning a new task, most model-based approaches apply a smooth penalty for changing weights, proportional to their importance for previous tasks aljundi2018memory ; kirkpatrick2017overcoming ; lee2017overcoming ; liu2018rotate ; zenke2017continual
. One of the main issues for these methods is that, depending on the task relatedness and the capacity of the network, they might over or under-estimate the importance of those weights. The main difference among those methods is how the importance of weights is calculated. In Elastic Weight Consolidation (EWC) an approximation of the diagonal of the Fisher Information Matrix (FIM) is used
kirkpatrick2017overcoming . Rotated EWC proposes a rotation of the weight space to get a better approximation of the FIM liu2018rotate. In Incremental Moment Matching (IMM) the moments of the posterior distribution are matched incrementally
lee2017overcoming. They assume a diagonal covariance matrix which means that there is no correlation among the parameters, and selectively balance between two weights using variance information. Synaptic Intelligence (SI) computes the importance weights in an online fashion by storing how much the loss would change for each parameter over the training
zenke2017continual. However, this method is prone to under-estimating when using pretrained networks, while over-estimating by relying too much on the gradient descent of some batches. Memory Aware Synapses (MAS) computes the weight importance in an online unsupervised way, connecting their approach with Hebbian learning
aljundi2018memory . It promotes the importance of those weights that even with small changes have a big effect on the network predictions. Finally, those approaches that learn an importance parameter for each weight can be combined with Selfless Sequential Learning (SSL) which imposes sparsity or local neuron inhibition on the neuron activations with a loss term aljundi2018selfless .Family | Method | Revisit data | Require Backbone net | Easily expandable | Overhead | Forgetting | Forward transfer | Features or weights |
Baseline | Finetune | No | No | Yes | None | Yes | Little | neither |
Joint | Yes | No | Yes | None | No | Little | neither | |
Freeze | No | Yes | Yes | None | No | Backbone only | neither | |
Distillation | LwF li2017learning | No | No | No | 1 float pp | Some | Yes | weights |
LFL jung2016less | No | No | No | 1 float pp | Some | Yes | weights | |
PNN rusu2016progressive | No | No | Yes | duplicate pt | Some | Yes | weights | |
P&C schwarz2018progress | No | No | No | extra network | Some | Little | weights | |
Model-based | EWC kirkpatrick2017overcoming | No | No | No | 1 float pp | Some | Yes | weights |
R-EWC liu2018rotate | No | No | No | 1 float pp | Some | Yes | weights | |
IMM lee2017overcoming | No | No | No | 1 float pp pt | Some | Yes | weights | |
SI zenke2017continual | No | No | No | 1 float pp | Some | Yes | weights | |
MAS aljundi2018memory | No | No | No | 1 float pp | Some | Yes | weights | |
SSL aljundi2018selfless | No | No | No | 1 float pfp | Some | Yes | both | |
Mask-based | PackNet mallya2018packnet | No | No | No | 1 int pp pt | No | Yes | weights |
PiggyBack mallya2018piggyback | No | Yes | No | 1 bit pp pt | No | Backbone only | weights | |
HAT serra2018overcoming | No | No | No | 1 float pf pt | Some | Yes | features | |
TFM w/o FN (Ours) | No | No | Yes | 2 bits pf pt | No | Yes | features | |
TFM (Ours) | No | No | Yes | 2 bits + 2 floats pf pt | No | Yes | features |
Some more works use other underlying methods. Progressive Neural Networks (PNN) add lateral connections at each layer of the network to a duplicate of that layer
rusu2016progressive . Then, the new column learns the new task while the old one keeps the weights fixed, meaning that resources are duplicated each time a task is added. This approach leads to non-forgetting while making the knowledge of previous tasks available during the learning of a new one through distillation. However, as each new task adds a column with the corresponding connections, the overhead scales quadratically with the number of tasks. Progress and Compress (P&C) expands the idea of PNN with the use of EWC but keeping the number of parameters constant schwarz2018progress . They propose a two-component setup with a knowledge base and an active column that follows a similar setup as PNN with lateral connections. Recently, Learn to Grow (LtG) proposes a two-part approach with a neural structure optimization component and a learning component which finetunes the parameters li2019learn . The neural structure component allows for each layer to reuse existing weights, adapt them or grow the network. In the worst case scenario, layers have to be added which makes the growth linear in the number of tasks. The finetuning component does parameter optimization and can be fixed or use an existing approach such as EWC to avoid catastrophic forgetting.Apart from the above mentioned families, some recent works use masks to directly influence or completely remove forgetting for each parameter. We refer to this family of approaches as mask-based. These approaches give better control on the flow of gradients through the network and have the benefit of reducing or removing catastrophic forgetting more than the previously mentioned alternatives, at the cost of depleting network capacity faster as new tasks are learned. PathNet uses evolutionary strategies to learn selective routing through the weights fernando2017pathnet . However, it is not end-to-end differentiable and computationally very expensive. PackNet trains the network with available weights, then prunes the less relevant weights and retrains with a smaller subset of them mallya2018packnet . Those weights are then not available for further learning of new tasks, which quickly reduces the capacity of the network. This results in lower number of parameters being free and performance dropping quickly on longer sequences. Piggyback proposes to use a pretrained network as a backbone and then uses binary masks on the weights to create different sub-networks for each task mallya2018piggyback . The main drawback with that approach is the backbone network itself, which is crucial to being able to learn each task on top of it and cannot have a too different distribution from them. In the case of PackNet, the use of a backbone network is not needed but recommended since it becomes more difficult to learn tasks from scratch (specially with larger networks) than it is from a network finetuned on a large-scale dataset. In these last two methods, although the binary mask for each parameter has almost no overhead for the network usage, storing a mask for each task does not scale very well after a certain number of tasks. Finally, Hard Attention to the Task (HAT) proposes a hard attention mechanism on the features after each layer serra2018overcoming
. The attention embeddings are non-binary and are learned together with each task and conditioned by the previous tasks’ attentions. Because of the annealing of the slope of the sigmoid used on the embeddings, they also define a different backward propagation through their attention mechanism with gradient compensation. This approach offers plasticity to the embeddings in order to learn them, but also allows the possibility to forget previous tasks during the backpropagation step. A no-forgetting idea is discussed in the appendices of their manuscript with a note on binary masks, connecting the removal of plasticity to the
inhibitory synapses idea mcculloch1943logical .In our proposed approach we take the latter side of that balance, using rigid masks that reduce plasticity but also ensure non-forgetting of previous tasks. Our approach also focuses on a natural expansion of the capacity of the network, which is not addressed in HAT and most of the previous related work. Our proposed approach uses masks on the features of the network to have a better control over which weights can be modified while learning new tasks. At the same time, the mask being ternary allows weights fixed for previous tasks to be used on new tasks without modifying those weights. This masking strategy allows the network to not forget anything from previous tasks and reduce the computational overhead in comparison to masking the weights.
We show in Table 1 a comprehensive overview of some of the characteristics that we consider to be more relevant to the experimental setup we propose. Our proposed method is unique in that it combines being expandable, having a low overhead cost and having no forgetting. All other methods have to choose only one of those three characteristics if any. Finally, we want to mention that there have been some methods proposed that consider a memory budget rebuffi2016icarl , or an online setup losing2018incremental , and methods with exemplars where each sample is used only once lopez2017gradient ; chaudhry2018efficient ; riemer2018learning . However, those setups are significantly different from the sequential setup we propose and therefore we do not compare to these. Same happens with Learn to Grow, which belongs to optimal architecture search setup and can be mixed or extended with multiple of the above mentioned approaches.
3 Learning without any Forgetting
Here we propose our method for task-aware continual learning, designed to learn new tasks without any forgetting of previously learned tasks. As discussed in the introduction, in order to enforce non-forgetting of previous tasks, the use of masks that create rigid states is an efficient way. Freezing the weights once learned will allow keeping the knowledge fixed without the possibility of forgetting. Works which have addressed this problem have focused on weight-masks where an additional parameter is learned for each weight in the network mallya2018piggyback ; mallya2018packnet . From a network overhead point of view, we argue that it is, however, better to work with feature-masks which learn an additional parameter for each feature in the network. In Table 2 we compare the number of weights and features in several popular networks. The table clearly shows that the overhead is significantly lower: on average weight-masks are a quadratic factor bigger than feature-masks.
Network | #weights | #features |
---|---|---|
LeNet lecun1998gradient | 59,956 | 226 |
AlexNet krizhevsky2012imagenet | 54,547,712 | 9,344 |
VGGNet simonyan2014very | 119,579,904 | 10,880 |
ResNet-50 he2016deep | 19,330,304 | 22,720 |
First, we will discuss binary masks and how those can easily encode the parts that we want to learn and the parts that we want to fix. Afterwards, we will explore what happens when we want to learn more than one task and how the binary masks need to be extended to ternary masks to make room for a new state. Finally, we will explore the use of feature normalization to allow for less rigid learning of new tasks.
3.1 Binary feature masks
Using binary feature masks on neural networks means that the masked neuron will have one of two states (0 or 1). When the masks are directly multiplied by the neuron activations, it will either use the corresponding filters or not (same for the backward pass, which will either be applied or not). Then, for each task we have a binary mask with the neurons that can be used or not. Since we do not allow for any forgetting, those masks will have to be disjoint. In Fig. 1 we show an example with two tasks where each of them is only allowed to use different neurons. A large amount of connections are completely unused, making the two sub-networks totally separable from one another.
Consider a fully connected layer (the theory can easily be extended to convolutional layers). The output of the layer is where , and . The binary feature mask for the forward pass is defined as follows:
(1) |
where refers to the mask for task at layer and is an element-wise multiplication. Masks from different tasks are forced to select different features ( ). The backward pass for training task is defined as:
(2) |
where is the AND logical operator and there are only non-zero gradients for those weights which join in a feature which is masked for task .

This setup allows the associated weights to a neuron to be either used-and-learnable, or neither. If used, they will contribute forward to the next layers (which is good, as it promotes forward transfer, i.e. sharing of knowledge from previous tasks). Yet at the same time this also implies that it will be possible to modify them (which is bad, as it introduces catastrophic forgetting on the previous tasks). With only binary masks, you cannot have one without the other. Alternatively, one could also define two separate binary states: “used” and ‘learnable”. This has been used for a long time in deep learning by freezing weights
oquab2014learning . Freezing weights is a mask-based way of switching on and off the learning of a layer. In this case, in both states the layer would contribute to the outcome of the network, but the update of the weights would only be done on those layers that are not masked. Here, we further explore this idea. We advocate that, in a sequential setup where the capacity of the network might increase when learning new tasks, the best way to mask the neurons is by having three states: “used”, “learnable” and “unused”. This can be achieved by using ternary masks on the neurons.3.2 Ternary feature masks (TFM)
Being able to use the connections between the neurons of the previous tasks and the neurons of the newly added task is important to reuse the learned information and reduce the amount of capacity that needs to be added. By using a ternary mask we can define three states:
-
forward only: the features are used during the forward pass so that the learned information from previous tasks is used; but the backward pass step is removed in order to keep the weights and prevent forgetting. This state is used on the features from previous tasks.
-
normal: forward and backward passes are applied as usual in order to learn the task at hand. This state is used on the new features created by the expansion of the network.
-
masked: neither forward nor backward passes are allowed, the features do not contribute to the network inference and the weights associated to it are frozen. This state is used at test time only when evaluating an old task after a new task is added. When extending the capacity of the network, the new features will not be used when doing inference on the previous tasks since those did not exist at the moment of their training.
Similar as in the case of the binary mask we assign features to tasks with a mask (with the corresponding layer). Again overlap in the selected features is not allowed. However, different than before, we now define a second mask per task which is defined as:
(3) |
The forward and backward pass are now given by:
(4) |
and
(5) |
respectively. During the forward pass, features which were selected by previously learned tasks can be used in the current task. During the backward pass, we make sure that all new weights can be updated while forcing the existing ones from previous tasks to remain the same. In Fig. 2 we show an example with two tasks, where adding features to the layer allows for more connections to be used than in the binary case. The part of the mask corresponding to corresponds to all available connections at task . In a similar way, the part of the mask corresponding to corresponds to all available connections at task . Subtracting both terms allows us to mask the connections that contain the already learned content and apply backpropagation only on the new connections.

Note that this definition also allows to use the same forward and backward pass in case we would want to re-train one of the previous tasks. However, since we do not contemplate this option for our proposed setup, we can simplify equation 5 to non-revisiting task incremental learning. In this case, the forward pass remains the same as in equation 4, and the backward pass can be rewritten as:
(6) |
where is the OR logical operator which makes the mask active when either operands are active.
Since can never be 0 if one of the current or previous is 1, both masks and can be combined in a single ternary mask. This is because weights associated to a feature that is not used in the forward pass are never updated. With this ternary mask, the states are associated as follows: when and the neuron is used and learnable (normal state), when and the neuron is used and contributes to the backward pass but the associated weights are not updated (forward only state), and finally when and the neuron is unused, not taking part in the inference or the update of the network (masked state). In our implementation we assign these three states to a ternary mask as , and respectively. We will provide code and make it public upon acceptance of this manuscript.
Allowing to use previously learned parameters in the forward pass, but only updating network parameters assigned to the current task in the backward pass is also applied in Packnet mallya2018packnet and HAT serra2018overcoming . However, in contrast to us, Packnet has the masks on the weights and not on the features. HAT applies a soft activation mask, which permits forgetting of previous tasks. We further distinguish from these methods by the task-specific feature normalization (discussed in the next section) which is a crucial ingredient of our method, and which allows not only to exploit previously learned features, but also to adapt them to the current task. This is not possible for neither Packnet nor HAT.
3.3 Task-specific feature normalization (FN)
Since the binary or ternary masks freeze filters learned on previous tasks, those filters have no room for flexibility to small changes in the features. This means that even when being very similar to the ones needed for a new task, they will tend to learn a similar version of those filters with shifted or scaled operators. This phenomenon is similar to the one observed when learning several styles for style transfer networks. A way of reusing learned filters in a more efficient way but still keeping the non-forgetting property would be to use a similar approach to conditional instance normalization dumoulin2017learned , which consists in transforming a set of features into a normalized version depending on the task.
Let be the features of layer , and , the learnable parameters for each feature of each layer given a fixed task . We define the task-specific feature normalization of as:
(7) |
where we apply a conditional normalization on the task without applying an instance normalization on the mean and standard deviation across the spatial dimensions. These parameters allow to slightly adjust the learned filters to the new tasks without modifying existing parameters (thus no forgetting happens) and with little overhead to the network capacity since the
and parameters are for each feature and not for all weights.3.4 Growing Ternary Feature Masks
One of the core characteristics of our proposed method is that it can easily grow and expand the capacity of the network as is required. Methods that allow the weights to be modified, with the addition of more tasks will eventually modify them, regardless of which approach is chosen to define the importance weights. Capacity is limited, and when the network is close to running out of it for learning the new task, the trade-off will allow for forgetting to take place. It is at that point that most methods are not easily expanded to accommodate more weights and grow the network before too much forgetting happens. This is because most of the approaches cannot predict the performance drop on previous tasks before learning the new task. With our approach, we enforce that to be a core part of the system. This provides the capacity to be able to not forget at a very small overhead cost, while allowing the network to grow in parameters only when necessary.
Given a network with layers, any layer with the corresponding learned features can be expanded if those learned features are not enough to represent the new task. When expanding a layer by new features, the output of the layer grows to . That affects only the newly added forward mask values:
(8) |
so that all features can be seen while learning the new task but ignored by previous tasks. And then the backward mask:
(9) |
so that it only affects the new connections without modifying previous knowledge.
Training small tasks on large networks at the beginning of a continual learning setup, usually leads to overfitting or too much repetition of filters. Feature usage on the new tasks look very unbalanced in comparison to learning larger tasks masana2017domain . We believe that learning tasks in their correct capacity and growing when more is needed is a much better approach to avoid overfitting. This observation is backed by the better results some other approaches have when pruning and retraining on smaller sub-networks than when directly pruning or learning in larger sub-networks mallya2018packnet ; rusu2016progressive .
3.5 Ternary Mask Implementation
The formulation of our proposed method in Sec. 3.2 states that we can combine both masks and into a ternary mask. In order to make the implementation easier and more efficient, this is done by creating a ternary mask that is set to be state for all features at task 1. This means that the first task will work as a normal network that allows for learning the task at hand as if we were using finetuning. Then, when moving to task 2, we will grow the network and add new features. The masks for task 1 associated with the new features will be set to state , therefore these are not used when evaluating nor are they learnable for task 1. The masks for task 2 will then be created by setting the previous existing features to state and the new added features to state . This would allow the features with state to be used during the forward pass for mask , while the features with state will contribute to both forward pass for mask and backward pass for mask (using Eq. 6). This process is explained in Algorithm 1.
4 Experimental results
In this section we report on a range of experiments to quantify the effectiveness of our proposed approach and compare with other state-of-the-art methods and baselines. Code will be made available upon acceptance.
4.1 Experimental Setup
Datasets. We evaluate our approaches on a larger lower resolution dataset (tiny ImageNet ILSVRC2012 deng2009imagenet ), on a large-scale dataset (ImageNet russakovsky2015imagenet ) and some fine-grained classification datasets: Oxford 102 Flowers nilsback2008automated , CUB-200-2011 Birds wah2011caltech and Stanford Actions yao2011human . Statistics over those datasets are summarized in Table 3. For all experiments we take a fixed random set of 10% of images for validation. The validation set is equally distributed among the number of classes and fixed for each experiment to ensure the fairness of the comparison. Since the test set is not labelled for ImageNet ILSVRC2012, we use the validation set for test instead.
Tiny ImageNet is a resized version of ImageNet with of the classes. ImageNet uses inputs that use random cropping to during training for data augmentation. In the case of Birds we resize the bounding box annotations of the objects to for all splits. We do the same for Actions but without using the bounding box annotations. For Flowers we resize to and also do data augmentation by random cropping patches during training and using the central crop for evaluation. In all experiments we perform random horizontal flips during training for data augmentation.
We decide to not do experiments on permuted MNIST since they have been shown to not allow a fair comparison between different approaches lee2017overcoming . The MNIST data contains a too large amount of zeros on each input, which leads to an easy identification of the important weights that can be frozen so that they do not overlap with the other tasks. Furthermore, the MNIST data might be too simple to represent more realistic scenarios. We do not consider other low resolution datasets than tiny ImageNet for the same reason.
Dataset | #Train | #Eval | #Classes |
---|---|---|---|
tiny ImageNet deng2009imagenet | 100,000 | 10,000 | 200 |
ImageNet ILSVRC2012 russakovsky2015imagenet | 1,280,861 | 50,000 | 1000 |
Oxford Flowers nilsback2008automated | 2,040 | 6,149 | 102 |
CUB-200-2011 Birds wah2011caltech | 5,994 | 5,794 | 200 |
Stanford Actions yao2011human | 4,000 | 5,532 | 40 |
Network architectures. For tiny ImageNet we use VGG-16, which has been proven to provide high performance results simonyan2014very
. Since tiny ImageNet has a low resolution, the last max-pool layer and the last three convolutional layers from the feature extractor are removed. For ImageNet and the fine-grained datasets we use AlexNet
krizhevsky2012imagenet . No pretrained weights are used and the models are trained from scratch using only samples from the training set. All approaches except ours have access to the full network from the beginning. However, our proposed TFM starts with a network that is smaller than the proposed ones at each layer (reduced number of output filters). Then, it grows as explained in Sec. 3.4 as more features are added every time a new task is learned. We limit the growth of the network to the total size of the one used by all other approaches. Therefore, at the end of learning all tasks, all approaches will have had access to the same amount of network capacity besides the extra parameters or regularizers that each method requires (see Overhead in Table 1).Training details.
All experiments are trained using backpropagation with plain Stochastic Gradient Descent following a similar setup to HAT
serra2018overcoming . With a batch size of , learning rate starts at , decaying by a factor of whenconsecutive epochs have no improvement on the validation loss, until either the learning rate is reduced below
or 200 epochs have passed. Data splits, task sequence, data loader shuffle and network initialization are fixed for all approaches given a seed. Following the results in de2019continual , we use dropout with .Baselines.
Finetuning has no extra hyperparameters and simply uses the cross-entropy loss to learn each task as it comes, without using data from previous tasks nor avoiding catastrophic forgetting. Joint training breaks the no-revisiting data rule and learns with data from the current task as well as all the previous tasks, serving as an upper-bound to compare all approaches. Finally, we propose to use Freezing as a baseline where we learn the first task and then freeze all layers except the head for the remaining tasks.
Hyperparameters. Distillation and model-based approaches use hyperparameters to control the trade-off between forgetting and intransigence on the knowledge of previous tasks. On top of that, LwF has a temperature scaling hyperparameter for the cross-entropy loss. From the mask-based models, HAT has a trade-off hyperparameter too and a maximum for the sigmoid gate steepness. PackNet has a prune percentage of the layers.
TFM has a growth percentage which is equal for all layers. This percentage is set on the validation set according to the following protocol. For each new task, several growth percentages are evaluated without the knowledge of previous or future tasks. Rather than choosing the growth rate with best performance, we pick the lowest growth rate which obtains a performance within a margin of the best performance (we set the margin to be 1.5% for tiny ImageNet and 0.1% for fine-grained). For ImageNet this scheme would be computationally demanding and we use a fixed growth schedule, starting from 55% of the weights for the first task and add 5% for all remaining tasks. Then, we fix that hyperparameter and we run the final experiment on train and evaluate on the test set.
4.2 Fine-grained datasets
A common setup to evaluate continual learning over a number of learning sessions are disjoint splits (tasks) inside the same classification dataset. However, some approaches report results on only two tasks lee2017overcoming ; jung2016less ; fernando2017pathnet
, which becomes too similar to transfer learning setups and does not allow to evaluate the true potential of continual learning. Because of that, we choose to evaluate our proposed approach and its variations on more than two tasks. It should be noted that we start training from scratch resulting in lower scores than reported by papers which train from a pretrained network. However, because of the large number of classes in ImageNet (including a subset of Birds) we consider training from scratch provides a more natural setting for continual learning.
We compare our proposed method (TFM) and an ablation version of it without the task-specific feature normalization (TFM w/o FN) with the mask-based approaches (HAT, PackNet), a well-known model-based approach (EWC) and the baselines (Finetune, Freezing, Joint) on three fine-grained datasets (Flowers, Birds, Actions). As explained earlier, to make the comparison fair our approach will learn the first task on a smaller network and then grow and learn the next tasks with a maximum growth the size of the network other approaches use.
Oxford 102 Flowers | |||||
Method | Task 1 | Task 2 | Task 3 | Task 4 | Avg. |
Finetuning | 10.0 (-20.3) | 5.1 (-17.1) | 6.7 (-13.6) | 17.3 (0.0) | 9.8 |
Freezing | 30.3 (0.0) | 39.8 (0.0) | 32.0 (0.0) | 33.1 (0.0) | 33.8 |
Joint | 54.6 (+24.3) | 58.9 (+11.5) | 57.7 (+4.5) | 47.0 (0.0) | 54.6 |
EWC kirkpatrick2017overcoming | 12.1 (-18.2) | 11.6 (-38.1) | 9.3 (-24.4) | 25.8 (0.0) | 14.7 |
HAT serra2018overcoming | 17.2 (-12.7) | 19.3 (-28.5) | 28.6 (+1.4) | 31.6 (0.0) | 24.2 |
PackNet mallya2018packnet | 32.0 (0.0) | 53.7 (0.0) | 43.6 (0.0) | 37.9 (0.0) | 41.8 |
TFM w/o FN | 36.4 (0.0) | 54.1 (0.0) | 38.6 (0.0) | 39.0 (0.0) | 42.0 |
TFM | 36.4 (0.0) | 53.8 (0.0) | 45.5 (0.0) | 37.6 (0.0) | 43.3 |
CUBS 200 Birds | |||||
Method | Task 1 | Task 2 | Task 3 | Task 4 | Avg. |
Finetuning | 7.4 (-30.2) | 2.6 (-30.0) | 29.7 (-3.4) | 43.1 (0.0) | 20.7 |
Freezing | 37.6 (0.0) | 35.1 (0.0) | 35.4 (0.0) | 38.4 (0.0) | 36.6 |
Joint | 48.7 (+11.1) | 52.1 (+6.0) | 50.7 (+1.5) | 51.9 (0.0) | 50.8 |
EWC kirkpatrick2017overcoming | 16.2 (-21.4) | 19.0 (-21.2) | 24.2 (-14.0) | 41.7 (0.0) | 25.3 |
HAT serra2018overcoming | 18.7 (-1.8) | 19.4 (-0.4) | 28.5 (-0.6) | 31.2 (0.0) | 24.4 |
PackNet mallya2018packnet | 35.3 (0.0) | 42.8 (0.0) | 44.4 (0.0) | 45.9 (0.0) | 42.1 |
TFM w/o FN | 42.9 (0.0) | 44.1 (0.0) | 48.3 (0.0) | 49.1 (0.0) | 46.1 |
TFM | 42.9 (0.0) | 43.1 (0.0) | 49.9 (0.0) | 48.8 (0.0) | 46.2 |
Stanford 40 Actions | |||||
Method | Task 1 | Task 2 | Task 3 | Task 4 | Avg. |
Finetuning | 24.4 (-10.5) | 26.5 (-7.7) | 17.6 (-16.8) | 28.9 (0.0) | 24.4 |
Freezing | 34.9 (0.0) | 29.4 (0.0) | 30.1 (0.0) | 30.5 (0.0) | 31.2 |
Joint | 45.7 (+10.8) | 40.3 (+4.8) | 43.2 (-1.1) | 40.2 (0.0) | 42.4 |
EWC kirkpatrick2017overcoming | 24.2 (-10.7) | 28.2 (-2.0) | 25.2 (-5.6) | 34.3 (0.0) | 28.0 |
HAT serra2018overcoming | 25.7 (-1.0) | 25.5 (-2.7) | 30.1 (-2.1) | 34.4 (0.0) | 28.9 |
PackNet mallya2018packnet | 32.5 (0.0) | 32.9 (0.0) | 36.7 (0.0) | 34.3 (0.0) | 34.1 |
TFM w/o FN | 35.3 (0.0) | 38.3 (0.0) | 39.2 (0.0) | 38.0 (0.0) | 37.7 |
TFM | 35.3 (0.0) | 37.2 (0.0) | 42.0 (0.0) | 37.2 (0.0) | 38.0 |
As can be seen in Table 4, our proposed approach outperforms the other approaches for the three datasets. For these datasets only on Flowers a considerable performance gain is observed when adding task-specific feature normalization. Only PackNet manages to obtain competitive results, however, on both Birds and Flowers, TFM does significantly better, while having a much lower memory overhead than PackNet (0.2Mb versus 27.3Mb respectively). It is also interesting to note how well Freezing works as a non-forgetting baseline.
Tiny ImageNet - classes randomly split | |||||||||||
Approach | Task 1 | Task 2 | Task 3 | Task 4 | Task 5 | Task 6 | Task 7 | Task 8 | Task 9 | Task 10 | Avg. |
(1-20) | (21-40) | (41-60) | (61-80) | 81-100) | (101-120) | (121-140) | (141-160) | (161-180) | (181-200) | all | |
Finetuning | 38.1 (-13.6) | 36.0 (-13.7) | 43.2 (-16.0) | 44.1 (-18.6) | 45.5 (-12.6) | 54.5 (-13.6) | 50.3 (-15.7) | 50.5 (-13.4) | 51.0 (-13.1) | 61.2 (0.0) | 47.4 |
Freezing | 51.7 (0.0) | 36.4 (0.0) | 39.5 (0.0) | 41.7 (0.0) | 42.9 (0.0) | 46.2 (0.0) | 45.7 (0.0) | 41.1 (0.0) | 41.2 (0.0) | 40.9 (0.0) | 42.7 |
Joint | 58.6 (+6.9) | 53.9 (+8.3) | 59.1 (+3.7) | 61.8 (+7.9) | 57.7 (+2.9) | 66.0 (+2.6) | 64.0 (+3.1) | 60.2 (+5.9) | 57.9 (+1.0) | 53.8 (0.0) | 59.3 |
LfL jung2016less | 32.4 (-18.9) | 35.4 (-17.0) | 43.4 (-15.7) | 44.1 (-20.2) | 45.0 (-15.0) | 55.9 (-14.5) | 49.4 (-16.1) | 51.1 (-12.4) | 58.6 (-8.0) | 61.4 (0.0) | 47.7 |
LwF li2017learning | 45.1 (-6.6) | 45.5 (-2.2) | 53.5 (-4.6) | 57.6 (-2.6) | 56.2 (0.0) | 65.7 (+0.4) | 63.5 (-0.3) | 58.4 (-1.9) | 59.6 (-0.3) | 58.5 (0.0) | 56.4 |
IMM-mode lee2017overcoming | 50.6 (-1.1) | 38.5 (+0.3) | 44.7 (-0.1) | 49.2 (+0.3) | 47.5 (+1.1) | 51.9 (-1.4) | 53.7 (-0.6) | 47.7 (-0.4) | 50.0 (-2.2) | 48.7 (0.0) | 48.3 |
EWC kirkpatrick2017overcoming | 33.9 (-17.4) | 35.4 (-14.4) | 43.6 (-15.4) | 46.7 (-15.9) | 49.5 (-9.1) | 52.5 (-15.8) | 47.8 (-20.0) | 50.2 (-13.8) | 56.6 (-9.9) | 61.4 (0.0) | 47.8 |
HAT serra2018overcoming | 46.8 (-0.2) | 49.1 (+0.8) | 55.8 (+0.2) | 58.0 (-0.2) | 53.7 (+0.3) | 61.0 (+0.1) | 58.7 (0.0) | 54.0 (-0.1) | 54.6 (-0.1) | 50.3 (0.0) | 54.2 |
PackNet mallya2018packnet | 52.5 (0.0) | 49.7 (0.0) | 56.5 (0.0) | 59.8 (0.0) | 55.0 (0.0) | 64.7 (0.0) | 61.7 (0.0) | 55.9 (0.0) | 55.2 (0.0) | 52.5 (0.0) | 56.4 |
TFM w/o FN (Ours) | 49.6 (0.0) | 47.2 (0.0) | 54.8 (0.0) | 58.2 (0.0) | 55.0 (0.0) | 64.0 (0.0) | 59.3 (0.0) | 53.6 (0.0) | 55.5 (0.0) | 51.9 (0.0) | 54.9 |
TFM (Ours) | 48.2 (0.0) | 47.7 (0.0) | 56.7 (0.0) | 58.2 (0.0) | 54.8 (0.0) | 62.2 (0.0) | 61.5 (0.0) | 57.3 (0.0) | 58.5 (0.0) | 54.8 (0.0) | 56.0 |
4.3 Task-similarity effects on tiny ImageNet
Next we experiment on several ten-task splits of tiny ImageNet. We compare our proposed approach (TFM) with two distillation methods with low overhead (LFL jung2016less , LwF li2017learning ), two of the best known model-based methods (EWC kirkpatrick2017overcoming , IMM lee2017overcoming ) and two of the most recent mask-based methods (HAT serra2018overcoming , PackNet mallya2018packnet ). We also compare the approaches to three baselines (Finetune, Freezing, Joint). The setup uses the VGGnet introduced in Sec. 4.1. The model is trained from scratch on 10 tasks with the same number of classes. We propose to evaluate those approaches under the same conditions on a random tiny ImageNet partition (see Table 5) and on a semantically similar partition (see Table 6). For further information on the latter, check B.
Tiny ImageNet - classes semantically split | |||||||||||
Approach | Task 1 | Task 2 | Task 3 | Task 4 | Task 5 | Task 6 | Task 7 | Task 8 | Task 9 | Task 10 | Avg. |
fly anim. | small artif. | hobbies | land anim. | big artif. | food | pets/aquatic | wearables | transport | scenes | all | |
Finetuning | 17.1 (-34.2) | 19.7 (-17.7) | 20.9 (-24.5) | 16.7 (-30.5) | 20.8 (-28.7) | 29.2 (-22.0) | 30.7 (-21.0) | 25.2 (-17.8) | 40.2 (-18.9) | 59.9 (0.0) | 28.0 |
Freezing | 51.3 (0.0) | 28.5 (0.0) | 27.2 (0.0) | 29.6 (0.0) | 29.0 (0.0) | 35.0 (0.0) | 31.7 (0.0) | 23.9 (0.0) | 37.0 (0.0) | 34.7 (0.0) | 32.8 |
Joint | 55.0 (+3.7) | 41.9 (+7.9) | 46.2 (+6.3) | 44.9 (+4.9) | 44.7 (+7.1) | 49.0 (+4.8) | 46.6 (+4.9) | 36.4 (+4.7) | 51.2 (+5.7) | 51.1 (0.0) | 46.7 |
LfL jung2016less | 17.2 (-34.1) | 18.4 (-21.0) | 21.5 (-24.0) | 18.7 (-30.5) | 20.2 (-28.6) | 27.4 (-23.9) | 28.4 (-22.3) | 26.0 (-18.4) | 41.2 (-17.6) | 59.1 (0.0) | 27.8 |
LwF li2017learning | 34.0 (-13.9) | 18.4 (-14.5) | 32.6 (-0.8) | 36.5 (-5.6) | 40.1 (-0.5) | 43.1 (-2.5) | 41.8 (-1.3) | 32.7 (-1.1) | 50.3 (-0.5) | 48.1 (0.0) | 37.8 |
IMM-mode lee2017overcoming | 42.3 (-9.0) | 28.8 (+0.1) | 26.5 (-3.1) | 30.7 (-3.6) | 32.5 (-3.1) | 28.8 (-13.0) | 35.4 (-6.2) | 27.3 (-3.7) | 43.6 (-4.9) | 42.7 (0.0) | 33.9 |
EWC kirkpatrick2017overcoming | 20.2 (-31.1) | 18.5 (-19.3) | 20.2 (-26.3) | 20.9 (-28.9) | 24.7 (-22.8) | 25.5 (-27.5) | 28.7 (-23.4) | 23.0 (-19.6) | 39.8 (-20.2) | 56.8 (0.0) | 27.8 |
HAT serra2018overcoming | 44.6 (+0.4) | 34.8 (+0.2) | 40.8 (-0.1) | 45.4 (+0.4) | 40.8 (-2.5) | 49.8 (0.0) | 44.9 (-0.2) | 33.1 (-1.7) | 51.9 (0.0) | 53.8 (0.0) | 44.0 |
PackNet mallya2018packnet | 47.0 (0.0) | 35.7 (0.0) | 42.7 (0.0) | 48.6 (0.0) | 45.8 (0.0) | 48.1 (0.0) | 45.9 (0.0) | 38.3 (0.0) | 51.2 (0.0) | 49.1 (0.0) | 45.2 |
TFM w/o FN (Ours) | 46.4 (0.0) | 34.7 (0.0) | 38.8 (0.0) | 44.1 (0.0) | 42.0 (0.0) | 48.3 (0.0) | 46.5 (0.0) | 35.7 (0.0) | 52.0 (0.0) | 54.8 (0.0) | 44.3 |
TFM (Ours) | 46.4 (0.0) | 37.2 (0.0) | 40.4 (0.0) | 44.1 (0.0) | 44.2 (0.0) | 48.2 (0.0) | 46.4 (0.0) | 37.5 (0.0) | 53.5 (0.0) | 54.7 (0.0) | 45.3 |
In the case of the random splits (see Table 5), most methods have quite good results on the last tasks with minor to no forgetting. LFL, IMM and EWC provide some improvement over Finetune. LwF has a very good performance due to tasks being quite similar. All mask-based models have a very similar performance, with PackNet having the better performance.
In the semantically similar splits (see Table 6), which has a more different distribution for each task than the random case, some approaches have difficulties avoiding catastrophic forgetting as the sequence gets longer. It is interesting to see, that the good results of LwF on the random split are not repeated when we have semantic splits. As observed before, LwF fails when there exists large changes in the distributions of the features between the tasks aljundi2016expert . Mask-based models outperform all other approaches again, with TFM having the better performance.
Both Tables 5 and 6 show accuracy and forgetting of each task after training the 10 tasks, and thus having learned the 200 tiny ImageNet classes. Results show that mask-based approaches achieve a better overall performance than other approaches on both splits, getting close to the joint training baseline. Freezing the network after the first task and learning only the head for the remaining tasks works better in the semantically similar splits than in the random splits. Furthermore, in Table 6, the Freezing baseline offers better results than LFL, and is better or competitive enough with the model-based approaches.
Tiny ImageNet - larger first task | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Approach | Task 1 | Task 2 | Task 3 | Task 4 | Task 5 | Task 6 | Task 7 | Task 8 | Task 9 | Task 10 | Avg. |
(1-110) | (111-120) | (121-130) | (131-140) | (141-150) | (151-160) | (161-170) | (171-180) | (181-190) | (191-200) | (111+) | |
Finetuning | 18.4 (-33.2) | 39.6 (-34.4) | 56.6 (-21.0) | 58.0 (-23.8) | 44.6 (-32.8) | 63.0 (-21.8) | 51.0 (-27.2) | 51.4 (-19.0) | 70.4 (-14.0) | 80.0 (0.0) | 57.2 |
Freezing | 51.6 (0.0) | 68.6 (0.0) | 70.2 (0.0) | 77.9 (0.0) | 68.1 (0.0) | 78.8 (0.0) | 72.9 (0.0) | 64.2 (0.0) | 76.7 (0.0) | 70.2 (0.0) | 69.9 |
LfL jung2016less | 16.7 (-33.9) | 58.3 (-6.6) | 59.5 (-2.9) | 64.6 (-2.2) | 58.2 (-1.4) | 64.7 (-0.6) | 63.4 (+0.5) | 54.4 (+0.1) | 61.0 (0.0) | 56.5 (0.0) | 60.0 |
LwF li2017learning | 10.8 (-40.0) | 29.5 (-43.5) | 44.6 (-30.4) | 61.2 (-21.2) | 55.5 (-15.1) | 73.3 (-8.8) | 71.7 (-3.5) | 62.3 (-1.8) | 77.8 (-2.2) | 74.3 (0.0) | 61.1 |
IMM-mode lee2017overcoming | 26.1 (-26.3) | 50.4 (-13.8) | 59.5 (-19.3) | 61.6 (22.8) | 54.0 (-21.8) | 63.2 (-22.5) | 58.2 (-19.9) | 56.0 (-13.9) | 75.0 (-6.7) | 79.9 (0.0) | 62.0 |
EWC kirkpatrick2017overcoming | 51.8 (-0.7) | 24.8 (-0.1) | 43.3 (-1.2) | 61.8 (-0.5) | 55.5 (-0.3) | 70.8 (-0.7) | 67.6 (-0.1) | 53.8 (-0.6) | 70.1 (-1.2) | 61.7 (0.0) | 56.6 |
HAT serra2018overcoming | 46.1 (0.0) | 60.1 (-0.1) | 68.2 (+0.2) | 73.2 (+0.1) | 63.2 (+0.1) | 76.2 (+0.1) | 67.4 (0.0) | 58.1 (-0.1) | 73.2 (0.0) | 59.1 (0.0) | 66.5 |
PackNet mallya2018packnet | 47.6 (0.0) | 74.0 (0.0) | 74.2 (0.0) | 79.0 (0.0) | 65.2 (0.0) | 76.2 (0.0) | 69.4 (0.0) | 61.4 (0.0) | 73.4 (0.0) | 64.8 (0.0) | 70.8 |
TFM w/o FN (Ours) | 49.6 (0.0) | 69.8 (0.0) | 71.1 (0.0) | 79.8 (0.0) | 68.4 (0.0) | 78.4 (0.0) | 72.9 (0.0) | 64.8 (0.0) | 75.9 (0.0) | 70.2 (0.0) | 72.4 |
TFM (Ours) | 49.9 (0.0) | 70.4 (0.0) | 71.4 (0.0) | 80.8 (0.0) | 70.5 (0.0) | 79.4 (0.0) | 73.9 (0.0) | 64.4 (0.0) | 76.5 (0.0) | 72.1 (0.0) | 73.3 |
4.4 Effect of starting-task size on tiny ImageNet
We propose an experimental setup where the first task of tiny ImageNet uses 110 classes (55%) while the remaining 9 tasks use 10 classes (5%) each. This allows most of the methods to start with a rich representation after learning the first task. In this setup, comparing existing methods with the Freezing baseline is interesting. In Table 7
we show the results of this scenario. Note that in the last column we average only over the smaller tasks (T2-T10). EWC shows little forgetting, keeping knowledge while learning new tasks. However, keeping the first task from forgetting causes the model to become too rigid and learn the rest of the tasks with more difficulty and having a lower overall performance. Trying to lower the trade-off hyperparameter shows a stronger forgetting of the first tasks and causes severe catastrophic forgetting. Distillation approaches try to keep representations the same as new tasks are learned. However, small changes in the weights cause forgetting later into the sequence. HAT works fine, but with a limited capacity to make changes, ends up not learning the new tasks as easily. Freezing the network after the first task seems to be one of the best options in this setup, since the rich representation of the first 110 classes is a good starting point to learn the rest of the tasks with a simple classifier. We therefore advocate that the Freezing baseline should always be included in continual learning comparisons since it often provides a much harder baseline than Finetuning. Only PackNet and TFM are able to improve over that baseline even if they start from a smaller capacity, with TFM having the best results.
ImageNet - classes randomly split | |||||||||||
Approach | Task 1 | Task 2 | Task 3 | Task 4 | Task 5 | Task 6 | Task 7 | Task 8 | Task 9 | Task 10 | Avg. |
(1-100) | (101-200) | (201-300) | (301-400) | (401-500) | (501-600) | (601-700) | (701-800) | (801-900) | (901-1000) | all | |
Finetuning | 25.8 (-43.0) | 32.2 (-36.2) | 31.4 (-35.3) | 37.8 (-27.7) | 39.1 (-27.7) | 43.7 (-25.7) | 46.0 (-22.8) | 50.0 (-16.5) | 53.4 (-12.1) | 63.7 (0.0) | 42.3 |
Freezing | 68.8 (0.0) | 53.5 (0.0) | 52.0 (0.0) | 51.2 (0.0) | 51.3 (0.0) | 53.9 (0.0) | 52.2 (0.0) | 53.9 (0.0) | 51.7 (0.0) | 51.2 (0.0) | 54.0 |
LwF li2017learning | 27.6 (-41.2) | 37.2 (-19.9) | 42.0 (-22.6) | 44.4 (-20.9) | 50.5 (-14.1) | 56.6 (-11.3) | 57.9 (-9.1) | 61.2 (-5.0) | 62.0 (-1.3) | 62.7 (0.0) | 50.2 |
IMM-mode lee2017overcoming | 68.5 (-0.3) | 53.6 (0.0) | 52.1 (0.0) | 51.7 (-0.1) | 52.5 (+0.3) | 55.5 (+0.2) | 54.7 (+0.1) | 53.5 (0.0) | 54.2 (+0.1) | 51.8 (0.0) | 54.8 |
EWC kirkpatrick2017overcoming | 21.8 (-47.0) | 26.5 (-41.7) | 29.5 (-36.5) | 32.9 (-32.6) | 35.6 (-30.9) | 40.4 (-28.1) | 40.0 (-26.2) | 44.7 (-20.7) | 47.8 (-16.2) | 61.1 (0.0) | 38.0 |
PackNet mallya2018packnet | 67.5 (0.0) | 65.8 (0.0) | 62.2 (0.0) | 58.4 (0.0) | 58.6 (0.0) | 58.7 (0.0) | 56.0 (0.0) | 56.5 (0.0) | 54.1 (0.0) | 53.6 (0.0) | 59.1 |
TFM (Ours) | 63.6 (0.0) | 62.2 (0.0) | 60.1 (0.0) | 61.6 (0.0) | 62.6 (0.0) | 64.5 (0.0) | 64.0 (0.0) | 63.7 (0.0) | 63.0 (0.0) | 59.9 (0.0) | 62.5 |


4.5 ImageNet
Most of the compared task-aware approaches have not been evaluated using a large-scale dataset such as ImageNet. We therefore compare our proposed method (TFM) with some of those state-of-the-art approaches. In Table 8 we can see that TFM outperforms all the other approaches at the end of learning ImageNet split into 10 tasks of random classes. An evolution of the results for each task is shown in Fig. 3. LwF does well when learning each new task with the help of the representation of the previous tasks. However, as more tasks are included, the older tasks start forgetting more. IMM (mode) has the opposite effect, it focuses on intransigence and tries to keep the knowledge of the older tasks, running out of capacity for the newer tasks. This allows for the approach to not forget much and even have a small backward transfer, but at the cost of performing worse with newer tasks. EWC has one of the worse performances, possibly due to the difficulty of having a good approximation of the FIM when there is so many classes per task. Both PackNet and TFM have a good overall performance with non-forgetting, and rely on the amount of capacity of the network more than the other approaches. As shown in Fig. 4, PackNet has a better performance during the first three tasks, taking advantage of the compression power of the pruning and finetuning. However, as the remaining capacity of the network gets smaller, TFM is capable of growing at a more scalable pace, getting a better performance on the remaining seven tasks and achieving the best results overall.
4.6 Comparison of memory usage
Previous experiments show that mask-based approaches are better at overcoming catastrophic forgetting on task-aware settings. However, unlike most distillation and model-based approaches, they make use of some overhead memory during inference. In Fig. 5 we visualize (in log scale) the absolute memory overhead used by the mask-based approaches on the ImageNet experiment in Sec 4.5. Considering that the network used is around 220Mb, we can observe that the approaches that focus on using embeddings or masks on the features (HAT, TFM) have a negligible overhead in comparison to the weight-masked approaches (PackNet). It should also be noticed that mask-based approaches keep the same overhead during training, while distillation and model-based approaches usually duplicate the network size at least. In conclusion, our method has similar memory usage as HAT, however, we outperform this method on all proposed experiments, and our method is significantly more memory efficient than PackNet whose performance we either match or outperform.
5 Conclusions
For many practical applications, it is important that network accuracy on tasks does not deteriorate when learning new tasks. Therefore, in this paper, we propose a new method for continual learning which does not suffer from any forgetting. Other than previous methods which apply masks to the weights, we propose to move the mask to the features (activations). This greatly reduces the number of extra parameters which are added per task and reduce the overhead of the network in which other approaches incur. In addition, we propose to apply a task-specific feature normalization of features, which allows adjusting previously learned features to new tasks. In ablation experiments this was found to improve results of the ternary feature masks. Furthermore, when compared to a wide range of other continual learning techniques, our method consistently outperforms these methods on a variety of datasets.

Appendix A A note on choosing expansion rates
The continual learning philosophy states the rule of not using data from previous tasks when learning new ones; only data of the current task can be used at each step of the setup. It is also common in machine learning setups to use a part of the training set as validation in order to choose the best hyper-parameters. Therefore, when learning each task, a validation set of that specific task at hand can be used to train the network avoiding overfitting, but no other data can be used (neither from test nor from other previous or future tasks). It is important to state that we strictly comply to these rules in the experiments we propose.
As explained in Section 3.4, our proposed approach can be expanded as needed in order to learn the new tasks without having to change the connections from previous tasks. However, this flexibility of choosing how many features will be added to each layer can easily become a rabbit-hole of architecture optimization. Because of that, we decide to propose a simple setup for how we apply our proposed approach to be comparable to the other state-of-the-art. We take the maximum layer size for all approaches to be the same as the VGGnet or AlexNet architectures for the experiments in Section 4. This way, all approaches will have a similar number of parameters.
Task | Semantic group | Classes |
---|---|---|
1 | Animals | scorpion, black widow, tarantula, spider web, centipede, trilobite, grasshopper, stick insect, cockroach, mantis, ladybug, dragonfly, monarch butterfly, sulphur butterfly, fly, bee, goose, black stork, king penguin, albatross. |
(flying & insects) | ||
2 | Artifacts | abacus, binoculars, candle, chain, chest, dumbbell, hourglass, lampshade, magnetic compass, nail, pill bottle, computer keyboard, acorn, plunger, syringe, teddy bear, torch, comic book, remote control, umbrella. |
(smaller) | ||
3 | Music, Sport | basketball, punching bag, rugby ball, scoreboard, stopwatch, volleyball, CD player, drumstick, iPod, oboe, organ, refrigerator, cask, plate, wooden spoon, teapot, frying pan, beaker, bucket, dining table. |
and Kitchen | ||
4 | Animals | brown bear, red panda, koala, pig, ox, bison, bighorn sheep, gazelle, dromedary, African elephant, orangutan, chimpanzee, baboon, cougar, lion, European fire salamander, bullfrog, tailed frog, American alligator, boa constrictor. |
(land) | ||
5 | Artifacts | altar, maypole, bannister, flagpole, fountain, parking meter, pay-phone, pole, cash machine, birdhouse, bathtub, rocking chair, potter’s wheel, sewing machine, space heater, turnstile, memorial tablet, desk, vestment, reel. |
(bigger) | ||
6 | Food | water jug, wok, guacamole, ice cream, lollipop, pretzel, mashed potato, cauliflower, bell pepper, mushroom, orange, lemon, banana, pomegranate, meat loaf, pizza, potpie, espresso, soda bottle, beer bottle. |
7 | Animals | dugong, goldfish, jellyfish, brain coral, American lobster, spiny lobster, sea slug, sea cucumber, guinea pig, snail, slug, poodle, Chihuahua, Yorkshire terrier, golden retriever, Labrador retriever, German shepherd, tabby cat, Persian cat, Egyptian cat. |
(water & pets) | ||
8 | Clothes and | academic gown, poncho, apron, backpack, bikini, bow tie, cardigan, fur coat, gasmask, kimono, military uniform, miniskirt, neck brace, Christmas stocking, sandal, snorkel, sock, sombrero, sunglasses, swimming trunks. |
wearables | ||
9 | Transport | bullet train, station wagon, freight car, go-kart, rickshaw, lifeboat, limousine, moving van, police van, school bus, convertible, crane, trolleybus, sports car, tractor, gondola, broom, cannon, lawn mower, missile. |
10 | Buildings | barbershop, barn, lighthouse, butcher shop, candy store, water tower, triumphal arch, suspension bridge, steel arch bridge, viaduct, thatched roof, cliff dwelling, dam, obelisk, picket fence, cliff, coral reef, lakeside, seacoast, alp. |
and scenes |

The differences between Figures 1 and 2 are further explained in Fig. 6.During the first task, the only used features are those that have a grey-border in them. This means that for the two shown layers, task 1 is learned by using 8 features (with 12 connections). Once task 2 arrives, we fix the grey-border features and expand the network with the green-border ones. The new task then uses the existing network and expands it with 5 features (with 24 new connections). The masks for the green-border features are set to unused for masks of task 1, while they are learnable for task 2. Finally, as task 3 comes, 5 features are also added (with 36 new connections). Masks corresponding to the new features are set to unused for tasks 1 and 2, while set to learnable for task 3. This way of expanding the network also shows that as we learn more tasks, more knowledge is available from previous ones. This opens the possibility of having to add less and less features over time since the addition of each feature creates more connections to learn. In practice, one can imagine that when learning new tasks that are very similar to previous ones, no new features will have to be added and the current network knowledge and a specific head for the task will be enough. Further research and analysis on the specific details for each layer expansion and architecture is left for future work.
Appendix B Tiny ImageNet semantic splits
References
- (1) R. M. French, Catastrophic forgetting in connectionist networks, in: Trends in cognitive sciences, Vol. 3, Elsevier, 1999, pp. 128–135.
- (2) I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, Y. Bengio, An empirical investigation of catastrophic forgetting in gradient-based neural networks, arXiv preprint arXiv:1312.6211.
- (3) M. McCloskey, N. J. Cohen, Catastrophic interference in connectionist networks: The sequential learning problem, in: Psychology of learning and motivation, Vol. 24, Elsevier, 1989, pp. 109–165.
-
(4)
X. Li, Y. Grandvalet, F. Davoine, A baseline regularization scheme for transfer learning with convolutional neural networks, Pattern Recognition 98 (2020) 107049.
- (5) M. Mermillod, A. Bugaiska, P. Bonin, The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects, Frontiers in psychology 4 (2013) 504.
- (6) R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, T. Tuytelaars, Memory aware synapses: Learning what (not) to forget, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- (7) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., Overcoming catastrophic forgetting in neural networks, Proceedings of the National Academy of Sciences (2017) 3521–3526.
- (8) S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, B.-T. Zhang, Overcoming catastrophic forgetting by incremental moment matching, in: Advances in Neural Information Processing Systems, 2017, pp. 4655–4665.
- (9) A. Mallya, D. Davis, S. Lazebnik, Piggyback: Adapting a single network to multiple tasks by learning to mask weights, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- (10) A. Mallya, S. Lazebnik, Packnet: Adding multiple tasks to a single network by iterative pruning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- (11) A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, R. Hadsell, Progressive neural networks, arXiv preprint arXiv:1606.04671.
- (12) D. Lopez-Paz, M. Ranzato, Gradient episodic memory for continual learning, in: Advances in Neural Information Processing Systems, 2017.
- (13) S.-A. Rebuffi, A. Kolesnikov, C. H. Lampert, iCaRL: Incremental classifier and representation learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- (14) A. Chaudhry, M. Ranzato, M. Rohrbach, M. Elhoseiny, Efficient lifelong learning with a-gem, in: Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- (15) M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, G. Tesauro, Learning to learn without forgetting by maximizing transfer and minimizing interference, in: Proceedings of the International Conference on Learning Representations (ICLR), 2018.
- (16) A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
- (17) J. Serra, D. Suris, M. Miron, A. Karatzoglou, Overcoming catastrophic forgetting with hard attention to the task, in: International Conference on Machine Learning (ICML), 2018.
- (18) R. Aljundi, M. Rohrbach, T. Tuytelaars, Selfless sequential learning, in: Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- (19) X. Liu, M. Masana, L. Herranz, J. Van de Weijer, A. M. Lopez, A. D. Bagdanov, Rotate your networks: Better weight consolidation and less catastrophic forgetting, in: International Conference on Pattern Recognition (ICPR), 2018.
- (20) F. Zenke, B. Poole, S. Ganguli, Continual learning through synaptic intelligence, in: International Conference on Machine Learning (ICML), 2017, pp. 3987–3995.
- (21) Z. Li, D. Hoiem, Learning without forgetting, IEEE transactions on pattern analysis and machine intelligence 40 (12) (2017) 2935–2947.
- (22) H. Jung, J. Ju, M. Jung, J. Kim, Less-forgetting learning in deep neural networks, arXiv preprint arXiv:1607.00122.
- (23) A. Rannen, R. Aljundi, M. B. B. T. Tuytelaars, Encoder based lifelong learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1320–1328.
- (24) J. Zhang, J. Zhang, S. Ghosh, D. Li, S. Tasci, L. Heck, H. Zhang, C.-C. J. Kuo, Class-incremental learning via deep model consolidation, arXiv preprint arXiv:1903.07864.
- (25) A. Chaudhry, P. K. Dokania, T. Ajanthan, P. H. Torr, Riemannian walk for incremental learning: Understanding forgetting and intransigence, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- (26) R. Aljundi, P. Chakravarty, T. Tuytelaars, Expert gate: Lifelong learning with a network of experts, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- (27) J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, R. Hadsell, Progress & compress: A scalable framework for continual learning, in: International Conference on Machine Learning (ICML), 2018.
- (28) X. Li, Y. Zhou, T. Wu, R. Socher, C. Xiong, Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting, in: International Conference on Machine Learning (ICML), 2019.
- (29) C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, D. Wierstra, Pathnet: Evolution channels gradient descent in super neural networks, preprint arXiv:1701.08734.
- (30) W. S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity, The bulletin of mathematical biophysics 5 (4) (1943) 115–133.
- (31) V. Losing, B. Hammer, H. Wersing, Incremental on-line learning: A review and comparison of state of the art algorithms, Neurocomputing 275 (2018) 1261–1274.
- (32) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
- (33) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.
- (34) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- (35) M. Oquab, L. Bottou, I. Laptev, J. Sivic, Learning and transferring mid-level image representations using convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
- (36) V. Dumoulin, J. Shlens, M. Kudlur, A learned representation for artistic style, in: Proceedings of the International Conference on Learning Representations (ICLR), 2017.
- (37) M. Masana, J. van de Weijer, L. Herranz, A. D. Bagdanov, J. M. Alvarez, Domain-adaptive deep network compression, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
- (38) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
- (39) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (3) (2015) 211–252.
- (40) M.-E. Nilsback, A. Zisserman, Automated flower classification over a large number of classes, in: IEEE Indian Conference on Computer Vision, Graphics & Image Processing, 2008.
- (41) C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The caltech-ucsd birds-200-2011 dataset.
- (42) B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, L. Fei-Fei, Human action recognition by learning bases of action attributes and parts, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2011.
- (43) M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, T. Tuytelaars, Continual learning: A comparative study on how to defy forgetting in classification tasks, arXiv preprint arXiv:1909.08383.