Ternary Feature Masks: continual learning without any forgetting

01/23/2020 ∙ by Marc Masana, et al. ∙ Universitat Autònoma de Barcelona 0

In this paper, we propose an approach without any forgetting to continual learning for the task-aware regime, where at inference the task-label is known. By using ternary masks we can upgrade a model to new tasks, reusing knowledge from previous tasks while not forgetting anything about them. Using masks prevents both catastrophic forgetting and backward transfer. We argue – and show experimentally – that avoiding the former largely compensates for the lack of the latter, which is rarely observed in practice. In contrast to earlier works, our masks are applied to the features (activations) of each layer instead of the weights. This considerably reduces the number of mask parameters to be added for each new task; with more than three orders of magnitude for most networks. The encoding of the ternary masks into two bits per feature creates very little overhead to the network, avoiding scalability issues. Our masks do not permit any changes to features which are used by previous tasks. As this may be too restrictive to allow learning of new tasks, we add task-specific feature normalization. This way, already learned features can adapt to the current task without changing the behavior of these features for previous tasks. Extensive experiments on several finegrained datasets and ImageNet show that our method outperforms current state-of-the-art while reducing memory overhead in comparison to weight-based approaches.



There are no comments yet.


page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Finetuning has been established as the most common method to use when learning a new task on top of an already learned one. This works well if you no longer require the system to perform the previous task. However, in many real-world situations one is interested in learning consecutive tasks which, in the end, the system should be able to perform all. This is the setting studied in lifelong learning, also referred to as sequential, incremental or continual learning. In this setting, the popular approach of finetuning suffers from catastrophic forgetting french1999catastrophic ; goodfellow2013empirical ; mccloskey1989catastrophic ; li2020baseline ; mermillod2013stability : all network capabilities are used for learning the new task, which leads to forgetting of the previous ones.

A popular strategy to avoid this is to use importance-weight loss proxies or regularizers aljundi2018memory ; kirkpatrick2017overcoming ; lee2017overcoming . These approaches compute an importance score for each of the weight parameters of the model based on previous tasks and use this to decide which weights can be modified for the current task. A drawback of these methods is that they need to store an extra variable (the importance score) for each weight. This leads to an overhead of a float per weight parameter, i.e. double the number of parameters which have to be stored. Other methods work with a binary mask to select part of the model for each task mallya2018piggyback ; mallya2018packnet . This leads to an overhead of one bit per task per weight parameter. Finally, some methods directly make a copy of the network rusu2016progressive or rely on the storage of exemplars lopez2017gradient ; rebuffi2016icarl ; chaudhry2018efficient ; riemer2018learning , which again increases memory consumption and renders these methods unsuitable when privacy requirements forbid storing of data.

In this paper, we advocate computing a mask at the level of the features111With features or activations we refer to the inputs of a layer that are also the output of a previous layer. The input images can be considered features for the network to perform the task at hand, but here we will only use the term features for the activations between layers. (activations) instead of at the level of the weights. We need the mask to be ternary, i.e. adding a third state for allowing features to be used during the forward pass while being masked during the backward pass. That allows to reuse the representations from previous tasks without modifying them and introducing forgetting. This drastically reduces the number of extra parameters that need to be stored. As an example, the popular Alexnet architecture krizhevsky2012imagenet has around 60 million weights, while having less than 10k features. One earlier method that builds on this idea is HAT serra2018overcoming , which stores an attention value for each feature for each task. Recently, SSL aljundi2018selfless

also brings attention to the activation neurons by promoting sparsity with different losses inspired by lateral inhibition in the mammalian brain. Those two recent works stress the importance of focusing on the features instead of the weights, not only because of the reduction in memory overhead, but also because they allow for better performance and less forgetting. However, both methods still allow some forgetting as new tasks are learned.

Over time, the forgetting typically increases with the number of tasks aljundi2018memory ; kirkpatrick2017overcoming ; lee2017overcoming ; lopez2017gradient ; rebuffi2016icarl ; serra2018overcoming . However, for many practical systems, it is undesirable if the accuracy of the system deteriorates over time, while the system learns new tasks. Moreover, under these settings, the user typically has no control on the amount of forgetting, i.e. there are no guarantees on the performance of the system after new tasks have been added. Even worse: to the user, it is unknown how much the system has actually forgotten or how well the system still performs on older tasks.

For these reasons, some works have studied systems which perform continual learning without any forgetting at all. Currently, apart from the methods that make copies of the network for each task rusu2016progressive , mask-based approaches are the only ones that guarantee no-forgetting mallya2018piggyback ; mallya2018packnet . Indeed, all methods that allow backward propagation into the parameters of previously learned tasks have no control on the amount of forgetting. Not updating the weights used for previous tasks using a binary mask prevents any forgetting. In the case of recent approaches PackNet mallya2018packnet and PiggyBack mallya2018piggyback

, this is enforced by binary-masking weights or learning masks that will be binarized after the task is learned. However, both these non-forgetting methods mask weights and therefore have a larger memory overhead than methods which would put the masks on the features instead. Another drawback of 

mallya2018piggyback is that it requires a backbone network as a starting point.

In this article, we propose a method for continual learning which does not suffer from any forgetting. Due to the nature of our proposed mask-based approach, we will only focus on evaluation on task-aware experimental setups. Instead of applying masks to all weights in the network, we propose to move the masks to the feature level, thereby significantly reducing the memory overhead. Our initial method requires only a 2-bit mask value for each activation for each task. In addition, we introduce a task-dependent normalization on the features. This allows to adjust previously learned features to be of more optimal use for later tasks, without changing the performance or weights assigned to previous tasks. This introduces a further memory overhead of storing two floats more per activation per task. Nevertheless, this method still has a significantly lower memory overhead than any method which stores additional parameters per weight aljundi2018memory ; kirkpatrick2017overcoming ; lee2017overcoming ; mallya2018piggyback ; mallya2018packnet ; liu2018rotate ; zenke2017continual .

The remainder of the paper is organized as follows. In the next section we review existing continual learning methods. In Section 3, we discuss masks on the features and introduce the proposed method Ternary Feature Masks (TFM). Then, we present the experimental setup and analysis of the results in Section 4 and conclude in Section 5.

2 Related work

Lifelong learning in the proposed continual learning setup has been addressed in multiple prior works. A large part of these use regularization-based techniques to reduce catastrophic forgetting without having to store raw input. They can be divided into two main families. Distillation approaches use teacher-student setups that aim at preserving the output of the teacher model on the new data distribution rusu2016progressive ; li2017learning ; jung2016less ; rannen2017encoder ; zhang2019class . The output for each of the previous tasks is encoded using targets, exemplars or other representations and constrained when learning the new task. Model-based approaches aljundi2018memory ; kirkpatrick2017overcoming ; lee2017overcoming ; aljundi2018selfless ; liu2018rotate ; zenke2017continual ; chaudhry2018riemannian focus on overcoming catastrophic forgetting by defining the importance of the weights in the network. By penalizing changes to important weights, that have a big impact on the network output or performance, the loss protects previous information.

Learning without Forgetting (LwF) proposes to use the knowledge distillation loss li2017learning to preserve the performance of previous tasks. However, if the data distribution of the new task is very different from the previous tasks, performance drops drastically aljundi2016expert . In order to solve that, rebuffi2016icarl (iCaRL) stores a subset of each tasks’ data as exemplars; while rannen2017encoder

(EBLL) solves the issue by learning undercomplete autoencoders for each task. Less Forgetting Learning (LFL) is also similar to LwF, preserving the previous tasks’ performance by penalizing changes on the shared representation 

jung2016less . This approach argues that the task-specific decision boundaries should not change, and freezes the last layer instead of mitigating the change with a loss. Expert Gate aljundi2016expert (EGate) learns a model for each task and an autoencoder gate which will choose the model to be used. Recently, Deep Model Consolidation (DMC) proposes to learn new classes separately and then learn a final model with double distillation and extra unlabelled data zhang2019class . However, most of these methods need a pre-processing step before each task. Furthermore, a main issue is also the scalability when learning many tasks, since the described methods have to store data, autoencoders, or larger models for each new task.

When learning a new task, most model-based approaches apply a smooth penalty for changing weights, proportional to their importance for previous tasks aljundi2018memory ; kirkpatrick2017overcoming ; lee2017overcoming ; liu2018rotate ; zenke2017continual

. One of the main issues for these methods is that, depending on the task relatedness and the capacity of the network, they might over or under-estimate the importance of those weights. The main difference among those methods is how the importance of weights is calculated. In Elastic Weight Consolidation (EWC) an approximation of the diagonal of the Fisher Information Matrix (FIM) is used 

kirkpatrick2017overcoming . Rotated EWC proposes a rotation of the weight space to get a better approximation of the FIM liu2018rotate

. In Incremental Moment Matching (IMM) the moments of the posterior distribution are matched incrementally 


. They assume a diagonal covariance matrix which means that there is no correlation among the parameters, and selectively balance between two weights using variance information. Synaptic Intelligence (SI) computes the importance weights in an online fashion by storing how much the loss would change for each parameter over the training 


. However, this method is prone to under-estimating when using pretrained networks, while over-estimating by relying too much on the gradient descent of some batches. Memory Aware Synapses (MAS) computes the weight importance in an online unsupervised way, connecting their approach with Hebbian learning 

aljundi2018memory . It promotes the importance of those weights that even with small changes have a big effect on the network predictions. Finally, those approaches that learn an importance parameter for each weight can be combined with Selfless Sequential Learning (SSL) which imposes sparsity or local neuron inhibition on the neuron activations with a loss term aljundi2018selfless .

Family Method Revisit data Require Backbone net Easily expandable Overhead Forgetting Forward transfer Features or weights
Baseline Finetune No No Yes None Yes Little neither
Joint Yes No Yes None No Little neither
Freeze No Yes Yes None No Backbone only neither
Distillation LwF li2017learning No No No 1 float pp Some Yes weights
LFL jung2016less No No No 1 float pp Some Yes weights
PNN rusu2016progressive No No Yes duplicate pt Some Yes weights
P&C schwarz2018progress No No No extra network Some Little weights
Model-based EWC kirkpatrick2017overcoming No No No 1 float pp Some Yes weights
R-EWC liu2018rotate No No No 1 float pp Some Yes weights
IMM lee2017overcoming No No No 1 float pp pt Some Yes weights
SI zenke2017continual No No No 1 float pp Some Yes weights
MAS aljundi2018memory No No No 1 float pp Some Yes weights
SSL aljundi2018selfless No No No 1 float pfp Some Yes both
Mask-based PackNet mallya2018packnet No No No 1 int pp pt No Yes weights
PiggyBack mallya2018piggyback No Yes No 1 bit pp pt No Backbone only weights
HAT serra2018overcoming No No No 1 float pf pt Some Yes features
TFM w/o FN (Ours) No No Yes 2 bits pf pt No Yes features
TFM (Ours) No No Yes 2 bits + 2 floats pf pt No Yes features
Table 1: Summary of related work characteristics. pf: per feature, pp: per parameter, pt: per task, pfp: per feature and parameter.

Some more works use other underlying methods. Progressive Neural Networks (PNN) add lateral connections at each layer of the network to a duplicate of that layer 

rusu2016progressive . Then, the new column learns the new task while the old one keeps the weights fixed, meaning that resources are duplicated each time a task is added. This approach leads to non-forgetting while making the knowledge of previous tasks available during the learning of a new one through distillation. However, as each new task adds a column with the corresponding connections, the overhead scales quadratically with the number of tasks. Progress and Compress (P&C) expands the idea of PNN with the use of EWC but keeping the number of parameters constant schwarz2018progress . They propose a two-component setup with a knowledge base and an active column that follows a similar setup as PNN with lateral connections. Recently, Learn to Grow (LtG) proposes a two-part approach with a neural structure optimization component and a learning component which finetunes the parameters li2019learn . The neural structure component allows for each layer to reuse existing weights, adapt them or grow the network. In the worst case scenario, layers have to be added which makes the growth linear in the number of tasks. The finetuning component does parameter optimization and can be fixed or use an existing approach such as EWC to avoid catastrophic forgetting.

Apart from the above mentioned families, some recent works use masks to directly influence or completely remove forgetting for each parameter. We refer to this family of approaches as mask-based. These approaches give better control on the flow of gradients through the network and have the benefit of reducing or removing catastrophic forgetting more than the previously mentioned alternatives, at the cost of depleting network capacity faster as new tasks are learned. PathNet uses evolutionary strategies to learn selective routing through the weights fernando2017pathnet . However, it is not end-to-end differentiable and computationally very expensive. PackNet trains the network with available weights, then prunes the less relevant weights and retrains with a smaller subset of them mallya2018packnet . Those weights are then not available for further learning of new tasks, which quickly reduces the capacity of the network. This results in lower number of parameters being free and performance dropping quickly on longer sequences. Piggyback proposes to use a pretrained network as a backbone and then uses binary masks on the weights to create different sub-networks for each task mallya2018piggyback . The main drawback with that approach is the backbone network itself, which is crucial to being able to learn each task on top of it and cannot have a too different distribution from them. In the case of PackNet, the use of a backbone network is not needed but recommended since it becomes more difficult to learn tasks from scratch (specially with larger networks) than it is from a network finetuned on a large-scale dataset. In these last two methods, although the binary mask for each parameter has almost no overhead for the network usage, storing a mask for each task does not scale very well after a certain number of tasks. Finally, Hard Attention to the Task (HAT) proposes a hard attention mechanism on the features after each layer serra2018overcoming

. The attention embeddings are non-binary and are learned together with each task and conditioned by the previous tasks’ attentions. Because of the annealing of the slope of the sigmoid used on the embeddings, they also define a different backward propagation through their attention mechanism with gradient compensation. This approach offers plasticity to the embeddings in order to learn them, but also allows the possibility to forget previous tasks during the backpropagation step. A no-forgetting idea is discussed in the appendices of their manuscript with a note on binary masks, connecting the removal of plasticity to the

inhibitory synapses idea mcculloch1943logical .

In our proposed approach we take the latter side of that balance, using rigid masks that reduce plasticity but also ensure non-forgetting of previous tasks. Our approach also focuses on a natural expansion of the capacity of the network, which is not addressed in HAT and most of the previous related work. Our proposed approach uses masks on the features of the network to have a better control over which weights can be modified while learning new tasks. At the same time, the mask being ternary allows weights fixed for previous tasks to be used on new tasks without modifying those weights. This masking strategy allows the network to not forget anything from previous tasks and reduce the computational overhead in comparison to masking the weights.

We show in Table 1 a comprehensive overview of some of the characteristics that we consider to be more relevant to the experimental setup we propose. Our proposed method is unique in that it combines being expandable, having a low overhead cost and having no forgetting. All other methods have to choose only one of those three characteristics if any. Finally, we want to mention that there have been some methods proposed that consider a memory budget rebuffi2016icarl , or an online setup losing2018incremental , and methods with exemplars where each sample is used only once lopez2017gradient ; chaudhry2018efficient ; riemer2018learning . However, those setups are significantly different from the sequential setup we propose and therefore we do not compare to these. Same happens with Learn to Grow, which belongs to optimal architecture search setup and can be mixed or extended with multiple of the above mentioned approaches.

3 Learning without any Forgetting

Here we propose our method for task-aware continual learning, designed to learn new tasks without any forgetting of previously learned tasks. As discussed in the introduction, in order to enforce non-forgetting of previous tasks, the use of masks that create rigid states is an efficient way. Freezing the weights once learned will allow keeping the knowledge fixed without the possibility of forgetting. Works which have addressed this problem have focused on weight-masks where an additional parameter is learned for each weight in the network mallya2018piggyback ; mallya2018packnet . From a network overhead point of view, we argue that it is, however, better to work with feature-masks which learn an additional parameter for each feature in the network. In Table 2 we compare the number of weights and features in several popular networks. The table clearly shows that the overhead is significantly lower: on average weight-masks are a quadratic factor bigger than feature-masks.

Network #weights #features
LeNet lecun1998gradient 59,956 226
AlexNet krizhevsky2012imagenet 54,547,712 9,344
VGGNet simonyan2014very 119,579,904 10,880
ResNet-50 he2016deep 19,330,304 22,720
Table 2: Difference between weights and features for different common network architectures. Last fully-connected is not taken into account since it depends on the number of classes.

First, we will discuss binary masks and how those can easily encode the parts that we want to learn and the parts that we want to fix. Afterwards, we will explore what happens when we want to learn more than one task and how the binary masks need to be extended to ternary masks to make room for a new state. Finally, we will explore the use of feature normalization to allow for less rigid learning of new tasks.

3.1 Binary feature masks

Using binary feature masks on neural networks means that the masked neuron will have one of two states (0 or 1). When the masks are directly multiplied by the neuron activations, it will either use the corresponding filters or not (same for the backward pass, which will either be applied or not). Then, for each task we have a binary mask with the neurons that can be used or not. Since we do not allow for any forgetting, those masks will have to be disjoint. In Fig. 1 we show an example with two tasks where each of them is only allowed to use different neurons. A large amount of connections are completely unused, making the two sub-networks totally separable from one another.

Consider a fully connected layer (the theory can easily be extended to convolutional layers). The output of the layer is where , and . The binary feature mask for the forward pass is defined as follows:


where refers to the mask for task at layer and is an element-wise multiplication. Masks from different tasks are forced to select different features ( ). The backward pass for training task is defined as:


where is the AND logical operator and there are only non-zero gradients for those weights which join in a feature which is masked for task .

Figure 1: Binary masks encode two states: used or unused. In this case, neurons in grey are learnable for task 1 but neurons in green are not, and the opposite is true for task 2. All grey weights are unused by both tasks.

This setup allows the associated weights to a neuron to be either used-and-learnable, or neither. If used, they will contribute forward to the next layers (which is good, as it promotes forward transfer, i.e. sharing of knowledge from previous tasks). Yet at the same time this also implies that it will be possible to modify them (which is bad, as it introduces catastrophic forgetting on the previous tasks). With only binary masks, you cannot have one without the other. Alternatively, one could also define two separate binary states: “used” and ‘learnable”. This has been used for a long time in deep learning by freezing weights 

oquab2014learning . Freezing weights is a mask-based way of switching on and off the learning of a layer. In this case, in both states the layer would contribute to the outcome of the network, but the update of the weights would only be done on those layers that are not masked. Here, we further explore this idea. We advocate that, in a sequential setup where the capacity of the network might increase when learning new tasks, the best way to mask the neurons is by having three states: “used”, “learnable” and “unused”. This can be achieved by using ternary masks on the neurons.

3.2 Ternary feature masks (TFM)

Being able to use the connections between the neurons of the previous tasks and the neurons of the newly added task is important to reuse the learned information and reduce the amount of capacity that needs to be added. By using a ternary mask we can define three states:

  • forward only: the features are used during the forward pass so that the learned information from previous tasks is used; but the backward pass step is removed in order to keep the weights and prevent forgetting. This state is used on the features from previous tasks.

  • normal: forward and backward passes are applied as usual in order to learn the task at hand. This state is used on the new features created by the expansion of the network.

  • masked: neither forward nor backward passes are allowed, the features do not contribute to the network inference and the weights associated to it are frozen. This state is used at test time only when evaluating an old task after a new task is added. When extending the capacity of the network, the new features will not be used when doing inference on the previous tasks since those did not exist at the moment of their training.

Similar as in the case of the binary mask we assign features to tasks with a mask (with the corresponding layer). Again overlap in the selected features is not allowed. However, different than before, we now define a second mask per task which is defined as:


The forward and backward pass are now given by:




respectively. During the forward pass, features which were selected by previously learned tasks can be used in the current task. During the backward pass, we make sure that all new weights can be updated while forcing the existing ones from previous tasks to remain the same. In Fig. 2 we show an example with two tasks, where adding features to the layer allows for more connections to be used than in the binary case. The part of the mask corresponding to corresponds to all available connections at task . In a similar way, the part of the mask corresponding to corresponds to all available connections at task . Subtracting both terms allows us to mask the connections that contain the already learned content and apply backpropagation only on the new connections.

Figure 2: Ternary masks encode three states: masked (frozen), forward only or normal (forward and backward). Compared to Fig. 1, all unused connections can now be learned without forgetting previous knowledge.

Note that this definition also allows to use the same forward and backward pass in case we would want to re-train one of the previous tasks. However, since we do not contemplate this option for our proposed setup, we can simplify equation 5 to non-revisiting task incremental learning. In this case, the forward pass remains the same as in equation 4, and the backward pass can be rewritten as:


where is the OR logical operator which makes the mask active when either operands are active.

Since can never be 0 if one of the current or previous is 1, both masks and can be combined in a single ternary mask. This is because weights associated to a feature that is not used in the forward pass are never updated. With this ternary mask, the states are associated as follows: when and the neuron is used and learnable (normal state), when and the neuron is used and contributes to the backward pass but the associated weights are not updated (forward only state), and finally when and the neuron is unused, not taking part in the inference or the update of the network (masked state). In our implementation we assign these three states to a ternary mask as , and respectively. We will provide code and make it public upon acceptance of this manuscript.

Allowing to use previously learned parameters in the forward pass, but only updating network parameters assigned to the current task in the backward pass is also applied in Packnet mallya2018packnet and HAT serra2018overcoming . However, in contrast to us, Packnet has the masks on the weights and not on the features. HAT applies a soft activation mask, which permits forgetting of previous tasks. We further distinguish from these methods by the task-specific feature normalization (discussed in the next section) which is a crucial ingredient of our method, and which allows not only to exploit previously learned features, but also to adapt them to the current task. This is not possible for neither Packnet nor HAT.

3.3 Task-specific feature normalization (FN)

Since the binary or ternary masks freeze filters learned on previous tasks, those filters have no room for flexibility to small changes in the features. This means that even when being very similar to the ones needed for a new task, they will tend to learn a similar version of those filters with shifted or scaled operators. This phenomenon is similar to the one observed when learning several styles for style transfer networks. A way of reusing learned filters in a more efficient way but still keeping the non-forgetting property would be to use a similar approach to conditional instance normalization dumoulin2017learned , which consists in transforming a set of features into a normalized version depending on the task.

Let be the features of layer , and , the learnable parameters for each feature of each layer given a fixed task . We define the task-specific feature normalization of as:


where we apply a conditional normalization on the task without applying an instance normalization on the mean and standard deviation across the spatial dimensions. These parameters allow to slightly adjust the learned filters to the new tasks without modifying existing parameters (thus no forgetting happens) and with little overhead to the network capacity since the

and parameters are for each feature and not for all weights.

3.4 Growing Ternary Feature Masks

One of the core characteristics of our proposed method is that it can easily grow and expand the capacity of the network as is required. Methods that allow the weights to be modified, with the addition of more tasks will eventually modify them, regardless of which approach is chosen to define the importance weights. Capacity is limited, and when the network is close to running out of it for learning the new task, the trade-off will allow for forgetting to take place. It is at that point that most methods are not easily expanded to accommodate more weights and grow the network before too much forgetting happens. This is because most of the approaches cannot predict the performance drop on previous tasks before learning the new task. With our approach, we enforce that to be a core part of the system. This provides the capacity to be able to not forget at a very small overhead cost, while allowing the network to grow in parameters only when necessary.

Given a network with layers, any layer with the corresponding learned features can be expanded if those learned features are not enough to represent the new task. When expanding a layer by new features, the output of the layer grows to . That affects only the newly added forward mask values:


so that all features can be seen while learning the new task but ignored by previous tasks. And then the backward mask:


so that it only affects the new connections without modifying previous knowledge.

Training small tasks on large networks at the beginning of a continual learning setup, usually leads to overfitting or too much repetition of filters. Feature usage on the new tasks look very unbalanced in comparison to learning larger tasks masana2017domain . We believe that learning tasks in their correct capacity and growing when more is needed is a much better approach to avoid overfitting. This observation is backed by the better results some other approaches have when pruning and retraining on smaller sub-networks than when directly pruning or learning in larger sub-networks mallya2018packnet ; rusu2016progressive .

  Input: ternary mask for each task and layer
  Input: number of features to add per layer
  Require: Network with layers
  Require: Tasks , current task % Loop for each layer % for     old_size current_output_size()     new_size old_size +     % For each previous task %     for         [old_size:new_size] 0     end for     % For the current task %     [0:old_size] 1     [old_size:new_size] 2 end for
Algorithm 1 : Growing Ternary Feature Masks

3.5 Ternary Mask Implementation

The formulation of our proposed method in Sec. 3.2 states that we can combine both masks and into a ternary mask. In order to make the implementation easier and more efficient, this is done by creating a ternary mask that is set to be state for all features at task 1. This means that the first task will work as a normal network that allows for learning the task at hand as if we were using finetuning. Then, when moving to task 2, we will grow the network and add new features. The masks for task 1 associated with the new features will be set to state , therefore these are not used when evaluating nor are they learnable for task 1. The masks for task 2 will then be created by setting the previous existing features to state and the new added features to state . This would allow the features with state to be used during the forward pass for mask , while the features with state will contribute to both forward pass for mask and backward pass for mask (using Eq. 6). This process is explained in Algorithm 1.

4 Experimental results

In this section we report on a range of experiments to quantify the effectiveness of our proposed approach and compare with other state-of-the-art methods and baselines. Code will be made available upon acceptance.

4.1 Experimental Setup

Datasets.   We evaluate our approaches on a larger lower resolution dataset (tiny ImageNet ILSVRC2012 deng2009imagenet ), on a large-scale dataset (ImageNet russakovsky2015imagenet ) and some fine-grained classification datasets: Oxford 102 Flowers nilsback2008automated , CUB-200-2011 Birds wah2011caltech and Stanford Actions yao2011human . Statistics over those datasets are summarized in Table 3. For all experiments we take a fixed random set of 10% of images for validation. The validation set is equally distributed among the number of classes and fixed for each experiment to ensure the fairness of the comparison. Since the test set is not labelled for ImageNet ILSVRC2012, we use the validation set for test instead.

Tiny ImageNet is a resized version of ImageNet with of the classes. ImageNet uses inputs that use random cropping to during training for data augmentation. In the case of Birds we resize the bounding box annotations of the objects to for all splits. We do the same for Actions but without using the bounding box annotations. For Flowers we resize to and also do data augmentation by random cropping patches during training and using the central crop for evaluation. In all experiments we perform random horizontal flips during training for data augmentation.

We decide to not do experiments on permuted MNIST since they have been shown to not allow a fair comparison between different approaches lee2017overcoming . The MNIST data contains a too large amount of zeros on each input, which leads to an easy identification of the important weights that can be frozen so that they do not overlap with the other tasks. Furthermore, the MNIST data might be too simple to represent more realistic scenarios. We do not consider other low resolution datasets than tiny ImageNet for the same reason.

Dataset #Train #Eval #Classes
tiny ImageNet deng2009imagenet 100,000 10,000 200
ImageNet ILSVRC2012 russakovsky2015imagenet 1,280,861 50,000 1000
Oxford Flowers nilsback2008automated 2,040 6,149 102
CUB-200-2011 Birds wah2011caltech 5,994 5,794 200
Stanford Actions yao2011human 4,000 5,532 40
Table 3: Summary of datasets used.

Network architectures.   For tiny ImageNet we use VGG-16, which has been proven to provide high performance results simonyan2014very

. Since tiny ImageNet has a low resolution, the last max-pool layer and the last three convolutional layers from the feature extractor are removed. For ImageNet and the fine-grained datasets we use AlexNet 

krizhevsky2012imagenet . No pretrained weights are used and the models are trained from scratch using only samples from the training set. All approaches except ours have access to the full network from the beginning. However, our proposed TFM starts with a network that is smaller than the proposed ones at each layer (reduced number of output filters). Then, it grows as explained in Sec. 3.4 as more features are added every time a new task is learned. We limit the growth of the network to the total size of the one used by all other approaches. Therefore, at the end of learning all tasks, all approaches will have had access to the same amount of network capacity besides the extra parameters or regularizers that each method requires (see Overhead in Table 1).

Training details.

   All experiments are trained using backpropagation with plain Stochastic Gradient Descent following a similar setup to HAT 

serra2018overcoming . With a batch size of , learning rate starts at , decaying by a factor of when

consecutive epochs have no improvement on the validation loss, until either the learning rate is reduced below

or 200 epochs have passed. Data splits, task sequence, data loader shuffle and network initialization are fixed for all approaches given a seed. Following the results in de2019continual , we use dropout with .


   Finetuning has no extra hyperparameters and simply uses the cross-entropy loss to learn each task as it comes, without using data from previous tasks nor avoiding catastrophic forgetting. Joint training breaks the no-revisiting data rule and learns with data from the current task as well as all the previous tasks, serving as an upper-bound to compare all approaches. Finally, we propose to use Freezing as a baseline where we learn the first task and then freeze all layers except the head for the remaining tasks.

Hyperparameters.   Distillation and model-based approaches use hyperparameters to control the trade-off between forgetting and intransigence on the knowledge of previous tasks. On top of that, LwF has a temperature scaling hyperparameter for the cross-entropy loss. From the mask-based models, HAT has a trade-off hyperparameter too and a maximum for the sigmoid gate steepness. PackNet has a prune percentage of the layers.

TFM has a growth percentage which is equal for all layers. This percentage is set on the validation set according to the following protocol. For each new task, several growth percentages are evaluated without the knowledge of previous or future tasks. Rather than choosing the growth rate with best performance, we pick the lowest growth rate which obtains a performance within a margin of the best performance (we set the margin to be 1.5% for tiny ImageNet and 0.1% for fine-grained). For ImageNet this scheme would be computationally demanding and we use a fixed growth schedule, starting from 55% of the weights for the first task and add 5% for all remaining tasks. Then, we fix that hyperparameter and we run the final experiment on train and evaluate on the test set.

4.2 Fine-grained datasets

A common setup to evaluate continual learning over a number of learning sessions are disjoint splits (tasks) inside the same classification dataset. However, some approaches report results on only two tasks lee2017overcoming ; jung2016less ; fernando2017pathnet

, which becomes too similar to transfer learning setups and does not allow to evaluate the true potential of continual learning. Because of that, we choose to evaluate our proposed approach and its variations on more than two tasks. It should be noted that we start training from scratch resulting in lower scores than reported by papers which train from a pretrained network. However, because of the large number of classes in ImageNet (including a subset of Birds) we consider training from scratch provides a more natural setting for continual learning.

We compare our proposed method (TFM) and an ablation version of it without the task-specific feature normalization (TFM w/o FN) with the mask-based approaches (HAT, PackNet), a well-known model-based approach (EWC) and the baselines (Finetune, Freezing, Joint) on three fine-grained datasets (Flowers, Birds, Actions). As explained earlier, to make the comparison fair our approach will learn the first task on a smaller network and then grow and learn the next tasks with a maximum growth the size of the network other approaches use.

Oxford 102 Flowers
Method Task 1 Task 2 Task 3 Task 4 Avg.
Finetuning 10.0 (-20.3) 5.1 (-17.1) 6.7 (-13.6) 17.3 (0.0) 9.8
Freezing 30.3 (0.0) 39.8 (0.0) 32.0 (0.0) 33.1 (0.0) 33.8
Joint 54.6 (+24.3) 58.9 (+11.5) 57.7 (+4.5) 47.0 (0.0) 54.6
EWC kirkpatrick2017overcoming 12.1 (-18.2) 11.6 (-38.1) 9.3 (-24.4) 25.8 (0.0) 14.7
HAT serra2018overcoming 17.2 (-12.7) 19.3 (-28.5) 28.6 (+1.4) 31.6 (0.0) 24.2
PackNet mallya2018packnet 32.0 (0.0) 53.7 (0.0) 43.6 (0.0) 37.9 (0.0) 41.8
TFM w/o FN 36.4 (0.0) 54.1 (0.0) 38.6 (0.0) 39.0 (0.0) 42.0
TFM 36.4 (0.0) 53.8 (0.0) 45.5 (0.0) 37.6 (0.0) 43.3
CUBS 200 Birds
Method Task 1 Task 2 Task 3 Task 4 Avg.
Finetuning 7.4 (-30.2) 2.6 (-30.0) 29.7 (-3.4) 43.1 (0.0) 20.7
Freezing 37.6 (0.0) 35.1 (0.0) 35.4 (0.0) 38.4 (0.0) 36.6
Joint 48.7 (+11.1) 52.1 (+6.0) 50.7 (+1.5) 51.9 (0.0) 50.8
EWC kirkpatrick2017overcoming 16.2 (-21.4) 19.0 (-21.2) 24.2 (-14.0) 41.7 (0.0) 25.3
HAT serra2018overcoming 18.7 (-1.8) 19.4 (-0.4) 28.5 (-0.6) 31.2 (0.0) 24.4
PackNet mallya2018packnet 35.3 (0.0) 42.8 (0.0) 44.4 (0.0) 45.9 (0.0) 42.1
TFM w/o FN 42.9 (0.0) 44.1 (0.0) 48.3 (0.0) 49.1 (0.0) 46.1
TFM 42.9 (0.0) 43.1 (0.0) 49.9 (0.0) 48.8 (0.0) 46.2
Stanford 40 Actions
Method Task 1 Task 2 Task 3 Task 4 Avg.
Finetuning 24.4 (-10.5) 26.5 (-7.7) 17.6 (-16.8) 28.9 (0.0) 24.4
Freezing 34.9 (0.0) 29.4 (0.0) 30.1 (0.0) 30.5 (0.0) 31.2
Joint 45.7 (+10.8) 40.3 (+4.8) 43.2 (-1.1) 40.2 (0.0) 42.4
EWC kirkpatrick2017overcoming 24.2 (-10.7) 28.2 (-2.0) 25.2 (-5.6) 34.3 (0.0) 28.0
HAT serra2018overcoming 25.7 (-1.0) 25.5 (-2.7) 30.1 (-2.1) 34.4 (0.0) 28.9
PackNet mallya2018packnet 32.5 (0.0) 32.9 (0.0) 36.7 (0.0) 34.3 (0.0) 34.1
TFM w/o FN 35.3 (0.0) 38.3 (0.0) 39.2 (0.0) 38.0 (0.0) 37.7
TFM 35.3 (0.0) 37.2 (0.0) 42.0 (0.0) 37.2 (0.0) 38.0
Table 4: Comparison with the state-of-the-art. Accuracy after learning 4 tasks on AlexNet from scratch. Number between brackets indicates forgetting.

As can be seen in Table 4, our proposed approach outperforms the other approaches for the three datasets. For these datasets only on Flowers a considerable performance gain is observed when adding task-specific feature normalization. Only PackNet manages to obtain competitive results, however, on both Birds and Flowers, TFM does significantly better, while having a much lower memory overhead than PackNet (0.2Mb versus 27.3Mb respectively). It is also interesting to note how well Freezing works as a non-forgetting baseline.

Tiny ImageNet - classes randomly split
Approach Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Task 10 Avg.
(1-20) (21-40) (41-60) (61-80) 81-100) (101-120) (121-140) (141-160) (161-180) (181-200) all
Finetuning 38.1 (-13.6) 36.0 (-13.7) 43.2 (-16.0) 44.1 (-18.6) 45.5 (-12.6) 54.5 (-13.6) 50.3 (-15.7) 50.5 (-13.4) 51.0 (-13.1) 61.2 (0.0) 47.4
Freezing 51.7 (0.0) 36.4 (0.0) 39.5 (0.0) 41.7 (0.0) 42.9 (0.0) 46.2 (0.0) 45.7 (0.0) 41.1 (0.0) 41.2 (0.0) 40.9 (0.0) 42.7
Joint 58.6 (+6.9) 53.9 (+8.3) 59.1 (+3.7) 61.8 (+7.9) 57.7 (+2.9) 66.0 (+2.6) 64.0 (+3.1) 60.2 (+5.9) 57.9 (+1.0) 53.8 (0.0) 59.3
LfL jung2016less 32.4 (-18.9) 35.4 (-17.0) 43.4 (-15.7) 44.1 (-20.2) 45.0 (-15.0) 55.9 (-14.5) 49.4 (-16.1) 51.1 (-12.4) 58.6 (-8.0) 61.4 (0.0) 47.7
LwF li2017learning 45.1 (-6.6) 45.5 (-2.2) 53.5 (-4.6) 57.6 (-2.6) 56.2 (0.0) 65.7 (+0.4) 63.5 (-0.3) 58.4 (-1.9) 59.6 (-0.3) 58.5 (0.0) 56.4
IMM-mode lee2017overcoming 50.6 (-1.1) 38.5 (+0.3) 44.7 (-0.1) 49.2 (+0.3) 47.5 (+1.1) 51.9 (-1.4) 53.7 (-0.6) 47.7 (-0.4) 50.0 (-2.2) 48.7 (0.0) 48.3
EWC kirkpatrick2017overcoming 33.9 (-17.4) 35.4 (-14.4) 43.6 (-15.4) 46.7 (-15.9) 49.5 (-9.1) 52.5 (-15.8) 47.8 (-20.0) 50.2 (-13.8) 56.6 (-9.9) 61.4 (0.0) 47.8
HAT serra2018overcoming 46.8 (-0.2) 49.1 (+0.8) 55.8 (+0.2) 58.0 (-0.2) 53.7 (+0.3) 61.0 (+0.1) 58.7 (0.0) 54.0 (-0.1) 54.6 (-0.1) 50.3 (0.0) 54.2
PackNet mallya2018packnet 52.5 (0.0) 49.7 (0.0) 56.5 (0.0) 59.8 (0.0) 55.0 (0.0) 64.7 (0.0) 61.7 (0.0) 55.9 (0.0) 55.2 (0.0) 52.5 (0.0) 56.4
TFM w/o FN (Ours) 49.6 (0.0) 47.2 (0.0) 54.8 (0.0) 58.2 (0.0) 55.0 (0.0) 64.0 (0.0) 59.3 (0.0) 53.6 (0.0) 55.5 (0.0) 51.9 (0.0) 54.9
TFM (Ours) 48.2 (0.0) 47.7 (0.0) 56.7 (0.0) 58.2 (0.0) 54.8 (0.0) 62.2 (0.0) 61.5 (0.0) 57.3 (0.0) 58.5 (0.0) 54.8 (0.0) 56.0
Table 5: Comparison with the state-of-the-art. Tiny ImageNet on VGGnet from scratch. Accuracy of each task after learning all tasks. Number between brackets indicates forgetting. Classes are randomly split and fixed for all approaches.

4.3 Task-similarity effects on tiny ImageNet

Next we experiment on several ten-task splits of tiny ImageNet. We compare our proposed approach (TFM) with two distillation methods with low overhead (LFL jung2016less , LwF li2017learning ), two of the best known model-based methods (EWC kirkpatrick2017overcoming , IMM lee2017overcoming ) and two of the most recent mask-based methods (HAT serra2018overcoming , PackNet mallya2018packnet ). We also compare the approaches to three baselines (Finetune, Freezing, Joint). The setup uses the VGGnet introduced in Sec. 4.1. The model is trained from scratch on 10 tasks with the same number of classes. We propose to evaluate those approaches under the same conditions on a random tiny ImageNet partition (see Table 5) and on a semantically similar partition (see Table 6). For further information on the latter, check B.

Tiny ImageNet - classes semantically split
Approach Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Task 10 Avg.
fly anim. small artif. hobbies land anim. big artif. food pets/aquatic wearables transport scenes all
Finetuning 17.1 (-34.2) 19.7 (-17.7) 20.9 (-24.5) 16.7 (-30.5) 20.8 (-28.7) 29.2 (-22.0) 30.7 (-21.0) 25.2 (-17.8) 40.2 (-18.9) 59.9 (0.0) 28.0
Freezing 51.3 (0.0) 28.5 (0.0) 27.2 (0.0) 29.6 (0.0) 29.0 (0.0) 35.0 (0.0) 31.7 (0.0) 23.9 (0.0) 37.0 (0.0) 34.7 (0.0) 32.8
Joint 55.0 (+3.7) 41.9 (+7.9) 46.2 (+6.3) 44.9 (+4.9) 44.7 (+7.1) 49.0 (+4.8) 46.6 (+4.9) 36.4 (+4.7) 51.2 (+5.7) 51.1 (0.0) 46.7
LfL jung2016less 17.2 (-34.1) 18.4 (-21.0) 21.5 (-24.0) 18.7 (-30.5) 20.2 (-28.6) 27.4 (-23.9) 28.4 (-22.3) 26.0 (-18.4) 41.2 (-17.6) 59.1 (0.0) 27.8
LwF li2017learning 34.0 (-13.9) 18.4 (-14.5) 32.6 (-0.8) 36.5 (-5.6) 40.1 (-0.5) 43.1 (-2.5) 41.8 (-1.3) 32.7 (-1.1) 50.3 (-0.5) 48.1 (0.0) 37.8
IMM-mode lee2017overcoming 42.3 (-9.0) 28.8 (+0.1) 26.5 (-3.1) 30.7 (-3.6) 32.5 (-3.1) 28.8 (-13.0) 35.4 (-6.2) 27.3 (-3.7) 43.6 (-4.9) 42.7 (0.0) 33.9
EWC kirkpatrick2017overcoming 20.2 (-31.1) 18.5 (-19.3) 20.2 (-26.3) 20.9 (-28.9) 24.7 (-22.8) 25.5 (-27.5) 28.7 (-23.4) 23.0 (-19.6) 39.8 (-20.2) 56.8 (0.0) 27.8
HAT serra2018overcoming 44.6 (+0.4) 34.8 (+0.2) 40.8 (-0.1) 45.4 (+0.4) 40.8 (-2.5) 49.8 (0.0) 44.9 (-0.2) 33.1 (-1.7) 51.9 (0.0) 53.8 (0.0) 44.0
PackNet mallya2018packnet 47.0 (0.0) 35.7 (0.0) 42.7 (0.0) 48.6 (0.0) 45.8 (0.0) 48.1 (0.0) 45.9 (0.0) 38.3 (0.0) 51.2 (0.0) 49.1 (0.0) 45.2
TFM w/o FN (Ours) 46.4 (0.0) 34.7 (0.0) 38.8 (0.0) 44.1 (0.0) 42.0 (0.0) 48.3 (0.0) 46.5 (0.0) 35.7 (0.0) 52.0 (0.0) 54.8 (0.0) 44.3
TFM (Ours) 46.4 (0.0) 37.2 (0.0) 40.4 (0.0) 44.1 (0.0) 44.2 (0.0) 48.2 (0.0) 46.4 (0.0) 37.5 (0.0) 53.5 (0.0) 54.7 (0.0) 45.3
Table 6: Comparison with the state-of-the-art. Tiny ImageNet on VGGnet from scratch. Accuracy of each task after learning all tasks. Number between brackets indicates forgetting. Classes are split by semantic closeness and fixed for all approaches.

In the case of the random splits (see Table 5), most methods have quite good results on the last tasks with minor to no forgetting. LFL, IMM and EWC provide some improvement over Finetune. LwF has a very good performance due to tasks being quite similar. All mask-based models have a very similar performance, with PackNet having the better performance.

In the semantically similar splits (see Table 6), which has a more different distribution for each task than the random case, some approaches have difficulties avoiding catastrophic forgetting as the sequence gets longer. It is interesting to see, that the good results of LwF on the random split are not repeated when we have semantic splits. As observed before, LwF fails when there exists large changes in the distributions of the features between the tasks aljundi2016expert . Mask-based models outperform all other approaches again, with TFM having the better performance.

Both Tables 5 and 6 show accuracy and forgetting of each task after training the 10 tasks, and thus having learned the 200 tiny ImageNet classes. Results show that mask-based approaches achieve a better overall performance than other approaches on both splits, getting close to the joint training baseline. Freezing the network after the first task and learning only the head for the remaining tasks works better in the semantically similar splits than in the random splits. Furthermore, in Table 6, the Freezing baseline offers better results than LFL, and is better or competitive enough with the model-based approaches.

Tiny ImageNet - larger first task
Approach Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Task 10 Avg.
(1-110) (111-120) (121-130) (131-140) (141-150) (151-160) (161-170) (171-180) (181-190) (191-200) (111+)
Finetuning 18.4 (-33.2) 39.6 (-34.4) 56.6 (-21.0) 58.0 (-23.8) 44.6 (-32.8) 63.0 (-21.8) 51.0 (-27.2) 51.4 (-19.0) 70.4 (-14.0) 80.0 (0.0) 57.2
Freezing 51.6 (0.0) 68.6 (0.0) 70.2 (0.0) 77.9 (0.0) 68.1 (0.0) 78.8 (0.0) 72.9 (0.0) 64.2 (0.0) 76.7 (0.0) 70.2 (0.0) 69.9
LfL jung2016less 16.7 (-33.9) 58.3 (-6.6) 59.5 (-2.9) 64.6 (-2.2) 58.2 (-1.4) 64.7 (-0.6) 63.4 (+0.5) 54.4 (+0.1) 61.0 (0.0) 56.5 (0.0) 60.0
LwF li2017learning 10.8 (-40.0) 29.5 (-43.5) 44.6 (-30.4) 61.2 (-21.2) 55.5 (-15.1) 73.3 (-8.8) 71.7 (-3.5) 62.3 (-1.8) 77.8 (-2.2) 74.3 (0.0) 61.1
IMM-mode lee2017overcoming 26.1 (-26.3) 50.4 (-13.8) 59.5 (-19.3) 61.6 (22.8) 54.0 (-21.8) 63.2 (-22.5) 58.2 (-19.9) 56.0 (-13.9) 75.0 (-6.7) 79.9 (0.0) 62.0
EWC kirkpatrick2017overcoming 51.8 (-0.7) 24.8 (-0.1) 43.3 (-1.2) 61.8 (-0.5) 55.5 (-0.3) 70.8 (-0.7) 67.6 (-0.1) 53.8 (-0.6) 70.1 (-1.2) 61.7 (0.0) 56.6
HAT serra2018overcoming 46.1 (0.0) 60.1 (-0.1) 68.2 (+0.2) 73.2 (+0.1) 63.2 (+0.1) 76.2 (+0.1) 67.4 (0.0) 58.1 (-0.1) 73.2 (0.0) 59.1 (0.0) 66.5
PackNet mallya2018packnet 47.6 (0.0) 74.0 (0.0) 74.2 (0.0) 79.0 (0.0) 65.2 (0.0) 76.2 (0.0) 69.4 (0.0) 61.4 (0.0) 73.4 (0.0) 64.8 (0.0) 70.8
TFM w/o FN (Ours) 49.6 (0.0) 69.8 (0.0) 71.1 (0.0) 79.8 (0.0) 68.4 (0.0) 78.4 (0.0) 72.9 (0.0) 64.8 (0.0) 75.9 (0.0) 70.2 (0.0) 72.4
TFM (Ours) 49.9 (0.0) 70.4 (0.0) 71.4 (0.0) 80.8 (0.0) 70.5 (0.0) 79.4 (0.0) 73.9 (0.0) 64.4 (0.0) 76.5 (0.0) 72.1 (0.0) 73.3
Table 7: Comparison with the state-of-the-art. Tiny ImageNet on VGGnet from scratch. Accuracy of each task after learning all tasks. Numbers between brackets indicates forgetting. Average on the smaller tasks 2 to 10.

4.4 Effect of starting-task size on tiny ImageNet

We propose an experimental setup where the first task of tiny ImageNet uses 110 classes (55%) while the remaining 9 tasks use 10 classes (5%) each. This allows most of the methods to start with a rich representation after learning the first task. In this setup, comparing existing methods with the Freezing baseline is interesting. In Table 7

we show the results of this scenario. Note that in the last column we average only over the smaller tasks (T2-T10). EWC shows little forgetting, keeping knowledge while learning new tasks. However, keeping the first task from forgetting causes the model to become too rigid and learn the rest of the tasks with more difficulty and having a lower overall performance. Trying to lower the trade-off hyperparameter shows a stronger forgetting of the first tasks and causes severe catastrophic forgetting. Distillation approaches try to keep representations the same as new tasks are learned. However, small changes in the weights cause forgetting later into the sequence. HAT works fine, but with a limited capacity to make changes, ends up not learning the new tasks as easily. Freezing the network after the first task seems to be one of the best options in this setup, since the rich representation of the first 110 classes is a good starting point to learn the rest of the tasks with a simple classifier. We therefore advocate that the Freezing baseline should always be included in continual learning comparisons since it often provides a much harder baseline than Finetuning. Only PackNet and TFM are able to improve over that baseline even if they start from a smaller capacity, with TFM having the best results.

ImageNet - classes randomly split
Approach Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9 Task 10 Avg.
(1-100) (101-200) (201-300) (301-400) (401-500) (501-600) (601-700) (701-800) (801-900) (901-1000) all
Finetuning 25.8 (-43.0) 32.2 (-36.2) 31.4 (-35.3) 37.8 (-27.7) 39.1 (-27.7) 43.7 (-25.7) 46.0 (-22.8) 50.0 (-16.5) 53.4 (-12.1) 63.7 (0.0) 42.3
Freezing 68.8 (0.0) 53.5 (0.0) 52.0 (0.0) 51.2 (0.0) 51.3 (0.0) 53.9 (0.0) 52.2 (0.0) 53.9 (0.0) 51.7 (0.0) 51.2 (0.0) 54.0
LwF li2017learning 27.6 (-41.2) 37.2 (-19.9) 42.0 (-22.6) 44.4 (-20.9) 50.5 (-14.1) 56.6 (-11.3) 57.9 (-9.1) 61.2 (-5.0) 62.0 (-1.3) 62.7 (0.0) 50.2
IMM-mode lee2017overcoming 68.5 (-0.3) 53.6 (0.0) 52.1 (0.0) 51.7 (-0.1) 52.5 (+0.3) 55.5 (+0.2) 54.7 (+0.1) 53.5 (0.0) 54.2 (+0.1) 51.8 (0.0) 54.8
EWC kirkpatrick2017overcoming 21.8 (-47.0) 26.5 (-41.7) 29.5 (-36.5) 32.9 (-32.6) 35.6 (-30.9) 40.4 (-28.1) 40.0 (-26.2) 44.7 (-20.7) 47.8 (-16.2) 61.1 (0.0) 38.0
PackNet mallya2018packnet 67.5 (0.0) 65.8 (0.0) 62.2 (0.0) 58.4 (0.0) 58.6 (0.0) 58.7 (0.0) 56.0 (0.0) 56.5 (0.0) 54.1 (0.0) 53.6 (0.0) 59.1
TFM (Ours) 63.6 (0.0) 62.2 (0.0) 60.1 (0.0) 61.6 (0.0) 62.6 (0.0) 64.5 (0.0) 64.0 (0.0) 63.7 (0.0) 63.0 (0.0) 59.9 (0.0) 62.5
Table 8: Comparison with the state-of-the-art. ImageNet on AlexNet from scratch. Accuracy of each task after learning all tasks. Number between brackets indicates forgetting.
Figure 3: Per task accuracy for ImageNet on AlexNet from scratch. Graph style from de2019continual . Best viewed in color.
Figure 4: Average accuracy progress after learning each task from ImageNet on AlexNet from scratch. Best viewed in color.

4.5 ImageNet

Most of the compared task-aware approaches have not been evaluated using a large-scale dataset such as ImageNet. We therefore compare our proposed method (TFM) with some of those state-of-the-art approaches. In Table 8 we can see that TFM outperforms all the other approaches at the end of learning ImageNet split into 10 tasks of random classes. An evolution of the results for each task is shown in Fig. 3. LwF does well when learning each new task with the help of the representation of the previous tasks. However, as more tasks are included, the older tasks start forgetting more. IMM (mode) has the opposite effect, it focuses on intransigence and tries to keep the knowledge of the older tasks, running out of capacity for the newer tasks. This allows for the approach to not forget much and even have a small backward transfer, but at the cost of performing worse with newer tasks. EWC has one of the worse performances, possibly due to the difficulty of having a good approximation of the FIM when there is so many classes per task. Both PackNet and TFM have a good overall performance with non-forgetting, and rely on the amount of capacity of the network more than the other approaches. As shown in Fig. 4, PackNet has a better performance during the first three tasks, taking advantage of the compression power of the pruning and finetuning. However, as the remaining capacity of the network gets smaller, TFM is capable of growing at a more scalable pace, getting a better performance on the remaining seven tasks and achieving the best results overall.

4.6 Comparison of memory usage

Previous experiments show that mask-based approaches are better at overcoming catastrophic forgetting on task-aware settings. However, unlike most distillation and model-based approaches, they make use of some overhead memory during inference. In Fig. 5 we visualize (in log scale) the absolute memory overhead used by the mask-based approaches on the ImageNet experiment in Sec 4.5. Considering that the network used is around 220Mb, we can observe that the approaches that focus on using embeddings or masks on the features (HAT, TFM) have a negligible overhead in comparison to the weight-masked approaches (PackNet). It should also be noticed that mask-based approaches keep the same overhead during training, while distillation and model-based approaches usually duplicate the network size at least. In conclusion, our method has similar memory usage as HAT, however, we outperform this method on all proposed experiments, and our method is significantly more memory efficient than PackNet whose performance we either match or outperform.

5 Conclusions

For many practical applications, it is important that network accuracy on tasks does not deteriorate when learning new tasks. Therefore, in this paper, we propose a new method for continual learning which does not suffer from any forgetting. Other than previous methods which apply masks to the weights, we propose to move the mask to the features (activations). This greatly reduces the number of extra parameters which are added per task and reduce the overhead of the network in which other approaches incur. In addition, we propose to apply a task-specific feature normalization of features, which allows adjusting previously learned features to new tasks. In ablation experiments this was found to improve results of the ternary feature masks. Furthermore, when compared to a wide range of other continual learning techniques, our method consistently outperforms these methods on a variety of datasets.

Figure 5: Log scale overhead growth for ImageNet on AlexNet. Best viewed in color.

Appendix A A note on choosing expansion rates

The continual learning philosophy states the rule of not using data from previous tasks when learning new ones; only data of the current task can be used at each step of the setup. It is also common in machine learning setups to use a part of the training set as validation in order to choose the best hyper-parameters. Therefore, when learning each task, a validation set of that specific task at hand can be used to train the network avoiding overfitting, but no other data can be used (neither from test nor from other previous or future tasks). It is important to state that we strictly comply to these rules in the experiments we propose.

As explained in Section 3.4, our proposed approach can be expanded as needed in order to learn the new tasks without having to change the connections from previous tasks. However, this flexibility of choosing how many features will be added to each layer can easily become a rabbit-hole of architecture optimization. Because of that, we decide to propose a simple setup for how we apply our proposed approach to be comparable to the other state-of-the-art. We take the maximum layer size for all approaches to be the same as the VGGnet or AlexNet architectures for the experiments in Section 4. This way, all approaches will have a similar number of parameters.

Task Semantic group Classes
1 Animals scorpion, black widow, tarantula, spider web, centipede, trilobite, grasshopper, stick insect, cockroach, mantis, ladybug, dragonfly, monarch butterfly, sulphur butterfly, fly, bee, goose, black stork, king penguin, albatross.
(flying & insects)
2 Artifacts

abacus, binoculars, candle, chain, chest, dumbbell, hourglass, lampshade, magnetic compass, nail, pill bottle, computer keyboard, acorn, plunger, syringe, teddy bear, torch, comic book, remote control, umbrella.

3 Music, Sport basketball, punching bag, rugby ball, scoreboard, stopwatch, volleyball, CD player, drumstick, iPod, oboe, organ, refrigerator, cask, plate, wooden spoon, teapot, frying pan, beaker, bucket, dining table.
and Kitchen
4 Animals brown bear, red panda, koala, pig, ox, bison, bighorn sheep, gazelle, dromedary, African elephant, orangutan, chimpanzee, baboon, cougar, lion, European fire salamander, bullfrog, tailed frog, American alligator, boa constrictor.
5 Artifacts altar, maypole, bannister, flagpole, fountain, parking meter, pay-phone, pole, cash machine, birdhouse, bathtub, rocking chair, potter’s wheel, sewing machine, space heater, turnstile, memorial tablet, desk, vestment, reel.
6 Food water jug, wok, guacamole, ice cream, lollipop, pretzel, mashed potato, cauliflower, bell pepper, mushroom, orange, lemon, banana, pomegranate, meat loaf, pizza, potpie, espresso, soda bottle, beer bottle.
7 Animals dugong, goldfish, jellyfish, brain coral, American lobster, spiny lobster, sea slug, sea cucumber, guinea pig, snail, slug, poodle, Chihuahua, Yorkshire terrier, golden retriever, Labrador retriever, German shepherd, tabby cat, Persian cat, Egyptian cat.
(water & pets)
8 Clothes and academic gown, poncho, apron, backpack, bikini, bow tie, cardigan, fur coat, gasmask, kimono, military uniform, miniskirt, neck brace, Christmas stocking, sandal, snorkel, sock, sombrero, sunglasses, swimming trunks.
9 Transport bullet train, station wagon, freight car, go-kart, rickshaw, lifeboat, limousine, moving van, police van, school bus, convertible, crane, trolleybus, sports car, tractor, gondola, broom, cannon, lawn mower, missile.
10 Buildings barbershop, barn, lighthouse, butcher shop, candy store, water tower, triumphal arch, suspension bridge, steel arch bridge, viaduct, thatched roof, cliff dwelling, dam, obelisk, picket fence, cliff, coral reef, lakeside, seacoast, alp.
and scenes
Table 9: Semantically similar splits for tiny ImageNet.
Figure 6: Network growth with ternary feature masks over three tasks.

The differences between Figures 1 and 2 are further explained in Fig. 6.During the first task, the only used features are those that have a grey-border in them. This means that for the two shown layers, task 1 is learned by using 8 features (with 12 connections). Once task 2 arrives, we fix the grey-border features and expand the network with the green-border ones. The new task then uses the existing network and expands it with 5 features (with 24 new connections). The masks for the green-border features are set to unused for masks of task 1, while they are learnable for task 2. Finally, as task 3 comes, 5 features are also added (with 36 new connections). Masks corresponding to the new features are set to unused for tasks 1 and 2, while set to learnable for task 3. This way of expanding the network also shows that as we learn more tasks, more knowledge is available from previous ones. This opens the possibility of having to add less and less features over time since the addition of each feature creates more connections to learn. In practice, one can imagine that when learning new tasks that are very similar to previous ones, no new features will have to be added and the current network knowledge and a specific head for the task will be enough. Further research and analysis on the specific details for each layer expansion and architecture is left for future work.

Appendix B Tiny ImageNet semantic splits

The semantically similar splits of tiny ImageNet used for the experiments in Table 6 are grouped as described in Table 9.


  • (1) R. M. French, Catastrophic forgetting in connectionist networks, in: Trends in cognitive sciences, Vol. 3, Elsevier, 1999, pp. 128–135.
  • (2) I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, Y. Bengio, An empirical investigation of catastrophic forgetting in gradient-based neural networks, arXiv preprint arXiv:1312.6211.
  • (3) M. McCloskey, N. J. Cohen, Catastrophic interference in connectionist networks: The sequential learning problem, in: Psychology of learning and motivation, Vol. 24, Elsevier, 1989, pp. 109–165.
  • (4)

    X. Li, Y. Grandvalet, F. Davoine, A baseline regularization scheme for transfer learning with convolutional neural networks, Pattern Recognition 98 (2020) 107049.

  • (5) M. Mermillod, A. Bugaiska, P. Bonin, The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects, Frontiers in psychology 4 (2013) 504.
  • (6) R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, T. Tuytelaars, Memory aware synapses: Learning what (not) to forget, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • (7) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., Overcoming catastrophic forgetting in neural networks, Proceedings of the National Academy of Sciences (2017) 3521–3526.
  • (8) S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, B.-T. Zhang, Overcoming catastrophic forgetting by incremental moment matching, in: Advances in Neural Information Processing Systems, 2017, pp. 4655–4665.
  • (9) A. Mallya, D. Davis, S. Lazebnik, Piggyback: Adapting a single network to multiple tasks by learning to mask weights, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • (10) A. Mallya, S. Lazebnik, Packnet: Adding multiple tasks to a single network by iterative pruning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • (11) A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, R. Hadsell, Progressive neural networks, arXiv preprint arXiv:1606.04671.
  • (12) D. Lopez-Paz, M. Ranzato, Gradient episodic memory for continual learning, in: Advances in Neural Information Processing Systems, 2017.
  • (13) S.-A. Rebuffi, A. Kolesnikov, C. H. Lampert, iCaRL: Incremental classifier and representation learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • (14) A. Chaudhry, M. Ranzato, M. Rohrbach, M. Elhoseiny, Efficient lifelong learning with a-gem, in: Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  • (15) M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, G. Tesauro, Learning to learn without forgetting by maximizing transfer and minimizing interference, in: Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  • (16) A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
  • (17) J. Serra, D. Suris, M. Miron, A. Karatzoglou, Overcoming catastrophic forgetting with hard attention to the task, in: International Conference on Machine Learning (ICML), 2018.
  • (18) R. Aljundi, M. Rohrbach, T. Tuytelaars, Selfless sequential learning, in: Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  • (19) X. Liu, M. Masana, L. Herranz, J. Van de Weijer, A. M. Lopez, A. D. Bagdanov, Rotate your networks: Better weight consolidation and less catastrophic forgetting, in: International Conference on Pattern Recognition (ICPR), 2018.
  • (20) F. Zenke, B. Poole, S. Ganguli, Continual learning through synaptic intelligence, in: International Conference on Machine Learning (ICML), 2017, pp. 3987–3995.
  • (21) Z. Li, D. Hoiem, Learning without forgetting, IEEE transactions on pattern analysis and machine intelligence 40 (12) (2017) 2935–2947.
  • (22) H. Jung, J. Ju, M. Jung, J. Kim, Less-forgetting learning in deep neural networks, arXiv preprint arXiv:1607.00122.
  • (23) A. Rannen, R. Aljundi, M. B. B. T. Tuytelaars, Encoder based lifelong learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1320–1328.
  • (24) J. Zhang, J. Zhang, S. Ghosh, D. Li, S. Tasci, L. Heck, H. Zhang, C.-C. J. Kuo, Class-incremental learning via deep model consolidation, arXiv preprint arXiv:1903.07864.
  • (25) A. Chaudhry, P. K. Dokania, T. Ajanthan, P. H. Torr, Riemannian walk for incremental learning: Understanding forgetting and intransigence, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • (26) R. Aljundi, P. Chakravarty, T. Tuytelaars, Expert gate: Lifelong learning with a network of experts, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • (27) J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, R. Hadsell, Progress & compress: A scalable framework for continual learning, in: International Conference on Machine Learning (ICML), 2018.
  • (28) X. Li, Y. Zhou, T. Wu, R. Socher, C. Xiong, Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting, in: International Conference on Machine Learning (ICML), 2019.
  • (29) C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, D. Wierstra, Pathnet: Evolution channels gradient descent in super neural networks, preprint arXiv:1701.08734.
  • (30) W. S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity, The bulletin of mathematical biophysics 5 (4) (1943) 115–133.
  • (31) V. Losing, B. Hammer, H. Wersing, Incremental on-line learning: A review and comparison of state of the art algorithms, Neurocomputing 275 (2018) 1261–1274.
  • (32) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
  • (33) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.
  • (34) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • (35) M. Oquab, L. Bottou, I. Laptev, J. Sivic, Learning and transferring mid-level image representations using convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • (36) V. Dumoulin, J. Shlens, M. Kudlur, A learned representation for artistic style, in: Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  • (37) M. Masana, J. van de Weijer, L. Herranz, A. D. Bagdanov, J. M. Alvarez, Domain-adaptive deep network compression, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • (38) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • (39) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (3) (2015) 211–252.
  • (40) M.-E. Nilsback, A. Zisserman, Automated flower classification over a large number of classes, in: IEEE Indian Conference on Computer Vision, Graphics & Image Processing, 2008.
  • (41) C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The caltech-ucsd birds-200-2011 dataset.
  • (42) B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, L. Fei-Fei, Human action recognition by learning bases of action attributes and parts, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2011.
  • (43) M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, T. Tuytelaars, Continual learning: A comparative study on how to defy forgetting in classification tasks, arXiv preprint arXiv:1909.08383.