Compacting, Picking and Growing for Unforgetting Continual Learning

10/15/2019 ∙ by Steven C. Y. Hung, et al. ∙ 0

Continual lifelong learning is essential to many applications. In this paper, we propose a simple but effective approach to continual deep learning. Our approach leverages the principles of deep model compression with weight pruning, critical weights selection, and progressive networks expansion. By enforcing their integration in an iterative manner, we introduce an incremental learning method that is scalable to the number of sequential tasks in a continual learning process. Our approach is easy to implement and owns several favorable characteristics. First, it can avoid forgetting (i.e., learn new tasks while remembering all previous tasks). Second, it allows model expansion but can maintain the model compactness when handling sequential tasks. Besides, through our compaction and selection/expanding mechanism, we show that the knowledge accumulated through learning previous tasks is helpful to adapt to a better model for the new tasks compared to training the models independently with tasks. Experimental results show that our approach can incrementally learn a deep model to tackle multiple tasks without forgetting, while the model compactness is maintained with the performance more satisfiable than ndividual task training.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Continual lifelong learning Thrun (1995); Parisi et al. (2019) has received much attention in recent deep learning studies. In this research track, we hope to learn a model capable of handling unknown sequential tasks while keeping the performance of the model on previously learned tasks. In continual lifelong learning, the training data of previous tasks are assumed non-available for the newly coming tasks. Although the model learned can be used as a pre-trained model, fine-tuning a model for the new task will force the model parameters to fit new data, which causes catastrophic forgetting McClelland et al. (1995); Pfulb and Gepperth (2019) on previous tasks.

To lessen the effect of catastrophic forgetting, techniques leveraging on regularization of gradients or weights during training have been studied Kirkpatrick et al. (2017); Zenke et al. (2017); Lee et al. (2017); Ritter et al. (2018). In Kirkpatrick et al. Kirkpatrick et al. (2017) and Zenke et al. Zenke et al. (2017), the proposed algorithms regularize the network weights and hope to search the common convergence for the current and previous tasks. Schwarz et al. Schwarz et al. (2018) introduce a network-distillation method for regularization, which imposes constraints on the neural weights adapted from the teacher to the student network and applies the elastic-weight-consolidation (EWC) Kirkpatrick et al. (2017) for incremental training. The regularization-based approaches reduce the affection of catastrophic forgetting. However, as the training data of previous tasks are missing during learning and the network capacity is fixed (and limited), the regularization approaches often forget the learned skills gradually. Earlier tasks tend to be forgotten more catastrophically in general. Hence, they would not be a favorable choice for continuous learning problems when the number of sequential tasks is unlimited.

To address the data-missing issue (i.e., lacking of the training data of old tasks), data-preserving and memory-replay techniques have been introduced. Data-preserving approaches (such as Rebuffi et al. (2017); Brahma and Othon (2018); Hu et al. (2019); Riemer et al. (2019b)) are designed to directly save important data or latent codes as an efficient form, while memory-replay approaches Shin et al. (2017); Wu et al. (2018b); Kemker and Kanan (2018); Wu et al. (2018a); Ostapenko et al. (2019) introduce additional memory models such as GANs for keeping data information or distribution in an indirect way. The memory models have the ability to replay previous data. Based on past data information, we can then train a model such that the performance can be recovered to a considerable extent for the old tasks. However, a general issue of memory-replay approaches is that they require explicit re-training using old information accumulated, which leads to either large working memory or compromise between the information memorized and forgetting.

This paper introduces an approach for learning sustainable but compact deep models, which can handle an unlimited number of sequential tasks while avoiding forgetting. As a limited architecture cannot ensure to remember the skills incrementally learned from unlimited tasks, our approach allows growing the architecture to some extent. Nevertheless, we also remove redundancy of the model during continual learning, and thus our approach can increasingly compact multiple tasks in a neural network with very limited model expansion.

Besides, pre-training or gradually fine-tuning the models from a starting task only incorporates prior knowledge at initialization; hence, the knowledge base is getting diminished with the past tasks. As humans have the ability to continually acquire, fine-tune and transfer knowledge and skills throughout their lifespan Parisi et al. (2019), in lifelong learning, we would hope that the experience accumulated from previous tasks is helpful to learn a new task. As the model increasingly learned by using our method serves as a compact, un-forgetting base, it generally yields a better model for the subsequent tasks than training the tasks independently. Experimental results reveal that our lifelong learning method can leverage the knowledge accumulated from the past to enhance the performance of new tasks.

Figure 1: Compacting, Picking, and Growing (CPG) continual learning. Given a well-trained model, gradual pruning is applied to compact the model to release redundant weights. The compact model weights are kept to avoid forgetting. Then a learnable binary weight-picking mask is trained along with previously released space for new tasks to effectively reuse the knowledge of previous tasks. The model can be expanded for new tasks if it does not meet the performance goal. Best viewed in color.

Motivation of Our Method Design:

Our method is designed by combining the ideas of deep model compression via weights pruning (Compacting), critical weights selection (Picking), and ProgressiveNet extension (Growing). We refer to it as CPG, whose rationals are given below.

As stated above, although the regularization or memory-replay approaches lessen the effect of forgetting, they often do not guarantee to preserve the performance for previous tasks. To exactly avoid forgetting, a promising way is to keep the old-task weights already learned Rusu et al. (2016); Xiao et al. (2014) and enlarge the network by adding nodes or weights for training new tasks. In ProgressiveNet Rusu et al. (2016), to ease the training of new tasks, the old-task weights are shared with the new ones but remain fixed, where only the new weights are adapted for the new task. As the old-task weights are kept, it ensures the performance of learned tasks. However, as the complexity of the model architecture is proportional to the number of tasks learned, they result in a highly redundant structure for keeping multiple models.

Motivated by ProgressiveNet, we design a method allowing the sustainability of architecture too. To avoid constructing a complex and huge structure like ProgressiveNet, we perform model compression for the current task every time so that a condensed model is established for the old tasks. According to deep-net compression Han et al. (2016), there is much redundancy in a neural network, and removing the redundant weights does not affect the network performance. Our approach exploits this property, which compresses the current task by deleting neglectable weights. This yields a compressing-and-growing loop for a sequence of tasks. Following the idea of ProgressiveNet, the weights preserved for the old tasks are set as invariant to avoid forgetting in our approach. However, unlike ProgressiveNet where the architecture is always grown for a new task, as the weights deleted from the current task can be released for use for the new tasks, we do not have to grow the architecture every time but can employ the weights released previously for learning the next task. Therefore, in the growing step of our CPG approach, two possible choices are provided. The first is to use the previously released weights for the new task. If the performance goal is not fulfilled yet when all the released weights are used, we then proceed to the second choice where the architecture is expanded and both the released and expanded weights are used for the new-task training.

Another distinction of our approach is the “picking” step. The idea is motivated below. In ProgressiveNet, the old-tasks weights preserved are all co-used (yet remain fixed) for learning the new tasks. However, as the number of tasks is increased, the amount of old-task weights is getting larger too. When all of them are co-used with the weights newly added in the growing step, the old weights (that are fixed) act like inertia as only the fewer new weights are allowed to be adapted, which tends to slow down the learning process and make the solution found immature in our experience.

To address this issue, inspired by Mallya et al. (2018), we do not employ all the old-task weights but picking only some critical ones from them via a differentiable mask. In the picking step of our CPG approach, the old weights’ picking-mask and the new weights added in the growing step are both adapted to learn an initial model for the new task. Then, likewise, the initial model obtained is compressed via weight pruning, and the compressed model is preserved for the new task as well.

To compress the weights for a task, a main difficulty is the lacking of prior knowledge to determining the pruning ratio. To solve this problem, in the compacting step of our CPG approach, we employ the gradual pruning procedure Zhu and Gupta (2017) that prunes a small portion of weights and retrains the remaining weights to restore the performance iteratively. The procedure stops when meeting a pre-defined accuracy goal. Note that only the newly added weights (from the released and/or expanded ones in the growing step) are allowed to be pruned, whereas the old-task weights remain unchanged.

Method Overview:

An overview of our method is depicted below. The CPG method compresses the deep model and (selectively) expands the architecture alternatively. First, a compressed model is built from pruning. Given a new task, the weights of the old-task models are fixed as well. Next, we pick and re-use some of the old-task weights critical to the new task via a differentiable mask, and use the previously released weights for learning together. If the accuracy goal is not attained yet, the architecture can be expanded by adding filters or nodes in the model and resuming the procedure. Then we repeat the gradual pruning Zhu and Gupta (2017) (i.e., iteratively removing a portion of weights and retraining) for compacting the model of the new task. An overview of our CPG approach is given in Figure 1. The new-task weights are formed by a combination of two parts: the first part is picked via a learnable mask on the old-task weights, and the second part is learned by gradual pruning/retraining of the extra weights. As the old-task weights are only picked but fixed, we can integrate the required function mappings in a compact model without affecting their accuracy in inference.

Our CPG approach includes several methods as its special cases. If the model is not compressed and the old weights are all picked, our approach is analogous to ProgressiveNet Rusu et al. (2016). If the model is compressed and the old weights are all picked, but the architecture expansion is forbidden, then our approach degenerates to PackNet Mallya and Lazebnik (2018) that packs multiple tasks sequentially in a single model, despite that PackNet does not employ gradual pruning but uses a fixed ratio for pruning the model directly. If the model is non-compressed and the architecture expansion is forbidden, our approach degenerates to Piggyback Mallya et al. (2018) that selects weights from the pre-trained model of a base task to handle other tasks. Main characteristics of our approach are summarized as follows.

Avoid forgetting: Our approach ensures unforgetting. The function mappings previously built are maintained as exactly the same when new tasks are incrementally added.

Expand with shrinking: Our method allows expansion but keeps the compactness of the architecture, which can potentially handle unlimited sequential tasks. Experimental results reveal that multiple tasks can be condensed in a model with slight or no architecture growing.

Compact knowledge base: Experimental results show that the condensed model recorded for previous tasks serves as knowledge base with accumulated experience for weights picking in our approach, which yields performance enhancement for learning new tasks.

2 Related Work

Continual lifelong learning Parisi et al. (2019) can be divided into three main categories: network regularization, memory or data replay, and dynamic architecture. Besides, works on task-free Aljundi et al. (2019) or as a program synthesis Valkov et al. (2018) have also been studied recently. In the following, we give a brief review of works in the main categories, and readers are suggested to refer to a recent survey paper Parisi et al. (2019) for more studies.

Network regularization: The key idea of network regularization approaches Kirkpatrick et al. (2017); Zenke et al. (2017); Lee et al. (2017); Ritter et al. (2018); Schwarz et al. (2018); Chaudhry et al. (2018); Dhar et al. (2019) is to restrictively update learned model weights. To keep the learned task information, some penalties are added to the change of weights. EWC Kirkpatrick et al. (2017) uses Fisher’s information to evaluate the importance of weights for old tasks, and updates weights according to the degree of importance. Based on similar ideas, the method in Zenke et al. (2017) calculates the importance by the learning trajectory. Online EWC Schwarz et al. (2018) and EWC++ Chaudhry et al. (2018) improve the efficiency issues of EWC. Learning without Memorizing(LwM) Dhar et al. (2019) presents an information preserving penalty. The approach builds an attention map, and hopes that the attention region of the previous and concurrent models are consistent. These works alleviate catastrophic forgetting but cannot guarantee the previous-task accuracy exactly.

Memory replay: Memory or data replay methods Rebuffi et al. (2017); Shin et al. (2017); Kemker and Kanan (2018); Brahma and Othon (2018); Wu et al. (2018b, a); Hu et al. (2019); Riemer et al. (2019b, a); Ostapenko et al. (2019) use additional models to remember data information. Generative Replay Shin et al. (2017) introduces GANs to lifelong learning. It uses a generator to sample fake data which have similar distribution to previous data. New tasks can be trained with these generated data. Memory Replay GANs (MeRGANs) Wu et al. (2018a) shows that forgetting phenomenon still exists in a generator, and the property of generated data will become worse with incoming tasks. They use replay data to enhance the generator quality. Dynamic Generative Memory (DGM) Ostapenko et al. (2019) uses neural masking to learn connection plasticity in conditional generative models, and set a dynamic expansion mechanism in the generator for sequential tasks. Although these methods can exploit data information, they still cannot guarantee the exact performance of past tasks.

Dynamic architecture: Dynamic-architecture approaches Rusu et al. (2016); Li and Hoiem (2018); Rosenfeld and Tsotsos (2018); Parisi et al. (2018); Yoon et al. (2018) adapt the architecture with a sequence of tasks. ProgressiveNet Rusu et al. (2016) expands the architecture for new tasks and keeps the function mappings by preserving the previous weights. LwF Li and Hoiem (2018) divides the model layers into two parts, shared and task-specific, where the former are co-used by tasks and the later are grown with further branches for new tasks. DAN Rosenfeld and Tsotsos (2018) extends the architecture per new task, while each layer in the new-task model is a sparse linear-combination of the original filters in the corresponding layer of a base model. Architecture expansion has also been adopted in a recent memory-replay approach Ostapenko et al. (2019) on GANs. These methods can considerably lessen or avoid catastrophic forgetting via architecture expansion, but the model is monotonically increased and a redundant structure would be yielded.

As continually growing the architecture will retain the model redundancy, some approach performs model compression before expansion Yoon et al. (2018) so that a compact model can be built. In the past, the most related method to ours is Dynamic-expansion Net (DEN) Yoon et al. (2018). DEN reduces the weights of the previous tasks via sparse-regularization. Newly added weights and old weights are both adapted for the new task with sparse constraints. However, DEN does not ensure non-forgetting. As the old-task weights are jointly trained with the new weights, part of the old-tasks weights are selected and modified under the sparse setting. Hence, a "Split & Duplication" step is introduced to further ‘restore’ some of the old weights modified, so as to lessen the forgetting effect.

Our approach is accomplished by a compactingpickinggrowing) loop, which selects critical weights from old tasks without modifying them, and thus avoids forgetting. Besides, our approach does not have to restore the old-task performance like DEN as the performance is already kept, which thus avoids a tedious "Split & Duplication" process which takes extra time for model adjustment and will affect the new-task performance. Our approach is hence simple and easier to implement. In the experimental results, we show that our approach also outperforms DEN to a considerable extent.

3 The CPG approach for Continual Lifelong Learning

Without loss of generality, our work follows a task-based sequential learning setup that is a common setting in continual learning. In the following, we present our method in the sequential-task manner.

Task 1: Given the first task (task-) and an initial model trained via its dataset, we perform gradual pruning Zhu and Gupta (2017) on the model to remove its redundancy with the performance kept. Instead of pruning weights one time to the pruning ratio goal, the gradual pruning removes a portion of the weights and retrains the model to restore the performance iteratively until meets the pruning criteria. Gradual pruning steps stabilize the compacting process. Thus, we compact the current model so that redundancy among the model weights are removed (or released). The weights in the compact model are then set unalterable and remain fixed to avoid forgetting. After gradual pruning, the model weights can be divided into two parts: one is preserved for task 1; the other is released and able to be employed by the subsequent tasks.

Task k to k+1: Assume that in task-, a compact model that can handle tasks to has been built and available. The model weights preserved for tasks to are denoted as . The released (redundant) weights associated with task- are denoted as , and they are extra weights that can be used for subsequent tasks. Given the dataset of task-, we apply a learnable mask to pick the old weights , with the dimension of . The weights picked are then represented as , the element-wise product of the 0-1 mask and . Without loss of generality, we use the piggyback approach Mallya et al. (2018) that learns a real-valued mask

and applies a threshold for binarization to construct

. Hence, given a new task, we pick a set of weights (known as the critical weights) from the compact model via a learnable mask. Besides, we also use the released weights for the new task. The mask and the additional weights are learned together on the training data of task-

with the loss function of task-

via back-propagation. Same as the piggyback approach Mallya et al. (2018), since the binarized mask is not differentiable, when training the binary mask , we update the real-valued mask in the backward pass; then is updated with a threshold on and applied to the forward pass. If the performance is unsatisfied yet, the model architecture can be grown to include more weights for training. That is, can be augmented with additional weights (such as new filters in convolutional layers and nodes in fully-connected layers) and then resumes the training of both and . Note that during traning, the mask and new weights are adapted but the old weights are “picked” only and remain fixed. Thus, old tasks can be exactly recalled.

Compaction of task k+1: After and are learned, an initial model of task- is obtained. Then, we fix the mask and apply gradual pruning to compress , so as to get the compact model and the redundant (released) weights for task-. The compact model of old tasks then becomes . The compacting and picking/growing loop is repeated from task to task. Details of CPG continual learning is listed in Algorithm 1.

Input: given task and an original model trained on task 1.
Set an accuracy goal for task ;
Alternatively remove small weights and re-train the remaining weights for task via gradual pruning Zhu and Gupta (2017), whenever the accuracy goal is still hold;
Let the model weights preserved for task be (referred to as task- weights), and those that are removed by the iterative pruning be (referred to as the released weights);
for  task (let the released weights of task be  do
        Set an accuracy goal for task ;
        Apply a mask to the weights ; train both and for task , with fixed;
        If the accuracy goal is not achieved, expand the number of filters (weights) in the model, reset and go to previous step;
        Gradually prune to obtain (with fixed) for task , until meeting the accuracy goal;
        and ;
end for
Algorithm 1 Compacting, Picking and Growing Continual Learning

4 Experiments and Results

We perform three experiments to verify the effectiveness of our approach. The first experiment contains 20 tasks organized with CIFAR-100 dataset. In the second experiment, we follow the same settings of PackNet Mallya and Lazebnik (2018) and Piggyback Mallya et al. (2018)

approaches, where several fine-grained datasets are chosen for classification in an incremental manner. In the third experiment, we start from face verification and compact three further facial-informatic tasks (expression, gender, and age) incrementally to examine the performance of our continual learning approach in a realistic scenario. We implement our CPG approach and independent task learning (from scratch or fine-tuning) via PyTorch 

Paszke et al. (2017) in all experiments, but realize DEN Ostapenko et al. (2019)

using Tensorflow 

Abadi et al. (2016) with its official codes.

(a) Task- (b) Task- (c) Task- (d) Task-
Figure 2: The accuracy of DEN, Finetune and CPG for the sequential tasks 1, 5, 10, 15 on CIFAR-100.

4.1 Twenty Tasks of CIFAR-100

We divide the CIFAR-100 dataset into 20 tasks. Each task has 5 classes, 2500 training images, and 500 testing images. In the experiment, VGG16-BN model (VGG16 with batch normalization layers) is employed to train the 20 tasks sequentially. First, we compare our approach with DEN 

Ostapenko et al. (2019) (as it also uses an alternating mechanism of compression and expansion) and fine-tuning. To implement fine-tuning, we train task- from scratch by using VGG16-BN; then, assuming the models of task to task are available, we then train the model of task- by fine-tuning one of the models randomly selected from tasks to . We repeat this process 5 times and get the average accuracy (referred to as Finetune Avg). To implement our CPG approach, task- is also trained by using VGG16-BN, and this initial model is adapted for the sequential tasks following Algorithm 1. DEN is implemented via the official codes provided by the authors and modified for VGG16-BN.

Figure 2 shows the classification accuracy of DEN, fine-tuning, and our CPG. Figure 2(a) is the accuracy of task-1 when all of the 20 tasks have been trained. Initially, the accuracy of DEN is higher than CPG and fine-tuning although the same model is trained from scratch. We conjecture that it is because they are implemented on different platforms (Tensorflow vs PyTorch). Nevertheless, the performance of task-1 gradually drops when the other tasks (2 to 20) are increasingly learned in DEN, as shown in Figure 2(a), and the drops are particularly significant for tasks 15 to 20. In Figure 2(b), the initial accuracy of DEN on task-5 becomes a little worse than that of CPG and fine-tuning. It reveals that DEN could not employ the previously leaned model (tasks 1-4) to enhance the performance of the current task (task 5). Besides, the accuracy of task-5 still drops with new tasks (6-20) are learned. Similarly, for tasks 10 and 15 respectively shown in Figures 2(c) and (d), DEN has a larger performance gap on the initial model, with the accuracy dropping incrementally either.

We attribute the phenomenon to the reasons as follows. As DEN does not guarantee unforgetting, a "Split & Duplication" step is enforced to recover the old-task performance. Though DEN tries to preserve the learned tasks as much as they could via optimizing weight sparsity, the tuning of hyperparameters in its loss function makes DEN non-intuitive to balance the learning of the current task and remembering the previous tasks. The performance thus drops although we have tried our best for tuning it. On the other hand, fine-tuning and our CPG have roughly the same accuracy initially on task-1 (both are trained from scratch), whereas CPG gradually outperforms fine-tuning on tasks 5, 10, and 15 in Figure 

2. The results suggest that our approach can exploit the accumulated knowledge base to enhance the new task performance. After model growing for 20 tasks, the final amount of weights is increased by times (compared to VGG16-BN) for both DEN and CPG. Hence, our approach can not only ensure maintaining the old-task performance (as the horizontal line shown in Figure 2), but effectively accumulate the weights for knowledge picking.

Unlike ProgressiveNet that uses all of the weights kept for the old tasks when training the new task, our method only picks the old-task weights critical to the new tasks. To evaluate the effectiveness of the weights picking mechanism, we compare CPG with PAE Hung et al. (2019) and PackNet Mallya and Lazebnik (2018). In our method, if all of the old weights are always picked, we refer it to as the pack-and-expand (PAE) approach. If we further restrict PAE such that the architecture expansion is forbidden, it degenerates to an existing approach, PackNet Mallya and Lazebnik (2018). Note that both PAE and PackNet ensure unforgetting. As shown in Table 1, besides the first two tasks, CPG performs more favorably than PAE and PackNet consistently. The results reveal that the critical-weights picking mechanism in CPG not only reduces the unnecessary weights but also boost the performance for new tasks. As PackNet does not allow model expansion, its weights amount remains the same (1). However, when proceeding with more tasks, available space in PackNet gradually reduces, which limits the effectiveness of PackNet to learn new tasks. PAE uses all the previous weights during learning. As with more tasks, the weights from previous tasks would dominate the whole network and become a burden in learning new tasks. Finally, as shown in the Expand (Exp.) field in Table 1, PAE grows the model and uses 2 times of weights for the 20 tasks. Our CPG expands to 1.5 of weights (with 0.41 redundant weights that can be released to future tasks). Hence, CPG finds a more compact and sustainable model with better accuracy when the picking mechanism is enforced.

Table 2 shows the performance of different settings of our CPG method, together with their comparison to independent task learning (including learning from scratch and fine-tuning from a pre-trained model). In this table, ‘scratch’ means learning each task independently from scratch via the VGG16-BN model. As depicted before, ‘fine-Avg’ means the average accuracy of fine-tuning from a previous model randomly selected and repeats the process 5 times. ‘fine-Max.’ means the maximum accuracy of these 5 random trials. In the implementation of our CPG algorithm, an accuracy goal has to be set for the gradual-pruning and model-expansion steps. In this table, the ‘avg’, ‘max’, and ‘top’ correspond to the settings of accuracy goals to be fine-Avg, fine-Max, and a slight increment of the maximum of both, respectively. The upper bound of model weights expansion is set as 1.5 in this experiment. As can be seen in Table 2, CPG gets better accuracy than both the average and maximum of fine-tuning in general. CPG also performs more favorably than learning from scratch averagely. This reveals again that the knowledge previously learned with our CPG can help learn new tasks.

Besides, the results show that a higher accuracy goal yields better performance and more consumption of weights in general. In Table 2, the accuracy achieved by ‘CPG avg’, ‘CPG max’, and ‘CPG top’ is getting increased. The former remains to have 0.41 redundant weights that are saved for future use, whereas the later two consumes all weights. The model size includes not only the backbone model weights, but also the overhead of final layers increased with new classes, batch-normalization parameters, and the binary masks. Including all overheads, the model sizes of CPG for the three settings are 2.16, 2.40 and 2.41 of the original VGG16-BN, as shown in Table 5. Compared to independent models (learning-from-scratch or fine-tuning) that require 20 for maintaining the old-task accuracy, our approach can yield a far smaller model to achieve exact unforgetting.

Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Avg.
PackNet 66.4 80.0 76.2 78.4 80.0 79.8 67.8 61.4 68.8 77.2 79.0 59.4 66.4 57.2 36.0 54.2 51.6 58.8 67.8 83.2 67.5 1 0
PAE 67.2 77.0 78.6 76.0 84.4 81.2 77.6 80.0 80.4 87.8 85.4 77.8 79.4 79.6 51.2 68.4 68.6 68.6 83.2 88.8 77.1 2 0
CPG 65.2 76.6 79.8 81.4 86.6 84.8 83.4 85.0 87.2 89.2 90.8 82.4 85.6 85.2 53.2 74.4 70.0 73.4 88.8 94.8 80.9 1.5 0.41
Table 1: The performance of PackNet, PAE and CPG on CIFAR-100 twenty tasks. We use Avg., Exp. and Red. as abbreviations for Average accuracy, Expansion weights and Redundant weights.
Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Avg.
Scratch 65.8 78.4 76.6 82.4 82.2 84.6 78.6 84.8 83.4 89.4 87.8 80.2 84.4 80.2 52.0 69.4 66.4 70.0 87.2 91.2 78.8 20 0
fine-Avg 65.2 76.1 76.1 77.8 85.4 82.5 79.4 82.4 82.0 87.4 87.4 81.5 84.6 80.8 52.0 72.1 68.1 71.9 88.1 91.5 78.6 20 0
fine-Max 65.8 76.8 78.6 80.0 86.2 84.8 80.4 84.0 83.8 88.4 89.4 83.8 87.2 82.8 53.6 74.6 68.8 74.4 89.2 92.2 80.2 20 0
CPG avg 65.2 76.6 79.8 81.4 86.6 84.8 83.4 85.0 87.2 89.2 90.8 82.4 85.6 85.2 53.2 74.4 70.0 73.4 88.8 94.8 80.9 1.5 0.41
CPG max 67.0 79.2 77.2 82.0 86.8 87.2 82.0 85.6 86.4 89.6 90.0 84.0 87.2 84.8 55.4 73.8 72.0 71.6 89.6 92.8 81.2 1.5 0
CPG top 66.6 77.2 78.6 83.2 88.2 85.8 82.4 85.4 87.6 90.8 91.0 84.6 89.2 83.0 56.2 75.4 71.0 73.8 90.6 93.6 81.7 1.5 0
Table 2: The performance of CPGs and individual models on CIFAR-100 twenty tasks. We use fine-Avg and fine-Max as abbreviations for Average and Max accuracy of the 5 fine-tuning models.
Methods Model Size (MB)
VGG16-BN 128.25
Individual Models 2565
CPG avg 278
CPG max 308
CPG top 310
Table 4: Statistics of the fine-grained datasets
Dataset #Train #Eval #Classes
ImageNet 1,281,167 50,000 1,000
CUBS 5,994 5,794 200
Stanford Cars 8,144 8,041 196
Flowers 2,040 6,149 102
WikiArt 42,129 10,628 195
Sketch 16,000 4,000 250
Table 5: Statistics of the facial-informatic datasets
Dataset #Train #Eval #Classes
VGGFace2 3,137,807 0 8,6301
LFW 0 13,233 5,749
FotW 6,171 3,086 3
IMDB-Wiki 216,161 0 3
AffectNet 283,901 3,500 7
Adience 12,287 3,868 8
Table 3: Model sizes on CIFAR-100 twenty tasks.

4.2 Fine-grained Image Classification Tasks

In this experiment, following the same settings in the works of PackNet Mallya and Lazebnik (2018) and Piggyback Mallya et al. (2018), six image classification datasets are used. The statistics are summarized in Table 5, where ImageNet Krizhevsky et al. (2012) is the first task, following by fine-grained classification tasks, CUBS Wah et al. (2011), Stanford Cars Krause et al. (2013) and Flowers Nilsback and Zisserman (2008), and finally WikiArt Saleh and Elgammal (2015) and Sketch Eitz et al. (2012)

that are artificial images drawing in various styles and objects. Unlike previous experiments where the first task consists of some of the five classes from CIFAR-100. In this experiment, the first-task classifier is trained on ImageNet, which is a strong base for fine-tuning. Hence, in the fine-tuning setting of this experiment, tasks 2 to 6 are all fine-tuned from the task-1, instead of selecting a previous task randomly. For all tasks, the image size is

, and the architecture used in this experiment is ResNet50.

The performance is shown on Table 7. Five methods are compared with CPG: training from scratch, fine-tuning, ProgressiveNet, PackNet, and Piggyback. For the first task (ImageNet), CPG and PackNet performs slightly worse than the others, since both methods have to compress the model (ResNet50) via pruning. Then, for tasks 2 to 6, CPG outperforms the others in almost all cases, which shows the superiority of our method on building a compact and unforgetting base for continual learning. As for the model size, ProgressiveNet increases the model per task. Learning-from-scratch and fine-tuning need 6 models to achieve unforgetting. Their model sizes are thus large. CPG yields a smaller model size comparable to piggyback, which is favorable when considering both accuracy and model size.

Dataset Train from Scratch Finetune Prog. Net PackNet Piggyback CPG
ImageNet 76.16 - 76.16 75.71 76.16 75.81
CUBS 40.96 82.83 78.94 80.41 81.59 83.59
Stanford Cars 61.56 91.83 89.21 86.11 89.62 92.80
Flowers 59.73 96.56 93.41 93.04 94.77 96.62
Wikiart 56.50 75.60 74.94 69.40 71.33 77.15
Sketch 75.40 80.78 76.35 76.17 79.91 80.33
Model Size
554 554 563 115 121 121
Table 7: Accuracy on facial-informatic tasks.
Task Train from Scratch Finetune CPG
Face -
Gender 83.70 90.80 89.66
Expression 57.64 62.54 63.57
Age 46.14 57.27 57.66
Exp. () 4 4 1
Red. () 0 0 0.003
Table 6: Accuracy on fine-grained dataset.

4.3 Facial-informatic Tasks

In a realistic scenario, four facial-informatic tasks, face verification, gender, expression and age classification are used with the datasets summarized in Table 5. For face verification, we use VGGFace2 Cao et al. (2018) for training and LFW Learned-Miller et al. (2016) for testing. For gender classification, we combine FotW Escalera et al. (2016) and IMDB-Wiki Rothe et al. (2015) datasets and classify faces into three categories, male, female and other. The AffectNet dataset Mollahosseini et al. (2017) is used for expression classification that classifies faces into seven primary emotions. Finally, Adience dataset Eidinger et al. (2014) contains faces with labels of eight different age groups. For these datasets, faces are aligned using MTCNN Zhang et al. (2016) with output size of . We use the 20-layer CNN in SphereFace Liu et al. (2017) and train a model for face verification task accordingly. We compare CPG with the models fine-tuned from the face verification task. The results are reported in Table 7. Compared with individual models (training-from-scratch and fine-tuning), CPG can achieve comparable or more favorable results without additional expansion. After learning four tasks, CPG still has 0.003 of the released weights able to be used for new tasks.

5 Conclusion and Future Work

We introduce a simple but effective method, CPG, for continual learning avoiding forgetting. Compacting a model can prevent the model complexity from unaffordable when the number of tasks is increased. Picking learned weights using binary masks and train them together with newly added weights is an effective way to reuse previous knowledge. The weights for old tasks are preserved, and thus prevents them from forgetting. Growing the model for new tasks facilitates the model to learn unknown or un-related tasks. CPG is easy to be realized and applicable to real situations. Experiments show that CPG can achieve similar or better accuracy with little additional space. Currently, our method compacts a model by weights pruning, and we plan to include channel pruning in the future.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. (2016)

    Tensorflow: large-scale machine learning on heterogeneous distributed systems

    arXiv. Cited by: §4.
  • [2] R. Aljundi, K. Kelchtermans, and T. Tuytelaars (2019) Task-free continual learning. In Proceedings of CVPR, Cited by: §2.
  • [3] P. P. Brahma and A. Othon (2018) Subset replay based continual learning for scalable improvement of autonomous systems. In Proceedings of the IEEE CVPRW, Cited by: §1, §2.
  • [4] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman (2018) VGGFace2: a dataset for recognising faces across pose and age. In Proceedings of IEEE FG, Cited by: §4.3.
  • [5] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In Proceedings of ECCV, Cited by: §2.
  • [6] P. Dhar, R. V. Singh, K. Peng, Z. Wu, and R. Chellappa (2019) Learning without memorizing. Proceedings of CVPR. Cited by: §2.
  • [7] E. Eidinger, R. Enbar, and T. Hassner (2014)

    Age and gender estimation of unfiltered faces

    IEEE TIFS 9 (12), pp. 2170–2179. Cited by: §4.3.
  • [8] M. Eitz, J. Hays, and M. Alexa (2012) How do humans sketch objects?. ACM Trans. Graph. 31 (4), pp. 44–1. Cited by: §4.2.
  • [9] S. Escalera, M. Torres Torres, B. Martinez, X. Baró, H. Jair Escalante, I. Guyon, G. Tzimiropoulos, C. Corneou, M. Oliu, M. Ali Bagheri, et al. (2016) Chalearn looking at people and faces of the world: face analysis workshop and challenge 2016. In Proceedings of IEEE CVPRW, Cited by: §4.3.
  • [10] S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of ICLR, Cited by: §1.
  • [11] W. Hu, Z. Lin, B. Liu, C. Tao, Z. Tao, J. Ma, D. Zhao, and R. Yan (2019) Overcoming catastrophic forgetting via model adaptation. In Proceedings of ICLR, Cited by: §1, §2.
  • [12] S. C. Hung, J. Lee, T. S. Wan, C. Chen, Y. Chan, and C. Chen (2019) Increasingly packing multiple facial-informatics modules in a unified deep-learning model via lifelong learning. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 339–343. Cited by: §4.1.
  • [13] R. Kemker and C. Kanan (2018) Fearnet: brain-inspired model for incremental learning. In Proceedings of ICLR, Cited by: §1, §2.
  • [14] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences. Cited by: §1, §2.
  • [15] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia. Cited by: §4.2.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.2.
  • [17] E. Learned-Miller, G. B. Huang, A. RoyChowdhury, H. Li, and G. Hua (2016) Labeled faces in the wild: a survey. In

    Advances in face detection and facial image analysis

    Cited by: §4.3.
  • [18] S. Lee, J. Kim, J. Ha, and B. Zhang (2017)

    Overcoming catastrophic forgetting by incremental moment matching

    In NIPS, Cited by: §1, §2.
  • [19] Z. Li and D. Hoiem (2018) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, pp. 2935–2947. Cited by: §2.
  • [20] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) SphereFace: deep hypersphere embedding for face recognition.

    IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 6738–6746.
    Cited by: §4.3.
  • [21] A. Mallya, D. Davis, and S. Lazebnik (2018) Piggyback: adapting a single network to multiple tasks by learning to mask weights. In Proceedins of ECCV, Cited by: §1, §1, §3, §4.2, §4.
  • [22] A. Mallya and S. Lazebnik (2018) Packnet: adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE CVPR, Cited by: §1, §4.1, §4.2, §4.
  • [23] J. L. McClelland, B. L. McNaughton, and R. C. O’reilly (1995) Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.. Psychological review. Cited by: §1.
  • [24] A. Mollahosseini, B. Hasani, and M. H. Mahoor (2017) AffectNet: a database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affective Comput.. Cited by: §4.3.
  • [25] M-E. Nilsback and A. Zisserman (2008-12) Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Cited by: §4.2.
  • [26] O. Ostapenko, M. Puscas, T. Klein, P. Jähnichen, and M. Nabi (2019) Learning to remember: a synaptic plasticity driven framework for continual learning. In Proceedings of CVPR, Cited by: §1, §2, §2, §4.1, §4.
  • [27] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §1, §1, §2.
  • [28] G. I. Parisi, X. Ji, and S. Wermter (2018) On the role of neurogenesis in overcoming catastrophic forgetting. In Proceedings of NeurIPS Workshop on Continual Learning, Cited by: §2.
  • [29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In Proceedings of NeurIPS, Cited by: §4.
  • [30] B. Pfulb and A. Gepperth (2019) A comprehensive, application-oriented study of catastrophic forgetting in dnns. In ICLR 2019, Cited by: §1.
  • [31] S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) Icarl: incremental classifier and representation learning. In Proceedings of IEEE CVPR, Cited by: §1, §2.
  • [32] M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and G. Tesauro (2019) Learning to learn without forgetting by maximizing transfer and minimizing interference. In Proceedings of ICLR, Cited by: §2.
  • [33] M. Riemer, T. Klinger, D. Bouneffouf, and M. Franceschini (2019) Scalable recollections for continual lifelong learning. In AAAI 2019, Cited by: §1, §2.
  • [34] H. Ritter, A. Botev, and D. Barber (2018) Online structured laplace approximations for overcoming catastrophic forgetting. In NeurIPS, Cited by: §1, §2.
  • [35] A. Rosenfeld and J. K. Tsotsos (2018) Incremental learning through deep adaptation. IEEE transactions on pattern analysis and machine intelligence, Early Access. Cited by: §2.
  • [36] R. Rothe, R. Timofte, and L. Van Gool (2015) Dex: deep expectation of apparent age from a single image. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 10–15. Cited by: §4.3.
  • [37] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. arXiv. Cited by: §1, §1, §2.
  • [38] B. Saleh and A. Elgammal (2015) Large-scale classification of fine-art paintings: learning the right metric on the right feature. In ICDMW, Cited by: §4.2.
  • [39] J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell (2018) Progress & compress: a scalable framework for continual learning. In Proceedings of ICML, Cited by: §1, §2.
  • [40] H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In Proceedings of NeurIPS, Cited by: §1, §2.
  • [41] S. Thrun (1995) A lifelong learning perspective for mobile robot control. In Intelligent Robots and Systems, Cited by: §1.
  • [42] L. Valkov, D. Chaudhari, A. Srivastava, C. A. Sutton, and S. Chaudhuri (2018) HOUDINI: lifelong learning as program synthesis. In NeurIPS, Cited by: §2.
  • [43] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §4.2.
  • [44] C. Wu, L. Herranz, X. Liu, y. wang, J. van de Weijer, and B. Raducanu (2018) Memory replay gans: learning to generate new categories without forgetting. In Proceedings of NeurIPS, Cited by: §1, §2.
  • [45] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, Z. Zhang, and Y. Fu (2018) Incremental classifier learning with generative adversarial networks. CoRR abs/1802.00853. Cited by: §1, §2.
  • [46] T. Xiao, J. Zhang, K. Yang, Y. Peng, and Z. Zhang (2014) Error-driven incremental learning in deep convolutional neural network for large-scale image classification. In Proceedings of ACM-MM, Cited by: §1.
  • [47] J. Yoon, E. Yang, J. Lee, and S. J. Hwang (2018) Lifelong learning with dynamically expandable networks. In Proceedings of ICLR, Cited by: §2, §2.
  • [48] F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In Proceedings of ICML, Cited by: §1, §2.
  • [49] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23, pp. 1499–1503. Cited by: §4.3.
  • [50] M. Zhu and S. Gupta (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. In Proceedings of NeurIPS Workshop on Machine Learning of Phones and other Consumer Devices, Cited by: §1, §1, §3, 1.