Adversarial Continual Learning

03/21/2020 ∙ by Sayna Ebrahimi, et al. ∙ 8

Continual learning aims to learn new tasks without forgetting previously learned ones. We hypothesize that representations learned to solve each task in a sequence have a shared structure while containing some task-specific properties. We show that shared features are significantly less prone to forgetting and propose a novel hybrid continual learning framework that learns a disjoint representation for task-invariant and task-specific features required to solve a sequence of tasks. Our model combines architecture growth to prevent forgetting of task-specific skills and an experience replay approach to preserve shared skills. We demonstrate our hybrid approach is effective in avoiding forgetting and show it is superior to both architecture-based and memory-based approaches on class incrementally learning of a single dataset as well as a sequence of multiple datasets in image classification. Our code is available at <>.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans can learn novel tasks by augmenting core capabilities with new skills learned based on information for a specific novel task. We conjecture that they can leverage a lifetime of previous task experiences in the form of fundamental skills that are robust to different task contexts. When a new task is encountered, these generic strategies form a base set of skills upon which task-specific learning can occur. We would like artificial learning agents to have the ability to solve many tasks sequentially under different conditions by developing task-specific and task-invariant skills that enable them to quickly adapt while avoiding catastrophic forgetting [catforgetting89] using their memory.

One line of continual learning approaches learns a single representation with a fixed capacity in which they detect important weight parameters for each task and minimize their further alteration in favor of learning new tasks. In contrast, structure-based approaches increase the capacity of the network to accommodate new tasks. However, these approaches do not scale well to the large number of tasks if they require a large amount of memory for each task. Another stream of approaches in continual learning relies on explicit or implicit experience replay by storing raw samples or training generative models, respectively.

Figure 1: Factorizing task-specific and task-invariant features in our method (ACL). Left: Shows ACL at training time where the Shared module is adversarially trained with the discriminator to generate task-invariant features () while the discriminator attempts to predict task labels. Architecture growth occurs at the arrival of each task by adding a task-specific module optimized to generate orthogonal representation () to . To prevent forgetting, 1) Private modules are stored for each task and 2) A shared module which is less prone to forgetting, yet is also retrained with experience reply with a limited number of exemplars Right: At test time, discriminator is removed and ACL uses the module for the specific task it is evaluated on.

In this paper, we propose a novel adversarial continual learning (ACL) method in which a disjoint latent space representation composed of task-specific or private latent space is learned for each task and a task-invariant or shared feature space is learned for all tasks to enhance better knowledge transfer as well as better recall of the previous tasks. The intuition behind our method is that tasks in a sequence share a part of the feature representation but also have a part of the feature representation which is task-specific. The shared features are notably less prone to forgetting and the tasks-specific features are important to retain to avoid forgetting the corresponding task. Therefore, factorizing these features separates the part of the representation that forgets from that which does not forget. To disentangle the features associated with each task, we propose a novel adversarial learning approach to enforce the shared features to be task-invariant and employ orthogonality constraints [factorized] to enforce the shared features to not appear in the task-specific space.

Once factorization is complete, minimizing forgetting in each space can be handled differently. In the task-specific latent space, due to the importance of these features in recalling the task, we freeze the private module and add a new one upon finishing learning a task. The shared module, however, is significantly less susceptible to forgetting and we only use the replay buffer mechanism in this space to the extend that factorization is not perfect, i.e., when tasks have little overlap or have high domain shift in between, using a tiny memory containing samples stored from prior tasks will help with better factorization and hence higher performance. We empirically found that unlike other memory-based methods in which performance increases by increasing the samples from prior tasks, our model requires a very tiny memory budget beyond which its performance remains constant. This alleviates the need to use old data, as in some applications it might not be possible to store a large amount of data, if any at all, after observing it. Instead, our approach leaves room for further use of memory, if available and need be, for architecture growth. Our approach is simple yet surprisingly powerful in not forgetting and achieves state-of-the-art results on visual continual learning benchmarks such as MNIST, CIFAR100, Permuted MNIST, and miniImageNet.

2 Related Work

2.1 Continual learning

The existing approaches to continual learning can be broadly divided into three categories: memory-based methods, structure-based methods, and regularization-based methods.

Memory-based methods: Methods in this category mitigate forgetting by relying on storing previous experience explicitly or implicitly wherein the former raw samples ([agem, gem, rehearsal, icarl, mer] are saved into the memory for rehearsal whereas in the latter a generative model such as a GAN [genreplay]

or an autoencoder

[fearnet] synthesizes them to perform pseudo-rehearsal. These methods allow for simultaneous multi-task learning on i.i.d. data which can significantly reduce forgetting. A recent study on tiny episodic memories in CL [tinymem] compared methods such as GEM [gem], A-GEM [agem] , MER [mer], and ER-RES [tinymem]. Similar to [mer], for ER-RES they used reservoir sampling using a single pass through the data. Reservoir sampling [reservoir] is a better sampling strategy for long input data compared to random selection. In this work, we explicitly store raw samples into a very tiny memory used for replay buffer and we differ from prior work in this line of research by how these stored examples are used by specific parts of our model (discriminator and the shared module) to prevent forgetting in the features found to be shared across tasks.

Structure-based methods: These methods exploit modularity and attempt to localize inference to a subset of the network such as columns [pnn]

, neurons

[pathnet, den], a mask over parameters [packnet, hat]. The performance on previous tasks is preserved by storing the learned module while accommodating new tasks by augmenting the network with new modules. For instance, Progressive Neural Nets (PNNs) [pnn] statically grow the architecture while retaining lateral connections to previously frozen modules resulting in guaranteed zero forgetting at the price of quadratic scale in the number of parameters. [den] proposed dynamically expandable networks (DEN) in which, network capacity grows according to tasks relatedness by splitting/duplicating the most important neurons while time-stamping them so that they remain accessible and re-trainable at all time. This strategy despite introducing computational cost is inevitable in continual learning scenarios where a large number of tasks are to be learned and a fixed capacity cannot be assumed.

Regularization methods: In these methods [ewc, SI, mas, ucb], the learning capacity is assumed fixed and continual learning is performed such that the change in parameters is controlled and reduced or prevented if it causes performance downgrade on prior tasks. Therefore, for parameter selection, there has to be defined a weight importance measurement concept to prioritize parameter usage. For instance, inspired by Bayesian learning, in elastic weight consolidation (EWC) method [ewc] important parameters are those to have the highest in terms of the Fisher information matrix. HAT [hat] learns an attention mask over important parameters. Authors in [ucb]

used per-weight uncertainty defined in Bayesian neural networks to control the change in parameters. Despite the success gained by these methods in maximizing the usage of a fixed capacity, they are often limited by the number of tasks.

2.2 Adversarial learning

Adversarial learning has been used for different problems such as generative models [gan], object composition [compositional], representation learning [makhzani2015adversarial], domain adaptation [adda]

, active learning 

[vaal], etc. The use of an adversarial network enables the model to train in a fully-differentiable manner by adjusting to solve the minimax optimization problem [gan]. Adversarial learning of the latent space has been extensively researched in domain adaptation [cycada], active learning [vaal], and representation learning [factorvae, makhzani2015adversarial]. While previous literature is concerned with the case of modeling a single or multiple tasks at once, here we extend this literature by considering the case of continuous learning where multiple tasks need to be learned in a sequential manner.

2.3 Latent Space Factorization

In the machine learning literature,

multi-view learning, aims at constructing and/or using different views or modalities for better learning performances [blum1998combining, xu2013survey]. The approaches to tackle multi-view learning aim at either maximizing the mutual agreement on distinct views of the data, or focus on obtaining a latent subspace shared by multiple views by assuming that the input views are generated from this latent subspace using Canonical correlation analysis and clustering [chaudhuri2009multi], Gaussian processes [shon2006learning], etc. Therefore, the concept of factorizing the latent space into shared and private parts has been extensively explored for different data modalities. Inspired by the practicality of factorized representation in handling different modalities, here we factorize the latent space learned for different tasks using adversarial learning and orthogonality constraints [factorized].

3 Adversarial Continual learning (ACL)

We consider the problem of learning a sequence of data distributions denoted as , where is the data distribution for task with sample tuples of input (), output label (), and task label (). The goal is to sequentially learn the model

for each task that can map each task input to its target output while maintaining its performance on all prior tasks. We aim to achieve this by learning a disjoint latent space representation composed of a

task-specific latent space for each task and a task-invariant feature space for all tasks to enhance better knowledge transfer as well as better catastrophic forgetting avoidance of prior knowledge. We mitigate catastrophic forgetting in each space differently. For the task-invariant feature space, we assume a limited memory budget of which stores samples from every single task prior to .

We begin by learning as a mapping from to . For -way classification task with a cross-entropy loss, this corresponds to


where is the softmax function and the subscript is dropped for simplicity. In the process of learning a sequence of tasks, an ideal is a model that maps the inputs to two independent latent spaces where one contains the shared features among all tasks and the other remains private to each task. In particular, we would like to disentangle the latent space into the information shared across all tasks () and the independent or private information of each task () which are as distinct as possible while their concatenation followed by a task-specific head outputs the desired targets.

To this end, we introduce a mapping called Shared ( and train it to generate features that fool an adversarial discriminator . Conversely, the adversarial discriminator (

) attempts to classify the generated features by their task labels (

) . This is achieved when the discriminator is trained to maximize the probability of assigning the correct task label to generated features while simultaneously

is trained to confuse the discriminator by minimizing . This corresponds to the following -way classification cross-entropy adversarial loss for this minimax game


Note that the extra label zero is associated with the ‘fake’ task label paired with randomly generated noise features of . In particular, we use adversarial learning in a different regime that appears in most works related to generative adversarial networks [gan] such that the generative modeling of input data distributions is not utilized here because the ultimate task is to learn a discriminative representation.

To facilitate training , we use the Gradient Reversal layer [DA] that optimizes the mapping to maximize the discriminator loss directly (). In fact, it acts as an identity function during forward propagation but negates its inputs and reverses the gradients during back propagation. The training for and is complete when is able to generate features that can no longer predict the correct task label for leading to become as task-invariant as possible. The private module (), however, attempts to accommodate the task-invariant features by learning merely the features that are specific to the task in hand and do not exist in . We further factorize and by using orthogonality constraints introduced in [factorized], also known as “difference” loss in the domain adaptation literature [DSN], to prevent the shared features between all tasks from appearing in the private encoded features. This corresponds to


where is the Frobenius norm and it is summed over the encoded features of all modules encoding samples for the current tasks and the memory.

Final output predictions for each task are then predicted using a task-specific multi-layer perceptron head which takes

concatenated with () as an input.

Taken together, these loss form the complete objective for ACL as


where , , and are regularizers to control the effect of each component. The full algorithm for ACL is given in Alg. 1.

3.1 Avoiding forgetting in Acl

Catastrophic forgetting occurs when a representation learned through a sequence of tasks changes in favor of learning the current task resulting in performance downgrade on previous tasks. The main insight to our approach is decoupling the conventional single representation learned for a sequence of tasks into two parts: a part that must not change because it contains task-specific features without which complete performance retrieval is not possible, and a part that is less prone to change as it contains the core structure of all tasks.

To fully prevent catastrophic forgetting in the first part (private features), we use compact modules that can be stored into memory. If factorization is successfully performed, the second part remains highly immune to forgetting. However, we empirically found that when disentanglement cannot be fully accomplished either because of the little overlap or large domain shift between the tasks, using a tiny replay buffer containing few samples for old data can be beneficial to retain high ACC values as well as mitigating forgetting.

1:   function TRAIN(
2:       Hyper-parameters:
6:       for  to T do
7:           for 

to epochs

8:               Compute for using
9:               Compute using
10:               Compute using , , and
14:               Compute for using and
16:           end for
18:           Store
21:       end for
22:   end function
   function UPDATEMEMORY(
       for  to  do
           for  to  do
           end for
       end for
   end function
   function EVAL(
       for  to  do
       end for
   end function
Algorithm 1 Adversarial Continual Learning (ACL))

3.2 Evaluation metrics

After training for each new task, we evaluate the resulting model on all prior tasks. Similar to [gem, ucb], to measure ACL performance we use ACC as the average test classification accuracy across all tasks. To measure forgetting we report backward transfer, BWT, which indicates how much learning new tasks has influenced the performance on previous tasks. While directly reports catastrophic forgetting, indicates that learning new tasks has helped with the preceding tasks.


where is the test classification accuracy on task after sequentially finishing learning the task.

We also compare methods based on the memory used either in the network architecture growth or replay buffer. Therefore, we convert them into memory size assuming numbers are floating point which is equivalent to .

4 Experiments

In this section, we review the benchmark datasets and baselines used in our evaluation as well as the implementation details. We then report the obtained results, an ablation study, and a brief analysis of ACL details.

4.1 ACL on Vision Benchmarks

Datasets: We evaluate our approach on the commonly used benchmarks datasets for class-incrementally learning where the entire dataset is divided into disjoint susbsets or tasks. We use common image classification datasets 5-Split MNIST and Permuted MNIST [mnist], previously used in [vcl, SI, ucb], 20-Split CIFAR100 [cifar] used in [SI, gem, agem], and 20-Split miniImageNet [mini] used in [tinymem, proto]. We also benchmark ACL on a sequence of 5-Datasets including SVHN, CIFAR10, not-MNIST, Fashion-MNIST and, MNIST and report average performance over multiple random task orderings. Dataset statistics are given in Table 5(a) in the appendix. No data augmentation of any kind has been used in our analysis.

Baselines: From the prior work, we compare with state-of-the-art approaches in all the three categories described in Section 2 including Elastic Weight Consolidation (EWC) [ewc], Progressive neural networks (PNNs) [pnn], and Hard Attention Mask (HAT) [hat] using implementations provided by [hat] unless otherwise stated. For memory-based methods including A-GEM, GEM, and ER-RES, for Permuted MNIST, 20-Split CIFAR100, and 20-Split miniImageNet, we relied on the implementation provided by [tinymem], but changed the experimental setting from single to multi-epoch and without using for cross validation for a more fair comparison against ACL and other baselines. On Permuted MNIST results for SI [SI] are reported from [hat], for VCL [vcl] those are obtained using their original provided code, and for Uncertainty-based CL in Bayesian framework (UCB) [ucb] are directly reported from the paper.

We also perform fine-tuning, and joint training. In fine-tuning (ORD-FT), an ordinary single module network without the discriminator is continuously trained without any forgetting avoidance strategy in the form of experience replay or architecture growth. In joint training with an ordinary network (ORD-JT) and our ACL setup (ACL-JT) we learn all the tasks jointly in a multitask learning fashion using the entire dataset at once which serves as the upper bound for average accuracy on all tasks, as it does not adhere to the continual learning scenario.

Implementation details: For all ACL experiments except for Permuted MNIST and 5-Split MNIST we used a reduced AlexNet [alexnet] architecture as the backbone for and modules. The architecture in is composed of convolutional and fully-connected (FC) layers whereas

is only a convolutional neural network (CNN) with similar number of layers and half-sized kernels compared to those used in

. The private head modules () and the discriminator are all composed of a small perceptron. Due to the differences between the structure of our setup and a regular network with a single module, we used a similar CNN structure to followed by larger hidden FC layers to match the total number of parameters throughout our experiments with our baselines for fair comparisons. For 5-Split MNIST and Permuted MNIST where baselines use a two-layer perceptron with

units in each and ReLU nonlinearity, we used a two-layer perceptron of size

and with ReLU activation in between in the shared module and a single-layer of size and ReLU for each . In each head, we also used an MLP with layers of size and , ReLU activations, and a

-unit softmax layer. In all our experiments, no pre-trained model is used. We used stochastic gradient descent for

ACL and baselines. Our code is provided as a zipped file included in the supplementary materials.

5 Results and Discussion

In the first set of experiments, we measure ACC, BWT, and the memory used by ACL and compare it against state-of-the-art methods with or without memory constraints on 20-Split miniImageNet. Next, we provide more insight and discussion on ACL and its component by performing an ablation study and visualizations on this dataset. In Section 6, we evaluate ACL on a more difficult continual learning setting where we sequentially train on different datasets. Finally, in section (7), we demonstrate the experiments on class incremental learning of single datasets commonly used in CL literature and compare the ACC and BWT metrics against prior work.

5.1 Acl Performance on 20-Split miniImageNet

Starting with 20-Split miniImageNet, we split it in tasks with classes at a time. Table 1(a) shows our results obtained for ACL compared to several baselines. We compare ACL with HAT as a regularization based method with no experience replay memory dependency that achieves ACC= with . Results for the memory-based methods of ER-RES and A-GEM are re(produced) by us using the implementation provided in [tinymem] by applying modifications to the network architecture to match with ACL in the backbone structure as well as the number of parameters. We only include A-GEM in Table 1(a) which is only a faster algorithm compared to its precedent GEM with identical performance.

A-GEM and ER-RES use an architecture with parameters MB along with storing images of size per class MB resulting in total memory size of MB. ACL is able to outperform all baselines in ACC=, , using total memory of MB for architecture growth MB and storing sample per class for replay buffer MB. In our ablation study in Section 5.2, we will show our performance without using replay buffer for this dataset is ACC=. However, ACL is able to overcome the gap by using only one image per class ( per task) to achieve ACC= without the need to have a large buffer for old data in class incrementally learning datasets like miniImagenet with diverse sets of classes.

Method ACC% BWT%
HAT[hat] - -
PNN [pnn] Zero -
A-GEM [agem] -
ORD-FT - -
ORD-JT - -
ACL-JT - -
ACL (Ours)
ACL (1 sample)
w/o Dis and Shared Zero
w/o Private -
w/o (w/o Dis) -
w/o Replay buffer -
Table 1: CL results on 20-Split miniImageNet measuring ACC , BWT , and Memory (MB). (**) denotes that methods do not adhere to the continual learning setup: ACL-JT and ORD-JT serve as the upper bound for ACC for ACL/ORD networks, respectively. denotes result is re(produced) by us using the original provided code. denotes result is obtained using the re-implementation setup by [hat]. All results are averaged over

runs and standard deviation is given in parentheses (b) Ablation study of

ACL on miniImageNet dataset

5.2 Ablation Studies on 20-Split miniImageNet

We now analyze the major building blocks of our proposed framework such as the discriminator, the shared module, replay buffer effect and the difference loss on the miniImagenet dataset. Ablation results are summarized in Table 1(b) and are as follows:

Discriminator and shared modules: We begin by ablating the discriminator and shared module (w/o Dis and Shared) which means only using the private modules. They are used one at a time for each task and stored in memory for further recall (zero-forgetting guaranteed). In 20-Split miniImageNet experiment, we use a small convolutional network with parameters in and in (private head) for each task. The average accuracy (ACC) for this model is where random chance is as it is a -class problem. This ablation shows that despite obtaining zero forgetting using P modules one at a time, they are not capable of performing the task when solely used because of their limited capacity.

Private modules: In the first ablation we showed that none of the tasks can be performed well using only modules. Ablating the Private modules leads to a similar observation for the shared module as it has only parameters. We have performed this ablation by fine-tuning the module while it is adversarially trained with . This results in a ACC= and confirming the important role of private modules in retaining tasks’ performance.

Discriminator: Now, we ablate the adversarial learning aspect of our method. Eliminating the role of the discriminator and hence the adversarial learning aspect of ACL, results in a drop in ACC. This result demonstrates the important role of in ACL.

Replay buffer: We then explore the effect of using previous samples as replay buffer in avoiding forgetting. We keep ACL in its full shape (S+P+D) but we do not use samples from previous tasks during training. Note that the discriminator has to still predict task labels for new task data as they become available while it has no chance of seeing the previous examples. By comparing the ACC achieved with no memory access with the best results obtained for ACL with memory access (ACL-1 samples) shows the effect of adding a single image per class (5 per task) in performance gain from to . Unlike A-GEM, and EP-RES approaches in which performance increases with more episodic memory, in ACL, ACC remains nearly similar to its highest performance. We have visualized this effect in Fig. 2 where on the left it illustrates the memory effect for ACL and memory-dependent baselines when , , , and images per class are used during training. Numbers used to plot this figure with their standard deviation are given in Table 2. We also show how memory affects the BWT in ACL in Fig. 2 (right) which follows the same pattern as we observed for ACC. Being insensitive to the amount of old data is a remarkable feature of ACL, not because of the small memory it consumes, but mainly due to the fact that access to the old data might be prohibited or very limited in some real world applications. Therefore, for a fixed allowed memory size, a method that can effectively use it for architecture growth can be considered as more practical for such applications.

Samples per class
ACL (ours) ACC
Table 2: Comparison of the effect of the replay buffer size between ACL and other baselines including A-GEM [agem], and ER-RES [tinymem] on 20-Split miniImageNet where unlike the baselines, ACL’s performance remains unaffected by the increase in number of samples stored per class as discussed in 5.2. The results from this table are used to generate Fig. 2 below.
Figure 2: Left: Comparing the replay buffer effect on ACC on 20-Split miniImageNet achieved by ACL against A-GEM [agem] and ER-RES [tinymem] when using , , , and samples per classes within each task discussed in 5.2. Right: Insensitivity of ACC and BWT to replay buffer in ACL. Best viewed in color.

Orthogonality Constraint (): Finally, we study the effect of orthogonality constraint in our objective function defined in Eq. 3. The results show an increase of in ACC compared to when adversarial learning is the only method we use for factorization. Comparing the performance increase resulting by adding versus confirms that the adversarial learning plays an important role in latent space disentanglement in our approach. We hypothesize this is due to the discriminator having direct access to task labels while training , whereas difference loss performs the minimization using and only.

5.3 Visualizing the effect of adversarial learning in Acl

Here we illustrate the role of adversarial learning in factorizing the latent space learned for a sequence of tasks in a continual learning setting using the T-distributed Stochastic Neighbor Embedding (T-SNE) [tsne] plots for the 20-Split miniImageNet experiment.

Figure 3: Visualizing the effect of adversarial learning in ACL where the latent spaces of both private and shared modules are compared against the generated features by corresponding modules trained without a discriminator. This plot shows that the shared module has been successfully trained to generate task-invariant features using adversarial learning whereas in the fourth column from left, we observe that without the discriminator, the shared module was only able to generate a non-uniform embedding

Fig. 3 visualizes the latent spaces of the shared and private modules trained with and without the discriminator. In particular, we used the model trained on the entire sequence of 20-Split miniImageNet and evaluated it on the test-sets belonging to tasks , , and each including images for classes, total of samples, which are color-coded with their class labels.

We first compare the discriminator’s effect on the latent space generated by the shared modules. As shown in Fig. 3

, the shared modules trained with adversarial loss (second column from left), consistently appear as a uniformly mixed distribution of encoded samples belonging to all classes for each task. In contrast, in the generated features by shared modules that were trained without a discriminator (fourth column from left), we observe a non-uniformly distributed mixture of features where small clusters can be found for some classes (

e.g. tasks , ) showing an entangled representation within each task.

We now move on to show the effect of the discriminator on the private modules’ latent spaces. As can be observed in the third column from left, private modules that were trained in a latent space factorized with a discriminator, appear to be nearly successful in uncovering class labels in their latent space although the final classification is yet to be happening in the private heads. As opposed to that, in the absence of the discriminator, private modules (shown in the fifth column) generate features as entangled as those generated by their shared module counterparts.

In the last row of Fig. 3, once again we used our final model trained on the entire sequence of 20-Split miniImageNet and tested it on the first tasks of the sequence one at a time and plotted them all in a single figure for both shared and private modules with and without the discriminator. Similar to the pattern we observed above in comparing the shared feature space for each task with/without , in the first column we observe a uniformly distributed embedding, now color-coded with task labels, where distinguishing tasks is impossible, as expected. In other words, it shows that the shared module has been successfully trained to generate task-invariant features using adversarial learning whereas in the fourth column from the left, we observe that without the discriminator, the shared module was only able to generate a non-uniform embedding. This indicates the impact of the discriminator in finding the true shared features across tasks. On the other hand, for the private module in a setup with a discriminator (third column from left), we observe separate clusters are generated with samples belonging to the same task which shows is only able to uncover task-specific features when it is trained along with a task-invariant space. Unlike that, a private module that was trained along with a non-adversarial shared module (last column from left), provides nearly similar feature spaces as their shared module counterpart proving that task factorization could not occur without the presence of a discriminator in the setting.

Note that in all the results shown in Fig. 3, we did not use the orthogonality constraints, (), to merely present the role of adversarial learning as the main mechanism used to generate task-specific and task-invariant features.

Method ACC% BWT%
HAT [hat] -
PNN [pnn] Zero -
A-GEM [agem] -
ORD-FT - -
ORD-JT - -
ACL-JT - -
ACL (Ours) -
(a) 20-Split CIFAR100
Method ACC% BWT%
UCB [ucb] -
ORD-FT - -
ACL (Ours) -
(b) Sequence of 5 Datasets
Table 3: CL results on 20-Split CIFAR100 measuring ACC , BWT , and Memory (MB). (*) denotes that methods do not adhere to the continual learning setup: ACL-JT and ORD-JT serve as the upper bound for ACC for ACL/ORD networks, respectively. denotes result reported from original work. denotes result reported from [gem]. denotes result is reported by [tinymem]. denotes result is obtained using the re-implementation setup by [hat]. denotes result is obtained by using the original provided code. All results are averaged over runs and standard deviation is given in parentheses

6 Acl Performance on a sequence of 5-Datasets

In this section, we present our results for continual learning of tasks using ACL in Table 3(b). Similar to the previous experiment we look at both ACC and BWT obtained for ACL, finetuning as well as UCB as our baseline. Results for this sequence are averaged over random permutations of tasks and standard deviations are given in parenthesis. CL on a sequence of datasets has been previously performed by two regularization based approaches of UCB and HAT where UCB was shown to be superior [ucb]. With this given sequence, ACL is able to outperform UCB by reaching ACC= and

using only half of the memory size and also no replay buffer. In Bayesian neural networks such as UCB, there exists double number of parameters compared to a regular model representing mean and variance of network weights. It is very encouraging to see that

ACL is not only able to continually learn on a single dataset, but also across diverse datasets.

7 Additional Experiments

Method ACC% BWT%
EWC [ewc] (T=10) - -
HAT [hat] (T=10) - -
UCB [ucb] (T=10) - -
VCL[vcl] (T=10) - -
VCL-C [vcl](T=10) -
PNN [pnn] (T=20) Zero N/A -
ORD-FT (T=10) - -
ORD-JT (T=10) - -
ACL-JT (T=10) - -
ACL (Ours) (T=10) - -
ACL (Ours) (T=20) -
ACL (Ours) (T=30) -
ACL (Ours) (T=40) -
Table 4: CL results on Permuted MNIST. measuring ACC , BWT , and Memory (MB). (*) denotes that methods do not adhere to the continual learning setup: ACL-JT and ORD-JT serve as the upper bound for ACC for ACL/ORD networks, respectively. () denotes result reported from original work. () denotes result is obtained by using the original provided code. () denotes the results reported by [hat] and () denotes results are reported by [tinymem]; T shows the number of tasks. All results are averaged over runs, the standard deviation is provided in parenthesis

20-Split CIFAR100: In this experiment we incrementally learn CIFAR100 in classes at a time in tasks. As shown in Table 3, HAT is the most competitive baseline, although it does not depend on memory and uses MB to store its architecture in which it learns task-based attention maps reaching ACC=. PNN uses MB to store the lateral modules to the memory and guarantees zero forgetting. Results for A-GEM, and ER-Reservoir are re(produced) by us using a CNN similar to our shared module architecture. We use fully connected layers with more number of neurons to compensate for the remaining number of parameters reaching MB of memory. We also stored images per class ( images of size () in total) which requires MB of memory. However, ACL achieves ACC= with using only MB to grow private modules with parameters (MB) without using memory for replay buffer. Similar to the previous experiments on MNIST, old data is not used which is mainly due to the overuse of parameters for CIFAR100 which is considered as a relevantly ‘easy’ dataset with all tasks (classes) sharing the same data distribution. While we leave further discussion about this to Section 5.2, we mention that factorizing the shared and private parameters in ACL prevents from using redundant parameters by only storing task-specific parameters in modules. In fact, as opposed to other memory-based methods, instead of starting from a large network and using memory to store samples, which might not be available in practice due to confidentiality issues (e.g. medical data), ACL uses memory to gradually add small modules to accommodate new tasks and relies on knowledge transfer through the learned shared module. The latter is what makes ACL to different than architecture-based methods such as PNN where the network grows by the entire column which results in using a highly disproportionate memory to what is needed to learn a new task with.

Permuted MNIST: Another popular variant of MNIST dataset in CL literature is Permuted MNIST where each task is composed by randomly permuting pixels of the entire MNIST dataset. To compare against values reported in prior work, we particularly report on a sequence of and tasks with ACC, BWT, and memory for ACL and baselines. To further evaluate ACL’s ability in handling more tasks, we continually learned up to tasks. As shown in Table 4, among the regularization-based methods, HAT achieves the highest performance of [hat] using an architecture of size MB. Vanilla VCL improves by in ACC and

in BWT using a K-means core-set memory size of

samples per task (MB) and an architecture size similar to HAT. PNN appears as a strong baseline achieving ACC= with guaranteed zero forgetting. Finetuning (ORD-FT) and joint training (ORD-JT) results for an ordinary network, similar to EWC and HAT (a two-layer MLP with units and ReLU activations), are also reported as reference values for lowest BWT and highest achievable ACC, respectively. ACL achieves the highest accuracy among all baselines for both sequences of and equal to ACC= and ACC=, and , respectively which shows that performance of ACL drops only by as the number of tasks doubles. ACL also remains efficient in using memory to grow the architecture compactly by adding only parameters MB for each task resulting in using total of MB and MB when and , respectively for the entire network including the shared module and the discriminator. We also observed that the performance of our model does not change as the number of tasks increases to and if each new task is accommodated with a new private module. Similar to the 5-Split MNIST experiment, we did not store old data and used memory only to grow the architecture by parameters (MB).

Method ACC% BWT%
EWC [ewc] - -
HAT [hat] -
UCB [ucb] -
VCL [vcl] - -
iCaRL [icarl] -
GEM [gem] -
VCL-C [vcl] -
ORD-FT - -
ORD-JT - -
ACL-JT (Ours) - -
ACL (Ours) -
Table 5: Class Incremental Learning on 5-Split MNIST. measuring ACC , BWT , and Memory (MB). (*) denotes that methods do not adhere to the continual learning setup: ACL-JT and ORD-JT serve as the upper bound for ACC for ACL/ORD networks, respectively. denotes result reported from original work. denotes result is obtained by using the original provided code. All results are averaged over runs, the standard deviation is provided in parenthesis

5-Split MNIST: As the last experiment in this section, we continually learn MNIST digits by following the conventional pattern of learning classes over sequential tasks [vcl, SI, ucb]. As shown in Table 5, we compare ACL with regularization-based methods with no memory dependency (EWC, HAT, UCB, Vanilla VCL) and methods relying on memory only (GEM and iCaRL), and VCL with K-means Core-set (VCL-C) where samples are stored per task. ACL reaches ACC= with zero forgetting outperforming UCB with ACC= which uses nearly more memory size. In this task, we only use architecture growth (no experience replay) where private parameters are added for each task resulting in memory requirement of MB to store all private modules. Our core architecture has a total number of parameters (). We also provide naive finetuning results for ACL and a regular single-module network with () parameters (MB). Joint training (multi-task learning) results for the regular network (ORD-JT) is computed as ACC= for ACL which requires MB for the entire dataset as well as the architecture. Joint training only serves as an upper-bound and is not a continual learning baseline.

8 Conclusion

In this work, we propose a novel hybrid continual learning algorithm – Adversarial Continual Learning (ACL) – that factorizes the representation learned for a sequence of tasks into task-specific and task-invariant features where the former is important to be fully preserved to avoid forgetting and the latter is empirically found to be remarkably less prone to forgetting. The novelty of our work is that we use adversarial learning along with orthogonality constraints to disentangle the shared and private latent representations which results in compact private modules that can be stored into memory and hence, efficiently preventing forgetting. A tiny replay buffer, although not critical, can be also integrated into our approach if forgetting occurs in the shared module. We evaluated ACL on CL benchmark datasets and established a new state of the art on 20-Split miniImageNet, 5-Datasets, 20-Split CIFAR100, Permuted MNIST, and 5-Split MNIST.


9 Datasets

Table 5(a) shows a summary of the datasets utilized in our work. From left to right columns are given as: dataset name, total number of classes in the dataset, number of tasks, image size, number of training images per task, number of validation images per task, number of test images per task, and number of classes in each task. Statistic of 5-Split MNIST and 5-Datasets experiments are given in 5(b) and 5(c), respectively. We did not use data augmentation for any dataset.

Dataset Classes Tasks Input Size Train/Task Valid/Task Test/Task Class/Task
5-Split MNIST [mnist] see Tab. 5(b) see Tab. 5(b)
Permuted MNIST [pmnist]
20-Split CIFAR100 [cifar]
20-Split miniImageNet [imagenet]
5-Datasets see Tab. 5(c) see Tab. 5(c) see Tab. 5(c)
(a) Statistics of utilized datasets
Task number
T = 1
# Training samples 10766 10276 9574 10356 10030
# Validation samples 1899 1813 1689 1827 1770
# Test samples 2115 2042 1874 1986 1983
(b) Number of training, validation, and test samples per task for 5-Split MNIST
# Training samples 51,000 15,526 9,574 42,500 62,269
# Validation samples 9,000 2,739 1,689 7,500 10,988
# Test samples 10,000 459 1,874 10,000 26,032
(c) Statistics of utilized datasets in 5-Datasets

. MNIST, notMNIST, and Fashion MNIST are padded with

to become and have channels.